Breaking ntHash (to better fix it)
Table of contents
This post was initially written as part of my PhD thesis, but I thought it'd be interesting to share it on its own.
NtHash is a popular method for hashing k-mers in bioinformatics, yet it has some surprising flaws. In this post, I walk through a few of them, and show that they can arise naturally, without an adversarial setup.
Reminders on ntHash#
NtHash (Mohamadi et al., 2016) is a family of rolling hashes specialized for hashing DNA sequences (with alphabet A/C/T/G). The hash of a sequence $x_0 \dots x_{k-1}$ is defined as $$ H(x_0 \dots x_{k-1}) = \bigoplus_{i=0}^{k-1} \operatorname{rol}^{k-1-i}(h(x_i)) $$ where $\oplus$ denotes a XOR, $\operatorname{rol}$ denotes a bitwise left rotation and $h(x_i)$ is a precomputed seed value associated to the character $x_i$.
One of the main use case of ntHash is to compute a hash over a sliding window with constant-time updates. Assuming we want to shift the current window from $x_0 \dots x_{k-1}$ to $x_1 \dots x_k$, we first need to remove the contribution of the first character: $$ H(x_1 \dots x_{k-1}) = H(x_0 \dots x_{k-1}) \oplus \operatorname{rol}^{k-1}(h(x_0)) $$ We then shift the remaining characters by applying a rotation and add the contribution of the new character: $$ H(x_1 \dots x_k) = \operatorname{rol}(H(x_1 \dots x_{k-1})) \oplus h(x_k) $$ This second equation can be extended to a more general merging property: given two sequences $u$ and $v$, $$ H(uv) = \operatorname{rol}^{|v|}(H(u)) \oplus H(v) $$
A vectorized implementation of ntHash is available in the seq-hash Rust library that I wrote with Ragnar Groot Koerkamp.
Forging and propagating collisions#
The first observation we can make is that it's very easy to create collisions when the window size $k$ is larger than the number of bits in the hash. Assuming we are using 32-bit hashes, any pair of bases that are 32 (or any multiple of 32) positions apart will end up with the same rotation (since rotating by 32 is equivalent to not rotating at all). Therefore, for any $\alpha, \beta \in \Sigma$ and $u \in \Sigma^{31}$, $$ H(\alpha \cdot u \cdot \beta) = H(\beta \cdot u \cdot \alpha) $$ and similarly $$ H(\alpha \cdot u \cdot \alpha) = H(\beta \cdot u \cdot \beta) = \operatorname{rol}(H(u)) $$ This issue is discussed in ntHash2 (Kazemi et al., 2022), where the authors suggest to replace the $\operatorname{rol}$ operation with $\operatorname{srol}$, which divides the 64-bit hash into a 31-bit and a 33-bit part and rotates them independently. While this operation is more expensive to compute, it has a period of 1023 instead of 32, thus limiting these specific collisions to $k \ge 1024$.
Something more problematic is that, because of the merging property discussed earlier, if for some word size $n$ we have a collision for $u \ne v \in \Sigma^n$, it will create $\Sigma^{k-n}$ collisions for each word size $k \ge n$: $$ H(u) = H(v) \implies \forall x, y \in \Sigma^*, H(x \cdot u \cdot y) = H(x \cdot v \cdot y) $$
Therefore, it is especially important to avoid collisions for small words. Ragnar Groot Koerkamp described on his website a method to select the seed values $h(x_i)$ to avoid collisions in 64-bit hashes for $k \le 32$. His solution is essentially to define $h(\texttt{T})$ as $h(\texttt{A}) \oplus h(\texttt{C}) \oplus h(\texttt{G})$ and to adjust the other three values to obtain an invertible system.
Bias on the leading zeros#
A very common use case of rolling hashes in bioinformatics is to hash every k-mer and select those with "small" hashes as representatives. For instance, this is extensively used in MinHash-based sketching techniques or in random minimizers.
These techniques generally assume that the hash values are independent and uniformly distributed. Under these assumptions, every k-mer has a probability $2^{-n}$ that its hash starts with at least $n$ zeros. So if we define "small" as starting with $n$ zeros, the distance between "small" k-mers roughly follows a geometric distribution (this is not exactly true because of repeated k-mers, though this is negligible for a large k).
Unfortunately, I noticed that the empirical distribution I had in my experiments (documented here) was not quite geometric, and that the repartition of "small" k-mers was skewed. I observed in particular that the number leading zeros for a given k-mer was very correlated with the number of leading zeros of the neighboring k-mers, thus invalidating both independence and uniformity assumptions.
To document this behavior, I computed the number of leading zeros for every k-mer hash in a large random sequence (4 billion bases), and measured the probability that two consecutive k-mers transition from $i$ leading zeros to $j$ leading zeros. These experiment scripts are available at https://github.com/imartayan/nthash-bias.
The plots below contain two visualizations of this transition probability. The first one shows the raw probability of each transition: each line corresponds to the current number of leading zeros, while each column corresponds to the next value. The second one shows the ratio between this probability and the expected probability for a geometric distribution: blue when less than expected, white when matching the expectation and red when more than expected.
The first thing that we can notice is that many of the cells of the first matrix are actually empty, meaning that the corresponding transition is simply not observed. While this is not very surprising for the bottom-right half of the matrix simply because of sample size (remember that a transition from $i$ to $j$ has a probability proportional to $2^{-(i+j)}$ in theory), it is much more surprising to see empty rows or columns interleaved with non-empty ones (see columns 3, 5, 6 and rows 4, 6, 7). In particular, this implies that after observing a hash with exactly 4 leading zeros, the hash next to it will always have at most 3 leading zeros. Conversely, a hash with exactly 3 leading zeros can only be observed after at most 4 leading zeros.
The second matrix allows us to put these probabilities into perspective with the expected ones. While the first rows and columns mostly match the expectations, the missing transitions in blue are compensated by overrepresented ones in red. Notably, after observing a hash with at least 9 leading zeros, the probability of having exactly 7 leading zeros next to it is 16× higher. This creates an interesting dynamic where small hashes with at least 9 leading zeros are likely followed by 7 leading zeros, which are then followed by at most 6 leading zeros.
Even though this phenomenon is difficult to explain, the offset of one between the missing rows and columns made me conjecture that it is related the 1-bit rotations performed at each step. To verify this hypothesis, I introduced variations of ntHash where the 1-bit rotations are replaced by R-bit rotations. This does not cause any problem as long as R is odd (since R and 32 need to be coprime) and does not affect the performances, but simply shuffles the rotation offsets at each step. The next plots show the results obtained for R = 7 and R = 13.
Interestingly, R does have an impact on the offset between biased rows and columns: for R = 7 we observe a pattern starting at $(i, j) = (8, 1)$ while for R = 13 we observe two patterns starting at $(i, j) = (14, 1)$ and $(i, j) = (0, 19)$ respectively (note that $19 = -13 \mod 32$), which confirms the $i - j = R \mod 32$ conjecture. One practical benefit of shifting the biased rows and columns is that their probability drastically drops, making them less concerning in practice and leading to a more regular transition matrix. As a result, I recommend using R = 13 or 15 by default to reduce the bias (and we implemented a similar fix in seq-hash).
References
- Mohamadi et al., 2016 — ntHash: Recursive nucleotide hashing
- Kazemi et al., 2022 — ntHash2: Recursive spaced seed hashing for nucleotide sequences