Igor Martayan - PostsPhD student in Computer ScienceZola2024-03-16T00:00:00+00:00https://igor.martayan.org/posts/atom.xmlAverage distance of two random k-mers2024-03-16T00:00:00+00:002024-03-16T00:00:00+00:00Igor Martayanhttps://igor.martayan.org/posts/average-kmer-dist/<p>Intuitively, I've always thought that the average distance of two random <em>k</em>-mers is equal to $4^k / 2$, but it turns out that this intuition is wrong.</p>
<p>To prove it, let's compute the average of $f : (x, y) \mapsto |x - y|$ over $[0, u]^2$ for a fixed $u$:
$$\begin{aligned}
& \frac{1}{u^2} \int_0^u \int_0^u |x - y| \,\mathrm{d}y \,\mathrm{d}x \\
&= \frac{1}{u^2} \int_0^u \left(\int_0^x x - y \,\mathrm{d}y\right) + \left(\int_x^u y - x \,\mathrm{d}y\right) \,\mathrm{d}x \\
&= \frac{1}{u^2} \int_0^u \frac{x^2}{2} + \frac{(u - x)^2}{2} \,\mathrm{d}x \\
&= \frac{1}{u^2} \int_0^u x^2 - x + \frac{u^2}{2} \,\mathrm{d}x \\
&= \frac{1}{u^2} \int_0^u x^2 \,\mathrm{d}x \\
&= \frac{u}{3}
\end{aligned}$$</p>
<p>In particular, for $u = 4^k$ this gives an average distance of $4^k / 3$.</p>
<p>Thanks <a rel="noopener" target="_blank" href="https://lrobidou.github.io/">Lucas</a> for pointing it out!</p>
Median trick and sketching2022-08-31T00:00:00+00:002022-08-31T00:00:00+00:00Igor Martayanhttps://igor.martayan.org/posts/median-trick/<p>In this post, I'd like to give some intuition about a useful technique from statistics which has many applications for randomized and sketching algorithms: the <strong>median trick</strong>.</p>
<h1 id="boosting-probabilities-with-the-median-trick">Boosting probabilities with the median trick</h1>
<p>Consider a random variable $Y$ which gives a "good" estimation with probability $p > \frac{1}{2}$.</p>
<blockquote>
<h6>Side note</h6>
<p>For instance, if your goal is to approximate some value $x$, having a good estimation could mean
$$(1 - \varepsilon) x \le Y \le (1 + \varepsilon) x$$</p>
</blockquote>
<p>The purpose of the median trick is to boost the probability of success up to $1 - \delta$, for some $\delta$ as small as you want.</p>
<p>In order to achieve this estimation, all we need is to maintain $r = C \ln \frac{1}{\delta}$ <em>independent</em> copies of $Y$ and to compute their <em>median</em> $M$.</p>
<blockquote>
<h6>Side note</h6>
<p>Computing the median quickly is an interesting problem in itself.
It can be achieved in linear time with an elegant algorithm that I will not detail here.</p>
</blockquote><h2 id="reminders-on-concentration-inequalities">Reminders on concentration inequalities</h2>
<p>Before diving into the proof, let's take some time to review some important concentration inequalities.</p>
<p>First of all, one of the most famous concentration inequality is <strong>Chebyshev's inequality</strong>.
It tells us that for any random variable $X$,
$$\mathbb{P}(|X - \mathbb{E}[X]| > \varepsilon) \le \frac{\mathbb{V}[X]}{\varepsilon^2}$$
In other word,
$$\mathbb{E}[X] - \sqrt{k} \sigma \le X \le \mathbb{E}[X] + \sqrt{k} \sigma$$
with probability at least $1 - \frac{1}{k}$.</p>
<p>Now, suppose that $X$ is the sum of $n$ independent Bernoulli variables
$$X = \sum_{i=1}^n X_i$$
Under this assumption, the <strong>Chernoff bounds</strong> give us much better inequalities.
For any $\alpha > 0$,
$$\mathbb{P}\big(X > (1 + \alpha) \mathbb{E}[X]\big) \le e^{-\frac{\alpha^2 \mathbb{E}[X]}{2 + \alpha}}$$
$$\mathbb{P}\big(X < (1 - \alpha) \mathbb{E}[X]\big) \le e^{-\frac{\alpha^2 \mathbb{E}[X]}{2}}$$</p>
<h2 id="proof-of-the-median-trick">Proof of the median trick</h2>
<p>First, denote $Y_1, \dots, Y_r$ the $r$ copies of $Y$ and define $r$ random variables $Z_1, \dots, Z_r$ such that
$$Z_i = \begin{cases}1 & \text{if } Y_i \text{ gives a good estimation}\cr 0 & \text{otherwise}\end{cases}$$
and
$$Z = \sum_{i=1}^r Z_i$$
Since $Z$ corresponds to the number of good estimations,
having $Z \ge \frac{r}{2}$ guarantees us that the median will be a good estimation as well.
$$\mathbb{E}[Z] = rp > \frac{r}{2}$$
Using Chernoff's lower bound, we get
$$
\begin{aligned}
\mathbb{P}\left(Z < \frac{r}{2}\right) & = \mathbb{P}\big(Z < (1 - \alpha) rp\big)\cr
& \le e^{-\frac{\alpha^2 rp}{2}} = e^{-\frac{\alpha^2 Cp \ln \frac{1}{\delta}}{2}}\cr
& = \delta^{\frac{\alpha^2 Cp}{2}} \le \delta
\end{aligned}
$$
as long as $C \ge \frac{2}{\alpha^2 p}$, with $\alpha = 1 - \frac{1}{2p}$.</p>
<p>Therefore $M$ gives a good estimation with probability at least $1 - \delta$.</p>
<h1 id="application-to-sketching-ams-sketch">Application to sketching: AMS sketch</h1>
<p>Now that I introduced the median trick, I would like to present a interesting application to sketching algorithms: the <strong>AMS sketch</strong>.</p>
<blockquote>
<h6>Side note</h6>
<p><em>AMS</em> stands for <em>Alon</em>, <em>Matias</em> and <em>Szegedy</em>, three famous computer scientists that received a <a rel="noopener" target="_blank" href="https://sigact.org/prizes/g%C3%B6del/2005.html">GĂ¶del prize</a> in 2005 for their work on sketching algorithms.</p>
</blockquote>
<p>Given a stream of values $\sigma = \langle \sigma_1, \dots, \sigma_m \rangle$ where each value belongs to $\llbracket 1, n \rrbracket$, our goal is to compute the second frequency moment
$$F_2 = \sum_{i=1}^n f_i^2$$
where $f_i$ denotes the frequency of $i$ ;
and we want to achieve this using very little space.
In particular, we cannot afford to store each $f_i$ in memory because it would require $\mathcal{O}(n \log m)$ space.</p>
<h2 id="first-estimation">First estimation</h2>
<p>The main idea of AMS sketch is to maintain a sum
$$S = \sum_{j=1}^m s(\sigma_j)$$
where $s$ is a <em>4-wise independent</em> hash function attributing a sign (1 or -1) to each value.</p>
<article>
<h6>$k$-wise independence</h6>
<p>A family of hash functions $H$ is $k$-wise independent if for every distinct $x_1, \dots, x_k \in\llbracket 1, n \rrbracket$ and every $y_1, \dots, y_k \in \llbracket 1, l \rrbracket$,
$$\mathbb{P}_{h \in H} (\forall i, h(x_i) = y_i) = \frac{1}{l^k}$$</p>
<p>These hash families are very useful for randomized algorithms.</p>
<p>A simple way to generate such a family is to use a random polynomial modulo some prime $p$:
$$x \mapsto a_k x^k + \dots + a_0 \mod p$$
where $a_0, \dots, a_k$ are chosen uniformly at random.</p>
</article>
<p>At the end of the stream, we have
$$S = \sum_{j=1}^m s(\sigma_j) = \sum_{i=1}^n s(i) f_i$$
and we approximate $F_2$ using
$$X = S^2 = \sum_{i=1}^n f_i^2 + \sum_{i \neq j} s(i) s(j) f_i f_j$$
$$\mathbb{E}[X] = F_2 + \sum_{i \neq j} \underbrace{\mathbb{E}[s(i) s(j)]}_{\mathbb{E}[s(i)] \cdot \mathbb{E}[s(j)] = 0} f_i f_j = F_2$$
because $s(i)$ and $s(j)$ are independent, which stems from 4-wise independence.
$$\mathbb{V}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \mathbb{E}[S^4] - F_2^2$$</p>
<p>and
$$
\begin{aligned}
\mathbb{E}[S^4] & = \sum_{i=1}^n f_i^4 + \binom{4}{2} \sum_{i < j} f_i^2 f_j^2\cr
& + \sum_{i \neq j \neq k \neq l} \underbrace{\mathbb{E}[s(i) \dots s(l)]}_{0} f_i \dots f_l
\end{aligned}
$$</p>
<p>so
$$\mathbb{V}[X] = 4 \sum_{i < j} f_i^2 f_j^2 \le 2 F_2^2$$</p>
<p>As you can see, $X$ gives a good estimation of $F_2$ on average but its variance is still quite big.</p>
<h2 id="improving-the-precision">Improving the precision</h2>
<p>In order to improve the precision of our estimation, let us try to reduce the variance.
To do that, let us make $t$ independent copies of $X$ and compute their mean
$$Y = \frac{1}{t} \sum_{i=1}^t X_i$$
While the average stays the same, the variance gets shrunk by a factor $t$:
$$\mathbb{V}[Y] = \frac{1}{t^2} \mathbb{V}\left[\sum_{i=1}^t X_i\right] = \frac{\mathbb{V}[X]}{t} \le \frac{2 F_2^2}{t}$$</p>
<p>What's more, by choosing $t = \frac{6}{\varepsilon^2}$, Chebyshev's inequality leads to
$$\mathbb{P}(|Y - F_2| > \varepsilon F_2) \le \frac{\mathbb{V}[Y]}{\varepsilon^2 F_2^2} \le \frac{2}{t \varepsilon^2} = \frac{1}{3}$$</p>
<p>Therefore, we know that
$$(1 - \varepsilon) F_2 \le Y \le (1 + \varepsilon) F_2$$
with probability at least $\frac{2}{3}$.</p>
<h2 id="wrapping-it-all-together">Wrapping it all together</h2>
<p>So we are left with an estimator that gives a good approximation of $F_2$ with probability $\frac{2}{3}$, and we would like to make it more reliable.
This sounds familiar, right?</p>
<p>As you might have guessed, it is time to make use of the median trick!</p>
<p>Using the method described in the first section, we obtain an estimator $M$ that satisfies
$$(1 - \varepsilon) F_2 \le M \le (1 + \varepsilon) F_2$$
with probability at least $1 - \delta$.</p>
<p>In the end, this algorithm requires $\mathcal{O}\big(\frac{1}{\varepsilon^2} \ln \frac{1}{\delta}\big)$ space, which is very efficient!</p>