Median trick and sketching
In this post, I'd like to give some intuition about a useful technique from statistics which has many applications for randomized and sketching algorithms: the median trick.
Boosting probabilities with the median trick
Consider a random variable $Y$ which gives a "good" estimation with probability $p > \frac{1}{2}$.
Side note
For instance, if your goal is to approximate some value $x$, having a good estimation could mean $$(1 - \varepsilon) x \le Y \le (1 + \varepsilon) x$$
The purpose of the median trick is to boost the probability of success up to $1 - \delta$, for some $\delta$ as small as you want.
In order to achieve this estimation, all we need is to maintain $r = C \ln \frac{1}{\delta}$ independent copies of $Y$ and to compute their median $M$.
Side note
Computing the median quickly is an interesting problem in itself. It can be achieved in linear time with an elegant algorithm that I will not detail here.
Reminders on concentration inequalities
Before diving into the proof, let's take some time to review some important concentration inequalities.
First of all, one of the most famous concentration inequality is Chebyshev's inequality. It tells us that for any random variable $X$, $$\mathbb{P}(|X - \mathbb{E}[X]| > \varepsilon) \le \frac{\mathbb{V}[X]}{\varepsilon^2}$$ In other word, $$\mathbb{E}[X] - \sqrt{k} \sigma \le X \le \mathbb{E}[X] + \sqrt{k} \sigma$$ with probability at least $1 - \frac{1}{k}$.
Now, suppose that $X$ is the sum of $n$ independent Bernoulli variables $$X = \sum_{i=1}^n X_i$$ Under this assumption, the Chernoff bounds give us much better inequalities. For any $\alpha > 0$, $$\mathbb{P}\big(X > (1 + \alpha) \mathbb{E}[X]\big) \le e^{-\frac{\alpha^2 \mathbb{E}[X]}{2 + \alpha}}$$ $$\mathbb{P}\big(X < (1 - \alpha) \mathbb{E}[X]\big) \le e^{-\frac{\alpha^2 \mathbb{E}[X]}{2}}$$
Proof of the median trick
First, denote $Y_1, \dots, Y_r$ the $r$ copies of $Y$ and define $r$ random variables $Z_1, \dots, Z_r$ such that $$Z_i = \begin{cases}1 & \text{if } Y_i \text{ gives a good estimation}\cr 0 & \text{otherwise}\end{cases}$$ and $$Z = \sum_{i=1}^r Z_i$$ Since $Z$ corresponds to the number of good estimations, having $Z \ge \frac{r}{2}$ guarantees us that the median will be a good estimation as well. $$\mathbb{E}[Z] = rp > \frac{r}{2}$$ Using Chernoff's lower bound, we get $$ \begin{aligned} \mathbb{P}\left(Z < \frac{r}{2}\right) & = \mathbb{P}\big(Z < (1 - \alpha) rp\big)\cr & \le e^{-\frac{\alpha^2 rp}{2}} = e^{-\frac{\alpha^2 Cp \ln \frac{1}{\delta}}{2}}\cr & = \delta^{\frac{\alpha^2 Cp}{2}} \le \delta \end{aligned} $$ as long as $C \ge \frac{2}{\alpha^2 p}$, with $\alpha = 1 - \frac{1}{2p}$.
Therefore $M$ gives a good estimation with probability at least $1 - \delta$.
Application to sketching: AMS sketch
Now that I introduced the median trick, I would like to present a interesting application to sketching algorithms: the AMS sketch.
Side note
AMS stands for Alon, Matias and Szegedy, three famous computer scientists that received a Gödel prize in 2005 for their work on sketching algorithms.
Given a stream of values $\sigma = \langle \sigma_1, \dots, \sigma_m \rangle$ where each value belongs to $\llbracket 1, n \rrbracket$, our goal is to compute the second frequency moment $$F_2 = \sum_{i=1}^n f_i^2$$ where $f_i$ denotes the frequency of $i$ ; and we want to achieve this using very little space. In particular, we cannot afford to store each $f_i$ in memory because it would require $\mathcal{O}(n \log m)$ space.
First estimation
The main idea of AMS sketch is to maintain a sum $$S = \sum_{j=1}^m s(\sigma_j)$$ where $s$ is a 4-wise independent hash function attributing a sign (1 or -1) to each value.
$k$-wise independence
A family of hash functions $H$ is $k$-wise independent if for every distinct $x_1, \dots, x_k \in\llbracket 1, n \rrbracket$ and every $y_1, \dots, y_k \in \llbracket 1, l \rrbracket$, $$\mathbb{P}_{h \in H} (\forall i, h(x_i) = y_i) = \frac{1}{l^k}$$
These hash families are very useful for randomized algorithms.
A simple way to generate such a family is to use a random polynomial modulo some prime $p$: $$x \mapsto a_k x^k + \dots + a_0 \mod p$$ where $a_0, \dots, a_k$ are chosen uniformly at random.
At the end of the stream, we have $$S = \sum_{j=1}^m s(\sigma_j) = \sum_{i=1}^n s(i) f_i$$ and we approximate $F_2$ using $$X = S^2 = \sum_{i=1}^n f_i^2 + \sum_{i \neq j} s(i) s(j) f_i f_j$$ $$\mathbb{E}[X] = F_2 + \sum_{i \neq j} \underbrace{\mathbb{E}[s(i) s(j)]}_{\mathbb{E}[s(i)] \cdot \mathbb{E}[s(j)] = 0} f_i f_j = F_2$$ because $s(i)$ and $s(j)$ are independent, which stems from 4-wise independence. $$\mathbb{V}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \mathbb{E}[S^4] - F_2^2$$
and $$ \begin{aligned} \mathbb{E}[S^4] & = \sum_{i=1}^n f_i^4 + \binom{4}{2} \sum_{i < j} f_i^2 f_j^2\cr & + \sum_{i \neq j \neq k \neq l} \underbrace{\mathbb{E}[s(i) \dots s(l)]}_{0} f_i \dots f_l \end{aligned} $$
so $$\mathbb{V}[X] = 4 \sum_{i < j} f_i^2 f_j^2 \le 2 F_2^2$$
As you can see, $X$ gives a good estimation of $F_2$ on average but its variance is still quite big.
Improving the precision
In order to improve the precision of our estimation, let us try to reduce the variance. To do that, let us make $t$ independent copies of $X$ and compute their mean $$Y = \frac{1}{t} \sum_{i=1}^t X_i$$ While the average stays the same, the variance gets shrunk by a factor $t$: $$\mathbb{V}[Y] = \frac{1}{t^2} \mathbb{V}\left[\sum_{i=1}^t X_i\right] = \frac{\mathbb{V}[X]}{t} \le \frac{2 F_2^2}{t}$$
What's more, by choosing $t = \frac{6}{\varepsilon^2}$, Chebyshev's inequality leads to $$\mathbb{P}(|Y - F_2| > \varepsilon F_2) \le \frac{\mathbb{V}[Y]}{\varepsilon^2 F_2^2} \le \frac{2}{t \varepsilon^2} = \frac{1}{3}$$
Therefore, we know that $$(1 - \varepsilon) F_2 \le Y \le (1 + \varepsilon) F_2$$ with probability at least $\frac{2}{3}$.
Wrapping it all together
So we are left with an estimator that gives a good approximation of $F_2$ with probability $\frac{2}{3}$, and we would like to make it more reliable. This sounds familiar, right?
As you might have guessed, it is time to make use of the median trick!
Using the method described in the first section, we obtain an estimator $M$ that satisfies $$(1 - \varepsilon) F_2 \le M \le (1 + \varepsilon) F_2$$ with probability at least $1 - \delta$.
In the end, this algorithm requires $\mathcal{O}\big(\frac{1}{\varepsilon^2} \ln \frac{1}{\delta}\big)$ space, which is very efficient!