<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
    <title>Igor Martayan - Posts</title>
    <subtitle>PhD student in Computer Science</subtitle>
    <link rel="self" type="application/atom+xml" href="https://igor.martayan.org/posts/atom.xml"/>
    <link rel="alternate" type="text/html" href="https://igor.martayan.org/posts/"/>
    <generator uri="https://www.getzola.org/">Zola</generator>
    <updated>2022-08-31T00:00:00+00:00</updated>
    <id>https://igor.martayan.org/posts/atom.xml</id>
    <entry xml:lang="en">
        <title>Median trick and sketching</title>
        <published>2022-08-31T00:00:00+00:00</published>
        <updated>2022-08-31T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Igor Martayan
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://igor.martayan.org/posts/median-trick/"/>
        <id>https://igor.martayan.org/posts/median-trick/</id>
        
        <content type="html" xml:base="https://igor.martayan.org/posts/median-trick/">&lt;p&gt;In this post, I&#x27;d like to give some intuition about a useful technique from statistics which has many applications for randomized and sketching algorithms: the &lt;strong&gt;median trick&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;boosting-probabilities-with-the-median-trick&quot;&gt;Boosting probabilities with the median trick&lt;a class=&quot;anchor&quot; href=&quot;#boosting-probabilities-with-the-median-trick&quot; aria-label=&quot;Anchor link for boosting-probabilities-with-the-median-trick&quot;&gt;#&lt;&#x2F;a&gt;&lt;&#x2F;h1&gt;
&lt;p&gt;Consider a random variable $Y$ which gives a &quot;good&quot; estimation with probability $p &amp;gt; \frac{1}{2}$.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
    &lt;h6&gt;Side note&lt;&#x2F;h6&gt;
    &lt;p&gt;For instance, if your goal is to approximate some value $x$, having a good estimation could mean
$$(1 - \varepsilon) x \le Y \le (1 + \varepsilon) x$$&lt;&#x2F;p&gt;

&lt;&#x2F;blockquote&gt;
&lt;p&gt;The purpose of the median trick is to boost the probability of success up to $1 - \delta$, for some $\delta$ as small as you want.&lt;&#x2F;p&gt;
&lt;p&gt;In order to achieve this estimation, all we need is to maintain $r = C \ln \frac{1}{\delta}$ &lt;em&gt;independent&lt;&#x2F;em&gt; copies of $Y$ and to compute their &lt;em&gt;median&lt;&#x2F;em&gt; $M$.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
    &lt;h6&gt;Side note&lt;&#x2F;h6&gt;
    &lt;p&gt;Computing the median quickly is an interesting problem in itself.
It can be achieved in linear time with an elegant algorithm that I will not detail here.&lt;&#x2F;p&gt;

&lt;&#x2F;blockquote&gt;&lt;h2 id=&quot;reminders-on-concentration-inequalities&quot;&gt;Reminders on concentration inequalities&lt;a class=&quot;anchor&quot; href=&quot;#reminders-on-concentration-inequalities&quot; aria-label=&quot;Anchor link for reminders-on-concentration-inequalities&quot;&gt;#&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;Before diving into the proof, let&#x27;s take some time to review some important concentration inequalities.&lt;&#x2F;p&gt;
&lt;p&gt;First of all, one of the most famous concentration inequality is &lt;strong&gt;Chebyshev&#x27;s inequality&lt;&#x2F;strong&gt;.
It tells us that for any random variable $X$,
$$\mathbb{P}(|X - \mathbb{E}[X]| &amp;gt; \varepsilon) \le \frac{\mathbb{V}[X]}{\varepsilon^2}$$
In other word,
$$\mathbb{E}[X] - \sqrt{k} \sigma \le X \le \mathbb{E}[X] + \sqrt{k} \sigma$$
with probability at least $1 - \frac{1}{k}$.&lt;&#x2F;p&gt;
&lt;p&gt;Now, suppose that $X$ is the sum of $n$ independent Bernoulli variables
$$X = \sum_{i=1}^n X_i$$
Under this assumption, the &lt;strong&gt;Chernoff bounds&lt;&#x2F;strong&gt; give us better inequalities.
For any $\alpha &amp;gt; 0$,
$$\mathbb{P}\big(X &amp;gt; (1 + \alpha) \mathbb{E}[X]\big) \le e^{-\frac{\alpha^2 \mathbb{E}[X]}{2 + \alpha}}$$
$$\mathbb{P}\big(X &amp;lt; (1 - \alpha) \mathbb{E}[X]\big) \le e^{-\frac{\alpha^2 \mathbb{E}[X]}{2}}$$&lt;&#x2F;p&gt;
&lt;h2 id=&quot;proof-of-the-median-trick&quot;&gt;Proof of the median trick&lt;a class=&quot;anchor&quot; href=&quot;#proof-of-the-median-trick&quot; aria-label=&quot;Anchor link for proof-of-the-median-trick&quot;&gt;#&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;First, denote $Y_1, \dots, Y_r$ the $r$ copies of $Y$ and define $r$ random variables $Z_1, \dots, Z_r$ such that
$$Z_i = \begin{cases}1 &amp;amp; \text{if } Y_i \text{ gives a good estimation}\cr 0 &amp;amp; \text{otherwise}\end{cases}$$
and
$$Z = \sum_{i=1}^r Z_i$$
Since $Z$ corresponds to the number of good estimations,
having $Z \ge \frac{r}{2}$ guarantees us that the median will be a good estimation as well.
$$\mathbb{E}[Z] = rp &amp;gt; \frac{r}{2}$$
Using Chernoff&#x27;s lower bound, we get
$$
\begin{aligned}
\mathbb{P}\left(Z &amp;lt; \frac{r}{2}\right) &amp;amp; = \mathbb{P}\big(Z &amp;lt; (1 - \alpha) rp\big)\cr
&amp;amp; \le e^{-\frac{\alpha^2 rp}{2}} = e^{-\frac{\alpha^2 Cp \ln \frac{1}{\delta}}{2}}\cr
&amp;amp; = \delta^{\frac{\alpha^2 Cp}{2}} \le \delta
\end{aligned}
$$
as long as $C \ge \frac{2}{\alpha^2 p}$, with $\alpha = 1 - \frac{1}{2p}$.&lt;&#x2F;p&gt;
&lt;p&gt;Therefore $M$ gives a good estimation with probability at least $1 - \delta$.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;application-to-sketching-ams-sketch&quot;&gt;Application to sketching: AMS sketch&lt;a class=&quot;anchor&quot; href=&quot;#application-to-sketching-ams-sketch&quot; aria-label=&quot;Anchor link for application-to-sketching-ams-sketch&quot;&gt;#&lt;&#x2F;a&gt;&lt;&#x2F;h1&gt;
&lt;p&gt;Now that I introduced the median trick, I would like to present an interesting application to sketching algorithms: the &lt;strong&gt;AMS sketch&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
    &lt;h6&gt;Side note&lt;&#x2F;h6&gt;
    &lt;p&gt;&lt;em&gt;AMS&lt;&#x2F;em&gt; stands for &lt;em&gt;Alon&lt;&#x2F;em&gt;, &lt;em&gt;Matias&lt;&#x2F;em&gt; and &lt;em&gt;Szegedy&lt;&#x2F;em&gt;, three famous computer scientists that received a &lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;sigact.org&#x2F;prizes&#x2F;g%C3%B6del&#x2F;2005.html&quot;&gt;Gödel prize&lt;&#x2F;a&gt; in 2005 for their work on sketching algorithms.&lt;&#x2F;p&gt;

&lt;&#x2F;blockquote&gt;
&lt;p&gt;Given a stream of values $\sigma = \langle \sigma_1, \dots, \sigma_m \rangle$ where each value belongs to $\llbracket 1, n \rrbracket$, our goal is to compute the second frequency moment
$$F_2 = \sum_{i=1}^n f_i^2$$
where $f_i$ denotes the frequency of $i$ ;
and we want to achieve this using very little space.
In particular, we cannot afford to store each $f_i$ in memory because it would require $\mathcal{O}(n \log m)$ space.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;first-estimation&quot;&gt;First estimation&lt;a class=&quot;anchor&quot; href=&quot;#first-estimation&quot; aria-label=&quot;Anchor link for first-estimation&quot;&gt;#&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;The main idea of AMS sketch is to maintain a sum
$$S = \sum_{j=1}^m s(\sigma_j)$$
where $s$ is a &lt;em&gt;4-wise independent&lt;&#x2F;em&gt; hash function attributing a sign (1 or -1) to each value.&lt;&#x2F;p&gt;
&lt;article&gt;
    &lt;h6&gt;$k$-wise independence&lt;&#x2F;h6&gt;
    &lt;p&gt;A family of hash functions $H$ is $k$-wise independent if for every distinct $x_1, \dots, x_k \in\llbracket 1, n \rrbracket$ and every $y_1, \dots, y_k \in \llbracket 1, l \rrbracket$,
$$\mathbb{P}_{h \in H} (\forall i, h(x_i) = y_i) = \frac{1}{l^k}$$&lt;&#x2F;p&gt;
&lt;p&gt;These hash families are very useful for randomized algorithms.&lt;&#x2F;p&gt;
&lt;p&gt;A simple way to generate such a family is to use a random polynomial modulo some prime $p$:
$$x \mapsto a_k x^k + \dots + a_0 \mod p$$
where $a_0, \dots, a_k$ are chosen uniformly at random.&lt;&#x2F;p&gt;

&lt;&#x2F;article&gt;
&lt;p&gt;At the end of the stream, we have
$$S = \sum_{j=1}^m s(\sigma_j) = \sum_{i=1}^n s(i) f_i$$
and we approximate $F_2$ using
$$X = S^2 = \sum_{i=1}^n f_i^2 + \sum_{i \neq j} s(i) s(j) f_i f_j$$
$$\mathbb{E}[X] = F_2 + \sum_{i \neq j} \underbrace{\mathbb{E}[s(i) s(j)]}_{\mathbb{E}[s(i)] \cdot \mathbb{E}[s(j)] = 0} f_i f_j = F_2$$
because $s(i)$ and $s(j)$ are independent, which stems from 4-wise independence.
$$\mathbb{V}[X] = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \mathbb{E}[S^4] - F_2^2$$&lt;&#x2F;p&gt;
&lt;p&gt;and
$$
\begin{aligned}
\mathbb{E}[S^4] &amp;amp; = \sum_{i=1}^n f_i^4 + \binom{4}{2} \sum_{i &amp;lt; j} f_i^2 f_j^2\cr
&amp;amp; + \sum_{i \neq j \neq k \neq l} \underbrace{\mathbb{E}[s(i) \dots s(l)]}_{0} f_i \dots f_l
\end{aligned}
$$&lt;&#x2F;p&gt;
&lt;p&gt;so
$$\mathbb{V}[X] = 4 \sum_{i &amp;lt; j} f_i^2 f_j^2 \le 2 F_2^2$$&lt;&#x2F;p&gt;
&lt;p&gt;As you can see, $X$ gives a good estimation of $F_2$ on average but its variance is still quite big.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;improving-the-precision&quot;&gt;Improving the precision&lt;a class=&quot;anchor&quot; href=&quot;#improving-the-precision&quot; aria-label=&quot;Anchor link for improving-the-precision&quot;&gt;#&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;In order to improve the precision of our estimation, let us try to reduce the variance.
To do that, let us make $t$ independent copies of $X$ and compute their mean
$$Y = \frac{1}{t} \sum_{i=1}^t X_i$$
While the average stays the same, the variance gets shrunk by a factor $t$:
$$\mathbb{V}[Y] = \frac{1}{t^2} \mathbb{V}\left[\sum_{i=1}^t X_i\right] = \frac{\mathbb{V}[X]}{t} \le \frac{2 F_2^2}{t}$$&lt;&#x2F;p&gt;
&lt;p&gt;What&#x27;s more, by choosing $t = \frac{6}{\varepsilon^2}$, Chebyshev&#x27;s inequality leads to
$$\mathbb{P}(|Y - F_2| &amp;gt; \varepsilon F_2) \le \frac{\mathbb{V}[Y]}{\varepsilon^2 F_2^2} \le \frac{2}{t \varepsilon^2} = \frac{1}{3}$$&lt;&#x2F;p&gt;
&lt;p&gt;Therefore, we know that
$$(1 - \varepsilon) F_2 \le Y \le (1 + \varepsilon) F_2$$
with probability at least $\frac{2}{3}$.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;wrapping-it-all-together&quot;&gt;Wrapping it all together&lt;a class=&quot;anchor&quot; href=&quot;#wrapping-it-all-together&quot; aria-label=&quot;Anchor link for wrapping-it-all-together&quot;&gt;#&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;So we are left with an estimator that gives a good approximation of $F_2$ with probability $\frac{2}{3}$, and we would like to make it more reliable.
This sounds familiar, right?&lt;&#x2F;p&gt;
&lt;p&gt;As you might have guessed, it is time to make use of the median trick!&lt;&#x2F;p&gt;
&lt;p&gt;Using the method described in the first section, we obtain an estimator $M$ that satisfies
$$(1 - \varepsilon) F_2 \le M \le (1 + \varepsilon) F_2$$
with probability at least $1 - \delta$.&lt;&#x2F;p&gt;
&lt;p&gt;In the end, this algorithm requires $\mathcal{O}\big(\frac{1}{\varepsilon^2} \ln \frac{1}{\delta}\big)$ space, which is very efficient!&lt;&#x2F;p&gt;
</content>
        
    </entry>
</feed>
