Category Archives: expository

computer science expository math

Ask the Experts


Each day, you must make a binary decision: to buy or sell a stock, whether to bring an umbrella to work, whether to take surface streets or the freeway on your commute, etc. Each time you make the wrong decision, you are penalized. Fortunately you have help in the form of \(n\) “expert” advisors each of whom suggest a choice for you. However, some experts are more reliable than others, and you do not initially know which is the most reliable. How can you find the most reliable expert while incurring the fewest penalties? How few penalties can you incur compared to the best expert?

To make matters simple, first assume that the best expert never gives the wrong advice. How quickly can we find that expert?

Here is a simple procedure that allows us to find the best expert. Choose an expert \(a\) arbitrarily on the first day, and follow their advice. Continue following \(a\)’s advice each day until they make a mistake. After your chosen expert errs, choose a different expert and follow their advice until they err. Continue this process indefinitely. Since we’ve assumed that the best expert never errs, you will eventually follow their advice.

Observe that in the above procedure, each day we either get the correct advice, or we eliminate one of the errant experts. Thus, the total number of errors we make before finding the best expert is \(n – 1\). Is it possible to find the best expert with fewer penalties?

Consider the following “majority procedure.” Each day, look at all \(m\) experts’ advice and take the majority opinion (choosing arbitrarily if there is a tie). At the end of the day, fire all incorrect experts, and continue the next day with the remaining experts. To analyze this procedure, observe that each day, either the majority answer is correct (and you are not penalized) or you fire at least half of the experts. Thus, the number of penalties you incur is at most \(\log n\) before you find the best expert. That is pretty good!

Question 1 Is it possible to find the best expert while incurring fewer than \(\log n\) penalties?

To this point, things have been relatively easy because we assumed that the best expert never errs. Thus, as soon as an expert makes an incorrect prediction, we know they are not the best expert. In practice, however, it is unlikely that the best expert is infallable. What if we are only guaranteed that the best expert errs at most \(m\) times?

It is not hard to see that the majority procedure can be generalized to account for an expert that errs at most \(m\) times. The idea is to follow the majority opinion, but to allow an expert to err \(m\) times before firing them. We call the resulting procedure the “\(m + 1\) strikes majority procedure.” Each day, follow the majority opinion (again breaking ties arbitrarily). Every time an expert is wrong, they receive a strike. If an expert ever exceeds \(m\) strikes, they are immediately fired.

To analyze the \(m + 1\) strikes procedure, observe that if the majority opinion is ever wrong, at least half of the remaining experts receive strikes. Since the total number of strikes of all remaining experts cannot exceed \(m n\), after \(2 m + 1\) majority errors, we know that at least half of the experts have been fired. Thus, we find the best expert with penalty at most \((2 m + 1) \log n\). (I suspect that a more clever analysis would give something like \((m + 1) \log n\), thus matching the majority procedure for \(m = 0\).) Can we do better than the \(m + 1\) strikes majority procedure?

One objection that one might have with the \(m + 1\) strikes procedure is that an expert must err \(m + 1\) times before any action is taken against them. Another reasonable approach would be to have a “confidence” parameter for each expert. Each time the expert errs, their confidence is decreased. When taking a majority decision, each expert’s prediction is then weighted by our confidence in that expert. If we adopt this strategy, we must specify how the confidence is decreased with each error.

Littlestone and Warmuth proposed the following “weighted majority algorithm” to solve this problem. Initially, each expert \(i\) is assigned a weight \(w_i = 1\), where a larger weight indicates a greater confidence in expert \(i\). Each time expert \(i\) errs, \(w_i\) is multiplied by \(1/2\). Each day, we take as our prediction the weighted majority–each expert \(i\)’s prediction carries \(w_i\) votes, and we choose the option receiving the greatest sum of votes.

By way of analysis, consider what happens each time the majority answer is wrong. Let \(W_t\) denote the total weight of all experts after \(t\) penalties have been incurred. Observe that initially we have \(W_0 = n\). When we incur the \((t + 1)\)-st penalty, at least half of the \(W_t\) total votes were cast in favor of the wrong answer. Since the total errant weight is cut in half, we find that \(W_{t + 1} \leq (3/4) W_t\). Thus, \(W_t \leq (3/4)^t n\). On the other hand, we know that some expert \(i\) errs at most \(m\) times. Therefore, \(2^{-m} \leq w_i \leq W_t\) for all \(t\). Combining this with the previous expression gives
\[
2^{-m} \leq W_t \leq (3/4)^t n.
\]
Taking logs of both sides of this expression, we find that
\[
t \leq \frac{(m + \log n)}{\log(4/3)}.
\]
It seems incredible to me that this simple and elegant algorithm gives such a powerful result. Even if we have initially no idea as to which expert is the best, we can still perform nearly as well as the best expert as long as \(m >> \log n\). (Note that \(1 / \log(4/3) \approx 2.4\).) Further, we do not even need to know what \(m\) is in advance to apply this procedure!

Question 2 Is it possible to improve upon the bound of the weighted majority algorithm?

computer science expository math

Testing Equality in Networks

Yesterday, I went to an interesting talk by Klim Efremenko about testing equality in networks. The talk was based on his joint paper with Noga Alon and Benny Sudakov. The basic problem is as follows. Suppose there is a network with \(k\) nodes, and each node \(v\) has an input in the form of an \(n\)-bit string \(M_v \in \{0, 1\}^n\). All of the nodes in the network want to verify that all of their strings are equal, i.e., that \(M_v = M_u\) for all nodes \(v\) and \(u\), and they may only communicate with their neighbors in the network. How many bits must be communicated in total?

For concreteness, suppose the network is a triangle. That is, there are three nodes–say played by Alice, Bob, and Carol–and each pair of nodes can communicate. Let’s first consider some upper bounds on the number of bits sent to check that all inputs are the same. The simplest thing to do would be for each player to send their input to the other two players. After this step, everyone knows everyone else’s input, so each player can check that all of the inputs are the same. The total communication required for this protocol is \(6 n\), as each player sends their entire input to the two other players.

It is not hard to see that this procedure is wasteful. It is enough that one player–say Alice–knows whether all of the inputs are the same, and she can tell the other players as much. Thus, we can make a more efficient procedure. Bob and Carol both send their inputs to Alice, and Alice checks if both inputs are the same as her input. She sends a single bit back to Bob and Carol depending on if all inputs are equal. The total communication required for this procedure is \(2 n + 2\). So we’ve improved the communication cost by a factor of \(3\)–not too shabby. But can we do better?

Let’s talk a little bit about lower bounds. To this end, consider the case where there are only two players, Alice and Bob. As in the previous paragraph, they can check that their inputs are equal using \(n + 1\) bits of communication. To see that \(n\) bits are in fact required, we can use the “fooling set” technique from communication complexity. Suppose to the contrary that Alice and Bob can check equality with \(b < n\) bits of communication. For any \(x \in \{0, 1\}^n\), consider the case where both Alice and Bob have input \(x\). Let \(m_x\) be the ``transcript'' of Alice and Bob's conversation when they both have \(x\) as their input. By assumption, the transcript contains at most \(b < n\) bits. Therefore, there are at most \(2^b\) distinct transcripts. Thus, by the pigeonhole principle, there are two values \(x \neq y\) that give the same transcript: \(m_x = m_y\). Now suppose Alice is given input \(x\) and Bob has input \(y\). In this case, when Alice and Bob communicate, they will generate the same transcript \(m_x (= m_y)\). Since the communication transcript determines whether Alice and Bob think they have the same input, they will both be convinced their inputs are the same, contradicting the assumption that \(x \neq y\)! Therefore, we must have \(b \geq n\)--Alice and Bob need to communicate at least \(n\) bits to check equality.

To obtain lower bounds for three players, we can use the same ideas as the two player lower bound of \(n\) bits. Assume for a moment that Bob and Carol know they have the same input. How much must Alice communicate with Bob/Carol to verify that her input is the same as theirs? By the argument in the previous paragraph, Alice must exchange a total of \(n\) bits with Bob/Carol. Similarly, Bob must exchange \(n\) bits with Alice/Carol, and Carol must exchange \(n\) bits with Alice/Bob. So the total number of bits exchanged must be at least \(3 n / 2\). The factor of \(1/2\) occurs because we count each bit exchanged between, say Alice and Bob, twice: once when we consider the communication between Alice and Bob/Carol and once when we consider the communication between Bob and Alice/Carol.

As it stands we have a gap in the communication necessary to solve the problem: we have an upper bound of \(2 n + 2\) bits, and a lower bound of \(3 n / 2\) bits. Which of these bounds is correct? Or is the true answer somewhere in between? Up to this point, the techniques we’ve used to understand the problem are fairly routine (at least to those who have studied some communication complexity). In what follows–the contribution of Klim, Noga, and Benny–we will see that it is possible match the lower bound of \(3 n / 2\) bits using a very clever encoding of the inputs.

The main tool that the authors employ is the existence of certain family of graphs, which I will refer to as Rusza-Szemeredi graphs. Here is the key lemma:

Lemma (Rusza-Szemeredi, 1978). For every \(m\), there exists a tri-partite graph \(H\) on \(3 m\) vertices which contains \(m^2 / 2^{O(\sqrt{\log m})}\) triangles such that no two triangles share a common edge.

The lemma says that there is a graph with a lot of triangles such that each triangle is determined by any one of its edges. To see how this lemma is helpful (or even relevant) to the problem of testing equality, consider the following modification of the original problem. Instead of being given a string in \(\{0, 1\}^n\), each player is player is given a triangle in the graph \(H\) from the lemma. Each triangle consists of \(3\) vertices, but the condition of the lemma ensures that each triangle is determined by any two of the three vertices–i.e., a single edge of the triangle. Thus, the three players can verify that they all have the same triangle by sharing information in the following way: Alice sends Bob the first vertex in her triangle; Bob sends Carol the second vertex in his triangle, and Carol sends Alice the third vertex in her triangle.

Miraculously, this procedure reveals enough information for Alice, Bob, and Carol to determine if they were all given the same triangle! Suppose, for example, that Alice and Bob have the same triangle \(T\), but Carol has a triangle \(T’ \neq T\). By the condition of the lemma, \(T’\) and \(T\) can share at most one vertex, so they must differ in at least two vertices. In particular, it must be the case that the second or third (or both) vertices of \(T\) and \(T’\) differ. If the second vertex of \(T\) and \(T’\) differ, then Carol will see that her triangle is not the same as Bob’s triangle (since Bob sends Carol his second vertex). If the third vertex of \(T\) and \(T’\) differ, then Alice will see that her triangle differs from Carol’s, as Carol sends Alice her triangle’s third vertex.

Now consider the case where Alice, Bob, and Carol are all given different triangles \(T_a = \{u_a, v_a, w_a\}\), \(T_b = \{u_b, v_b, w_b\}\), and \(T_c = \{u_c, v_c, w_c\}\). Suppose that all three players accept the outcome of the communication protocol. This means that \(u_a = u_b\) (since Alice sends \(u_a\) to Bob), and similarly \(v_b = v_c\) and \(w_c = w_a\). In particular this implies that \(H\) contains the edges
\[
\{u_a, v_b\} \in T_b, \{v_b, w_a\} \in T_c, \{w_a, u_a\} \in T_a.
\]
Together these three edges form form a triangle \(T’\) which is present in the graph \(H\). However, observe that \(T’\) shares an edge \(\{w_a, u_a\}\) with \(T_a\), contradicting the property of \(H\) guaranteed by the lemma. Therefore, Alice, Bob, and Carol cannot all accept if they are each given different triangles! Thus they can determine if they were all given the same triangle by each sending a single vertex of their triangle, which requires \(\log m\) bits.

To solve the original problem–where each player is given an \(n\)-bit string–we encode each string as a triangle in a suitable graph \(H\). In order to make this work, we need to choose \(m\) (the number of vertices in \(H\)) large enough that there is one triangle for each possible string. Since there are \(2^n\) possible strings, using the lemma, we need to take \(n\) sufficiently large that
\[
m^2 / 2^{O(\sqrt{\log m})} \geq 2^n.
\]
Taking the logarithm of both sides, we find that
\[
\log m – O(\sqrt{\log m}) \geq \frac 1 2 n
\]
is large enough. Thus, to send the identity of a single vertex of the triangle which encodes a player’s input, they must send \(\log m\) bits, which is roughly \(\frac 1 2 n\). Therefore, the total communication in this protocol is roughly \(\frac 3 2 n\), which matches the lower bounds. We summarize this result in the following theorem.

Theorem. Suppose \(3\) players each hold \(n\)-bit strings. Then
\[
\frac 3 2 n + O(\sqrt{n}).
\]
bits of communication are necessary and sufficient to test if all three strings are equal.

The paper goes on to generalize this result, but it seems that all of the main ideas are already present in the \(3\) player case.

expository math teaching

Probability Primer


This post is a very brief introduction to some basic concepts in probability theory. We encounter uncertainty often in our everyday lives, for example, in the weather, games of chance (think of rolling dice or shuffling a deck of cards), financial markets, etc. Probability theory provides a language to quantify uncertainty, thereby allowing us to reason about future events whose outcomes are not yet known. In this note, we only consider events where the number of potential outcomes is finite.

The basic object of study in probability is a probability space. A (finite) probability space consists of a (finite) sample space \(\Omega = \{\omega_1, \omega_2, \ldots, \omega_n\}\) together with a function \(P : \Omega \to [0,1]\) which assigns probabilities to the outcomes \(\omega_i \in \Omega\). The probability function \(P\) satisfies \(P(\omega_i) \geq 0\) for all \(i\) and
\[
\sum_{i = 1}^n P(\omega_i) = 1.
\]
The interpretation is that \(P\) assigns probabilities or likelihoods to the events in \(\Omega\).

Example. We model the randomness of tossing a coin. In this case, there are two possible outcomes of a coin toss, heads or tails, so we take our sample space to be \(\Omega = \{H, T\}\). Since heads and tails are equally likely, we have \(P(H) = P(T) = 1/2\).

Example. Rolling a standard (six-sided) die has six outcomes which are equally likely. Thus we take \(\Omega = \{1, 2, 3, 4, 5, 6\}\) and \(P(i) = 1/6\) for \(i = 1, 2, \ldots, 6\).

Definition. Let \(\Omega\) be a sample space and \(P\) a probability function on \(\Omega\). A random variable is a function \(X : \Omega \to \mathbf{R}\).

Example. Going back to the coin toss example above, we can define a random variable \(C\) on the coin toss probability space defined by \(C(H) = 1\) and \(C(T) = -1\). This random variable may arise as a simple game: two players toss a coin and bet on the outcome. If the outcome is heads, player 1 wins one dollar, while if the outcome is tails, player 1 loses one dollar.

Example. For the die rolling example, we can define the random variable \(R\) by \(R(i) = i\). The value of \(R\) is simply the value showing on die after it is rolled.

Definition. A fundamental value assigned to random variable is its expected value, expectation or average, which is defined by the formula
\[
E(X) = \sum_{i = 1}^n X(\omega_i) P(\omega_i).
\]
The expected value of a random variable quantifies the behavior we expect to see in a random variable if we repeat an experiment (e.g. a coin flip or die roll) many times over.

Example. Going back to the coin flip example, we can compute
\[
E(C) = C(H) P(H) + C(T) P(T) = 1 \cdot \frac{1}{2} + (-1) \cdot \frac{1}{2} = 0.
\]
This tells us that if we repeatedly play the coin flip betting game described above, neither player has an advantage. The players expect to win about as much as they lose.

For the die rolling example, we compute
\[
E(R) = R(1) P(1) + \cdots + R(6) P(1) = 1 \cdot \frac{1}{6} + \cdots + 6 \cdot \frac{1}{6} = 3.5.
\]
Thus an “average” die roll is \(3.5\) (even though \(3.5\) cannot be the outcome of a single die roll).

Often, when speaking about random variables we omit reference to the underlying probability space. In this case, we speak only of the probability that a random variable \(X\) takes on various values. For the coin flipping example above, we could have just defined \(C\) by
\[
P(C = 1) = P(C = -1) = 1/2
\]
without saying anything about the underlying sample space \(\Omega\). The danger in this view is that if we don’t explicitly define \(C\) as a function on some probability space, it may make comparison of random variables difficult. To see an example of this, consider the random variable \(S\) defined on the die roll sample space, \(\Omega = \{1,\ldots, 6\}\) by
\[
S(i) =
\begin{cases}
1 &\text{ if } i \text{ is even}\\
-1 &\text{ if } i \text{ is odd}.
\end{cases}
\]
Notice that, like our variable \(C\) defined for coin flips, we have \(P(S = 1) = P(S = -1) = 1/2\), so in some sense \(C\) and \(S\) are “the same.” However, they are defined on different sample spaces: \(C\) is defined on the sample space of coin flips, while \(S\) is defined on the sample space of die rolls.

Consider a game where the play is determined by a coin flip and a die roll. For the examples above, the random variable \(C\) depends only on the outcome of the coin flip, while \(R\) and \(S\) depend only on the outcome of the die roll. Since the outcome of the coin flip has no effect on the outcome of the die roll, the variables \(C\) and \(R\) are independent of one another, as are \(C\) and \(S\). However, \(R\) and \(S\) depend on the same outcome (the die roll) so their values may depend on each other. In fact, the value of \(S\) is completely determined by the value of \(R\)! So knowing the value of \(R\) allows us to determine the value of \(S\), and knowing the value of \(S\) tells us something about the value of \(R\) (namely whether \(R\) is even or odd).

Definition. Suppose \(X\) and \(Y\) are random variables defined on the same probability space (i.e., \(X, Y : \Omega \to \mathbf{R}\)). We say that \(X\) and \(Y\) are independent if for all possible values \(x\) of \(X\) and \(y\) of \(Y\) we have
\[
P(X = x \text{ and } Y = y) = P(X = x) P(Y = y).
\]

For our examples above with the coin flip and the die roll, \(C\) and \(R\) cannot be said to be independent because they are defined on different probability spaces. The variables \(R\) and \(S\) are both defined on the die roll sample space, so they can be compared. However, they are not independent. For example, we have \(P(R = 1) = 1/6\) and \(P(S = 1) = 1/2\). Since \(S(i) = 1\) only when \(i\) is even, we have
\[
P(R = 1 \text{ and } S = 1) = 0 \neq \frac{1}{6} \cdot \frac{1}{2}.
\]

Let \(W\) be the random variable on the die roll sample space defined by
\[
W(i) =
\begin{cases}
1 & \text{ if } i = 1, 4\\
2 & \text{ if } i = 2, 5\\
3 & \text{ if } i = 3, 6.
\end{cases}
\]
We claim that \(W\) and \(S\) are independent. This can be verified by brute force calculation. For example, note that we have \(S = 1\) and \(W = 1\) only when the outcome of the die roll is 4. Therefore,
\[
P(S = 1 \text{ and } W = 1) = \frac{1}{6} = P(S = 1) P(W = 1).
\]
Similar calculations show that similar equalities hold for all possible values of \(S\) and \(W\), hence these random variables are independent.

Given two random variables \(X\) and \(Y\) defined on the same probability space, we can define their sum \(X + Y\) and product \(X Y\).

Proposition. Suppose \(X, Y : \Omega \to \mathbf{R}\) are independent random variables. Then
\[
E(X + Y) = E(X) + E(Y) \quad\text{and}\quad E(X Y) = E(X) E(Y).
\]

Proof. For the first equality, by definition we compute
\[
E(X + Y) = \sum_{x, y} P(X = x \text{ and } Y = y) (x + y).
\]
Using the fact that \(X\) and \(Y\) are independent, we find
\[
\begin{align}
E(X + Y) &= \sum_{x, y} P(X = x \text{ and } Y = y) (x + y)\\
&= \sum_{x, y} P(X = x) P(Y = y) (x + y)\\
&= \sum_{x} P(X = x) x \sum_y P(Y = y) + \sum_y P(Y = y) y \sum_x P(X = x)\\
&= \sum_x P(X = x) x + \sum_y P(Y = y) y\\
&= E(X) + E(Y).
\end{align}
\]
The fourth equality holds because \(P\) satisfies \(\sum_x P(X = x) = 1\) and \(\sum_y P(Y = y) = 1\). Similarly, we compute
\[
\begin{align}
E(X Y) &= \sum_{x, y} P(X = x \text{ and } Y = y) x y\\
&= \sum_{x, y} P(X = x) P(Y = y) x y\\
&= \left(\sum_{x} P(X = x) x\right) \left(\sum_{y} P(Y = y) y\right)\\
&= E(X) E(Y).
\end{align}
\]
These equations give the desired results. ∎

The equation \(E(X + Y) = E(X) + E(Y)\) is satisfied even if \(X\) and \(Y\) are not independent. This fundamental fact about probability is known as the linearity of expectation. However, in order to have \(E(X Y) = E(X) E(Y)\), \(X\) and \(Y\) must be independent.

Exercise. Prove that \(E(X + Y) = E(X) + E(Y)\) without assuming that \(X\) and \(Y\) are independent.

Exercise. Give an example of random variables \(X\) and \(Y\) for which \(E(X Y) \neq E(X) E(Y)\). (Note that \(X\) and \(Y\) cannot be independent.

computer science expository math musings

John Nash, Cryptography, and Computational Complexity


Recently, the brilliant mathematician John Nash and his wife were killed in a car crash. While Nash was probably most famous for his pioneering work on game theory and his portrayal in Ron Howard‘s popular film A Beautiful Mind, he also worked in the field of cryptography. Several years ago, the NSA declassified several letters from 1955 between Nash and the NSA wherein Nash describes some ideas for an encryption/decryption scheme. While the NSA was not interested in the particular scheme devised by Nash, it seems that Nash foresaw the importance of computational complexity in the field of cryptography. In the letters, Nash states:

Consider the enciphering process with a finite “key,” operating on binary messages. Specifically, we can assume the process [is] described by a function
\[
y_i = F(\alpha_1, \alpha_2, \ldots, \alpha_r; x_i, x_{i-1}, \ldots, x_{i-n})
\]
where the \(\alpha\)’s, \(x\)’s and \(y\)’s are mod 2 and if \(x_i\) is changed with the other \(x\)’s and \(\alpha\)’s left fixed then \(y_i\) is changed. The \(\alpha\)’s denote the “key” containing \(r\) bits of information. \(n\) is the maximum span of the “memory” of the process…

…We see immediately that in principle the enemy needs very little information to break down the process. Essentially, as soon as \(r\) bits of enciphered message have been transmitted the key is about determined. This is no security, for a practical key should not be too long. But this does not consider how easy or difficult it is for the enemy to make the computation determining the key. If this computation, although always possible in principle, were sufficiently long at best the process could still be secure in a practical sense.

Nash goes on to say that

…a logical way to classify the enciphering process is the way in which the computation length for the computation on the key increases with increasing length of the key… Now my general conjecture is as follows: For almost all sufficiently complex types of enciphering…the mean key computation length increases exponentially with the length of the key.

The significance of this general conjecture, assuming its truth, is easy to see. It means that it is quite feasible to design ciphers that are effectively unbreakable.

To my knowledge, Nash’s letter letter is the earliest reference to using computational complexity to achieve practical cryptography. The idea is that while it is theoretically possible to decrypt an encrypted message without the key, doing so requires a prohibitive amount of computational resources. Interestingly, Nash’s letter predates an (in)famous 1956 letter from Kurt Godel to John von Neumann which is widely credited as being the first reference to the “P vs NP problem.” However the essential idea of the P vs NP problem is nascent in Nash’s conjecture: there are problems whose solution can efficiently be verified, but finding such a solution is computationally intractable. Specifically, a message can easily be decrypted if one knows the key, but finding the key to decrypt the message is hopelessly difficult.

The P vs NP problem was only formalized 16 years after Nash’s letters by Stephen Cook in his seminal paper, The complexity of theorem-proving procedures. In fact, Nash’s conjecture is strictly stronger than the P vs NP problem–its formulation is more akin to the exponential time hypothesis, which was only formulated in 1999!

Concerning Nash’s conjecture, he was certainly aware of the difficulty of its proof:

The nature of this conjecture is such that I cannot prove it, even for a special type of cipher. Nor do I expect it to be proven. But this does not destroy its significance. The probability of the truth of the conjecture can be guessed at on the basis of experience with enciphering and deciphering.

Indeed, the P vs NP problem remains among the most notorious open problems in mathematics and theoretical computer science.

PDF of Nash’s Letters to the NSA.

expository math teaching

Logic and Sets

I have just uploaded notes on Basic Logic and Naive Set Theory for math 115AH. Please let me know in the comments below if you notice typos or if anything is unclear.

computer science expository math

Shorter Certificates for Set Disjointness


Suppose two players, Alice and Bob, each hold equal sized subsets of \([n] = \{1, 2, \ldots, n\}\). A third party, Carole, wishes to convince Alice and Bob that their subsets are disjoint, i.e., their sets have no elements in common. How efficiently can Carole prove to Alice and Bob that their sets are disjoint?

To formalize the situation, suppose \(k\) is an integer with \(1 < k < n/2\). Alice holds a set \(A \subseteq [n]\) with \(|A| = k\) and similarly Bob holds \(B \subseteq [n]\) with \(|B| = k\). We assume that Carole sees the sets \(A\) and \(B\), but Alice and Bob have no information about each others' sets. We allow Carole to send messages to Alice and Bob, and we allow Alice and Bob to communicate with each other. Our goal is to minimize the total amount of communication between Carole, Alice, and Bob. In particular, we will consider communication protocols of the following form:

  1. Carole produces a certificate or proof that \(A \cap B = \emptyset\) and sends this certificate to Alice and Bob.
  2. Alice and Bob individually verify that Carole’s certificate is valid with respect to their individual inputs. Alice and Bob then send each other (very short) messages saying whether or not they accept Carole’s proof. If they both accept the proof, they can be certain that their sets are disjoint
  3. In order for the proof system described above to be valid, it must satisfy the following two properties:

    Completeness If \(A \cap B = \emptyset\) then Carole can send a certificate that Alice and Bob both accept.

    Soundness If \(A \cap B \neq \emptyset\) then any certificate that Carole sends must be rejected by at least one of Alice and Bob.

    Before giving a “clever” solution to this communication problem, we describe a naive solution. Since Carole sees \(A\) and \(B\), her proof of their disjointness could simply be to send Alice and Bob the (disjoint) pair \(C = (A, B)\). Then Alice verifies the validity of the certificate \(C\) by checking that her input \(A\) is equal to the first term in \(C\), and similarly Bob checks that \(B\) is equal to the second term. Clearly, if \(A\) and \(B\) are disjoint, Alice and Bob will both accept \(C\), while if \(A\) and \(B\) intersect, Alice or Bob will reject every certificate that Carole could send.

    Let us quickly analyze the efficiency of this protocol. The certificate that Carole sends consists of a pair of \(k\)-subsets of \([n]\). The naive encoding of simply listing the elements of \(A\) and \(B\) requires \(2 k \log n\) bits — each \(i \in A \cup B\) requires \(\log n\) bits, and there are \(2 k\) such indices in the list. In fact, even if Carole is much cleverer in her encoding of the sets \(A\) and \(B\), she cannot compress the proof significantly for information theoretic reasons. Indeed, there are
    $$
    {n \choose k}{n-k \choose k}
    $$
    distinct certificates that Carole must be able to send, hence her message must be of length at least
    $$
    \log {n \choose k} \geq \log ((n/k)^k) = k \log n – k \log k.
    $$
    Is it possible for Carole, Alice, and Bob to devise a more efficient proof system for disjointness?

    read more »

computer science expository math musings

The Aura of Interactive Proofs


In his essay The Work of Art in the Age of Mechanical Reproduction, Walter Benjamin introduces the idea that original artwork has an aura — some ineffable something that the work’s creator imbues into the work, but which is lost in reproductions made by mechanical means. There is something unique about an original work. Let us imagine that Walter is able to read the aura of work of art, or sense its absence. Thus, he has the seemingly magical ability to tell original art from mechanical forgery.

Andy Warhol is skeptical of Walter’s claims. Andy doesn’t believe in the aura. And even if the aura does exist, he sincerely doubts that Walter is able to detect its presence. In an attempt to unmask Walter for the fraud he undoubtedly he is, Andy hatches a cunning plan to catch Walter in a lie. By hand, he paints an original painting, then using the most advanced technology available, makes a perfect replica of the original. Although the replica looks exactly like the original to the layman (and even Andy himself), according to Walter, there is something missing in the replica.

Andy’s original plan was to present Walter with the two seemingly identical paintings and simply ask which is the original. He soon realized, however, that this approach would be entirely unsatisfactory for his peace of mind. If Walter picked the true original, Andy still couldn’t be entirely convinced of Walter’s powers: maybe Walter just guessed and happened to guess correctly! How can Andy change his strategy in order to (1) be more likely to catch Walter in his lie (if in fact is he lying) and (2) be more convinced to Walter’s supernatural abilities if indeed he is telling the truth?

read more »

computer science expository math

Factor Characterization of Matrix Rank


I am currently taking a course on communication complexity with Alexander Sherstov. Much of communication complexity involves matrix analysis, so yesterday we did a brief review of results from linear algebra. In the review, Sherstov gave the following definition for the rank of a matrix \(M \in \mathbf{F}^{n \times m}\):
\[
\mathrm{rank}(M) = \min\{k | M = A B, A \in \mathbf{F}^{n \times k}, B \in \mathbf{F}^{k \times m}\}
\]
Since this “factor” definition of rank appears very different from the standard definition given in a linear algebra course, I thought I would prove that the two definitions are equivalent.

The “standard” definition of rank is
\[
\mathrm{rank}(M) = \mathrm{dim}\ \mathrm{span} \{\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_m\}
\]
where \(\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_m\) are the columns of \(M\). Equivalently, \(\mathrm{rank}(M)\) is the number of linearly independent columns of \(M\).

An easy consequence of the standard definition of rank is that the rank of a product of matrices is at most the rank of the first matrix: \(\mathrm{rank}(A B) \leq \mathrm{rank}(A)\). Indeed, the columns of \(AB\) are linear combinations of the columns of \(A\), hence the span of the columns of \(AB\) is a subspace of the span of the columns of \(A\). Thus, if we can write \(M = A B\) with \(A \in \mathbf{F}^{n \times k}\) and \(B \in \mathbf{F}^{k \times m}\), we must have \(\mathrm{rank}(M) \leq k\).

It remains to show that if \(\mathrm{rank}(M) = k\) then we can factor \(M = A B\) with \(A \in \mathbf{F}^{n \times k}\) and \(B \in \mathbf{F}^{k \times m}\). To this end, assume without loss of generality that the first \(k\) columns of \(M\) are linearly independent. Write these columns as vectors
\(\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k\). Denote the remaining columns by \(\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_{m-k}\). Since these vectors lie in the span of the first \(k\) columns of \(M\), we can write the \(\mathbf{w}_j\) as linear combinations of the \(\mathbf{v}_i\):
\[
\mathbf{w}_j = \sum_{i = 1}^k b_{i, j} \mathbf{v}_i \quad\text{for}\quad j = 1, 2, \ldots, m – k.
\]
Now define the matrices
\[
A = (\mathbf{v}_1\ \mathbf{v}_2\ \cdots\ \mathbf{v}_k)
\]
and
\[
B = (\mathbf{e}_1\ \mathbf{e}_2\ \cdots\ \mathbf{e}_k\ \mathbf{b}_1\ \mathbf{b}_2\ \cdots\ \mathbf{b}_{m – k})
\]
where \(\mathbf{e}_j\) is the \(j\)th standard basis vector in \(\mathbf{F}^k\) and
\[
\mathbf{b}_j =
\begin{pmatrix}
b_{1, j}\\
b_{2, j}\\
\vdots\\
b_{k, j}
\end{pmatrix}
\quad\text{for}\quad j = 1, 2, \ldots, m – k.
\]
It is straightforward to verify that we have \(M = A B\), which proves the equivalence of the two definitions of rank.

expository game theory

The Game Theory of (Anti) Vaccination


I recently read an article about the prevalence of parents not vaccinating their children in certain (read: affluent) LA communities. While I vehemently disagree with the anti-vaccination movement, there are certain game-theoretic incentives that might compel people not to vaccinate. The idea is very closely related to the prisoner’s dilemma–one of the most studied scenarios in game theory.

Imagine that you do believe that vaccines had the potential to cause harm. Maybe not everyone vaccinated is harmed by the vaccine, but you believe there is a chance that the vaccine itself will cause some malady. Let’s quantify the (perceived) harm caused by vaccination to be say, \(4\) harm units. If an otherwise rational person chooses not vaccinate, it is probably because the perceived harm done by the vaccine is greater than the perceived harm from the disease the vaccine is meant to prevent. So let’s quantify the (anticipated) harm caused by not vaccinating to be, say, \(1\) harm unit. In this (overly simplified and probably inaccurate) world, any reasonable person would choose no vaccine (\(1\) harm unit) over vaccination (\(4\) harm units).

Here is the problem: by not vaccinating your own children, you put the entire rest of the population at a greater risk of the disease. We can formalize this as follows. Assume there is a population of \(n\) people, each of whom chooses either a vaccine or no vaccine. A person who gets vaccinated incurs \(4\) harm units, but if a person chooses not be vaccinated, the entire population of \(n\) people incur \(1\) harm unit.

Bob is deciding whether or not to vaccinate. Regardless of Bob’s choice, some people will vaccinate, while others will not. Suppose \(k\) people choose not to vaccinate. Then every person incurs \(k\) harm units from those people. Thus, if Bob chooses to vaccinate, he will incur \(k + 4\) harm units (while each non-vaccinator incurs \(k\)), but if he chooses not to, he will only incur \(k + 1\) harm units (as will the other non-vaccinators). Thus, Bob has an incentive to not vaccinate. Bob is no martyr (in fact, he is downright selfish) so he decides not to vaccinate.

What is interesting about this scenario, is that any individual can improve their own situation (i.e., incur less harm) by selfishly choosing not to vaccinate. But if everyone (or even more than \(4\) people) choose not to vaccinate, everyone is worse off than if they had all chosen to vaccinate. In fact, if everyone acts in their own best interest, everyone achieves the worst possible outcome!

expository math

Writing Natural Numbers as Sums of Primes


The prime numbers are defined in terms of the multiplicative structure of the natural numbers, \(\mathbf{N}\). Specifically, the definition of prime–that \(p \in \mathbf{N}\) is prime if it has no non-trivial divisors–only refers to multiplication, not addition. The additive structure of the prime numbers is much less well understood, and indeed, many of the most famous open problems and deepest theorems in number theory concern the additive structure of the primes. For example, see the twin primes conjecture, the Goldbach conjecture, the abc conjecture, and the Green-Tao theorem. In this post, we consider a much weaker variant of the Goldbach Conjecture:

Theorem 1. Every natural number \(m\) with \(1 < m < 2^{n + 1}\) can be written as a sum of at most \(n\) primes. Before giving a rigorous proof of Theorem 1, we give the following intuitive motivation. Suppose we were asked to find relatively few primes \(p_1, p_2, \ldots, p_k\) whose sum is \(m\). A reasonable way to proceed would be to find, say, the largest prime \(p_1 < m - 1\). Then taking \(m_1 = m - p_1\), we need only find primes \(p_2, \ldots, p_k\) whose sum is \(m_1\). Thus, we have reduced the problem of writing \(m\) as a sum of primes to writing the strictly smaller \(m_1\) as a sum of primes. Iterating this procedure, we will eventually find \(p_1, p_2, \ldots, p_k\) with \(m = p_1 + p_2 + \cdots + p_k\). The only question is how large \(k\) must be. Theorem suggests that we should be able to always have \(k \leq \lfloor \log_2 (m) \rfloor \). The key fact that we will use to prove that \(k \leq \lfloor\log_2(m) \rfloor\) suffices is a statement about the density of the primes: Bertrand’s Postulate.

Bertrand’s Postulate. For every natural number \(m > 1\) there exists a prime \(p\) satisfying \(m < p < 2 m\). Bertrand's Postulate is precisely the key piece we need in order to use the heuristic described above to prove Theorem 1. Indeed, Bertrand's Postulate shows that the prime \(p_1\) satisfies \(p_1 > (m – 1) / 2\). Thus, \(m_1 = m – p_1 \leq m / 2\). Since each iteration of our procedure for finding \(p_1, \ldots, p_k\) cuts the size of the problem in half, the process should terminate after \(\log_2 m\) iterations. We are now ready to formalize this intuition with a proof.

Proof of Theorem 1. We argue by induction on \(n\). For the base case, \(n = 1\), the only natural numbers \(m\) with \(1 < m < 2^{n+1}\) are \(m = 2, 3\). Since both of these numbers are prime, the theorem holds. For the inductive step, suppose Theorem 1 holds for some fixed \(n > 1\). We will show that Theorem 1 also holds for \(n+1\). Suppose \(m\) satisfies \(1 < m < 2^{n + 2}\). Let \(p_1\) be the largest prime with \(p_1 < m - 1\). By Bertrand's Postulate, \(p_1 > (m – 1) / 2\), thus \(m_1 = m – p_1\) satisfies \(m_1 < 2^{n + 1}\). By the inductive hypothesis, \(m_1\) is the sum of primes \(m_1 = p_2 + p_3 + \cdots + p_k\) with \(k \leq n + 1\). Thus, \(m = p_1 + m_1 = p_1 + p_2 + \cdots p_k\) is the sum of at most \(n + 1\) primes, as desired. ∎