The Robustness of the Diversity Prediction Theorem I: generalizing the result

There is a PDF version of this post here.

Pick a question to which the answer is a number. What is the population of Weston-super-Mare? How many ants are there in the world? Then go out into the street and ask the first ten people you meet for their answers. Look how far those answers lie from the truth. Now take the average of the answers---their mean. And look how far it lies from the truth. If you measure the distance between an answer and the true value in the way many statisticians do---that is, you take the difference and square it---then you'll notice that the distance of the average from the truth is less than the average distance from the truth. I can predict this with confidence because it is no coincidence, nor even just a common but contingent outcome---it's a mathematical fact. And it remains a mathematical fact for many alternative ways of measuring the distance from an answer to the truth---essentially, all that's required is that the measure is strictly convex in its first argument (Larrick, et al. 2012). That's a neat result! In expectation, you're better off going with the  average answer than picking an answer at random and going with that.

But, if we keep using squared difference, we get an even more interesting result. The distance the average answer lies from the truth is equal to the average distance the answers lie from the truth less a quantity that measures the diversity of the answers. Scott Page calls this the Diversity Prediction Theorem and uses it as part of his general argument that diversity is valuable in group reasoning and decision-making. The quantity that measures the diversity of the answer is the average distance the answers lie from the mean answer: the answers in a homogeneous set will, on average, lie close to their mean, while those in a heterogeneous set will, on average, lie far from their mean. This has an intriguing corollary: given two collections of answers with the same average distance from the truth, the mean of the more diverse one will be more accurate than the mean of the less diverse one. That's another neat result! It isn't quite as neat as people sometimes make out. Sometimes, they say that it shows that more diverse sets of answers are more accurate; but that's only true when you're comparing sets with the same average distance from the truth; increasing diversity can often increase the average distance from the truth. But it's still a neat result!


A natural question: is it a quirk of mean squared error? In fact, it's not. There are many many ways of measuring the distance between answers, mean answers, and the truth for which the Diversity Prediction Theorem holds. As I'll show here, they are exactly the so-called Bregman divergences, which I'll introduce below.

That every Bregman divergence gives the Diversity Prediction Theorem was shown by David Pfau (2013) in an unpublished note. I haven't been able to find a proof that only those functions give it, but it follows reasonably straightforwardly from a characterization of Bregman divergences due to Banerjee, et al. (2005).

Having talked informally through the results I want to present, let me now present them precisely.

First, the measure of distance between two numbers that is used in the standard version of the Diversity Prediction Theorem:

Definition 1 (Squared difference)$$q(x, y) := (x-y)^2$$

So our first result is this:

Proposition 1 Suppose $a_1, \ldots, a_n, t$ are real numbers with $a_i \neq a_j$ for some $i, j$. And let $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$. Then$$q(a^\star, t) < \frac{1}{n} \sum^n_{i=1} q(a_i, t)$$

As I mentioned, this generalises to any measure of distance that is strictly convex in its first argument.

Definition 2 (Strictly convex function) Suppose $X$ is a convex subset of the real numbers and $d : X \times X \rightarrow \mathbb{R}$. Then $d$ is strictly convex in its first argument if, for any $x_1, x_2, y$ in $X$ and any $0 < \lambda < 1$,$$d(\lambda x_1 + (1-\lambda) x_2, y) < \lambda d(x_1, y) + (1-\lambda) d(x_2, y)$$

Proposition 2 Suppose $X$ is a convex subset of the real numbers and $d : X \times X \rightarrow [0, \infty]$ is strictly convex in its first argument. Suppose $a_1, \ldots, a_n, t$ are real numbers in $X$ with $a_i \neq a_j$ for some $i, j$. And let $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$. Then$$d(a^\star, t) < \frac{1}{n} \sum^n_{i=1} d(a_i, t)$$

Proof. This follows immediately from Jensen's inequality, which entails that, for any strictly convex function $f$, and any $a_1, \ldots, a_n$ with $a_i \neq a_j$ for some $i, j$,$$f\left (\frac{1}{n}\sum^n_{i=1} a_i \right ) < \frac{1}{n} \sum^n_{i=1} f(a_i).$$$\Box$

Next, we meet the Diversity Prediction Theorem:

Proposition 3 Suppose $a_1, \ldots, a_n, t$ are real numbers with $a_i \neq a_j$ for some $i, j$. And let $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$. Then$$q(a^\star, t) = \frac{1}{n} \sum^n_{i=1} q(a_i, t) - \frac{1}{n} \sum^n_{i=1} q(a_i, a^\star)$$

I won't offer a proof, since it follows from the more general version below. To state that, we need the notion of a Bregman divergence:

Definition 3 (Bregman divergence) Suppose $X$ is a convex subset of the real numbers and $\varphi : X \rightarrow \mathbb{R}$ is a continuously differentiable, strictly convex function. Then, for $x, y$ in $X$, $$d_\varphi(x, y) = \varphi(x) - \varphi(y) - \varphi'(y)(x-y)$$We say that $d_\varphi$ is the Bregman divergence generated by $\varphi$.

To calculate the Bregman divergence from $x$ to $y$ generated by $\varphi$, you take the tangent to $\varphi$ at $y$ and calculate the difference between $\varphi$ at $x$ and this tangent at $x$. This is illustrated in Figure 1. Note that, for any $x, y$ in $X$, $d_\varphi(x, y) \geq 0$, with equality iff $x = y$.

Figure 1: $\varphi(x) = x\log(x)$ is plotted in blue; the tangent to $\varphi$ at $y$ is plotted in yellow; $d_\varphi(x, y)$ is the length of the dashed line.

Two examples of Bregman divergences, with the strictly convex functions that generate them:

Squared Euclidean distance  Suppose $\varphi(x) = x^2$. Then$$d_\varphi(x, y) = (x-y)^2$$

Generalized Kullback-Leibler divergence  Suppose $\varphi(x) = x\log(x)$. Then$$d_\varphi(x, y) = x\log \left ( \frac{x}{y} \right ) - x + y$$

Proposition 4 (Pfau 2013) Suppose $X$ is a convex subset of the real numbers and $\varphi : X \rightarrow \mathbb{R}$ is a continuously differentiable, strictly convex function. Suppose $a_1, \ldots, a_n, t$ are real numbers in $X$ with $a_i \neq a_j$ for some $i, j$. And let $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$. Then$$d_\varphi(a^\star, t) = \frac{1}{n} \sum^n_{i=1} d_\varphi(a_i, t) - \frac{1}{n} \sum^n_{i=1} d_\varphi(a_i, a^\star)$$

Proof. The result follows easily from the following three equations:

\begin{eqnarray*} d_\varphi(a^\star, t) & = & \varphi(a^\star) - \varphi(t) - \varphi'(t)(a^\star - t) \\ & & \\ \frac{1}{n}\sum^n_{i=1} d_\varphi(a_i, t) & = & \frac{1}{n} \sum^n_{i=1} \varphi(a_i) - \varphi(t) - \varphi'(t)\left (\frac{1}{n} \sum^n_{i=1} a_i - t \right ) \\  & = & \frac{1}{n} \sum^n_{i=1} \varphi(a_i) - \varphi(t) - \varphi'(t)\left (a^\star - t \right ) \\  & & \\ \frac{1}{n} \sum^n_{i=1} d_\varphi(a_i, a^\star) & = & \frac{1}{n}\sum^n_{i=1}\varphi(a_i) - \varphi(a^\star) - \varphi'(a^\star)\left (\frac{1}{n}\sum^n_{i=1} a_i - \varphi(a^\star) \right ) \\ & = & \frac{1}{n}\sum^n_{i=1}\varphi(a_i) - \varphi(a^\star)\end{eqnarray*}

$\Box$

Next, we prove that the Bregman divergences are the only functions that give the Diversity Prediction Theorem. The proof relies on the following characterization of Bregman divergences.

Lemma 5 (Banerjee et al. 2005) Suppose $X$ is a convex subset of the real numbers and $d : X \times X \rightarrow [0, \infty]$. And suppose that, for any $a_1, \ldots, a_n$ in $X$, with $a^\star = \frac{1}{n}\sum^n_{i=1} a_i$, the function $t \mapsto \frac{1}{n} \sum^n_{i=1} d(a_i, t)$ is minimized at $t = a^\star$. Then there is a continuously differentiable and strictly convex function on $X$ such that $d = d_\varphi$.

Proposition 6 Suppose $X$ is a convex subset of the real numbers and $d : X \times X \rightarrow [0, \infty]$. And suppose that, for any $a_1, \ldots, a_n, t$ in $X$ with $a_i \neq a_j$ for some $i, j$, with $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$, $$d(a^\star, t) = \frac{1}{n} \sum^n_{i=1} d(a_i, t) - \frac{1}{n} \sum^n_{i=1} d(a_i, a^\star)$$Then there is a continuously differentiable and strictly convex function on $X$ such that $d = d_\varphi$.

Proof. Suppose that, for any $a_1, \ldots, a_n, t$ are real numbers in $X$ with $a_i \neq a_j$ for some $i, j$, with $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$, $$d(a^\star, t) = \frac{1}{n} \sum^n_{i=1} d(a_i, t) - \frac{1}{n} \sum^n_{i=1} d(a_i, a^\star)$$Then$$\frac{1}{n} \sum^n_{i=1} d(a_i, t) = d(a^\star, t) + \frac{1}{n} \sum^n_{i=1} d(a_i, a^\star)$$And so the function $t \mapsto \frac{1}{n} \sum^n_{i=1} d(a_i, t)$ is minimal when $t \mapsto d(a^\star, t)$ is minimal; and since, for all $x, y$ in $X$, $d(x, y) \geq 0$ with equality iff $x = y$, $t \mapsto d(a^\star, t)$ is minimal at $t = a^\star$. So, by Lemma 5, there is a continuously differentiable and strictly convex function on $X$ such that $d = d_\varphi$. $\Box$

References

Banerjee, A., Guo, X., & Wang, H. (2005). On the Optimality of Conditional Expectation as a Bregman Predictor. IEEE Transactions of Information Theory, 51, 2664–69.

Larrick, R. P., Mannes, A. E., & Soll, J. B. (2012). The Social Psychology of the Wisdom of the Crowds. In J. I. Krueger (Ed.) Social Judgment and Decision Making, chap. 13, (pp. 227–242). New York: Psychology Press.

Pfau, D. (2013). A Generalized Bias-Variance Decomposition for Bregman Divergences. Unpublished manuscript.

Comments