Monday, 24 August 2020

Accuracy and explanation: thoughts on Douven

 For a PDF of this post, see here.

Igor has eleven coins in his pocket. The first has 0% chance of landing heads, the second 10% chance, the third 20%, and so on up to the tenth, which has 90% chance, and the eleventh, which has 100% chance. He picks one out without letting me know which, and he starts to toss it. After the first 10 tosses, it has landed tails 5 times. How confident should I be that the coin is fair? That is, how confident should I be that it is the sixth coin from Igor's pocket; the one with 50% chance of landing heads? According to the Bayesian, the answer is calculated as follows:$$P_E(H_5) = P(H_5 | E) = \frac{P(H_5)P(E | H_5)}{\sum^{10}_{i=0} P(H_i) P(E|H_i)}$$where

  • $E$ is my evidence, which says that 5 out of 10 of the tosses landed heads,
  • $P_E$ is my new posterior updating credence upon learning the evidence $E$,
  • $P$ is my prior,
  • $H_i$ is the hypothesis that the coin has $\frac{i}{10}$ chance of landing heads,
  • $P(H_0) = \ldots = P(H_{10}) = \frac{1}{11}$, since I know nothing about which coin Igor pulled from his pocket, and
  • $P(E | H_i) = \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5$, by the Principal Principle, and since each coin toss is independent of each other one.

So, upon learning that the coin landed heads five times out of ten, my posterior should be:$$P_E(H_5) = P(H_5 | E) = \frac{P(H_5)P(E | H_5)}{\sum^{10}_{i=0} P(H_i) P(E|H_i)} = \frac{\frac{1}{11} \left ( \frac{5}{10} \right )^5\left ( \frac{5}{10} \right )^5}{\sum^{10}_{i=1}\frac{1}{11} \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5 } \approx 0.2707$$But some philosophers have suggested that this is too low. The Bayesian calculation takes into account how likely the hypothesis in question makes the evidence, as well as how likely I thought the hypothesis in the first place, but it doesn't take into account that the hypothesis explains the evidence. We'll call these philosophers explanationists. Upon learning that the coin landed heads five times out of ten, the explanationist says, we should be most confident in $H_5$, the hypothesis that the coin is fair, and the Bayesian calculation does indeed give this. But we should be most confident in part because $H_5$ best explains the evidence, and the Bayesian calculation takes no account of this.

To accommodate the explanationist's demand, Igor Douven proposes the following alternative updating rule:$$P_E(H_k) = P(H_k | E) = \frac{P(H_k)P(E | H_k) + f(H_k, E)}{\sum^{10}_{i=0} (P(H_i) P(E|H_i) + f(H_i, E))}$$where $f$ gives a little boost to $H_k$ if it is the best explanation of $E$ and not if it isn't. Perhaps, for instance,

  • $f(H_k, E) = 0.1$, if the frequency of heads among the coin tosses that $E$ reports is uniquely closest to the chance of heads according to $H_k$, namely, $\frac{k}{10}$,
  • $f(H_k, E) = 0.05$, if the frequency of heads among the coin tosses that $E$ reports is equally closest to the chance of heads according to $H_k$ and another hypothesis,
  • $f(H_k, E) = 0$, otherwise.

Thus, according to this:$$P_E(H_5) = \frac{P(H_5)P(E | H_5) + 0.1}{\left (\sum^{10}_{i=0} P(H_i) P(E|H_i) \right ) + 0.1} = \frac{\frac{1}{11} \left ( \frac{5}{10} \right )^5\left ( \frac{5}{10} \right )^5 + 0.1}{\sum^{10}_{i=1}\frac{1}{11} \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5 + 0.1 } \approx 0.9746$$So, as required, $H_5$ certainly gets a boost in posterior probability because it best explains the run of heads and tails we observe.

Before we move on, it's worth noting a distinctive feature of this case. In many cases where we wish to apply something like abduction or inference to the best explanation, we might think that we can record our enthusiasm for good explanations in the priors. For instance, suppose I have two scientific theories, $T_1$ and $T_2$, both of which predict the evidence I've collected. So, they both make the evidence equally likely. But I want to assign higher probability to $T_1$ upon receipt of that evidence because it provides a better explanation for the evidence. Then I should simply encode this in my prior. That is, I should assign $P(T_1) > P(T_2)$. But that sort of move isn't open to us in Douven's example. The reason is that none of the chance hypotheses are better explanations in themselves: none is simpler or more general or what have you. But rather, for each, there is evidence we might obtain such that it is a better explanation of that evidence. But before we obtain the evidence, we don't know which will prove the better explanation of it, and so can't accommodate our explanationist instincts by giving that hypothesis a boost in our prior.

Now let's return to the example. There are well known objections to updating in the explanationist way Douven suggests. Most famously, van Fraassen pointed out that we have good reasons to comply with the Bayesian method of updating, and the explanationist method deviates quite dramatically from that (Laws and Symmetry, chapter 6) . When he was writing, the most compelling argument was David Lewis' diachronic Dutch Book argument. If you plan to update as Douven suggests, by giving an extra-Bayesian boost to the hypothesis that best explains the evidence, then there is a series of bets you'll accept before you receive the evidence and another set you'll accept afterwards that, taken together, will lose you money for sure. Douven is unfazed. He first suggests that vulnerability to a Dutch Book does not impugn your epistemic rationality, but only your practical rationality. He notes Skyrms's claim that, in the case of synchronic Dutch Books, such vulnerability reveals an inconsistency in your assessment of the same bet presented in different ways, and therefore perhaps some epistemic failure, but notes that this cannot be extended to the diachronic case. In any case, he says, avoiding the machinations of malevolent bookies is only one practical concern that we have, and, let's be honest, not a very pressing one. What's more, he points out that, while updating in the Bayesian fashion serves one practical end, namely, making us immune to these sorts of diachronic sure losses, there are other practical ends it might not serve as well. For instance, he uses computer simulations to show that, if we update in his explanationist way, we'll tend to assign credence greater than 0.99 in the true hypothesis much more quickly than if we update in the Bayesian way. He admits that we'll also tend to assign credence greater than 0.99 in a false hypothesis much more quickly than if we use Bayesian updating. But he responds, again with the results of a computer simulation result: suppose we keep tossing the coin until one of the rules assigns more than 0.99 to a hypothesis; then award points to that rule if the hypothesis it becomes very confident in is true, and deduct them if it is false; then the explanationist updating rule will perform better on average than the Bayesian rule. So, if there is some practical decision that you will make only when your credence in a hypothesis exceeds 0.99 -- perhaps the choice is to administer a particular medical treatment, and you need to be very certain in your diagnosis before doing so -- then you will be better off on average updating as Douven suggests, rather than as the Bayesian requires.

So much for the practical implications of updating in one way or another. I am more interested in the epistemic implications, and so is Douven. He notes that, since van Fraassen gave his argument, there is a new way of justifying the Bayesian demand to update by conditioning on your evidence. These are the accuracy arguments. While Douven largely works with the argument for conditioning that Hannes Leitgeb and I gave, I think the better version of that argument is due to Hilary Greaves and David Wallace. The idea is that, as usual, we measure the inaccuracy of a credence function using a strictly proper inaccuracy measure $\mathfrak{I}$. That is, if $P$ is a probabilistic credence function and $w$ is a possible world, then $\mathfrak{I}(P, w)$ gives the inaccuracy of $P$ at $w$. And, if $P$ is a probabilistic credence function, $P$ expects itself to be least inaccurate. That is, $\sum_w P(w) \mathfrak{I}(P, w) < \sum_w P(w) \mathfrak{I}(Q, w)$, for any credence function $Q \neq P$. Then Greaves and Wallace ask us to consider how you might plan to update your credence function in response to different pieces of evidence you might receive. Thus, suppose you know that the evidence you'll receive will be one of the following propositions, $E_1, \ldots, E_m$, which form a partition. This is the situation you're in if you know that you're about to witness 10 tosses of a coin, for instance, as in Douven's example: $E_1$ might be $HHHHHHHHHH$, $E_2$ might be $HHHHHHHHHT$, and so on. Then suppose you plan how you'll respond to each. If you learn $E_i$, you'll adopt $P_i$. Then we'll call this updating plan $\mathcal{R}$ and write it $(P_1, \ldots, P_m)$. Then we can calculate the expected inaccuracy of a given updating plan. Its inaccuracy at a world is the inaccuracy of the credence function it recommends in response to learning the element of the partition that is true at that world. That is, for world $w$ at which $E_i$ is true,$$\mathfrak{I}(\mathcal{R}, w) = \mathfrak{I}(P_i, w)$$And Greaves and Wallace show that the updating rule your prior expects to be best is the Bayesian one. That is, if there is $E_i$ and $P(E_i) > 0$ and $P_i(-) \neq P(X|E_i)$, then there is an alternative updating rule $\mathcal{R}^\star = (P^\star_1, \ldots, P^\star_m)$ such that$$\sum_w P(w) \mathfrak{I}(\mathcal{R}^\star, w) < \sum_w P(w) \mathfrak{I}(\mathcal{R}, w)$$So, in particular, your prior expects the Bayesian rule to be more accurate than Douven's rule.

In response to this, Douven points out that there are many ways in which we might value the accuracy of our updating plans. For instance, the Greaves and Wallace argument considers only your accuracy at a single later point in time, after you've received a single piece of evidence and updated only on it. But, Douven argues, we might be interested not in the one-off inaccuracy of a single application of an updating rule, but rather in its inaccuracy in the long run. And we might be interested in different features of the long-run total inaccuracy of using that rule: we might be interested in just adding up all of the inaccuracies of the various credence functions you obtain from multiple applications of the rule; or we might be less interested in the inaccuracies of the interim credence functions and more interested in the inaccuracy of the final credence function you obtain after multiple updates. And, Douven claims, the accuracy arguments do not tell us anything about which performs better out of the Bayesian and explanationist approaches when viewed in these different ways.

However, that's not quite right. It turns out that we can, in fact, adapt the Greaves and Wallace argument to cover these cases. To see how, it's probably best to illustrate it with the simplest possible case, but it should be obvious how to scale up the idea. So suppose: 

  • my credences are defined over four worlds, $XY$, $X\overline{Y}$, $\overline{X}Y$, and $\overline{X}\overline{Y}$;
  • my prior at $t_0$ is $P$;
  • at $t_1$, I'll learn either $X$ or its negation $\overline{X}$, and I'll respond with $P_X$ or $P_{\overline{X}}$, respectively;
  • at $t_2$, I'll learn $XY$, $X\overline{Y}$, $\overline{X}Y$, or $\overline{X} \overline{Y}$, and I'll respond with $P_{XY}$, $P_{X\overline{Y}}$, $P_{\overline{X}Y}$, or $P_{\overline{X}\overline{Y}}$, respectively.

For instance, I might know that a coin is going to be tossed twice, once just before $t_1$ and once just before $t_2$. So $X$ is the proposition that it lands heads on the first toss, i.e., $X = \{HH, HT\}$, while $\overline{X}$ is the proposition it lands tails on the first toss $\overline{X} = \{TH, TT\}$. And then $Y$ is the proposition it lands heads on the second toss. So $XY = \{HH\}$, $X\overline{Y} = \{HT\}$, and so on.

Now, taken together, $P_X$, $P_{\overline{X}}$, $P_{XY}$, $P_{X\overline{Y}}$, $P_{\overline{X}Y}$, and $P_{\overline{X}\overline{Y}}$ constitute my updating plan---let's denote that $\mathcal{R}$. Now, how might be measure the inaccuracy of this plan $\mathcal{R}$? Well, we want to assign a weight to the inaccuracy of the credence function it demands after the first update -- let's call that $\alpha_1$; and we want a weight for the result of the second update -- let's call that $\alpha_2$. So, for instance, if I'm interested in the total inaccuracy obtained by following this rule, and each time is just as important as each other time, I just set $\alpha_1 = \alpha_2$; but if I care much more about my final inaccuracy, then I let $\alpha_1 \ll \alpha_2$. Then the inaccuracy of my updating rule is$$\begin{eqnarray*}
\mathfrak{I}(\mathcal{R}, XY) &  = & \alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY) \\
\mathfrak{I}(\mathcal{R}, X\overline{Y}) &  = & \alpha_1 \mathfrak{I}(P_X, X\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, X\overline{Y}) \\
\mathfrak{I}(\mathcal{R}, \overline{X}Y) &  = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y) \\
\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) &  = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})
\end{eqnarray*}$$Thus, the expected inaccuracy of $\mathcal{R}$ from the point of view of my prior $P$ is:

$P(XY)\mathfrak{I}(\mathcal{R}, XY) + P(X\overline{Y})\mathfrak{I}(\mathcal{R}, X\overline{Y}) + P(\overline{X}Y)\mathfrak{I}(\mathcal{R}, \overline{X}Y) + P(\overline{X} \overline{Y})\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) = $

$P(XY)[\alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY)] + $

$P(X\overline{Y})[\alpha_1 \mathfrak{I}(P_X, X\overline{Y}) + \alpha_2\mathfrak{I}(P_{X\overline{Y}}, X\overline{Y})] + $

$P(\overline{X}Y)[\alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y)] + $

$P(\overline{X}\overline{Y})[\alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})]$

But it's easy to see that this is equal to:

$\alpha_1[P(XY)\mathfrak{I}(P_X, XY) + P(X\overline{Y})\mathfrak{I}(P_X, X\overline{Y}) + $

$P(\overline{X}Y)\mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + P(\overline{X}\overline{Y})\mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y})] + $

$\alpha_2[\mathfrak{I}(P_{XY}, XY) + P(X\overline{Y})\mathfrak{I}(P_{X\overline{Y}}, X\overline{Y}) + $

$P(\overline{X}Y)\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y) + P(\overline{X}\overline{Y})\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})]$

Now, this is the weighted sum of the expected inaccuracies of the two parts of my updating plan taken separately; the part that kicks in at $t_1$, and the part that kicks in at $t_2$. And, thanks to Greaves and Wallace's result, we know that each of those expected inaccuracies is minimized by the rule that demands you condition on your evidence. Now, we also know that conditioning $P$ on $XY$ is the same as conditioning $P(-|X)$ on $XY$, and so on. So a rule that tells you, at $t_2$, to update your $t_0$ credence function on your total evidence at $t_2$ is also one that tells you, at $t_2$, to update your $t_1$ credence function on your total evidence at $t_2$. So, of the updating rules that cover the two times $t_1$ and $t_2$, the one that minimizes expected inaccuracy is the one that results from conditioning at each time. That is, if the part of $\mathcal{R}$ that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of $\mathcal{R}$ that kicks in at $t_2$ doesn't demand I condition my credence function at $t_1$ on my evidence at $t_2$, then there is an alternative rule $\mathcal{R}^\star$, that $P$ expects to be more accurate: that is,$$\sum_w P(w)\mathfrak{I}(\mathcal{R}^\star, w) < \sum_w P(w)\mathfrak{I}(\mathcal{R}, w)$$And, as I mentioned above, it's clear how to generalize this to cover not just updating plans that cover two different times at which you receive evidence, but any finite number.

However, I think Douven would not be entirely moved by this. After all, while he is certainly interested in the long-run effects on inaccuracy of using one updating rule or another, he thinks that looking only to expected inaccuracy is a mistake. He thinks that we care about other features of updating rules. Indeed, he provides us with one, and uses computer simulations to show that, in the toy coin tossing case that we've been using, the explanationist account has that desirable feature to a greater degree than the Bayesian account.

For each possible bias value, we ran 1000 simulations of a sequence of 1000 tosses. As previously, the explanationist and the Bayesian updated their degrees of belief after each toss. We registered in how many of those 1000 simulations the explanationist incurred a lower penalty than the Bayesian at various reference points [100 tosses, 250, 500, 750, 1000], at which we calculated both Brier penalties and log score penalties. The outcomes [...] show that, on either measure of inaccuracy, IBE is most often the winner—it incurs the lowest penalty -- at each reference point. Hence, at least in the present kind of context, IBE seems a better choice than Bayes' rule. (page 439)
How can we square this with the Greaves and Wallace result? Well, as Douven goes on to explain: "[the explanationist rule] in general achieves greater accuracy than [the Bayesian], even if typically not much greater accuracy"; but "[the Bayesian rule] is less likely than [explanationist rule] to ever make one vastly inaccurate, even though the former typically makes one somewhat more inaccurate than the latter." So the explanationist is most often more accurate, but when it is more accurate, it's only a little more, while when it is less accurate, it's a lot less. So, in expectation, the Bayesian rule wins. Douven then argues that you might be more interested in being more likely to be more accurate, rather than being expectedly more accurate.

Perhaps. But in any case there's another accuracy argument for the Bayesian way of updating that doesn't assume that expected inaccuracy is the thing you want to minimize. This is an argument that Ray Briggs and I gave a couple of years ago. I'll illustrate it in the same setting we used above, where we have prior $P$, at $t_1$ we'll learn $X$ or $\overline{X}$, and at $t_2$ we'll learn $XY$, $X\overline{Y}$, $\overline{X}Y$, or $\overline{X} \overline{Y}$. And we measure the inaccuracy of an updating rule $\mathcal{R} = (P_X, P_{\overline{X}}, P_{XY}, P_{X\overline{Y}}, P_{\overline{X}Y}, P_{\overline{X}\overline{Y}})$ for this as follows:
\mathfrak{I}(\mathcal{R}, XY) &  = & \alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY) \\
\mathfrak{I}(\mathcal{R}, X\overline{Y}) &  = & \alpha_1 \mathfrak{I}(P_X, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, X\overline{Y}) \\
\mathfrak{I}(\mathcal{R}, \overline{X}Y) &  = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{X\overline{Y}}, \overline{X}Y) \\
\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) &  = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})
\end{eqnarray*}$$Then the following is true: if the part of my plan that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of my plan that kicks in at $t_2$ doesn't demand I condition my $t_1$ credence function on my evidence at $t_2$, then, for any $0 < \beta < 1$, there is an alternative prior $P^\star$ and its associated Bayesian updating rule $\mathcal{R}^\star$, such that, for all worlds $w$,$$\beta\mathfrak{I}(P^\star, w) + (1-\beta)\mathfrak{I}(\mathcal{R}^\star, w) < \beta \mathfrak{I}(P, w) + (1-\beta)\mathfrak{I}(\mathcal{R}, w)$$And, again, this result generalizes to cases that include any number of times at which we receive new evidence, and in which, at each time, the set of propositions we might receive as evidence forms a partition. So it certainly covers the case of the coin of unknown bias that we've been using throughout. So, if you plan to update in some way other than by Bayesian conditionalization starting with your prior, there is an alternative prior and plan that, taken together, is guaranteed to have greater accuracy than yours; that is, they will have greater total accuracy than yours however the world turns out.

How do we square this with Douven's simulation results? The key is that this dominance result includes the prior in it. It does not say that, if $\mathcal{R}$ requires you not to condition $P$ on your evidence at any point, then a rule that does require that is guaranteed to be better. It says that if $\mathcal{R}$ requires you not to condition $P$ on your evidence at any point, then there is an alternative prior $P^\star$ such that it, together with a rule that requires you to condition it on your evidence, are better than $P$ and $\mathcal{R}$ for sure. Douven's results compare the performance of conditioning on $P$ and performing the explanationist update on it. This shows that while conditioning might not always give a better result than the explanationist, there is an alternative prior such that conditioning on it is guaranteed to be better than retaining the original prior and performing the explanationist rule. And that, I think, is the reason we should prefer conditioning on our evidence to giving the little explanationist boosts that Douven suggests. If we update by conditioning, our prior and update rule, taken together, are never accuracy dominated; it we update using Douven's explanationist rule, our prior and update rule, taken together, are accuracy dominated.

Before wrapping up, it's worth mentioning that there's a little wrinkle to iron out. It might be that, while the original prior and the posteriors it generates at the various times all satisfy the Principal Principle, the dominating prior and updating rule don't. While being dominated is clearly bad, you might think that being dominated by something that is itself irrational -- because it violates the Principal Principle, or for other reasons -- isn't so bad. But in fact we can tweak things to avoid this situation. The following is true: if the part of my plan that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of my plan that kicks in at $t_2$ doesn't demand I condition my $t_1$ credence function on my evidence at $t_2$, then, for any $0 < \beta < 1$, there is an alternative prior $P^\star$ and its associated Bayesian updating rule $\mathcal{R}^\star$, such that, $P^\star$ obeys the Principal Principle and, for all possible objective chance functions $ch$,

$\beta\sum_{w} ch(w) \mathfrak{I}(P^\star, w) + (1-\beta)\sum_{w} ch(w) \mathfrak{I}(\mathcal{R}^\star, w) < $

$\beta \sum_{w} ch(w) \mathfrak{I}(P, w) + (1-\beta)\sum_{w} ch(w) \mathfrak{I}(\mathcal{R}, w)$

So I'm inclined to think that Douven's critique of the Dutch Book argument against the explanationist updating rule hits the mark; and I can see why he thinks the expected accuracy argument against it is also less than watertight; but I think the accuracy dominance argument against it is stronger. We shouldn't use that updating rule, with its extra boost for explanatory hypotheses, because if we do so, there will be an alternative prior such that applying the Bayesian updating rule to that prior is guaranteed to be more accurate than applying the explanationist rule to our actual prior.

Tuesday, 11 August 2020

The only symmetric inaccuracy measure is the Brier score

If you'd like a PDF of this post, see here. 

[UPDATE 1: I should have made this clear in the original post. The Normality condition makes the proof go through more easily, but it isn't really necessary. Suppose we simply assume instead that $$\mathfrak{I}(w^i, j)= \left \{ \begin{array}{ll} b & \mbox{if } i \neq j \\ a & \mbox{if } i = j \end{array} \right.$$Then we can show that, if $\mathfrak{I}$ is symmetric then, for any probabilistic credence function $p$ and any world $w_i$,$$\mathfrak{I}(p, i) = (b-a)\frac{1}{2} \left (1 - 2p_i + \sum_j p^2_j \right ) + a$$End Update 1.]

[UPDATE 2: There's something puzzling about the result below. Suppose $\mathcal{W} = \{w_1, \ldots, w_n\}$ is the set of possible worlds. And suppose $\mathcal{F}$ is the full algebra of propositions built out of those worlds. That is, $\mathcal{F}$ is the set of subsets of $\mathcal{W}$. Then there are two versions of the Brier score over a probabilistic credence function $p$ defined on $\mathcal{F}$. The first considers only the credences that $p$ assigns to the possible worlds. Thus,$$\mathfrak{B}(p, i) = \sum^n_{j=1} (w^i_j - p_j)^2 = 1 - 2p_i + \sum_j p^2_j$$But there is another that considers also the credences that $p$ assigns to the other propositions in $\mathcal{F}$. Thus,$$\mathfrak{B}^\star(p, i) = \sum_{X \in \mathcal{F}} (w_i(X) - p(X))^2$$Now, at first sight, these look related, but not very closely. However, notice that both are symmetric. Thus, by the extension of Selten's theorem below (plus update 1 above), if $\mathfrak{I}(w^i, j) = b$ for $i \neq j$ and 0 for $i = j$, then $\mathfrak{I}(p, i) = \frac{1}{2}b\mathfrak{B}(p, i)$. Now, $\mathfrak{B}(w^i, j) = 2$ for $i \neq j$, and $\mathfrak{B}(w^i, j) = 0$ for $i = j$, and so this checks out. But what about $\mathfrak{B}^\star$? Well, according to our extension of Selten's theorem, since $\mathfrak{B}^\star$ is symmetric, we can see that it is just a multiple of $\mathfrak{B}$, the factor determined by $\mathfrak{B}^\star(w^i, j)$. So what is this number? Well, it turns out that, if $i \neq j$, then$$\mathfrak{B}^\star(w^i, j) = 2\sum^{n-2}_{k=0} {n-2 \choose k}$$Thus, it follows that$$\mathfrak{B}^\star(p, i) = \sum^{n-2}_{k=0} {n-2 \choose k}\mathfrak{B}(p, i)$$And you can verify this by other means as well. This is quite a nice result independently of all this stuff about symmetry. After all, there doesn't seem any particular reason to favour $\mathfrak{B}$ over $\mathfrak{B}^\star$ or vice versa. This result shows that using one for the sorts of purposes we have in accuracy-first epistemology won't give different results from using the other. End update 2.]

So, as is probably obvious, I've been trying recently to find out what things look like in accuracy-first epistemology if you drop the assumption that the inaccuracy of a whole credal state is the sum of the inaccuracies of the individual credences that it comprises --- this assumption is sometimes called Additivity or Separability. In this post, I want to think about a result concerning additive inaccuracy measures that intrigued me in the past and on the basis of which I tried to mount an argument in favour of the Brier score. The result dates back to Reinhard Selten, the German economist who shared the 1994 Nobel prize with John Harsanyi and John Nash for his contributions to game theory. In this post, I'll show that the result goes through even if we don't assume additivity.

Suppose $\mathfrak{I}$ is an inaccuracy measure. Thus, if $c$ is a credence function defined on the full algebra built over the possible worlds $w_1, \ldots, w_n$, then $\mathfrak{I}(c, i)$ measures the inaccuracy of $c$ at world $w_i$. Then define the following function on pairs of probabilistic credence functions:$$\mathfrak{D}_\mathfrak{I}(p, q) = \sum_i p_i \mathfrak{I}(q, i) - \sum_i p_i\mathfrak{I}(p, i)$$$\mathfrak{D}_\mathfrak{I}$ measures how much more inaccurate $p$ expects $q$ to be than it expects itself to be; equivalently, how much more accurate $p$ expects itself to be than it expects $q$ to be. Now, if $\mathfrak{I}$ is strictly proper, $\mathfrak{D}_\mathfrak{I}$ is positive whenever $p$ and $q$ are different, and zero when they are the same, so in that case $\mathfrak{D}_\mathfrak{I}$ is a divergence. But we won't be assuming that here -- rather remarkably, we don't need to.

Now, it's not hard to see that $\mathfrak{D}_\mathfrak{I}$ is not necessarily symmetric. For instance, consider the log score$$\mathfrak{L}(p, i) = -\log p_i$$Then$$\mathfrak{D}_\mathfrak{L}(p, q) = p_i \log \frac{p_i}{q_i}$$This is the so-called Kullback-Leibler divergence and it is not symmetric. Nonetheless, it's equally easy to see that it is at least possible for $\mathfrak{D}_\mathfrak{I}$ to be symmetric. For instance, consider the Brier score$$\mathfrak{B}(p, i) = 1-2p_i + \sum_j p^2_j$$Then$$\mathfrak{D}_\mathfrak{B}(p, q) = \sum_i (p_i - q_i)^2$$So the natural question arises: how many inaccuracy measures are symmetric in this way? That is, how many generate symmetric divergences in the way that the Brier score does? It turns out: none, except the Brier score.

First, a quick bit of notation: Given a possible world $w_i$, we write $w^i$ for the probabilistic credence function that assigns credence 1 to world $w_i$ and 0 to any world $w_j$ with $j \neq i$. 

And two definitions:

Definition (Normal inaccuracy measure) An inaccuracy measure $\mathfrak{I}$ is normal if $$\mathfrak{I}(w^i, j) = \left \{ \begin{array}{ll} 1 & \mbox{if } i \neq j \\ 0 & \mbox{if } i = j \end{array} \right.$$

Definition (Symmetric inaccuracy measure) An inaccuracy measure is symmetric if $$\mathfrak{D}_\mathfrak{I}(p, q) = \mathfrak{D}_\mathfrak{I}(q, p)$$for all probabilistic credence functions $p$ and $q$.

Thus,  $\mathfrak{I}$ is symmetric if, for any probability functions $p$ and $q$, the loss of accuracy that $p$ expects to suffer by moving to $q$ is the same as the loss of accuracy that $q$ expects to suffer by moving to $p$.

Theorem The only normal and symmetric inaccuracy measure agrees with the Brier score for probabilistic credence functions.

Proof. (This just adapts Selten's proof in exactly the way you'd expect.) Suppose $\mathfrak{D}_\mathfrak{I}(p, q) = \mathfrak{D}_\mathfrak{I}(q, p)$ for all probabilistic $p$, $q$. Then, in particular, for any world $w_i$ and any probabilistic $p$,$$\sum_j w^i_j \mathfrak{I}(p, j) - \sum_j w^i_j \mathfrak{I}(w^i, j) = \sum_j p_j \mathfrak{I}(w^i, j) -\sum_j p_j \mathfrak{I}(p, j)$$So,$$\mathfrak{I}(p, i) = (1-p_i) - \sum_j p_j \mathfrak{I}(p, j)$$So,$$\sum_j p_j \mathfrak{I}(p, j) = 1 - \sum_j p^2_j- \sum_j p_j \mathfrak{I}(p, j)$$So,$$\sum_j p_j \mathfrak{I}(p, j) = \frac{1}{2}[1 - \sum_j p^2_j]$$So,$$\mathfrak{I}(p, i) = 1-p_i -\frac{1}{2}[1 - \sum_j p^2_j] = \frac{1}{2} \left (1 - 2p_i + \sum_j p^2_j \right )$$as required. $\Box$

There are a number of notable features of this result:

First, the theorem does not assume that the inaccuracy measure is strictly proper, but since the Brier score is strictly proper, it follows that symmetry entails strict propriety.

Second, the theorem does not assume additivity, but since the Brier score is additive, it follows that symmetry entails additivity.

Monday, 10 August 2020

The Accuracy Dominance Argument for Conditionalization without the Additivity assumption

 For a PDF of this post, see here.

Last week, I explained how you can give an accuracy dominance argument for Probabilism without assuming that your inaccuracy measures are additive -- that is, without assuming that the inaccuracy of a whole credence function is obtained by adding up the inaccuracy of all the individual credences that it assigns. The mathematical result behind that also allows us to give my chance dominance argument for the Principal Principle without assuming additivity, and ditto for my accuracy-based argument for linear pooling. In this post, I turn to another Bayesian norm, namely, Conditionalization. The first accuracy argument for this was given by Hilary Greaves and David Wallace, building on ideas developed by Graham Oddie. It was an expected accuracy argument, and it didn't assume additivity. More recently, Ray Briggs and I offered an accuracy dominance argument for the norm, and we did assume additivity. It's this latter argument I'd like to consider here. I'd like to show how it goes through even without assuming additivity. And indeed I'd like to generalise it at the same time. The generalisation is inspired by a recent paper by Michael Rescorla. In it, Rescorla notes that all the existing arguments for Conditionalization assume that, when your evidence comes in the form of a proposition learned with certainty, that proposition must be true. He then offers a Dutch Book argument for Conditionalization that doesn't make this assumption, and he issues a challenge for other sorts of arguments to do the same. Here, I take up that challenge. To do so, I will offer an argument for what I call the Weak Reflection Principle.

Weak Reflection Principle (WRP) Your current credence function should be a linear combination of the possible future credence functions that you endorse.

A lot might happen between now and tomorrow. I might see new sights, think new thoughts; I might forget things I know today, take mind-altering drugs that enhance or impair my thinking; and so on. So perhaps there is a set of credence functions I think I might have tomorrow. Some of those I'll endorse -- perhaps those that I'd get if I saw certain new things, or enhanced my cognition in various ways. And some of them I'll disavow -- perhaps those that I'd get if I forget certain things, or impaired my cognition. WRP asks you to separate out the wheat from the chaff, and once you've identified the ones you endorse, it tells you that your current credence function should lie within the span of those future ones; it should be in their convex hull; it should be a weighted sum or convex combination of them.

One nice thing about WRP is that it gives back Conditionalization in certain cases. Suppose $c^0$ is my current credence function. Suppose I know that between now and tomorrow I'll learn exactly one member of the partition $E_1, \ldots, E_m$ with certainty --- this is the situation that Greaves and Wallace envisage. And suppose I endorse credence function $c^1$ as a response to learning $E_1$, $c^2$ as a response to learning $E_2$, and so on. Then, if I satisfy WRP, and if $c^k(E_k) = 1$, since I did after all learn it with certainty, then it follows that, whenever $c^0(E_k) > 0$, $c^k(X) = c^0(X | E_k)$, which is exactly what Conditionalization asks of you. And notice that, at no point did we assume that if I learn $E_k$, then $E_k$ is true. So we've answered Rescorla's challenge if we can establish WRP.

To do that, we need Theorem 1 below. And to get there, we need to go via Lemmas 1 and 2. Just to remind ourselves of the framework:

  • $w_1, \ldots, w_n$ are the possible worlds;
  • credence functions are defined on the full algebra built on top of these possible worlds;
  • given a credence function $c$, we write $c_i$ for the credence that $c$ assigns to $w_i$.  

Lemma 1 If $c^0$ is not in the convex combination of $c^1, \ldots, c^m$, then $(c^0, c^1, \ldots, c^m)$ is not in the convex hull of $\mathcal{X}$, where$$\mathcal{X} := \{(w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m) : 1 \leq i \leq n\ \&\ 1 \leq k \leq m\}$$

Definition 1 Suppose $\mathfrak{I}$ is a continuous strictly proper inaccuracy measure. Then let$$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m)) = \sum^m_{k=0} \left ( \sum^n_{i=1} p^k_i \mathfrak{I}(c^k, i) - \sum^n_{i=1} p^k_i \mathfrak{I}(p^k, i) \right )$$

Lemma 2 Suppose $\mathfrak{I}$ is a continuous strictly proper inaccuracy measure. Suppose $\mathcal{X}$ is a closed convex set of $(n+1)$-tuples of probabilistic credence functions. And suppose $(c^0, c^1, \ldots, c^n)$ is not in $\mathcal{X}$. Then there is $(q^0, q^1, \ldots, q^m)$ in $\mathcal{X}$ such that 

(i) for all $(p^0, p^1, \ldots, p^m) \neq (q^0, q^1, \ldots, q^m)$ in $\mathcal{X}$,

$\mathfrak{D}_\mathfrak{I}((q^0, q^1, \ldots, q^m), (c^0, c^1, \ldots, c^m)) <$

$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m))$;

(ii) for all $(p^0, p^1, \ldots, p^n)$ in $\mathcal{X}$,

$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m)) \geq$

$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (q^0, q^1, \ldots, q^m))  +$

$\mathfrak{D}_\mathfrak{I}((q^0, q^1, \ldots, q^m), (c^0, c^1, \ldots, c^m))$.

Theorem 1 Suppose each $c^0, c^1, \ldots, c^n$ is a probabilistic credence function. If $c^0$ is not in the convex hull of $c^1, \ldots, c^m$, then there are probabilistic credence functions $q^0, q^1, \ldots, q^m$ such that for all worlds $w_i$ and $1 \leq k \leq m$,$$\mathfrak{I}(q^0, i) + \mathfrak{I}(q^k, i) < \mathfrak{I}(c^0, i) + \mathfrak{I}(c^k, i)$$ 

Let's keep the proofs on ice for a moment. What does this show exactly? It says that, if you don't do as WRP demands, there is some alternative current credence function and, for each of the possible future credence functions in the set you endorse, there is an alternative such that having your current credence function now and then one of your endorsed future credence functions later is guaranteed to make you less accurate overall than having the alternative to your current credence function now and the alternative to that endorsed future credence function later. This, I claim, establishes WRP.

Now for the proofs.

Proof of Lemma 1. We prove the contrapositive. Suppose $(c^0, c^1, \ldots, c^m)$ is in $\mathcal{X}$. Then there are $0 \leq \lambda_{i, k} \leq 1$ such that $\sum^n_{i=1}\sum^m_{k=1}  \lambda_{i, k} = 1$ and$$(c^0, c^1, \ldots, c^m) = \sum^n_{i=1} \sum^m_{k=1}  \lambda_{i, k} (w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m)$$Thus,$$c^0 = \sum^n_{i=1}\sum^m_{k=1}  \lambda_{i,k} w^i$$
and$$c^k = \sum^n_{i=1} \lambda_{i, k} w^i +  \sum^n_{i=1} \sum_{l \neq k} \lambda_{i, l} c^k$$So$$(\sum^n_{i=1} \lambda_{i, k}) c^k =  \sum^n_{i=1} \lambda_{i, k} w^i$$So let $\lambda_k =  \sum^n_{i=1} \lambda_{i, k}$. Then, for $1 \leq k \leq m$,$$\lambda_k c_k = \sum^n_{i=1} \lambda_{i, k} w^i$$And thus$$\sum^m_{k=1} \lambda^k c^k = \sum^m_{k=1} \sum^n_{i=1} \lambda_{i, k} w^i = c^0$$as required. $\Box$

Proof of Lemma 2. This proceeds exactly like the corresponding theorem from the previous blogpost. $\Box$

Proof of Theorem 1. So, if $c^0$ is not in the convex hull of $c^1, \ldots, c^m$, there is $(q^0, q^1, \ldots, q^m)$ such that, for all $(p^0, p^1, \ldots, p^m)$ in $\mathcal{X}$,$$\mathfrak{D}((p^0, p^1, \ldots, p^m), (q^0, q^1, \ldots, q^m)) < \mathfrak{D}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m))$$In particular, for any world $w_i$ and $1 \leq k \leq m$,

$\mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (q^0, q^1, \ldots, q^m)) <$

$\mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (c^0, c^1, \ldots, c^m))$

& & \mathfrak{I}(q^0, i) + \mathfrak{I}(q^k, i) \\
& = & \mathfrak{D}(w^i, q^0) + \mathfrak{D}(w^i, q^k) \\
& \leq & \mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (q^0, q^1, \ldots, q^m)) \\
& < & \mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (c^0, c^1, \ldots, c^m)) \\
& = &  \mathfrak{D}(w^i, c^0) + \mathfrak{D}(w^i, c^k) \\
& = & \mathfrak{I}(c^0, i) + \mathfrak{I}(c^k, i)
\end{eqnarray*}$$as required.


Friday, 7 August 2020

The Accuracy Dominance Argument for Probabilism without the Additivity assumption

For a PDF of this post, see here.

One of the central arguments in accuracy-first epistemology -- the one that gets the project off the ground, I think -- is the accuracy-dominance argument for Probabilism. This started life in a more pragmatic guise in de Finetti's proof that, if your credences are not probabilistic, there are alternatives that would lose less than yours would if they were penalised using the Brier score, which levies a price of $(1-x)^2$ on every credence $x$ in a truth and $x^2$ on every credence $x$ in a falsehood. This was then adapted to an accuracy-based argument by Roger Rosencrantz, where he interpreted the Brier score as a measure of inaccuracy, not a penalty score. Interpreted thus, de Finetti's result says that any non-probabilistic credences are accuracy-dominated by some probabilistic credences. Jim Joyce then noted that this argument only establishes Probabilism if you have a further argument that inaccuracy should be measured by the Brier score. He thought there was no particular reason to think that's right, so he greatly generalized de Finetti's result to show that, relative to a much wider range of inaccuracy measures, all non-probabilistic credences are accuracy dominated. One problem with this, which Al Hájek pointed out, was that he didn't give a converse argument -- that is, he didn't show that, for each of his inaccuracy measures, each probabilistic credence function is not accuracy dominated. Joel Predd and his Princeton collaborators then addressed this concern and proved a very general result, namely, that for any additive, continuous, and strictly proper inaccuracy measure, any non-probabilistic credences are accuracy-dominated, while no probabilistic credences are.

That brings us to this blogpost. Additivity is a controversial claim. It says that the inaccuracy of a credence function is the (possibly weighted) sum of the inaccuracies of the credences it assigns. So the question arises: can we do without additivity? In this post, I'll give a quick proof of the accuracy-dominance argument that doesn't assume anything about the inaccuracy measures other than that they are continuous and strictly proper. Anyone familiar with the Predd, et al. paper will see that the proof strategy draws very heavily on theirs. But it bypasses out the construction of the Bregman divergence that corresponds to the strictly proper inaccuracy measure. For that, you'll have to wait for Jason Konek's forthcoming work...

  • $\mathcal{F}$ is a set of propositions;
  • $\mathcal{W} = \{w_1, \ldots, w_n\}$ be the set of possible worlds relative to $\mathcal{F}$;
  • $\mathcal{C}$ be the set of credence functions on $\mathcal{F}$;
  • $\mathcal{P}$ be the set of probability functions on $\mathcal{F}$. So, by de Finetti's theorem, $\mathcal{P} = \{v_w : w \in \mathcal{W}\}^+$. If $p$ is in $\mathcal{P}$, we write $p_i$ for $p(w_i)$.
Theorem Suppose $\mathfrak{I}$ is a strictly proper inaccuracy measure on the credence functions in $\mathcal{F}$. Then if $c$ is not in $\mathcal{P}$, there is $c^\star$ in $\mathcal{P}$ such that, for all $w_i$ in $\mathcal{W}$,
\mathfrak{I}(c^\star, w_i) < \mathfrak{I}(c, w_i)

Proof. We begin by defining a divergence $\mathfrak{D} : \mathcal{P} \times \mathcal{C} \rightarrow [0, \infty]$ that takes a probability function $p$ and a credence function $c$ and measures the divergence from the former to the latter:
\mathfrak{D}(p, c) = \sum_i p_i  \mathfrak{I}(c, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)
Three quick points about $\mathfrak{D}$.

(1) $\mathfrak{D}$ is a divergence. Since $\mathfrak{I}$ is strictly proper, $\mathfrak{D}(p, c) \geq 0$ with equality iff $c = p$.

(2) $\mathfrak{D}(v_{w_i}, c) = \mathfrak{I}(c, w_i)$, for all $w_i$ in $\mathcal{W}$.

(3) $\mathfrak{D}$ is strictly convex in its first argument.  Suppose $p$ and $q$ are in $\mathcal{P}$, and suppose $0 < \lambda < 1$. Then let $r = \lambda p + \lambda q$. Then, since $\sum_i p_i\mathfrak{I}(c, w_i)$ is uniquely minimized, as a function of $c$, at $c = p$, and $\sum_i q_i\mathfrak{I}(c, w_i)$ is uniquely minimized, as a function of $c$, at $c = q$, we have$$\begin{eqnarray*}
\sum_i p_i \mathfrak{I}(c, w_i) & < & \sum_i p_i \mathfrak{I}(r, w_i) \\
\sum_i q_i \mathfrak{I}(c, w_i) & < & \sum_i q_i \mathfrak{I}(r, w_i)

 $\lambda [-\sum_i p_i \mathfrak{I}(p, w_i)] + (1-\lambda) [-\sum_i q_i \mathfrak{I}(q, w_i)] >$

$ \lambda [-\sum_i p_i \mathfrak{I}(r, w_i)] + (1-\lambda) [-\sum_i q_i \mathfrak{I}(r, w_i)] = $

$-\sum_i r_i \mathfrak{I}(r, w_i)$

Now, adding

$\lambda \sum_i p_i \mathfrak{I}(c, w_i) + (1-\lambda)\sum_i q_i\mathfrak{I}(c, w_i) =$

$\sum_i (\lambda p_i + (1-\lambda)q_i) \mathfrak{I}(c, w_i) = \sum_i r_i \mathfrak{I}(c, w_i)$

to both sides gives

$\lambda [\sum_i p_i \mathfrak{I}(c, w_i)-\sum_i p_i \mathfrak{I}(p, w_i)]+ $

$(1-\lambda) [\sum_i q_i\mathfrak{I}(c, w_i)-\sum_i q_i \mathfrak{I}(q, w_i)] > $

 $\sum_i r_i \mathfrak{I}(c, w_i)-\sum_i r_i \mathfrak{I}(r, w_i)$

That is,$$\lambda \mathfrak{D}(p, c) + (1-\lambda) \mathfrak{D}(q, c) > \mathfrak{D}(\lambda p + (1-\lambda)q, c)$$as required.

Now, suppose $c$ is not in $\mathcal{P}$. Then, since $\mathcal{P}$ is a closed convex set, there is a unique $c^\star$ in $\mathcal{P}$ that minimizes $\mathfrak{D}(x, c)$ as a function of $x$. Now, suppose $p$ is in $\mathcal{P}$. We wish to show that$$\mathfrak{D}(p, c) \geq \mathfrak{D}(p, c^\star) + \mathfrak{D}(c^\star, c)$$We can see that this holds iff$$\sum_i (p_i - c^\star_i) (\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i)) \geq 0$$After all,
& & \mathfrak{D}(p, c) - \mathfrak{D}(p, c^\star) - \mathfrak{D}(c^\star, c) \\
& = & [\sum_i p_i \mathfrak{I}(c, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)] - \\
&& [\sum_i p_i \mathfrak{I}(c^\star, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)] - \\
&& [\sum_i c^\star_i \mathfrak{I}(c, w_i) - \sum_i c^\star_i \mathfrak{I}(c^\star, w_i)] \\
& = & \sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i))
Now we prove this inequality. We begin by observing that, since $p$, $c^\star$ are in $\mathcal{P}$, since $\mathcal{P}$ is convex, and since $\mathfrak{D}(x, c)$ is minimized uniquely at $x = c^\star$, if $0 < \varepsilon < 1$, then$$\frac{1}{\varepsilon}[\mathfrak{D}(\varepsilon p + (1-\varepsilon) c^\star, c) - \mathfrak{D}(c^\star, c)] > 0$$Expanding that, we get

$\frac{1}{\varepsilon}[\sum_i (\varepsilon p_i + (1- \varepsilon) c^\star_i)\mathfrak{I}(c, w_i) -$

$\sum_i (\varepsilon p_i + (1-\varepsilon)c^\star_i)\mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i) - $

$\sum_i c^\star_i\mathfrak{I}(c, w_i) + \sum_i c^\star_i \mathfrak{I}(c^\star,  i)] > 0$\medskip


$\frac{1}{\varepsilon}[\sum_i ( c^\star_i + \varepsilon(p_i - c^\star_i))\mathfrak{I}(c, w_i) -$

$\sum_i ( c^\star_i + \varepsilon(p_i-c^\star_i))\mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i) - $

$\sum_i c^\star_i\mathfrak{I}(c, w_i) + \sum_i c^\star_i \mathfrak{I}(c^\star, w_i)] > 0 $\medskip


$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(\varepsilon p+ (1-\varepsilon) c^\star), w_i) +$

$ \frac{1}{\varepsilon}[\sum_i c^\star_i \mathfrak{I}(c^\star, w_i) - \sum_ic^\star_i \mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i)] > 0$\medskip

Now, since $\mathfrak{I}$ is strictly proper,
$$\frac{1}{\varepsilon}[\sum_i c^\star_i \mathfrak{I}(c^\star, w_i) - \sum_ic^\star_i \mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i)] < 0$$
So, for all $\varepsilon > 0$,$$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(\varepsilon p+ (1-\varepsilon) c^\star, w_i) > 0$$
So, since $\mathfrak{I}$ is continuous$$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i)) \geq 0$$which is what we wanted to show. So, by above,$$\mathfrak{D}(p,c) \geq \mathfrak{D}(p, c^\star) + \mathfrak{D}(c^\star, c) $$In particular, since each $w_i$ is in $\mathcal{P}$,$$\mathfrak{D}(v_{w_i}, c) \geq \mathfrak{D}(v_{w_i}, c^\star) + \mathfrak{D}(c^\star, c)$$But, since $c^\star$ is in $\mathcal{P}$ and $c$ is not, and since $\mathfrak{D}$ is a divergence, $\mathfrak{D}(c^\star, c) > 0$. So$$\mathfrak{I}(c, w_i) = \mathfrak{D}(v_{w_i}, c) > \mathfrak{D}(v_{w_i}, c^\star) = \mathfrak{I}(c^\star, w_i)$$as required. $\Box$