For a PDF of this post, see here.

Igor has eleven coins in his pocket. The first has 0% chance of landing heads, the second 10% chance, the third 20%, and so on up to the tenth, which has 90% chance, and the eleventh, which has 100% chance. He picks one out without letting me know which, and he starts to toss it. After the first 10 tosses, it has landed tails 5 times. How confident should I be that the coin is fair? That is, how confident should I be that it is the sixth coin from Igor's pocket; the one with 50% chance of landing heads? According to the Bayesian, the answer is calculated as follows:$$P_E(H_5) = P(H_5 | E) = \frac{P(H_5)P(E | H_5)}{\sum^{10}_{i=0} P(H_i) P(E|H_i)}$$where

- $E$ is my evidence, which says that 5 out of 10 of the tosses landed heads,
- $P_E$ is my new posterior updating credence upon learning the evidence $E$,
- $P$ is my prior,
- $H_i$ is the hypothesis that the coin has $\frac{i}{10}$ chance of landing heads,
- $P(H_0) = \ldots = P(H_{10}) = \frac{1}{11}$, since I know nothing about which coin Igor pulled from his pocket, and
- $P(E | H_i) = \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5$, by the Principal Principle, and since each coin toss is independent of each other one.

So, upon learning that the coin landed heads five times out of ten, my posterior should be:$$P_E(H_5) = P(H_5 | E) = \frac{P(H_5)P(E | H_5)}{\sum^{10}_{i=0} P(H_i) P(E|H_i)} = \frac{\frac{1}{11} \left ( \frac{5}{10} \right )^5\left ( \frac{5}{10} \right )^5}{\sum^{10}_{i=1}\frac{1}{11} \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5 } \approx 0.2707$$But some philosophers have suggested that this is too low. The Bayesian calculation takes into account how likely the hypothesis in question makes the evidence, as well as how likely I thought the hypothesis in the first place, but it doesn't take into account that the hypothesis explains the evidence. We'll call these philosophers explanationists. Upon learning that the coin landed heads five times out of ten, the explanationist says, we should be most confident in $H_5$, the hypothesis that the coin is fair, and the Bayesian calculation does indeed give this. But we should be most confident in part because $H_5$ best explains the evidence, and the Bayesian calculation takes no account of this.

To accommodate the explanationist's demand, Igor Douven proposes the following alternative updating rule:$$P_E(H_k) = P(H_k | E) = \frac{P(H_k)P(E | H_k) + f(H_k, E)}{\sum^{10}_{i=0} (P(H_i) P(E|H_i) + f(H_i, E))}$$where $f$ gives a little boost to $H_k$ if it is the best explanation of $E$ and not if it isn't. Perhaps, for instance,

- $f(H_k, E) = 0.1$, if the frequency of heads among the coin tosses that $E$ reports is uniquely closest to the chance of heads according to $H_k$, namely, $\frac{k}{10}$,
- $f(H_k, E) = 0.05$, if the frequency of heads among the coin tosses that $E$ reports is equally closest to the chance of heads according to $H_k$ and another hypothesis,
- $f(H_k, E) = 0$, otherwise.

Thus, according to this:$$P_E(H_5) = \frac{P(H_5)P(E | H_5) + 0.1}{\left (\sum^{10}_{i=0} P(H_i) P(E|H_i) \right ) + 0.1} = \frac{\frac{1}{11} \left ( \frac{5}{10} \right )^5\left ( \frac{5}{10} \right )^5 + 0.1}{\sum^{10}_{i=1}\frac{1}{11} \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5 + 0.1 } \approx 0.9746$$So, as required, $H_5$ certainly gets a boost in posterior probability because it best explains the run of heads and tails we observe.

Before we move on, it's worth noting a distinctive feature of this case. In many cases where we wish to apply something like abduction or inference to the best explanation, we might think that we can record our enthusiasm for good explanations in the priors. For instance, suppose I have two scientific theories, $T_1$ and $T_2$, both of which predict the evidence I've collected. So, they both make the evidence equally likely. But I want to assign higher probability to $T_1$ upon receipt of that evidence because it provides a better explanation for the evidence. Then I should simply encode this in my prior. That is, I should assign $P(T_1) > P(T_2)$. But that sort of move isn't open to us in Douven's example. The reason is that none of the chance hypotheses are better explanations in themselves: none is simpler or more general or what have you. But rather, for each, there is evidence we might obtain such that it is a better explanation of that evidence. But before we obtain the evidence, we don't know which will prove the better explanation of it, and so can't accommodate our explanationist instincts by giving that hypothesis a boost in our prior.

Now let's return to the example. There are well known objections to updating in the explanationist way Douven suggests. Most famously, van Fraassen pointed out that we have good reasons to comply with the Bayesian method of updating, and the explanationist method deviates quite dramatically from that (*Laws and Symmetry*, chapter 6) . When he was writing, the most compelling argument was David Lewis' diachronic Dutch Book argument. If you plan to update as Douven suggests, by giving an extra-Bayesian boost to the hypothesis that best explains the evidence, then there is a series of bets you'll accept before you receive the evidence and another set you'll accept afterwards that, taken together, will lose you money for sure. Douven is unfazed. He first suggests that vulnerability to a Dutch Book does not impugn your epistemic rationality, but only your practical rationality. He notes Skyrms's claim that, in the case of synchronic Dutch Books, such vulnerability reveals an inconsistency in your assessment of the same bet presented in different ways, and therefore perhaps some epistemic failure, but notes that this cannot be extended to the diachronic case. In any case, he says, avoiding the machinations of malevolent bookies is only one practical concern that we have, and, let's be honest, not a very pressing one. What's more, he points out that, while updating in the Bayesian fashion serves one practical end, namely, making us immune to these sorts of diachronic sure losses, there are other practical ends it might not serve as well. For instance, he uses computer simulations to show that, if we update in his explanationist way, we'll tend to assign credence greater than 0.99 in the true hypothesis much more quickly than if we update in the Bayesian way. He admits that we'll also tend to assign credence greater than 0.99 in a false hypothesis much more quickly than if we use Bayesian updating. But he responds, again with the results of a computer simulation result: suppose we keep tossing the coin until one of the rules assigns more than 0.99 to a hypothesis; then award points to that rule if the hypothesis it becomes very confident in is true, and deduct them if it is false; then the explanationist updating rule will perform better on average than the Bayesian rule. So, if there is some practical decision that you will make only when your credence in a hypothesis exceeds 0.99 -- perhaps the choice is to administer a particular medical treatment, and you need to be very certain in your diagnosis before doing so -- then you will be better off on average updating as Douven suggests, rather than as the Bayesian requires.

So much for the practical implications of updating in one way or another. I am more interested in the epistemic implications, and so is Douven. He notes that, since van Fraassen gave his argument, there is a new way of justifying the Bayesian demand to update by conditioning on your evidence. These are the accuracy arguments. While Douven largely works with the argument for conditioning that Hannes Leitgeb and I gave, I think the better version of that argument is due to Hilary Greaves and David Wallace. The idea is that, as usual, we measure the inaccuracy of a credence function using a strictly proper inaccuracy measure $\mathfrak{I}$. That is, if $P$ is a probabilistic credence function and $w$ is a possible world, then $\mathfrak{I}(P, w)$ gives the inaccuracy of $P$ at $w$. And, if $P$ is a probabilistic credence function, $P$ expects itself to be least inaccurate. That is, $\sum_w P(w) \mathfrak{I}(P, w) < \sum_w P(w) \mathfrak{I}(Q, w)$, for any credence function $Q \neq P$. Then Greaves and Wallace ask us to consider how you might plan to update your credence function in response to different pieces of evidence you might receive. Thus, suppose you know that the evidence you'll receive will be one of the following propositions, $E_1, \ldots, E_m$, which form a partition. This is the situation you're in if you know that you're about to witness 10 tosses of a coin, for instance, as in Douven's example: $E_1$ might be $HHHHHHHHHH$, $E_2$ might be $HHHHHHHHHT$, and so on. Then suppose you plan how you'll respond to each. If you learn $E_i$, you'll adopt $P_i$. Then we'll call this updating plan $\mathcal{R}$ and write it $(P_1, \ldots, P_m)$. Then we can calculate the expected inaccuracy of a given updating plan. Its inaccuracy at a world is the inaccuracy of the credence function it recommends in response to learning the element of the partition that is true at that world. That is, for world $w$ at which $E_i$ is true,$$\mathfrak{I}(\mathcal{R}, w) = \mathfrak{I}(P_i, w)$$And Greaves and Wallace show that the updating rule your prior expects to be best is the Bayesian one. That is, if there is $E_i$ and $P(E_i) > 0$ and $P_i(-) \neq P(X|E_i)$, then there is an alternative updating rule $\mathcal{R}^\star = (P^\star_1, \ldots, P^\star_m)$ such that$$\sum_w P(w) \mathfrak{I}(\mathcal{R}^\star, w) < \sum_w P(w) \mathfrak{I}(\mathcal{R}, w)$$So, in particular, your prior expects the Bayesian rule to be more accurate than Douven's rule.

In response to this, Douven points out that there are many ways in which we might value the accuracy of our updating plans. For instance, the Greaves and Wallace argument considers only your accuracy at a single later point in time, after you've received a single piece of evidence and updated only on it. But, Douven argues, we might be interested not in the one-off inaccuracy of a single application of an updating rule, but rather in its inaccuracy in the long run. And we might be interested in different features of the long-run total inaccuracy of using that rule: we might be interested in just adding up all of the inaccuracies of the various credence functions you obtain from multiple applications of the rule; or we might be less interested in the inaccuracies of the interim credence functions and more interested in the inaccuracy of the final credence function you obtain after multiple updates. And, Douven claims, the accuracy arguments do not tell us anything about which performs better out of the Bayesian and explanationist approaches when viewed in these different ways.

However, that's not quite right. It turns out that we can, in fact, adapt the Greaves and Wallace argument to cover these cases. To see how, it's probably best to illustrate it with the simplest possible case, but it should be obvious how to scale up the idea. So suppose:

- my credences are defined over four worlds, $XY$, $X\overline{Y}$, $\overline{X}Y$, and $\overline{X}\overline{Y}$;
- my prior at $t_0$ is $P$;
- at $t_1$, I'll learn either $X$ or its negation $\overline{X}$, and I'll respond with $P_X$ or $P_{\overline{X}}$, respectively;
- at $t_2$, I'll learn $XY$, $X\overline{Y}$, $\overline{X}Y$, or $\overline{X} \overline{Y}$, and I'll respond with $P_{XY}$, $P_{X\overline{Y}}$, $P_{\overline{X}Y}$, or $P_{\overline{X}\overline{Y}}$, respectively.

For instance, I might know that a coin is going to be tossed twice, once just before $t_1$ and once just before $t_2$. So $X$ is the proposition that it lands heads on the first toss, i.e., $X = \{HH, HT\}$, while $\overline{X}$ is the proposition it lands tails on the first toss $\overline{X} = \{TH, TT\}$. And then $Y$ is the proposition it lands heads on the second toss. So $XY = \{HH\}$, $X\overline{Y} = \{HT\}$, and so on.

Now, taken together, $P_X$, $P_{\overline{X}}$, $P_{XY}$, $P_{X\overline{Y}}$, $P_{\overline{X}Y}$, and $P_{\overline{X}\overline{Y}}$ constitute my updating plan---let's denote that $\mathcal{R}$. Now, how might be measure the inaccuracy of this plan $\mathcal{R}$? Well, we want to assign a weight to the inaccuracy of the credence function it demands after the first update -- let's call that $\alpha_1$; and we want a weight for the result of the second update -- let's call that $\alpha_2$. So, for instance, if I'm interested in the total inaccuracy obtained by following this rule, and each time is just as important as each other time, I just set $\alpha_1 = \alpha_2$; but if I care much more about my final inaccuracy, then I let $\alpha_1 \ll \alpha_2$. Then the inaccuracy of my updating rule is$$\begin{eqnarray*}

\mathfrak{I}(\mathcal{R}, XY) & = & \alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY) \\

\mathfrak{I}(\mathcal{R}, X\overline{Y}) & = & \alpha_1 \mathfrak{I}(P_X, X\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, X\overline{Y}) \\

\mathfrak{I}(\mathcal{R}, \overline{X}Y) & = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y) \\

\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) & = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})

\end{eqnarray*}$$Thus, the expected inaccuracy of $\mathcal{R}$ from the point of view of my prior $P$ is:

$P(XY)\mathfrak{I}(\mathcal{R}, XY) + P(X\overline{Y})\mathfrak{I}(\mathcal{R}, X\overline{Y}) + P(\overline{X}Y)\mathfrak{I}(\mathcal{R}, \overline{X}Y) + P(\overline{X} \overline{Y})\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) = $

$P(XY)[\alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY)] + $

$P(X\overline{Y})[\alpha_1 \mathfrak{I}(P_X, X\overline{Y}) + \alpha_2\mathfrak{I}(P_{X\overline{Y}}, X\overline{Y})] + $

$P(\overline{X}Y)[\alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y)] + $

$P(\overline{X}\overline{Y})[\alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})]$

But it's easy to see that this is equal to:

$\alpha_1[P(XY)\mathfrak{I}(P_X, XY) + P(X\overline{Y})\mathfrak{I}(P_X, X\overline{Y}) + $

$P(\overline{X}Y)\mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + P(\overline{X}\overline{Y})\mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y})] + $

$\alpha_2[\mathfrak{I}(P_{XY}, XY) + P(X\overline{Y})\mathfrak{I}(P_{X\overline{Y}}, X\overline{Y}) + $

$P(\overline{X}Y)\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y) + P(\overline{X}\overline{Y})\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})]$

Now, this is the weighted sum of the expected inaccuracies of the two parts of my updating plan taken separately; the part that kicks in at $t_1$, and the part that kicks in at $t_2$. And, thanks to Greaves and Wallace's result, we know that each of those expected inaccuracies is minimized by the rule that demands you condition on your evidence. Now, we also know that conditioning $P$ on $XY$ is the same as conditioning $P(-|X)$ on $XY$, and so on. So a rule that tells you, at $t_2$, to update your $t_0$ credence function on your total evidence at $t_2$ is also one that tells you, at $t_2$, to update your $t_1$ credence function on your total evidence at $t_2$. So, of the updating rules that cover the two times $t_1$ and $t_2$, the one that minimizes expected inaccuracy is the one that results from conditioning at each time. That is, if the part of $\mathcal{R}$ that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of $\mathcal{R}$ that kicks in at $t_2$ doesn't demand I condition my credence function at $t_1$ on my evidence at $t_2$, then there is an alternative rule $\mathcal{R}^\star$, that $P$ expects to be more accurate: that is,$$\sum_w P(w)\mathfrak{I}(\mathcal{R}^\star, w) < \sum_w P(w)\mathfrak{I}(\mathcal{R}, w)$$And, as I mentioned above, it's clear how to generalize this to cover not just updating plans that cover two different times at which you receive evidence, but any finite number.

However, I think Douven would not be entirely moved by this. After all, while he is certainly interested in the long-run effects on inaccuracy of using one updating rule or another, he thinks that looking only to expected inaccuracy is a mistake. He thinks that we care about other features of updating rules. Indeed, he provides us with one, and uses computer simulations to show that, in the toy coin tossing case that we've been using, the explanationist account has that desirable feature to a greater degree than the Bayesian account.

For each possible bias value, we ran 1000 simulations of a sequence of 1000 tosses. As previously, the explanationist and the Bayesian updated their degrees of belief after each toss. We registered in how many of those 1000 simulations the explanationist incurred a lower penalty than the Bayesian at various reference points [100 tosses, 250, 500, 750, 1000], at which we calculated both Brier penalties and log score penalties. The outcomes [...] show that, on either measure of inaccuracy, IBE is most often the winner—it incurs the lowest penalty -- at each reference point. Hence, at least in the present kind of context, IBE seems a better choice than Bayes' rule. (page 439)How can we square this with the Greaves and Wallace result? Well, as Douven goes on to explain: "[the explanationist rule] in general achieves greater accuracy than [the Bayesian], even if typically not much greater accuracy"; but "[the Bayesian rule] is less likely than [explanationist rule] to ever make one vastly inaccurate, even though the former typically makes one somewhat more inaccurate than the latter." So the explanationist is most often more accurate, but when it is more accurate, it's only a little more, while when it is less accurate, it's a lot less. So, in expectation, the Bayesian rule wins. Douven then argues that you might be more interested in being more likely to be more accurate, rather than being expectedly more accurate.

Perhaps. But in any case there's another accuracy argument for the Bayesian way of updating that doesn't assume that expected inaccuracy is the thing you want to minimize. This is an argument that Ray Briggs and I gave a couple of years ago. I'll illustrate it in the same setting we used above, where we have prior $P$, at $t_1$ we'll learn $X$ or $\overline{X}$, and at $t_2$ we'll learn $XY$, $X\overline{Y}$, $\overline{X}Y$, or $\overline{X} \overline{Y}$. And we measure the inaccuracy of an updating rule $\mathcal{R} = (P_X, P_{\overline{X}}, P_{XY}, P_{X\overline{Y}}, P_{\overline{X}Y}, P_{\overline{X}\overline{Y}})$ for this as follows:

$$\begin{eqnarray*}

\mathfrak{I}(\mathcal{R}, XY) & = & \alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY) \\

\mathfrak{I}(\mathcal{R}, X\overline{Y}) & = & \alpha_1 \mathfrak{I}(P_X, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, X\overline{Y}) \\

\mathfrak{I}(\mathcal{R}, \overline{X}Y) & = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{X\overline{Y}}, \overline{X}Y) \\

\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) & = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})

\end{eqnarray*}$$Then the following is true: if the part of my plan that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of my plan that kicks in at $t_2$ doesn't demand I condition my $t_1$ credence function on my evidence at $t_2$, then, for any $0 < \beta < 1$, there is an alternative prior $P^\star$ and its associated Bayesian updating rule $\mathcal{R}^\star$, such that, for all worlds $w$,$$\beta\mathfrak{I}(P^\star, w) + (1-\beta)\mathfrak{I}(\mathcal{R}^\star, w) < \beta \mathfrak{I}(P, w) + (1-\beta)\mathfrak{I}(\mathcal{R}, w)$$And, again, this result generalizes to cases that include any number of times at which we receive new evidence, and in which, at each time, the set of propositions we might receive as evidence forms a partition. So it certainly covers the case of the coin of unknown bias that we've been using throughout. So, if you plan to update in some way other than by Bayesian conditionalization starting with your prior, there is an alternative prior and plan that, taken together, is guaranteed to have greater accuracy than yours; that is, they will have greater total accuracy than yours however the world turns out.

How do we square this with Douven's simulation results? The key is that this dominance result includes the prior in it. It does not say that, if $\mathcal{R}$ requires you not to condition $P$ on your evidence at any point, then a rule that does require that is guaranteed to be better. It says that if $\mathcal{R}$ requires you not to condition $P$ on your evidence at any point, then there is an alternative prior $P^\star$ such that it, together with a rule that requires you to condition it on your evidence, are better than $P$ and $\mathcal{R}$ for sure. Douven's results compare the performance of conditioning on $P$ and performing the explanationist update on it. This shows that while conditioning might not always give a better result than the explanationist, there is an alternative prior such that conditioning on it is guaranteed to be better than retaining the original prior and performing the explanationist rule. And that, I think, is the reason we should prefer conditioning on our evidence to giving the little explanationist boosts that Douven suggests. If we update by conditioning, our prior and update rule, taken together, are never accuracy dominated; it we update using Douven's explanationist rule, our prior and update rule, taken together, are accuracy dominated.

Before wrapping up, it's worth mentioning that there's a little wrinkle to iron out. It might be that, while the original prior and the posteriors it generates at the various times all satisfy the Principal Principle, the dominating prior and updating rule don't. While being dominated is clearly bad, you might think that being dominated by something that is itself irrational -- because it violates the Principal Principle, or for other reasons -- isn't so bad. But in fact we can tweak things to avoid this situation. The following is true: if the part of my plan that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of my plan that kicks in at $t_2$ doesn't demand I condition my $t_1$ credence function on my evidence at $t_2$, then, for any $0 < \beta < 1$, there is an alternative prior $P^\star$ and its associated Bayesian updating rule $\mathcal{R}^\star$, such that, $P^\star$ obeys the Principal Principle and, for all possible objective chance functions $ch$,

$\beta\sum_{w} ch(w) \mathfrak{I}(P^\star, w) + (1-\beta)\sum_{w} ch(w) \mathfrak{I}(\mathcal{R}^\star, w) < $

$\beta \sum_{w} ch(w) \mathfrak{I}(P, w) + (1-\beta)\sum_{w} ch(w) \mathfrak{I}(\mathcal{R}, w)$

So I'm inclined to think that Douven's critique of the Dutch Book argument against the explanationist updating rule hits the mark; and I can see why he thinks the expected accuracy argument against it is also less than watertight; but I think the accuracy dominance argument against it is stronger. We shouldn't use that updating rule, with its extra boost for explanatory hypotheses, because if we do so, there will be an alternative prior such that applying the Bayesian updating rule to that prior is guaranteed to be more accurate than applying the explanationist rule to our actual prior.