Thursday, 3 September 2020

Accuracy and Explanation in a Social Setting: thoughts on Douven and Wenmackers

For a PDF version of this post, see here.

In this post, I want to continue my discussion of the part of van Fraassen's argument against inference to the best explanation (IBE) that turns on its alleged clash with Bayesian Conditionalization (BC). In the previous post, I looked at Igor Douven's argument that there are at least some ways of valuing accuracy on which updating by IBE comes out better than BC. I concluded that Douven's arguments don't save IBE; BC is still the only rational way to update.

The setting for Douven's arguments was individualist epistemology. That is, he considered only the single agent collecting evidence directly from the world and updating in the light of it. But of course we often receive evidence not directly from the world, but indirectly through the opinions of others. I learn how many positive SARS-CoV-2 tests there have been in my area in the past week not my inspecting the test results myself but by listening to the local health authority. In their 2017 paper, 'Inference to the Best Explanation versus Bayes’s Rule in a Social Setting', Douven joined with Sylvia Wenmackers to ask how IBE and BC fare in a context in which some of my evidence comes from the world and some from learning the opinions of others, where those others are also receiving some of their evidence from the world and some from others, and where one of those others from whom they're learning might be me. Like Douven's study of IBE vs BC in the individual setting, Douven and Wenmackers conclude in favour of IBE. Indeed, their conclusion in this case is considerably stronger than in the individual case:

The upshot will be that if agents not only update their degrees of belief on the basis of evidence, but also take into account the degrees of belief of their epistemic neighbours, then the noted advantage of Bayesian updating [from Douven's earlier paper] evaporates and IBE does better than Bayes’s rule on every reasonable understanding of inaccuracy minimization. (536-7)

As in the previous post, I want to stick up for BC. As in the individualist setting, I think this is the update rule we should use in the social setting.

Following van Fraassen's original discussion and the strategy pursued in Douven's solo piece, Douven and Wenmackers take the general and ill-specified question whether IBE is better than BC and make it precise by asking it in a very specific case. We imagine a group of individuals. Each has a coin. All coins have the same bias. No individual knows what this shared bias is, but they do know that it is the same bias for each coin, and they know that the options are given by the following bias hypotheses:

$B_0$: coin has 0% chance of landing heads

$B_1$: coin has 10% chance of landing heads

$\ldots$

$B_9$: coin has 90% chance of landing heads

$B_{10}$: coin has 100% chance of landing heads

Though they don't say so, I think Douven and Wenmackers assume that all individuals have the same prior over $B_0, \ldots, B_{10}$, namely, the uniform prior; and each satisfies the Principal Principle, and so their credences in everything else follows from their credences in $B_0, \ldots, B_{10}$. As we'll see, we needn't assume that they all have the uniform prior over the bias hypotheses. In any case, they assume that things proceed as follows:

Step (i) Each member tosses their coin some fixed number of times. This produces their worldly evidence for this round.

Step (ii) Each then updates their credence function on this worldly evidence they've obtained. To do this, each member uses the same updating rule, either BC or a version of IBE. We'll specify these in more detail below.

Step (iii) Each then learns the updated credence functions of the others in the group. This produces their social evidence for this round.

Step (iv) They then update their own credence function by taking the average of their credence function and the other credence functions in the group that lie within a certain distance of theirs. The set of credence functions that lie within a certain distance of one's own, Douven and Wenmackers call one's bounded confidence interval.

They then repeat this cycle a number of times, each time an individual begins with the credence function they reached at the end of the previous cycle.

Douven and Wenmackers use simulation techniques to see how this group of individuals perform for different updating rules used in step (ii) and different specifications of how close a credence function must lie to yours in order to be included in the average in step (iv). Here's the class of updating rules that they consider: if $P$ is your prior and $E$ is your evidence then your updated credence function should be$$P^c_E(B_i) = \frac{P(B_i)P(E|B_i) + f_c(B_i, E)}{\sum^{10}_{k=0} \left (P(B_k)P(E|B_k) + f_c(B_k, E) \right )}$$where$$f_c(B_i, E) = \left \{ \begin{array}{ll} c & \mbox{if } P(E | B_i) > P(E | B_j) \mbox{ for all } j \neq i \\ \frac{1}{2}c & \mbox{if } P(E | B_i) = P(E|B_j) > P(E | B_k) \mbox{ for all } k \neq j, i \\  0 & \mbox{otherwise} \end{array} \right. $$That is, for $c = 0$, this update rule is just BC, while for $c > 0$, it gives a little boost to whichever hypothesis best explains the evidence $E$, where providing the best explanation for a series of coin tosses amounts to making it most likely, and if two bias hypotheses make the evidence most likely, they split the boost between them. Douven and Wenmackers consider $c = 0, 0.1, \ldots, 0.9, 1$. For each rule, specified by $c$, they also consider different sizes of bounded confidence intervals. These are specified by the parameter $\varepsilon$. Your bounded confidence interval for $\varepsilon$ includes each credence function for which the average difference between the credences it assigns and the credences you assign is at most $\varepsilon$. Thus, $\varepsilon = 0$ is the most exclusive, and includes only your own credence function, while $\varepsilon = 1$ is the most inclusive, and includes all credence functions in the group. Again, Douven and Wenmackers consider $\varepsilon = 0, 0.1, \ldots, 0.9, 1$. Here are two of their main results:

  1. For each bias other than $p = 0.1$ or $0.9$, there is an explanationist rule (i.e. $c > 0$ and some specific $\varepsilon$) that gives rise to a lower average inaccuracy at the end of the process than all BC rules (i.e. $c = 0$ and any $\varepsilon$).
  2. There is an averaging explanationist rule (i.e. $c > 0$ and $\varepsilon > 0$) such that, for each bias other than $p = 0, 0.1, 0.9, 1$, it gives rise to lower average inaccuracy than all BC rules (i.e. $c = 0$ and any $\varepsilon$).

Inaccuracy is measured by the Brier score throughout.

Now, you can ask whether these results are enough to tell so strongly in favour of IBE. But that isn't my concern here. Rather, I want to focus on a more fundamental problem: Douven and Wenmackers' argument doesn't really compare BC with IBE. They're comparing BC-for-worldly-data-plus-Averaging-for-social-data with IBE-for-worldly-data-plus-Averaging-for-social-data. So their simulation results don't really impugn BC, because the average inaccuracies that they attribute to BC don't really arise from it. They arise from using BC in step (ii), but something quite different in step (iv). Douven and Wenmackers ask the Bayesian to respond to the social evidence they receive using a non-Bayesian rule, namely, Averaging. And we can see just how far Averaging lies from BC by considering the following version of the example we have been using throughout.

Consider the biased coin case, and suppose there are just three members of the group. And suppose they all start with the uniform prior over the bias hypotheses. At step (i), they each toss their coin twice. The first individual's coin lands $HT$, the second's $HH$, and the third's $TH$. So, at step (ii), if they all use BC (i.e. $c = 0$), they update on this worldly evidence as follows, where $P$ is the shared prior:
$$\begin{array}{r|ccccccccccc}
& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\
\hline
&&&&&&&&&& \\
P & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} \\
&&&&&&&&&& \\
P(-|HT) & 0 & \frac{9}{165} & \frac{16}{165}& \frac{21}{165}& \frac{24}{165} & \frac{25}{165}& \frac{24}{165}& \frac{21}{165}& \frac{16}{165}& \frac{9}{165}& 0\\
&&&&&&&&&& \\
P(-|HH) & 0 &   \frac{1}{385} &  \frac{4}{385}&  \frac{9}{385}&  \frac{16}{385}&  \frac{25}{385}&  \frac{36}{385}&  \frac{49}{385}&  \frac{64}{385}&  \frac{81}{385}&  \frac{100}{385}\\
&&&&&&&&&& \\
P(-|TH) &  0 & \frac{9}{165} & \frac{16}{165}& \frac{21}{165}& \frac{24}{165} & \frac{25}{165}& \frac{24}{165}& \frac{21}{165}& \frac{16}{165}& \frac{9}{165}& 0\\
\end{array}$$
Now, at step (iii), they each learn the other's distribution. And they average on that. Let's suppose I'm the first individual. Then I have two choices for my BCI. It either includes my own credence function $P(-|HT)$ and the third individual's $P(-|TH)$, which are identical, or it includes all three, $P(-|HT), P(-|HH), P(-|TH)$. Let's suppose it includes all three. Here is the outcome of averaging:$$\begin{array}{r|ccccccccccc}
& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\
\hline
&&&&&&&&&& \\
\mbox{Av} & 0 & \frac{129}{3465} & \frac{236}{3465}& \frac{321}{3465}& \frac{384}{3465}& \frac{425}{3465}& \frac{444}{3465}& \frac{441}{3465}& \frac{416}{3465}& \frac{369}{3465}& \frac{243}{3465}
\end{array}$$
And now compare that with what they would do if they updated at step (iv) using BC rather than Averaging. I learn the distributions of the second and third individuals. Now, since I know how many times they tossed their coin, and I know that they updated by BC at step (ii), I thereby learn something about how their coin landed. I know that it landed in such a way that would lead them to update to $P(-|HH)$ and $P(-|TH)$, respectively. Now what exactly does this tell me? In the case of the second individual, it tells me that their coin landed $HH$, since that's the only evidence that would lead them to update to $P(-|HH)$. In the case of the third individual, my evidence is not quite so specific. I learn that their coin either landed $HT$ or $TH$, since either of those, and only one of those, would lead them to update to $P(-|TH)$. In general, learning an individual's posteriors when you know their prior and the number of times they've tossed the coin will teach you how many heads they saw and how many tails, though it won't tell you the order in which they saw them. But that's fine. We can still update on that information using BC, and indeed BC will tell us to adopt the same credence as we would if we were to learn the more specific evidence of the order in which the coin tosses landed. If we do so in this case, we get:
$$\begin{array}{r|ccccccccccc}
& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\
\hline&&&&&&&&&& \\
\mbox{Bayes} & 0 & \frac{81}{95205} & \frac{1024}{95205} & \frac{3969}{95205} & \frac{9216}{95205} & \frac{15625}{95205} & \frac{20736}{95205} & \frac{21609}{95205} & \frac{16384}{95205} & \frac{6561}{95205} &0 \\
\end{array}
$$And this is pretty far from what I got by Averaging at step (iv).

So updating using BC is very different from averaging. Why, then, do Douven and Wenmackers use Averaging rather than BC for step (iv)? Here is their motivation:

[T]aking a convex combination of the probability functions of the individual agents in a group is the best studied method of forming social probability functions. Authors concerned with social probability functions have mostly considered assigning different weights to the probability functions of the various agents, typically in order to reflect agents’ opinions about other agents’ expertise or past performance. The averaging part of our update rule is in some regards simpler and in others less simple than those procedures. It is simpler in that we form probability functions from individual probability functions by taking only straight averages of individual probability functions, and it is less simple in that we do not take a straight average of the probability functions of all given agents, but only of those whose probability function is close enough to that of the agent whose probability is being updated. (552)

In some sense, they're right. Averaging or linear pooling or taking a convex combination of individual credence functions is indeed the best studied method of forming social credence functions. And there are good justifications for it: János Aczél and Carl Wagner and, independently, Kevin J. McConway, give a neat axiomatic characterization; and I've argued that there are accuracy-based reasons to use it in particular cases. The problem is that our situation in step (iv) is not the sort of situation in which you should use Averaging. Arguments for Averaging concern those situations in which you have a group of individuals, possibly experts, and each has a credence function over the same set of propositions, and you want to produce a single credence function that could be called the group's collective credence function. Thus, for instance, if I wish to give the SAGE group's collective credence that there will be a safe and effective SARS-CoV-2 vaccine by March 2021, I might take the average of their individual credences. But this is quite a different task from the one that faces me as the first individual when I reach step (iv) of Douven and Wenmackers' process. There, I already have credences in the propositions in question. What's more, I know how the other individuals update and the sort of evidence they will have received, even if I don't know which particular evidence of that sort they have. And that allows me to infer from their credences after the update at step (ii) a lot about the evidence they receive. And I have opinions about the propositions in question conditional on the different evidence my fellow group members received. And so, in this situation, I'm not trying to summarise our individual opinions as a single opinion. Rather, I'm trying to use their opinions as evidence to inform my own. And, in that case, BC is better than Averaging. So, in order to show that IBE is superior to BC in some respect, it doesn't help to compare BC at step (ii) + Averaging at step (iv) with IBE at (ii) + Averaging at (iv). It would be better to compare BC at (ii) and (iv) with IBE at (ii) and (iv).

So how do things look if we do that? Well, it turns out that we don't need simulations to answer the question. We can simply appeal to the mathematical results we mentioned in the previous post: first, Hilary Greaves and David Wallace's expected accuracy argument; and second, the accuracy dominance argument that Ray Briggs and I gave. Or, more precisely, we use the slight extensions of those results to multiple learning experiences that I sketched in the previous post. For both of those results, the background framework is the same. We begin with a prior, which we hold at $t_0$, before we begin gathering evidence. And we then look forward to a series of times $t_1, \ldots, t_n$ at each of which we will learn some evidence. And, for each time, we know the possible pieces of evidence we might receive, and we plan, for each time, which credence function we would adopt in response to each of the pieces of evidence we might learn at that time. Thus, formally, for each $t_i$ there is a partition from which our evidence at $t_i$ will come. For each $t_{i+1}$, the partition is a fine-graining of the partition at $t_i$. That is, our evidence gets more specific as we proceed. In the case we've been considering, at $t_1$, we'll learn the outcome of our own coin tosses; at $t_2$, we'll add to that our fellow group members' credence functions at $t_1$, from which we can derive a lot about the outcome of their first run of coin tosses; at $t_3$, we'll add to that the outcome of our next run of our own coin tosses; at $t_4$, we'll add our outcomes of the other group members' coin tosses by learning their credences at $t_3$; and so on. The results are then as follows: 

Theorem (Extended Greaves and Wallace) For any strictly proper inaccuracy measure, the updating rule that minimizes expected inaccuracy from the point of view of the prior is BC.

Theorem (Extended Briggs and Pettigrew) For any continuous and strictly proper inaccuracy measure, if your updating rule is not BC, then there is an alternative prior and alternative updating rule that accuracy dominates your prior and your updating rule.

Now, these results immediately settle one question: if you are an individual in the group, and you know which update rules the others have chosen to use, then you should certainly choose BC for yourself. After all, if you have picked your prior, then it expects picking BC to minimize your inaccuracy, and thus expects picking BC to minimize the total inaccuracy of the group that includes you; and if you have not picked your prior, then if you consider a prior together with something other than BC as your updating rule, there's some other combination you could chose instead that is guaranteed to do better, and thus some other combination you could choose that is guaranteed to improve the total accuracy of the group. But Douven and Wenmackers don't set up the problem like this. Rather, they assume that all members of the group use the same updating rule. So the question is whether everyone picking BC is better than everyone picking something else. Fortunately, at least in the case of the coin tosses, this does follow. As we'll see, things could get more complicated with other sorts of evidence.

If you know the updating rules that others will use, then you pick your updating rule simply on the basis of its ability to get you the best accuracy possible; the others have made their choices and you can't affect that. But if you are picking an updating rule for everyone to use, you must consider not only its properties as an updating rule for the individual, but also its properties as a means of signalling to the other members what evidence you have. Thus, prior to considering the details of this, you might think that there could be an updating rule that is very good at producing accurate responses to evidence, but poor at producing a signal to others of the evidence you've received---there might be a wide range of different pieces of evidence you could receive that would lead you to update to the same posterior using this rule, and in that case, learning your posterior would give little information about your evidence. If that were so, we might prefer an updating rule that does not produce such accurate updates, but does signal very clearly what evidence is received. For, in that situation, each individual would produce a less accurate update at step (ii), but would then receive a lot more evidence at step (iv), because the update at step (ii) would signal the evidence that the other members of the group received much more clearly. However, in the coin toss set up that Douven and Wenmackers consider, this isn't an issue. In the coin toss case, learning someone's posterior when you know their prior and how many coin tosses they have observed allows you to learn exactly how many heads and how many tails they observed. It doesn't tell you the order in which you learned them, but knowing that further information wouldn't affect how you would update anyway, either on the BC rule or on the IBE rule---learning $HT \vee TH$ leads to the same update as learning $HT$ for both Bayesian and IBEist. So when we are comparing them, we can consider the information learned at step (ii) and step (iv) both to be worldly information. Both give us information about the tosses of the coin that our peers witnessed. So when we are comparing them, we needn't take into account how good they are at signalling the evidence you have. They are both equally good and both very good. So comparing them when choosing a single rule that each member of the group must use, we need only compare the accuracy of using them as update rules. And the theorems above indicate that BC wins out on that measure.

Monday, 24 August 2020

Accuracy and explanation: thoughts on Douven

 For a PDF of this post, see here.

Igor has eleven coins in his pocket. The first has 0% chance of landing heads, the second 10% chance, the third 20%, and so on up to the tenth, which has 90% chance, and the eleventh, which has 100% chance. He picks one out without letting me know which, and he starts to toss it. After the first 10 tosses, it has landed tails 5 times. How confident should I be that the coin is fair? That is, how confident should I be that it is the sixth coin from Igor's pocket; the one with 50% chance of landing heads? According to the Bayesian, the answer is calculated as follows:$$P_E(H_5) = P(H_5 | E) = \frac{P(H_5)P(E | H_5)}{\sum^{10}_{i=0} P(H_i) P(E|H_i)}$$where

  • $E$ is my evidence, which says that 5 out of 10 of the tosses landed heads,
  • $P_E$ is my new posterior updating credence upon learning the evidence $E$,
  • $P$ is my prior,
  • $H_i$ is the hypothesis that the coin has $\frac{i}{10}$ chance of landing heads,
  • $P(H_0) = \ldots = P(H_{10}) = \frac{1}{11}$, since I know nothing about which coin Igor pulled from his pocket, and
  • $P(E | H_i) = \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5$, by the Principal Principle, and since each coin toss is independent of each other one.

So, upon learning that the coin landed heads five times out of ten, my posterior should be:$$P_E(H_5) = P(H_5 | E) = \frac{P(H_5)P(E | H_5)}{\sum^{10}_{i=0} P(H_i) P(E|H_i)} = \frac{\frac{1}{11} \left ( \frac{5}{10} \right )^5\left ( \frac{5}{10} \right )^5}{\sum^{10}_{i=1}\frac{1}{11} \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5 } \approx 0.2707$$But some philosophers have suggested that this is too low. The Bayesian calculation takes into account how likely the hypothesis in question makes the evidence, as well as how likely I thought the hypothesis in the first place, but it doesn't take into account that the hypothesis explains the evidence. We'll call these philosophers explanationists. Upon learning that the coin landed heads five times out of ten, the explanationist says, we should be most confident in $H_5$, the hypothesis that the coin is fair, and the Bayesian calculation does indeed give this. But we should be most confident in part because $H_5$ best explains the evidence, and the Bayesian calculation takes no account of this.

To accommodate the explanationist's demand, Igor Douven proposes the following alternative updating rule:$$P_E(H_k) = P(H_k | E) = \frac{P(H_k)P(E | H_k) + f(H_k, E)}{\sum^{10}_{i=0} (P(H_i) P(E|H_i) + f(H_i, E))}$$where $f$ gives a little boost to $H_k$ if it is the best explanation of $E$ and not if it isn't. Perhaps, for instance,

  • $f(H_k, E) = 0.1$, if the frequency of heads among the coin tosses that $E$ reports is uniquely closest to the chance of heads according to $H_k$, namely, $\frac{k}{10}$,
  • $f(H_k, E) = 0.05$, if the frequency of heads among the coin tosses that $E$ reports is equally closest to the chance of heads according to $H_k$ and another hypothesis,
  • $f(H_k, E) = 0$, otherwise.

Thus, according to this:$$P_E(H_5) = \frac{P(H_5)P(E | H_5) + 0.1}{\left (\sum^{10}_{i=0} P(H_i) P(E|H_i) \right ) + 0.1} = \frac{\frac{1}{11} \left ( \frac{5}{10} \right )^5\left ( \frac{5}{10} \right )^5 + 0.1}{\sum^{10}_{i=1}\frac{1}{11} \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5 + 0.1 } \approx 0.9746$$So, as required, $H_5$ certainly gets a boost in posterior probability because it best explains the run of heads and tails we observe.

Before we move on, it's worth noting a distinctive feature of this case. In many cases where we wish to apply something like abduction or inference to the best explanation, we might think that we can record our enthusiasm for good explanations in the priors. For instance, suppose I have two scientific theories, $T_1$ and $T_2$, both of which predict the evidence I've collected. So, they both make the evidence equally likely. But I want to assign higher probability to $T_1$ upon receipt of that evidence because it provides a better explanation for the evidence. Then I should simply encode this in my prior. That is, I should assign $P(T_1) > P(T_2)$. But that sort of move isn't open to us in Douven's example. The reason is that none of the chance hypotheses are better explanations in themselves: none is simpler or more general or what have you. But rather, for each, there is evidence we might obtain such that it is a better explanation of that evidence. But before we obtain the evidence, we don't know which will prove the better explanation of it, and so can't accommodate our explanationist instincts by giving that hypothesis a boost in our prior.

Now let's return to the example. There are well known objections to updating in the explanationist way Douven suggests. Most famously, van Fraassen pointed out that we have good reasons to comply with the Bayesian method of updating, and the explanationist method deviates quite dramatically from that (Laws and Symmetry, chapter 6) . When he was writing, the most compelling argument was David Lewis' diachronic Dutch Book argument. If you plan to update as Douven suggests, by giving an extra-Bayesian boost to the hypothesis that best explains the evidence, then there is a series of bets you'll accept before you receive the evidence and another set you'll accept afterwards that, taken together, will lose you money for sure. Douven is unfazed. He first suggests that vulnerability to a Dutch Book does not impugn your epistemic rationality, but only your practical rationality. He notes Skyrms's claim that, in the case of synchronic Dutch Books, such vulnerability reveals an inconsistency in your assessment of the same bet presented in different ways, and therefore perhaps some epistemic failure, but notes that this cannot be extended to the diachronic case. In any case, he says, avoiding the machinations of malevolent bookies is only one practical concern that we have, and, let's be honest, not a very pressing one. What's more, he points out that, while updating in the Bayesian fashion serves one practical end, namely, making us immune to these sorts of diachronic sure losses, there are other practical ends it might not serve as well. For instance, he uses computer simulations to show that, if we update in his explanationist way, we'll tend to assign credence greater than 0.99 in the true hypothesis much more quickly than if we update in the Bayesian way. He admits that we'll also tend to assign credence greater than 0.99 in a false hypothesis much more quickly than if we use Bayesian updating. But he responds, again with the results of a computer simulation result: suppose we keep tossing the coin until one of the rules assigns more than 0.99 to a hypothesis; then award points to that rule if the hypothesis it becomes very confident in is true, and deduct them if it is false; then the explanationist updating rule will perform better on average than the Bayesian rule. So, if there is some practical decision that you will make only when your credence in a hypothesis exceeds 0.99 -- perhaps the choice is to administer a particular medical treatment, and you need to be very certain in your diagnosis before doing so -- then you will be better off on average updating as Douven suggests, rather than as the Bayesian requires.

So much for the practical implications of updating in one way or another. I am more interested in the epistemic implications, and so is Douven. He notes that, since van Fraassen gave his argument, there is a new way of justifying the Bayesian demand to update by conditioning on your evidence. These are the accuracy arguments. While Douven largely works with the argument for conditioning that Hannes Leitgeb and I gave, I think the better version of that argument is due to Hilary Greaves and David Wallace. The idea is that, as usual, we measure the inaccuracy of a credence function using a strictly proper inaccuracy measure $\mathfrak{I}$. That is, if $P$ is a probabilistic credence function and $w$ is a possible world, then $\mathfrak{I}(P, w)$ gives the inaccuracy of $P$ at $w$. And, if $P$ is a probabilistic credence function, $P$ expects itself to be least inaccurate. That is, $\sum_w P(w) \mathfrak{I}(P, w) < \sum_w P(w) \mathfrak{I}(Q, w)$, for any credence function $Q \neq P$. Then Greaves and Wallace ask us to consider how you might plan to update your credence function in response to different pieces of evidence you might receive. Thus, suppose you know that the evidence you'll receive will be one of the following propositions, $E_1, \ldots, E_m$, which form a partition. This is the situation you're in if you know that you're about to witness 10 tosses of a coin, for instance, as in Douven's example: $E_1$ might be $HHHHHHHHHH$, $E_2$ might be $HHHHHHHHHT$, and so on. Then suppose you plan how you'll respond to each. If you learn $E_i$, you'll adopt $P_i$. Then we'll call this updating plan $\mathcal{R}$ and write it $(P_1, \ldots, P_m)$. Then we can calculate the expected inaccuracy of a given updating plan. Its inaccuracy at a world is the inaccuracy of the credence function it recommends in response to learning the element of the partition that is true at that world. That is, for world $w$ at which $E_i$ is true,$$\mathfrak{I}(\mathcal{R}, w) = \mathfrak{I}(P_i, w)$$And Greaves and Wallace show that the updating rule your prior expects to be best is the Bayesian one. That is, if there is $E_i$ and $P(E_i) > 0$ and $P_i(-) \neq P(X|E_i)$, then there is an alternative updating rule $\mathcal{R}^\star = (P^\star_1, \ldots, P^\star_m)$ such that$$\sum_w P(w) \mathfrak{I}(\mathcal{R}^\star, w) < \sum_w P(w) \mathfrak{I}(\mathcal{R}, w)$$So, in particular, your prior expects the Bayesian rule to be more accurate than Douven's rule.

In response to this, Douven points out that there are many ways in which we might value the accuracy of our updating plans. For instance, the Greaves and Wallace argument considers only your accuracy at a single later point in time, after you've received a single piece of evidence and updated only on it. But, Douven argues, we might be interested not in the one-off inaccuracy of a single application of an updating rule, but rather in its inaccuracy in the long run. And we might be interested in different features of the long-run total inaccuracy of using that rule: we might be interested in just adding up all of the inaccuracies of the various credence functions you obtain from multiple applications of the rule; or we might be less interested in the inaccuracies of the interim credence functions and more interested in the inaccuracy of the final credence function you obtain after multiple updates. And, Douven claims, the accuracy arguments do not tell us anything about which performs better out of the Bayesian and explanationist approaches when viewed in these different ways.

However, that's not quite right. It turns out that we can, in fact, adapt the Greaves and Wallace argument to cover these cases. To see how, it's probably best to illustrate it with the simplest possible case, but it should be obvious how to scale up the idea. So suppose: 

  • my credences are defined over four worlds, $XY$, $X\overline{Y}$, $\overline{X}Y$, and $\overline{X}\overline{Y}$;
  • my prior at $t_0$ is $P$;
  • at $t_1$, I'll learn either $X$ or its negation $\overline{X}$, and I'll respond with $P_X$ or $P_{\overline{X}}$, respectively;
  • at $t_2$, I'll learn $XY$, $X\overline{Y}$, $\overline{X}Y$, or $\overline{X} \overline{Y}$, and I'll respond with $P_{XY}$, $P_{X\overline{Y}}$, $P_{\overline{X}Y}$, or $P_{\overline{X}\overline{Y}}$, respectively.

For instance, I might know that a coin is going to be tossed twice, once just before $t_1$ and once just before $t_2$. So $X$ is the proposition that it lands heads on the first toss, i.e., $X = \{HH, HT\}$, while $\overline{X}$ is the proposition it lands tails on the first toss $\overline{X} = \{TH, TT\}$. And then $Y$ is the proposition it lands heads on the second toss. So $XY = \{HH\}$, $X\overline{Y} = \{HT\}$, and so on.

Now, taken together, $P_X$, $P_{\overline{X}}$, $P_{XY}$, $P_{X\overline{Y}}$, $P_{\overline{X}Y}$, and $P_{\overline{X}\overline{Y}}$ constitute my updating plan---let's denote that $\mathcal{R}$. Now, how might be measure the inaccuracy of this plan $\mathcal{R}$? Well, we want to assign a weight to the inaccuracy of the credence function it demands after the first update -- let's call that $\alpha_1$; and we want a weight for the result of the second update -- let's call that $\alpha_2$. So, for instance, if I'm interested in the total inaccuracy obtained by following this rule, and each time is just as important as each other time, I just set $\alpha_1 = \alpha_2$; but if I care much more about my final inaccuracy, then I let $\alpha_1 \ll \alpha_2$. Then the inaccuracy of my updating rule is$$\begin{eqnarray*}
\mathfrak{I}(\mathcal{R}, XY) &  = & \alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY) \\
\mathfrak{I}(\mathcal{R}, X\overline{Y}) &  = & \alpha_1 \mathfrak{I}(P_X, X\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, X\overline{Y}) \\
\mathfrak{I}(\mathcal{R}, \overline{X}Y) &  = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y) \\
\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) &  = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})
\end{eqnarray*}$$Thus, the expected inaccuracy of $\mathcal{R}$ from the point of view of my prior $P$ is:

$P(XY)\mathfrak{I}(\mathcal{R}, XY) + P(X\overline{Y})\mathfrak{I}(\mathcal{R}, X\overline{Y}) + P(\overline{X}Y)\mathfrak{I}(\mathcal{R}, \overline{X}Y) + P(\overline{X} \overline{Y})\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) = $

$P(XY)[\alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY)] + $

$P(X\overline{Y})[\alpha_1 \mathfrak{I}(P_X, X\overline{Y}) + \alpha_2\mathfrak{I}(P_{X\overline{Y}}, X\overline{Y})] + $

$P(\overline{X}Y)[\alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y)] + $

$P(\overline{X}\overline{Y})[\alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})]$

But it's easy to see that this is equal to:

$\alpha_1[P(XY)\mathfrak{I}(P_X, XY) + P(X\overline{Y})\mathfrak{I}(P_X, X\overline{Y}) + $

$P(\overline{X}Y)\mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + P(\overline{X}\overline{Y})\mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y})] + $

$\alpha_2[\mathfrak{I}(P_{XY}, XY) + P(X\overline{Y})\mathfrak{I}(P_{X\overline{Y}}, X\overline{Y}) + $

$P(\overline{X}Y)\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y) + P(\overline{X}\overline{Y})\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})]$

Now, this is the weighted sum of the expected inaccuracies of the two parts of my updating plan taken separately; the part that kicks in at $t_1$, and the part that kicks in at $t_2$. And, thanks to Greaves and Wallace's result, we know that each of those expected inaccuracies is minimized by the rule that demands you condition on your evidence. Now, we also know that conditioning $P$ on $XY$ is the same as conditioning $P(-|X)$ on $XY$, and so on. So a rule that tells you, at $t_2$, to update your $t_0$ credence function on your total evidence at $t_2$ is also one that tells you, at $t_2$, to update your $t_1$ credence function on your total evidence at $t_2$. So, of the updating rules that cover the two times $t_1$ and $t_2$, the one that minimizes expected inaccuracy is the one that results from conditioning at each time. That is, if the part of $\mathcal{R}$ that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of $\mathcal{R}$ that kicks in at $t_2$ doesn't demand I condition my credence function at $t_1$ on my evidence at $t_2$, then there is an alternative rule $\mathcal{R}^\star$, that $P$ expects to be more accurate: that is,$$\sum_w P(w)\mathfrak{I}(\mathcal{R}^\star, w) < \sum_w P(w)\mathfrak{I}(\mathcal{R}, w)$$And, as I mentioned above, it's clear how to generalize this to cover not just updating plans that cover two different times at which you receive evidence, but any finite number.

However, I think Douven would not be entirely moved by this. After all, while he is certainly interested in the long-run effects on inaccuracy of using one updating rule or another, he thinks that looking only to expected inaccuracy is a mistake. He thinks that we care about other features of updating rules. Indeed, he provides us with one, and uses computer simulations to show that, in the toy coin tossing case that we've been using, the explanationist account has that desirable feature to a greater degree than the Bayesian account.

For each possible bias value, we ran 1000 simulations of a sequence of 1000 tosses. As previously, the explanationist and the Bayesian updated their degrees of belief after each toss. We registered in how many of those 1000 simulations the explanationist incurred a lower penalty than the Bayesian at various reference points [100 tosses, 250, 500, 750, 1000], at which we calculated both Brier penalties and log score penalties. The outcomes [...] show that, on either measure of inaccuracy, IBE is most often the winner—it incurs the lowest penalty -- at each reference point. Hence, at least in the present kind of context, IBE seems a better choice than Bayes' rule. (page 439)
How can we square this with the Greaves and Wallace result? Well, as Douven goes on to explain: "[the explanationist rule] in general achieves greater accuracy than [the Bayesian], even if typically not much greater accuracy"; but "[the Bayesian rule] is less likely than [explanationist rule] to ever make one vastly inaccurate, even though the former typically makes one somewhat more inaccurate than the latter." So the explanationist is most often more accurate, but when it is more accurate, it's only a little more, while when it is less accurate, it's a lot less. So, in expectation, the Bayesian rule wins. Douven then argues that you might be more interested in being more likely to be more accurate, rather than being expectedly more accurate.

Perhaps. But in any case there's another accuracy argument for the Bayesian way of updating that doesn't assume that expected inaccuracy is the thing you want to minimize. This is an argument that Ray Briggs and I gave a couple of years ago. I'll illustrate it in the same setting we used above, where we have prior $P$, at $t_1$ we'll learn $X$ or $\overline{X}$, and at $t_2$ we'll learn $XY$, $X\overline{Y}$, $\overline{X}Y$, or $\overline{X} \overline{Y}$. And we measure the inaccuracy of an updating rule $\mathcal{R} = (P_X, P_{\overline{X}}, P_{XY}, P_{X\overline{Y}}, P_{\overline{X}Y}, P_{\overline{X}\overline{Y}})$ for this as follows:
$$\begin{eqnarray*}
\mathfrak{I}(\mathcal{R}, XY) &  = & \alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY) \\
\mathfrak{I}(\mathcal{R}, X\overline{Y}) &  = & \alpha_1 \mathfrak{I}(P_X, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, X\overline{Y}) \\
\mathfrak{I}(\mathcal{R}, \overline{X}Y) &  = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{X\overline{Y}}, \overline{X}Y) \\
\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) &  = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})
\end{eqnarray*}$$Then the following is true: if the part of my plan that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of my plan that kicks in at $t_2$ doesn't demand I condition my $t_1$ credence function on my evidence at $t_2$, then, for any $0 < \beta < 1$, there is an alternative prior $P^\star$ and its associated Bayesian updating rule $\mathcal{R}^\star$, such that, for all worlds $w$,$$\beta\mathfrak{I}(P^\star, w) + (1-\beta)\mathfrak{I}(\mathcal{R}^\star, w) < \beta \mathfrak{I}(P, w) + (1-\beta)\mathfrak{I}(\mathcal{R}, w)$$And, again, this result generalizes to cases that include any number of times at which we receive new evidence, and in which, at each time, the set of propositions we might receive as evidence forms a partition. So it certainly covers the case of the coin of unknown bias that we've been using throughout. So, if you plan to update in some way other than by Bayesian conditionalization starting with your prior, there is an alternative prior and plan that, taken together, is guaranteed to have greater accuracy than yours; that is, they will have greater total accuracy than yours however the world turns out.

How do we square this with Douven's simulation results? The key is that this dominance result includes the prior in it. It does not say that, if $\mathcal{R}$ requires you not to condition $P$ on your evidence at any point, then a rule that does require that is guaranteed to be better. It says that if $\mathcal{R}$ requires you not to condition $P$ on your evidence at any point, then there is an alternative prior $P^\star$ such that it, together with a rule that requires you to condition it on your evidence, are better than $P$ and $\mathcal{R}$ for sure. Douven's results compare the performance of conditioning on $P$ and performing the explanationist update on it. This shows that while conditioning might not always give a better result than the explanationist, there is an alternative prior such that conditioning on it is guaranteed to be better than retaining the original prior and performing the explanationist rule. And that, I think, is the reason we should prefer conditioning on our evidence to giving the little explanationist boosts that Douven suggests. If we update by conditioning, our prior and update rule, taken together, are never accuracy dominated; it we update using Douven's explanationist rule, our prior and update rule, taken together, are accuracy dominated.

Before wrapping up, it's worth mentioning that there's a little wrinkle to iron out. It might be that, while the original prior and the posteriors it generates at the various times all satisfy the Principal Principle, the dominating prior and updating rule don't. While being dominated is clearly bad, you might think that being dominated by something that is itself irrational -- because it violates the Principal Principle, or for other reasons -- isn't so bad. But in fact we can tweak things to avoid this situation. The following is true: if the part of my plan that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of my plan that kicks in at $t_2$ doesn't demand I condition my $t_1$ credence function on my evidence at $t_2$, then, for any $0 < \beta < 1$, there is an alternative prior $P^\star$ and its associated Bayesian updating rule $\mathcal{R}^\star$, such that, $P^\star$ obeys the Principal Principle and, for all possible objective chance functions $ch$,

$\beta\sum_{w} ch(w) \mathfrak{I}(P^\star, w) + (1-\beta)\sum_{w} ch(w) \mathfrak{I}(\mathcal{R}^\star, w) < $

$\beta \sum_{w} ch(w) \mathfrak{I}(P, w) + (1-\beta)\sum_{w} ch(w) \mathfrak{I}(\mathcal{R}, w)$

So I'm inclined to think that Douven's critique of the Dutch Book argument against the explanationist updating rule hits the mark; and I can see why he thinks the expected accuracy argument against it is also less than watertight; but I think the accuracy dominance argument against it is stronger. We shouldn't use that updating rule, with its extra boost for explanatory hypotheses, because if we do so, there will be an alternative prior such that applying the Bayesian updating rule to that prior is guaranteed to be more accurate than applying the explanationist rule to our actual prior.

Tuesday, 11 August 2020

The only symmetric inaccuracy measure is the Brier score

If you'd like a PDF of this post, see here. 

[UPDATE 1: I should have made this clear in the original post. The Normality condition makes the proof go through more easily, but it isn't really necessary. Suppose we simply assume instead that $$\mathfrak{I}(w^i, j)= \left \{ \begin{array}{ll} b & \mbox{if } i \neq j \\ a & \mbox{if } i = j \end{array} \right.$$Then we can show that, if $\mathfrak{I}$ is symmetric then, for any probabilistic credence function $p$ and any world $w_i$,$$\mathfrak{I}(p, i) = (b-a)\frac{1}{2} \left (1 - 2p_i + \sum_j p^2_j \right ) + a$$End Update 1.]

[UPDATE 2: There's something puzzling about the result below. Suppose $\mathcal{W} = \{w_1, \ldots, w_n\}$ is the set of possible worlds. And suppose $\mathcal{F}$ is the full algebra of propositions built out of those worlds. That is, $\mathcal{F}$ is the set of subsets of $\mathcal{W}$. Then there are two versions of the Brier score over a probabilistic credence function $p$ defined on $\mathcal{F}$. The first considers only the credences that $p$ assigns to the possible worlds. Thus,$$\mathfrak{B}(p, i) = \sum^n_{j=1} (w^i_j - p_j)^2 = 1 - 2p_i + \sum_j p^2_j$$But there is another that considers also the credences that $p$ assigns to the other propositions in $\mathcal{F}$. Thus,$$\mathfrak{B}^\star(p, i) = \sum_{X \in \mathcal{F}} (w_i(X) - p(X))^2$$Now, at first sight, these look related, but not very closely. However, notice that both are symmetric. Thus, by the extension of Selten's theorem below (plus update 1 above), if $\mathfrak{I}(w^i, j) = b$ for $i \neq j$ and 0 for $i = j$, then $\mathfrak{I}(p, i) = \frac{1}{2}b\mathfrak{B}(p, i)$. Now, $\mathfrak{B}(w^i, j) = 2$ for $i \neq j$, and $\mathfrak{B}(w^i, j) = 0$ for $i = j$, and so this checks out. But what about $\mathfrak{B}^\star$? Well, according to our extension of Selten's theorem, since $\mathfrak{B}^\star$ is symmetric, we can see that it is just a multiple of $\mathfrak{B}$, the factor determined by $\mathfrak{B}^\star(w^i, j)$. So what is this number? Well, it turns out that, if $i \neq j$, then$$\mathfrak{B}^\star(w^i, j) = 2\sum^{n-2}_{k=0} {n-2 \choose k}$$Thus, it follows that$$\mathfrak{B}^\star(p, i) = \sum^{n-2}_{k=0} {n-2 \choose k}\mathfrak{B}(p, i)$$And you can verify this by other means as well. This is quite a nice result independently of all this stuff about symmetry. After all, there doesn't seem any particular reason to favour $\mathfrak{B}$ over $\mathfrak{B}^\star$ or vice versa. This result shows that using one for the sorts of purposes we have in accuracy-first epistemology won't give different results from using the other. End update 2.]

So, as is probably obvious, I've been trying recently to find out what things look like in accuracy-first epistemology if you drop the assumption that the inaccuracy of a whole credal state is the sum of the inaccuracies of the individual credences that it comprises --- this assumption is sometimes called Additivity or Separability. In this post, I want to think about a result concerning additive inaccuracy measures that intrigued me in the past and on the basis of which I tried to mount an argument in favour of the Brier score. The result dates back to Reinhard Selten, the German economist who shared the 1994 Nobel prize with John Harsanyi and John Nash for his contributions to game theory. In this post, I'll show that the result goes through even if we don't assume additivity.

Suppose $\mathfrak{I}$ is an inaccuracy measure. Thus, if $c$ is a credence function defined on the full algebra built over the possible worlds $w_1, \ldots, w_n$, then $\mathfrak{I}(c, i)$ measures the inaccuracy of $c$ at world $w_i$. Then define the following function on pairs of probabilistic credence functions:$$\mathfrak{D}_\mathfrak{I}(p, q) = \sum_i p_i \mathfrak{I}(q, i) - \sum_i p_i\mathfrak{I}(p, i)$$$\mathfrak{D}_\mathfrak{I}$ measures how much more inaccurate $p$ expects $q$ to be than it expects itself to be; equivalently, how much more accurate $p$ expects itself to be than it expects $q$ to be. Now, if $\mathfrak{I}$ is strictly proper, $\mathfrak{D}_\mathfrak{I}$ is positive whenever $p$ and $q$ are different, and zero when they are the same, so in that case $\mathfrak{D}_\mathfrak{I}$ is a divergence. But we won't be assuming that here -- rather remarkably, we don't need to.

Now, it's not hard to see that $\mathfrak{D}_\mathfrak{I}$ is not necessarily symmetric. For instance, consider the log score$$\mathfrak{L}(p, i) = -\log p_i$$Then$$\mathfrak{D}_\mathfrak{L}(p, q) = p_i \log \frac{p_i}{q_i}$$This is the so-called Kullback-Leibler divergence and it is not symmetric. Nonetheless, it's equally easy to see that it is at least possible for $\mathfrak{D}_\mathfrak{I}$ to be symmetric. For instance, consider the Brier score$$\mathfrak{B}(p, i) = 1-2p_i + \sum_j p^2_j$$Then$$\mathfrak{D}_\mathfrak{B}(p, q) = \sum_i (p_i - q_i)^2$$So the natural question arises: how many inaccuracy measures are symmetric in this way? That is, how many generate symmetric divergences in the way that the Brier score does? It turns out: none, except the Brier score.

First, a quick bit of notation: Given a possible world $w_i$, we write $w^i$ for the probabilistic credence function that assigns credence 1 to world $w_i$ and 0 to any world $w_j$ with $j \neq i$. 

And two definitions:

Definition (Normal inaccuracy measure) An inaccuracy measure $\mathfrak{I}$ is normal if $$\mathfrak{I}(w^i, j) = \left \{ \begin{array}{ll} 1 & \mbox{if } i \neq j \\ 0 & \mbox{if } i = j \end{array} \right.$$

Definition (Symmetric inaccuracy measure) An inaccuracy measure is symmetric if $$\mathfrak{D}_\mathfrak{I}(p, q) = \mathfrak{D}_\mathfrak{I}(q, p)$$for all probabilistic credence functions $p$ and $q$.

Thus,  $\mathfrak{I}$ is symmetric if, for any probability functions $p$ and $q$, the loss of accuracy that $p$ expects to suffer by moving to $q$ is the same as the loss of accuracy that $q$ expects to suffer by moving to $p$.

Theorem The only normal and symmetric inaccuracy measure agrees with the Brier score for probabilistic credence functions.

Proof. (This just adapts Selten's proof in exactly the way you'd expect.) Suppose $\mathfrak{D}_\mathfrak{I}(p, q) = \mathfrak{D}_\mathfrak{I}(q, p)$ for all probabilistic $p$, $q$. Then, in particular, for any world $w_i$ and any probabilistic $p$,$$\sum_j w^i_j \mathfrak{I}(p, j) - \sum_j w^i_j \mathfrak{I}(w^i, j) = \sum_j p_j \mathfrak{I}(w^i, j) -\sum_j p_j \mathfrak{I}(p, j)$$So,$$\mathfrak{I}(p, i) = (1-p_i) - \sum_j p_j \mathfrak{I}(p, j)$$So,$$\sum_j p_j \mathfrak{I}(p, j) = 1 - \sum_j p^2_j- \sum_j p_j \mathfrak{I}(p, j)$$So,$$\sum_j p_j \mathfrak{I}(p, j) = \frac{1}{2}[1 - \sum_j p^2_j]$$So,$$\mathfrak{I}(p, i) = 1-p_i -\frac{1}{2}[1 - \sum_j p^2_j] = \frac{1}{2} \left (1 - 2p_i + \sum_j p^2_j \right )$$as required. $\Box$

There are a number of notable features of this result:

First, the theorem does not assume that the inaccuracy measure is strictly proper, but since the Brier score is strictly proper, it follows that symmetry entails strict propriety.

Second, the theorem does not assume additivity, but since the Brier score is additive, it follows that symmetry entails additivity.



Monday, 10 August 2020

The Accuracy Dominance Argument for Conditionalization without the Additivity assumption

 For a PDF of this post, see here.

Last week, I explained how you can give an accuracy dominance argument for Probabilism without assuming that your inaccuracy measures are additive -- that is, without assuming that the inaccuracy of a whole credence function is obtained by adding up the inaccuracy of all the individual credences that it assigns. The mathematical result behind that also allows us to give my chance dominance argument for the Principal Principle without assuming additivity, and ditto for my accuracy-based argument for linear pooling. In this post, I turn to another Bayesian norm, namely, Conditionalization. The first accuracy argument for this was given by Hilary Greaves and David Wallace, building on ideas developed by Graham Oddie. It was an expected accuracy argument, and it didn't assume additivity. More recently, Ray Briggs and I offered an accuracy dominance argument for the norm, and we did assume additivity. It's this latter argument I'd like to consider here. I'd like to show how it goes through even without assuming additivity. And indeed I'd like to generalise it at the same time. The generalisation is inspired by a recent paper by Michael Rescorla. In it, Rescorla notes that all the existing arguments for Conditionalization assume that, when your evidence comes in the form of a proposition learned with certainty, that proposition must be true. He then offers a Dutch Book argument for Conditionalization that doesn't make this assumption, and he issues a challenge for other sorts of arguments to do the same. Here, I take up that challenge. To do so, I will offer an argument for what I call the Weak Reflection Principle.

Weak Reflection Principle (WRP) Your current credence function should be a linear combination of the possible future credence functions that you endorse.

A lot might happen between now and tomorrow. I might see new sights, think new thoughts; I might forget things I know today, take mind-altering drugs that enhance or impair my thinking; and so on. So perhaps there is a set of credence functions I think I might have tomorrow. Some of those I'll endorse -- perhaps those that I'd get if I saw certain new things, or enhanced my cognition in various ways. And some of them I'll disavow -- perhaps those that I'd get if I forget certain things, or impaired my cognition. WRP asks you to separate out the wheat from the chaff, and once you've identified the ones you endorse, it tells you that your current credence function should lie within the span of those future ones; it should be in their convex hull; it should be a weighted sum or convex combination of them.

One nice thing about WRP is that it gives back Conditionalization in certain cases. Suppose $c^0$ is my current credence function. Suppose I know that between now and tomorrow I'll learn exactly one member of the partition $E_1, \ldots, E_m$ with certainty --- this is the situation that Greaves and Wallace envisage. And suppose I endorse credence function $c^1$ as a response to learning $E_1$, $c^2$ as a response to learning $E_2$, and so on. Then, if I satisfy WRP, and if $c^k(E_k) = 1$, since I did after all learn it with certainty, then it follows that, whenever $c^0(E_k) > 0$, $c^k(X) = c^0(X | E_k)$, which is exactly what Conditionalization asks of you. And notice that, at no point did we assume that if I learn $E_k$, then $E_k$ is true. So we've answered Rescorla's challenge if we can establish WRP.

To do that, we need Theorem 1 below. And to get there, we need to go via Lemmas 1 and 2. Just to remind ourselves of the framework:

  • $w_1, \ldots, w_n$ are the possible worlds;
  • credence functions are defined on the full algebra built on top of these possible worlds;
  • given a credence function $c$, we write $c_i$ for the credence that $c$ assigns to $w_i$.  

Lemma 1 If $c^0$ is not in the convex combination of $c^1, \ldots, c^m$, then $(c^0, c^1, \ldots, c^m)$ is not in the convex hull of $\mathcal{X}$, where$$\mathcal{X} := \{(w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m) : 1 \leq i \leq n\ \&\ 1 \leq k \leq m\}$$

Definition 1 Suppose $\mathfrak{I}$ is a continuous strictly proper inaccuracy measure. Then let$$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m)) = \sum^m_{k=0} \left ( \sum^n_{i=1} p^k_i \mathfrak{I}(c^k, i) - \sum^n_{i=1} p^k_i \mathfrak{I}(p^k, i) \right )$$

Lemma 2 Suppose $\mathfrak{I}$ is a continuous strictly proper inaccuracy measure. Suppose $\mathcal{X}$ is a closed convex set of $(n+1)$-tuples of probabilistic credence functions. And suppose $(c^0, c^1, \ldots, c^n)$ is not in $\mathcal{X}$. Then there is $(q^0, q^1, \ldots, q^m)$ in $\mathcal{X}$ such that 

(i) for all $(p^0, p^1, \ldots, p^m) \neq (q^0, q^1, \ldots, q^m)$ in $\mathcal{X}$,

$\mathfrak{D}_\mathfrak{I}((q^0, q^1, \ldots, q^m), (c^0, c^1, \ldots, c^m)) <$

$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m))$;

(ii) for all $(p^0, p^1, \ldots, p^n)$ in $\mathcal{X}$,

$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m)) \geq$

$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (q^0, q^1, \ldots, q^m))  +$

$\mathfrak{D}_\mathfrak{I}((q^0, q^1, \ldots, q^m), (c^0, c^1, \ldots, c^m))$.

Theorem 1 Suppose each $c^0, c^1, \ldots, c^n$ is a probabilistic credence function. If $c^0$ is not in the convex hull of $c^1, \ldots, c^m$, then there are probabilistic credence functions $q^0, q^1, \ldots, q^m$ such that for all worlds $w_i$ and $1 \leq k \leq m$,$$\mathfrak{I}(q^0, i) + \mathfrak{I}(q^k, i) < \mathfrak{I}(c^0, i) + \mathfrak{I}(c^k, i)$$ 

Let's keep the proofs on ice for a moment. What does this show exactly? It says that, if you don't do as WRP demands, there is some alternative current credence function and, for each of the possible future credence functions in the set you endorse, there is an alternative such that having your current credence function now and then one of your endorsed future credence functions later is guaranteed to make you less accurate overall than having the alternative to your current credence function now and the alternative to that endorsed future credence function later. This, I claim, establishes WRP.

Now for the proofs.

Proof of Lemma 1. We prove the contrapositive. Suppose $(c^0, c^1, \ldots, c^m)$ is in $\mathcal{X}$. Then there are $0 \leq \lambda_{i, k} \leq 1$ such that $\sum^n_{i=1}\sum^m_{k=1}  \lambda_{i, k} = 1$ and$$(c^0, c^1, \ldots, c^m) = \sum^n_{i=1} \sum^m_{k=1}  \lambda_{i, k} (w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m)$$Thus,$$c^0 = \sum^n_{i=1}\sum^m_{k=1}  \lambda_{i,k} w^i$$
and$$c^k = \sum^n_{i=1} \lambda_{i, k} w^i +  \sum^n_{i=1} \sum_{l \neq k} \lambda_{i, l} c^k$$So$$(\sum^n_{i=1} \lambda_{i, k}) c^k =  \sum^n_{i=1} \lambda_{i, k} w^i$$So let $\lambda_k =  \sum^n_{i=1} \lambda_{i, k}$. Then, for $1 \leq k \leq m$,$$\lambda_k c_k = \sum^n_{i=1} \lambda_{i, k} w^i$$And thus$$\sum^m_{k=1} \lambda^k c^k = \sum^m_{k=1} \sum^n_{i=1} \lambda_{i, k} w^i = c^0$$as required. $\Box$

Proof of Lemma 2. This proceeds exactly like the corresponding theorem from the previous blogpost. $\Box$

Proof of Theorem 1. So, if $c^0$ is not in the convex hull of $c^1, \ldots, c^m$, there is $(q^0, q^1, \ldots, q^m)$ such that, for all $(p^0, p^1, \ldots, p^m)$ in $\mathcal{X}$,$$\mathfrak{D}((p^0, p^1, \ldots, p^m), (q^0, q^1, \ldots, q^m)) < \mathfrak{D}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m))$$In particular, for any world $w_i$ and $1 \leq k \leq m$,

$\mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (q^0, q^1, \ldots, q^m)) <$

$\mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (c^0, c^1, \ldots, c^m))$

But$$\begin{eqnarray*}
& & \mathfrak{I}(q^0, i) + \mathfrak{I}(q^k, i) \\
& = & \mathfrak{D}(w^i, q^0) + \mathfrak{D}(w^i, q^k) \\
& \leq & \mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (q^0, q^1, \ldots, q^m)) \\
& < & \mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (c^0, c^1, \ldots, c^m)) \\
& = &  \mathfrak{D}(w^i, c^0) + \mathfrak{D}(w^i, c^k) \\
& = & \mathfrak{I}(c^0, i) + \mathfrak{I}(c^k, i)
\end{eqnarray*}$$as required.

 

Friday, 7 August 2020

The Accuracy Dominance Argument for Probabilism without the Additivity assumption

For a PDF of this post, see here.

One of the central arguments in accuracy-first epistemology -- the one that gets the project off the ground, I think -- is the accuracy-dominance argument for Probabilism. This started life in a more pragmatic guise in de Finetti's proof that, if your credences are not probabilistic, there are alternatives that would lose less than yours would if they were penalised using the Brier score, which levies a price of $(1-x)^2$ on every credence $x$ in a truth and $x^2$ on every credence $x$ in a falsehood. This was then adapted to an accuracy-based argument by Roger Rosencrantz, where he interpreted the Brier score as a measure of inaccuracy, not a penalty score. Interpreted thus, de Finetti's result says that any non-probabilistic credences are accuracy-dominated by some probabilistic credences. Jim Joyce then noted that this argument only establishes Probabilism if you have a further argument that inaccuracy should be measured by the Brier score. He thought there was no particular reason to think that's right, so he greatly generalized de Finetti's result to show that, relative to a much wider range of inaccuracy measures, all non-probabilistic credences are accuracy dominated. One problem with this, which Al Hájek pointed out, was that he didn't give a converse argument -- that is, he didn't show that, for each of his inaccuracy measures, each probabilistic credence function is not accuracy dominated. Joel Predd and his Princeton collaborators then addressed this concern and proved a very general result, namely, that for any additive, continuous, and strictly proper inaccuracy measure, any non-probabilistic credences are accuracy-dominated, while no probabilistic credences are.

That brings us to this blogpost. Additivity is a controversial claim. It says that the inaccuracy of a credence function is the (possibly weighted) sum of the inaccuracies of the credences it assigns. So the question arises: can we do without additivity? In this post, I'll give a quick proof of the accuracy-dominance argument that doesn't assume anything about the inaccuracy measures other than that they are continuous and strictly proper. Anyone familiar with the Predd, et al. paper will see that the proof strategy draws very heavily on theirs. But it bypasses out the construction of the Bregman divergence that corresponds to the strictly proper inaccuracy measure. For that, you'll have to wait for Jason Konek's forthcoming work...

Suppose:
  • $\mathcal{F}$ is a set of propositions;
  • $\mathcal{W} = \{w_1, \ldots, w_n\}$ be the set of possible worlds relative to $\mathcal{F}$;
  • $\mathcal{C}$ be the set of credence functions on $\mathcal{F}$;
  • $\mathcal{P}$ be the set of probability functions on $\mathcal{F}$. So, by de Finetti's theorem, $\mathcal{P} = \{v_w : w \in \mathcal{W}\}^+$. If $p$ is in $\mathcal{P}$, we write $p_i$ for $p(w_i)$.
Theorem Suppose $\mathfrak{I}$ is a strictly proper inaccuracy measure on the credence functions in $\mathcal{F}$. Then if $c$ is not in $\mathcal{P}$, there is $c^\star$ in $\mathcal{P}$ such that, for all $w_i$ in $\mathcal{W}$,
$$
\mathfrak{I}(c^\star, w_i) < \mathfrak{I}(c, w_i)
$$

Proof. We begin by defining a divergence $\mathfrak{D} : \mathcal{P} \times \mathcal{C} \rightarrow [0, \infty]$ that takes a probability function $p$ and a credence function $c$ and measures the divergence from the former to the latter:
$$
\mathfrak{D}(p, c) = \sum_i p_i  \mathfrak{I}(c, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)
$$
Three quick points about $\mathfrak{D}$.

(1) $\mathfrak{D}$ is a divergence. Since $\mathfrak{I}$ is strictly proper, $\mathfrak{D}(p, c) \geq 0$ with equality iff $c = p$.

(2) $\mathfrak{D}(v_{w_i}, c) = \mathfrak{I}(c, w_i)$, for all $w_i$ in $\mathcal{W}$.

(3) $\mathfrak{D}$ is strictly convex in its first argument.  Suppose $p$ and $q$ are in $\mathcal{P}$, and suppose $0 < \lambda < 1$. Then let $r = \lambda p + \lambda q$. Then, since $\sum_i p_i\mathfrak{I}(c, w_i)$ is uniquely minimized, as a function of $c$, at $c = p$, and $\sum_i q_i\mathfrak{I}(c, w_i)$ is uniquely minimized, as a function of $c$, at $c = q$, we have$$\begin{eqnarray*}
\sum_i p_i \mathfrak{I}(c, w_i) & < & \sum_i p_i \mathfrak{I}(r, w_i) \\
\sum_i q_i \mathfrak{I}(c, w_i) & < & \sum_i q_i \mathfrak{I}(r, w_i)
\end{eqnarray*}$$Thus

 $\lambda [-\sum_i p_i \mathfrak{I}(p, w_i)] + (1-\lambda) [-\sum_i q_i \mathfrak{I}(q, w_i)] >$

$ \lambda [-\sum_i p_i \mathfrak{I}(r, w_i)] + (1-\lambda) [-\sum_i q_i \mathfrak{I}(r, w_i)] = $

$-\sum_i r_i \mathfrak{I}(r, w_i)$

Now, adding

$\lambda \sum_i p_i \mathfrak{I}(c, w_i) + (1-\lambda)\sum_i q_i\mathfrak{I}(c, w_i) =$

$\sum_i (\lambda p_i + (1-\lambda)q_i) \mathfrak{I}(c, w_i) = \sum_i r_i \mathfrak{I}(c, w_i)$

to both sides gives

$\lambda [\sum_i p_i \mathfrak{I}(c, w_i)-\sum_i p_i \mathfrak{I}(p, w_i)]+ $

$(1-\lambda) [\sum_i q_i\mathfrak{I}(c, w_i)-\sum_i q_i \mathfrak{I}(q, w_i)] > $

 $\sum_i r_i \mathfrak{I}(c, w_i)-\sum_i r_i \mathfrak{I}(r, w_i)$

That is,$$\lambda \mathfrak{D}(p, c) + (1-\lambda) \mathfrak{D}(q, c) > \mathfrak{D}(\lambda p + (1-\lambda)q, c)$$as required.

Now, suppose $c$ is not in $\mathcal{P}$. Then, since $\mathcal{P}$ is a closed convex set, there is a unique $c^\star$ in $\mathcal{P}$ that minimizes $\mathfrak{D}(x, c)$ as a function of $x$. Now, suppose $p$ is in $\mathcal{P}$. We wish to show that$$\mathfrak{D}(p, c) \geq \mathfrak{D}(p, c^\star) + \mathfrak{D}(c^\star, c)$$We can see that this holds iff$$\sum_i (p_i - c^\star_i) (\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i)) \geq 0$$After all,
$$\begin{eqnarray*}
& & \mathfrak{D}(p, c) - \mathfrak{D}(p, c^\star) - \mathfrak{D}(c^\star, c) \\
& = & [\sum_i p_i \mathfrak{I}(c, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)] - \\
&& [\sum_i p_i \mathfrak{I}(c^\star, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)] - \\
&& [\sum_i c^\star_i \mathfrak{I}(c, w_i) - \sum_i c^\star_i \mathfrak{I}(c^\star, w_i)] \\
& = & \sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i))
\end{eqnarray*}$$
Now we prove this inequality. We begin by observing that, since $p$, $c^\star$ are in $\mathcal{P}$, since $\mathcal{P}$ is convex, and since $\mathfrak{D}(x, c)$ is minimized uniquely at $x = c^\star$, if $0 < \varepsilon < 1$, then$$\frac{1}{\varepsilon}[\mathfrak{D}(\varepsilon p + (1-\varepsilon) c^\star, c) - \mathfrak{D}(c^\star, c)] > 0$$Expanding that, we get

$\frac{1}{\varepsilon}[\sum_i (\varepsilon p_i + (1- \varepsilon) c^\star_i)\mathfrak{I}(c, w_i) -$

$\sum_i (\varepsilon p_i + (1-\varepsilon)c^\star_i)\mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i) - $

$\sum_i c^\star_i\mathfrak{I}(c, w_i) + \sum_i c^\star_i \mathfrak{I}(c^\star,  i)] > 0$\medskip

 So

$\frac{1}{\varepsilon}[\sum_i ( c^\star_i + \varepsilon(p_i - c^\star_i))\mathfrak{I}(c, w_i) -$

$\sum_i ( c^\star_i + \varepsilon(p_i-c^\star_i))\mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i) - $

$\sum_i c^\star_i\mathfrak{I}(c, w_i) + \sum_i c^\star_i \mathfrak{I}(c^\star, w_i)] > 0 $\medskip

 So\medskip

$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(\varepsilon p+ (1-\varepsilon) c^\star), w_i) +$

$ \frac{1}{\varepsilon}[\sum_i c^\star_i \mathfrak{I}(c^\star, w_i) - \sum_ic^\star_i \mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i)] > 0$\medskip

Now, since $\mathfrak{I}$ is strictly proper,
$$\frac{1}{\varepsilon}[\sum_i c^\star_i \mathfrak{I}(c^\star, w_i) - \sum_ic^\star_i \mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i)] < 0$$
So, for all $\varepsilon > 0$,$$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(\varepsilon p+ (1-\varepsilon) c^\star, w_i) > 0$$
So, since $\mathfrak{I}$ is continuous$$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i)) \geq 0$$which is what we wanted to show. So, by above,$$\mathfrak{D}(p,c) \geq \mathfrak{D}(p, c^\star) + \mathfrak{D}(c^\star, c) $$In particular, since each $w_i$ is in $\mathcal{P}$,$$\mathfrak{D}(v_{w_i}, c) \geq \mathfrak{D}(v_{w_i}, c^\star) + \mathfrak{D}(c^\star, c)$$But, since $c^\star$ is in $\mathcal{P}$ and $c$ is not, and since $\mathfrak{D}$ is a divergence, $\mathfrak{D}(c^\star, c) > 0$. So$$\mathfrak{I}(c, w_i) = \mathfrak{D}(v_{w_i}, c) > \mathfrak{D}(v_{w_i}, c^\star) = \mathfrak{I}(c^\star, w_i)$$as required. $\Box$




Sunday, 26 July 2020

Decomposing Bregman divergences

For a PDF of this post, see here.

Here are a couple of neat little results about Bregman divergences that I just happened upon. They might help to prove some more decomposition theorems along the lines of this classic result by Morris DeGroot and Stephen Fienberg and, more recently, this paper by my colleagues in the computer science department at Bristol. I should say that a lot is known about Bregman divergences because of their role in information geometry, so these results are almost certainly known already, but I don't know where.

Refresher on Bregman divergences


First up, what's a divergence? It's essentially generalization of the notion of a measure of distance from one point to another. The points live in some closed convex subset $\mathcal{X} \subseteq \mathbf{R}^n$. A divergence is a function $D : \mathcal{X} \times \mathcal{X} \rightarrow [0, \infty]$ such that
  • $D(x, y) \geq 0$, for all $x$, $y$ in $\mathcal{X}$, and
  • $D(x, y) = 0$ iff $x = y$.
Note: We do not assume that a divergence is symmetric. So the distance from $x$ to $y$ need not be the same as the distance from $y$ to $x$. That is, we do not assume $D(x, y) = D(y, x)$ for all $x$, $y$ in $\mathcal{X}$. Indeed, among the family of divergences that we will consider -- the Bregman divergences -- only one is symmetric -- the squared Euclidean distance. And we do not assume the triangle inequality. That is, we don't assume that the divergence from $x$ to $z$ is at most the sum of the divergence from $x$ to $y$ and the divergence from $y$ to $z$. That is, we do not assume $D(x, z) \leq D(x, y) + D(y, z)$. Indeed, the conditions under which $D(x, z) = D(x, y) + D(y, z)$ for a Bregman divergence $D$ will be our concern here.

So, what's a Bregman divergence? $D : \mathcal{X} \times \mathcal{X} \rightarrow [0, \infty]$ is a Bregman divergence if there is a strictly convex function $\Phi : \mathcal{X} \rightarrow \mathbb{R}$ that is differentiable on the interior of $\mathcal{X}$ such that$$D(x, y) = \Phi(x) - \Phi(y) - \nabla \Phi(y) (x-y)$$In other words, to find the divergence from $x$ to $y$, you go to $y$, find the tangent to $\Phi$ at $y$. Then hop over to $x$ and subtract the value at $x$ of the tangent you just drew at $y$ from the value at $x$ of $\Phi$. That is, you subtract $\nabla \Phi(y) (x-y) + \Phi(y)$ from $\Phi(x)$. Because $\Phi$ is convex, it is always curving away from the tangent, and so $\nabla \Phi(y) (x-y) + \Phi(y)$, the value at $x$ of the tangent you drew at $y$, is always less than $\Phi(x)$, the value at $x$ of $\Phi$.

The two most famous Bregman divergences are:
  • Squared Euclidean distance. Let $\Phi(x) = ||x||^2 = \sum_i x_i^2$, in which case$$D(x, y) = ||x-y||^2 = \sum_i (x_i - y_i)^2$$
  • Generalized Kullback-Leibler divergence. Let $\Phi(x) = \sum_i x_i \log x_i$, in which case$$D(x, y) = \sum_i x_i\log\frac{x_i}{y_i} - x_i + y_i$$
Bregman divergences are convex in the first argument. Thus, we can define, for $z$ in $\mathcal{X}$ and for a closed convex subset $C \subseteq \mathcal{X}$, the $D$-projection of $z$ into $C$ is the point $\pi_{z, C}$ in $C$ such that $D(y, z)$ is minimized, as a function of $y$, at $y = \pi_{z, C}$. Now, we have the following theorem about Bregman divergences, due to Imre Csiszár:

Theorem (Generalized Pythagorean Theorem) If $\mathcal{C} \subseteq \mathcal{X}$ is closed and convex, then$$D(x, \pi_{z, C}) + D(\pi_{z, C}, z) \leq D(x, z)$$

Decomposing Bregman divergences


This invites the question: when does equality hold? The following result gives a particular class of cases, and in doing so provides us with a recipe for creating decompositions of Bregman divergences into their component parts. Essentially, it says that the above inequality is an equality if $C$ is a plane in $\mathbb{R}^n$.

Theorem 1  Suppose $r$ is in $\mathbb{R}$ and $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\sum_i \alpha_i = 1$. Then let $C := \{(x_1, \ldots, x_n) : \sum_i \alpha_ix_i = r\}$. Then if $z$ in $\mathcal{X}$ and $x$ is in $C$,$$D_\Phi(x, z) = D_\Phi(x, \pi_{z, C}) + D_\Phi(\pi_{z, C}, z)$$

Proof of Theorem 1.  We begin by showing:

Lemma 1 For any $x$, $y$, $z$ in $\mathcal{X}$,$$D_\Phi(x, z) = D_\Phi(x, y) + D_\Phi(y, z) \Leftrightarrow (\nabla \Phi(y) - \nabla \Phi(z))(x-y) = 0$$

Proof of Lemma 1.  $$D_\Phi(x, z) = D_\Phi(x, y) + D_\Phi(y, z)$$iff

$\Phi(x) - \Phi(z) - \nabla(z)(x-z)$

$= \Phi(x) - \Phi(y) - \nabla(y)(x-y) + \Phi(y) - \Phi(z) - \nabla(z)(y-z)$

iff$$(\nabla \Phi(y) - \nabla \Phi(z))(x-y) = 0$$as required.

Return to Proof of Theorem 1. Now we show that if $x$ is in $C$, then$$(\nabla \Phi(\pi_{z, C}) - \Phi(z))(x-\pi_{z, C}) = 0$$We know that $D(y, z)$ is minimized on $C$, as a function of $y$, at $y = \pi_{z, C}$. Thus, let $y = \pi_{z, C}$. And let $h(x) := \sum_i \alpha_ix^i - r$. Then $\frac{\partial}{\partial x_i} h(x) = \alpha_i$. So, by the KKT conditions, there is $\lambda$ such that,$$\nabla \Phi(y) - \nabla \Phi(z) + (\lambda \alpha_1, \ldots, \lambda \alpha_n) = (0, \ldots, 0)$$Thus,$$\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z) = -\lambda \alpha_i$$for all $i = 1, \ldots, n$.

Thus, finally, 
\begin{eqnarray*}
& &(\nabla \Phi(y) - \nabla \Phi(z))(x-y) \\
& = & \sum_i \left (\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z)\right )(x_i-y_i) \\
& = &  \sum_i (-\lambda \alpha_i) (x_i - y_i) \\
& = & -\lambda \left (\sum_i \alpha_i x_i - \sum_i \alpha_i y_i\right ) \\
& = & -\lambda (r-r) \\
& = & 0
\end{eqnarray*}
as required. $\Box$

Theorem 2  Suppose $1 \leq k \leq n$. Let $C := \{(x_1, \ldots, x_n) : x_1 = x_2 = \ldots = x_k\}$. Then if $z$ in $\mathcal{X}$ and $x$ is in $C$,$$D_\Phi(x, z) = D_\Phi(x, \pi_{z, C}) + D_\Phi(\pi_{z, C}, z)$$

Proof of Theorem 2. We know that $D(y, z)$ is minimized on $C$, as a function of $y$, at $y = \pi_{z, C}$. Thus, let $y = \pi_{z, C}$. And let $h_i(x) := x_{i+1} - x_i$, for $i = 1, \ldots, k-1$. Then$$\frac{\partial}{\partial x_j} h_i(x) = \left \{ \begin{array}{ll} 1 & \mbox{if } i+1 = j \\ -1 & \mbox{if } i = j \\ 0 & \mbox{otherwise}\end{array} \right.$$ So, by the KKT conditions, there are $\lambda_1, \ldots, \lambda_k$ such that,

$\nabla \Phi(y) - \nabla \Phi(z)$

$+ (-\lambda_1, \lambda_1, 0, \ldots, 0) + (0, -\lambda_2, \lambda_2, 0, \ldots, 0) + \ldots$

$+ (0, \ldots, 0, -\lambda_k, \lambda_k, 0, \ldots, 0) = (0, \ldots, 0)$

Thus,$$\begin{eqnarray*}\frac{\partial}{\partial y_1} \Phi(y) - \frac{\partial}{\partial z_1} \Phi(z) & = & - \lambda_1 \\ \frac{\partial}{\partial y_2} \Phi(y) - \frac{\partial}{\partial z_2} \Phi(z) & = & \lambda_1 - \lambda_2 \\ \vdots & \vdots & \vdots \\ \frac{\partial}{\partial y_{k-1}} \Phi(y) - \frac{\partial}{\partial z_{k-1}} \Phi(z) & = & \lambda_{k-2}- \lambda_{k-1} \\ \frac{\partial}{\partial y_k} \Phi(y) - \frac{\partial}{\partial z_k} \Phi(z) & = & \lambda_{k-1} \\ \frac{\partial}{\partial y_{k+1}} \Phi(y) - \frac{\partial}{\partial z_{k+1}} \Phi(z) & = & 0 \\ \vdots & \vdots & \vdots \\ \frac{\partial}{\partial y_n} \Phi(y) - \frac{\partial}{\partial z_n} \Phi(z) & = & 0 \end{eqnarray*}$$

Thus, finally, 
\begin{eqnarray*}
& &(\nabla \Phi(y) - \nabla \Phi(z))(x-y) \\
& = & \sum_i \left (\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z)\right )(x_i-y_i) \\
& = & -\lambda_1(x_1-y_1) + (\lambda_1 - \lambda_2)(x_2-y_2) + \ldots \\
&& + (\lambda_{k-2} - \lambda_{k-1})(x_{k-1}-y_{k-1}) + \lambda_{k-1}(x_k-y_k) \\
&& + 0(x_{k+1} - y_{k+1}) + \ldots + 0 (x_n - y_n) \\
& = & \sum^{k-1}_{i=1} \lambda_i (x_{i+1} - x_i) + \sum^{k-1}_{i=1} \lambda_i (y_i - y_{i+1})\\
& = & 0
\end{eqnarray*}
as required. $\Box$

DeGroot and Fienberg's calibration and refinement decomposition


To obtain these two decomposition results, we needed to assume nothing more than that $D$ is a Bregman divergence. The classic result by DeGroot and Fienberg requires a little more. We can see this by considering a very special case of it. Suppose $(X_1, \ldots, X_n)$ is a sequence of propositions that forms a partition. And suppose $w$ is a possible world. Then we can represent $w$ as the vector $w = (0, \ldots, 0, 1, 0, \ldots, 0)$, which takes value 1 at the proposition that is true in $w$ and 0 everywhere else. Now suppose $c = (c, \ldots, c)$ is an assignment of the same credence to each proposition. Then one very particular case of DeGroot and Fienberg's result says that, if $(0, \ldots, 0, 1, 0, \ldots, 0)$ is the world at which $X_i$ is true, then

$D((0, \ldots, 0, 1, 0, \ldots, 0), (c, \ldots, c))$

$= D((0, \ldots, 0, 1, 0, \ldots, 0), (\frac{1}{n}, \ldots, \frac{1}{n})) + D((\frac{1}{n}, \ldots, \frac{1}{n}), (c, \ldots, c))$

Now, we know from Lemma 1 that this is true iff$$(\nabla \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \nabla \Phi(c, \ldots, c))((0, \ldots, 0, 1, 0, \ldots, 0) - (\frac{1}{n}, \ldots, \frac{1}{n})) = 0$$which is true iff

$\left ( \frac{\partial}{\partial x_i} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_i} \Phi(c, \ldots, c) \right )$

$= \frac{1}{n} \sum^n_{j=1} \left ( \frac{\partial}{\partial x_j} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_j} \Phi(c, \ldots, c) \right )$

and that is true iff

$\frac{\partial}{\partial x_i} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_i} \Phi(c, \ldots, c)$

$= \frac{\partial}{\partial x_j} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_j} \Phi(c, \ldots, c)$

for all $1 \leq i, j, \leq n$, which is true iff, for any $x$, $1 \leq i, j \leq n$,$$\frac{\partial}{\partial x_i} \Phi(x, \ldots, x) = \frac{\partial}{\partial x_j} \Phi(x, \ldots, x)$$Now, this is true if $\Phi(x_1, \ldots, x_n) = \sum^n_{i=1} \varphi(x_i)$ for some $\varphi$. That is, it is true if $D$ is an additive Bregman divergence. But it is also true for certain non-additive Bregman divergences, such as the one generated from the log-sum-exp function:

Definition (log-sum-exp) Suppose $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\sum^n_{i=1} \alpha_i = 1$. Then let $$\Phi^A(x_1, \ldots, x_n) = \log(1 + \alpha_1e^{x_1} + \ldots \alpha_ne^{x_n})$$Then
$$D(x, y) = \log (1 + \sum_i \alpha_ie^{x_i}) - \log(1 + \sum_i \alpha_ie^{y_i}) - \sum_k \frac{\alpha_k(x_k - y_k)e^{y_k}}{1 + \sum_i \alpha_ie^{y_i}}$$

Now$$\frac{\partial}{\partial x_i} \Phi^A(x_1, \ldots, x_n) = \frac{\alpha_i e^{x_i}}{1 + \alpha_1 e^{x_1} + \ldots + \alpha_ne^{x_n}}$$So, if $\alpha_i = \alpha_j$ for all $1 \leq i, j \leq n$, then$$\frac{\partial}{\partial x_i} \Phi^A(x, \ldots, x) = \frac{\alpha e^x}{1 + e^x} = \frac{\partial}{\partial x_j} \Phi^A(x, \ldots, x)$$But if $\alpha_i \neq \alpha_j$ for some $1 \leq i, j \leq n$, then$$\frac{\partial}{\partial x_i} \Phi^A(x, \ldots, x) = \frac{\alpha_ie^x}{1 + e^x} \neq \frac{\alpha_je^x}{1 + e^x} = \frac{\partial}{\partial x_j} \Phi^A(x, \ldots, x)$$

And indeed, the result even fails if we have a semi-additive Bregman divergence. That is, there are different $\phi_1, \ldots, \phi_n$ such that $\Phi(x) = \sum^n_{i=1} \phi_i(x_i)$. For instance, suppose $\phi_1(x) = x^2$ and $\phi_2(x) = x\log x$ and $\Phi(x, y) = \phi_1(x) +  \phi_2(y) = x^2 + y\log y$. Then$$\frac{\partial}{\partial x_1} \Phi(x, x) = 2x \neq 1 + \log x = \frac{\partial}{\partial x_2} \Phi(x, x)$$

Proving the Generalized Pythagorean Theorem


In this section, I really just spell out in more detail the proof that Predd, et al. give of the Generalized Pythagorean Theorem, which is their Proposition 3. But that proof contains some important general facts that might be helpful for people working with Bregman divergences. I collect these together here into one lemma.

Lemma 2  Suppose $D$ is a Bregman divergence generated from $\Phi$. And suppose $x, y, z \in \mathcal{X}$. Then$$\begin{eqnarray*} & & D(x, z) - [D(x, y) + D(y, z)] \\ & = & (\nabla \Phi(y) - \nabla \Phi(z))(x - y) \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(y + \varepsilon (x - y), z) - D(y, z)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z)] \end{eqnarray*}$$

We can then prove the Generalized Pythagorean Theorem easily. After all, if $x$ is in a closed convex set $C$ and $y$ is the point in $C$ that minimizes $D(y, z)$ as a function of $y$. Then, for all $0 \leq \varepsilon \leq 1$, $\varepsilon x + (1-\varepsilon)y$ is in $C$. And since $y$ minimizes,$$D(\varepsilon x + (1-\varepsilon)y, z) \geq D(y, z)$$. So $D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z) \geq 0$. So $$\lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon}D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z) \geq 0$$So, by Lemma 2,$$D(x, z) \geq D(x, y) + D(y, z)$$

Proof of Lemma 2.  $$\begin{eqnarray*} && \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(z) - \nabla \Phi(z)(\varepsilon x + (1-\varepsilon)y - z)) - \\ & & (\Phi(y) - \Phi(z) - \nabla\Phi(z)(y-z))]\\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(y) - \varepsilon\nabla \Phi(z)(x -y)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(y)]  - \nabla \Phi(z)(x -y) \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(y + \varepsilon (x -y)) - \Phi(y)]  - \nabla \Phi(z)(x -y) \\ & = & \nabla \Phi(y)(x -y) - \nabla \Phi(z)(x -y) \\ & = & (\nabla \Phi(y) - \nabla \Phi(z))(x -y)\end{eqnarray*}$$

Thursday, 23 July 2020

Epistemic risk and permissive rationality (part I): an overview

I got interested in epistemic risk again, after a hiatus of four or five years, by thinking about the debate in epistemology between permissivists and impermissivists about epistemic rationality. Roughly speaking, according to the impermissivist, every body of evidence you might obtain mandates a unique rational set of attitudes in response --- this is sometimes called the uniqueness thesis. According to the permissivist, there is evidence you might obtain that doesn't mandate a unique rational set of attitudes in response --- there are, instead, multiple rational responses.

I want to argue for permissivism. And I want to do it by appealing to the sorts of claims about how to set your priors and posteriors that I've been developing over this series of blogposts (here and here). In the first of those blogposts, I argued that we should pick priors using a decision rule called the generalized Hurwicz criterion (GHC). That is, we should see our choice of priors as a decision we must make; and we should make that decision using a particular decision rule -- namely, GHC -- where we take the available acts to be the different possible credence functions and the utility of an act at a world to be a measure of its accuracy at that world.

Now, GHC is, in fact, not a single decision rule, but a family of rules, each specified by some parameters that I call the Hurwicz weights. These encode different attitudes to risk -- they specify the weight you assign to the best-case scenario, the weight you assign to the second-best, and so on down to the weight you assign to the second-worst scenario, and the weight you assign to the worst. And, what's more, many different attitudes to risk are permissible; and therefore many different Hurwicz weights are permissible; and so many versions of GHC are legitimate decision rules to adopt when picking priors. So different permissible attitudes to risk determine different Hurwicz weights; and different Hurwicz weights mandate different rational priors; and different rational priors mandate different rational posteriors given the same evidence. Epistemic rationality, therefore, is permissive. That's the argument in brief.

With this post, I'd like to start a series of posts in which I explore how this view plays out in the permissivism debate. If there are many different rationally permissible responses to the same piece of evidence because there are many different rationally permissible attitudes to risk, then how does that allow us to answer the various objections to permissivism that have been raised.

In this post, I want to do four things: first, run through a taxonomy of varieties of permissivism that slightly expands on one due to Elizabeth Jackson; second, explain my motivation for offering this argument for permissivism; third, discuss an earlier risk-based argument for the position due to Thomas Kelly; finally, situate within Jackson's taxonomy the version of permissivism that follows from my own risk-based approach to setting priors and posteriors.

Varieties of permissivism


Let's start with the taxonomy of permissivisms. I suspect it's not complete; there are likely other dimensions along which permissivists will differ. But it's quite useful for our purposes.

First, there are different versions of permissivism for different sorts of doxastic attitudes we might have in response to evidence. So there are versions for credences, beliefs, imprecise probabilities, comparative confidences, ranking functions, and so on. For instance, on the credal version of permissivism, there is evidence that doesn't determine a unique set of credences that rationality requires us to have in response to that evidence. For many different sorts of doxastic attitude, you can be permissive with respect to one but not the other: permissive about beliefs but not about credences, for instance, or vice versa.

Second, permissivism comes in interpersonal and intrapersonal versions. According to interpersonal permissivism, it is possible for different individuals to have the same evidence, but different attitudes in response, and yet both be rational. According to the intrapersonal version, there is evidence a single individual might have, and different sets of attitudes such that whichever they have, they'll still be rational. Most people who hold to intrapersonal permissivism for a certain sort of doxastic attitude also hold to the interpersonal version, but there are many who think intrapersonal permissivism is mistaken but interpersonal permissivism is correct.

Third, it comes in wide and narrow versions. This is determined by how many different attitudes are permitted in response to a piece of evidence, and how much variation there is between them. On narrow versions, there are not so many different rational responses and they do not vary too widely; on wide versions, there are many and they vary greatly.

Fourth, it comes in common and rare versions. On the first, most evidence is permissive; on the latter, permissive evidence is rare.

I'll end up defending two versions of permissivism: (i) a wide common version of interpersonal permissivism about credences; and (ii) a narrow common version of intrapersonal permissivism about credences.

Why argue for permissivism?


Well, because it's true, mainly. But there's another motivation for adding to the already crowded marketplace of arguments for the position. Many philosophers defend permissivism for negative reasons. They look at two very different sorts of evidence and give reasons to be pessimistic about the prospects of identifying a unique rational credal response to them. They are: very sparse evidence and very complex evidence. In the first, they say, our evidence constrains us too little. There are too many credal states that respect it. If there is a single credal response that rationality mandates to this sparse evidence, there must be some way to whittle down the vast set of states that respect it leave us with only one. For instance, some philosophers claim that, among this vast set of states, we should pick the one that has lowest informational content, since any other will go beyond what is warranted by the evidence. But it has proven extremely difficult to identify that credal state in many cases, such as von Mises' water-wine example, Bertrand's paradox, and van Fraassen's cube factory. Despairing of finding a way to pick a single credal state from this vast range, many philosophers have become permissivist. In the second sort of case, at the other extreme, where our evidence is very complex not very sparse, our evidence points in too many directions at once. In such cases, you might hope to identify a unique way in which to weigh the different sources of evidence and the direction in which they point to give the unique credal state that rationality mandates. And yet again, it has proven difficult to find a principled way of assigning these weights. Despairing, philosophers have become permissivist in these cases too.

I'd like to give a positive motivation for permissivism---one that doesn't motivate it by pointing to the difficulty of establishing its negation. My account will be based within accuracy-first epistemology, and it will depend crucially on the notion of epistemic risk. Rationality permits a variety of attitudes to risk in the practical sphere. Faced with the same risky choice, you might be willing to gamble because you are risk-seeking, and I might be unwilling because I am risk-averse, but we are both rational and neither more rational than the other. On my account, rationality also permits different attitudes to risk in the epistemic sphere. And different attitudes to epistemic risk warrant different credal attitudes in response to a body of evidence. Therefore, permissivism.

Epistemic risk encoded in epistemic utility


It is worth noting that this is not the first time that the notion of epistemic risk has entered the permissivism debate. In an early paper on the topic, Thomas Kelly appeals to William James' distinction between the two goals that we have when we have beliefs---believing truths and avoiding errors. When we have a belief, it gives us a chance of being right, but it also runs the risk of being wrong. In constrast, when we withhold judgment on a proposition, we run no risk of being wrong, but we give ourselves no chance of being right. Kelly then notes that whether you should believe on the basis of some evidence depends on how strongly you want to believe truths and how strongly you don't want to believe falsehoods. Using an epistemic utility framework introduced independently by Kenny Easwaran and Kevin Dorst, we can make this precise. Suppose:
  1. I assign a positive epistemic utility of $R > 0$ to believing a truth or disbelieving a falsehood;
  2. I assign a negative epistemic utility (or positive epistemic disutility) of $-W < 0$ to believing a falsehood or disbelieving a truth; and
  3. I assign a neutral epistemic utility of 0 to withholding judgment.
And suppose $W > R$. And suppose further that there is some way to measure, for each proposition, how likely or probable my evidence makes that proposition---that is, we assume there is a unique evidential probability function of the sort that J. M. Keynes, E. T. Jaynes, and Timothy Williamson envisaged. Then, if $r$ is how likely my evidence makes the proposition $X$, then:
  1. the expected value of believing $X$ is $rR + (1-r)(-W)$,
  2. the expected value of disbelieving $X$ is $r(-W) + (1-r)R$, and
  3. the expected value of withholding judgment is $0$.
A quick calculation shows that believing uniquely maximises expected utility when $r > \frac{W}{R+W}$, disbelieving uniquely maximises when $r < \frac{R}{R+W}$, and withholding uniquely maximises if $\frac{R}{R +W} < r < \frac{W}{R+W}$. What follows is that the more you disvalue being wrong, the stronger the evidence will have to be in order to make it rational to believe. Now, Kelly assumes that various values of $R$ and $W$ are rationally permissible---it is permissible to disvalue believing falsehoods a lot more than you value believing truths, and it is permissible to disvalue that just a little more. And, if that is the case, different individuals might have the same evidence while rationality requires of them different doxastic attitudes---a belief for one of them, who disvalues being wrong only a little more than they value being right, and no belief for the other, where the difference between their disvalue for false belief and value for true belief is much greater. Kelly identifies the values you pick for $R$ and $W$ with your attitudes to epistemic risk. So different doxastic attitudes are permissible in the face of the same evidence because different attitudes to epistemic risk are permissible.

Now, there are a number of things worth noting here before I pass to my own alternative approach to epistemic risk.

First, note that Kelly manages to show that epistemic rationality might be permissive even if there is a unique evidential probability measure. So even those who think you can solve the problem of what probability is demanded by the very sparse evidence and the very complex evidence we described above, still they should countenance a form of epistemic permissivism if they agree that there are different permissible values for $R$ and $W$.

Second, it might seem at first that Kelly's argument gives interpersonal permissivism at most. After all, for fixed $R$ and $W$, and a unique evidential probability $r$ for $X$ given your evidence, it might seem that there is always a single attitude---belief in $X$, disbelief in $X$, or judgment withheld about $X$---that maximises expected epistemic value. But this isn't always true. After all, if $r = \frac{R}{R + W}$, then it turns out that disbelieving and withholding have the same expected epistemic value, and if $r = \frac{W}{R+W}$, then believing and withholding have the same expected epistemic value. And in those cases, it would be rationally permissible for an individual to pick either.

Third, and relatedly, it might seem that Kelly's argument gives only narrow permissivism, since it allows for cases in which believing and withholding are both rational, and it allows for cases in which disbelieving and withholding are both rational, but it doesn't allow for cases in which all three are rational. But that again is a mistake. If you value believing truths exactly as much as you value believing falsehoods, so that $R = W$, and if the objective evidential probability of $X$ given your evidence is $r = \frac{1}{2}$, then believing, disbelieving, and withholding judgment are all permissible. Having said that, there is some reason to say that it is not rationally permissible to set $R = W$. After all, if you do, and if $r = \frac{1}{2}$, then it is permissible to both believe $X$ and believe $\overline{X}$ at the same time, and that seems wrong.

Fourth, and most importantly for my purposes, Kelly's argument works for beliefs, but not for credences. The problem, briefly stated, is this: suppose $r$ is how likely my evidence makes proposition $X$. And suppose $\mathfrak{s}(1, x)$ is the accuracy of credence $x$ in a truth, while $\mathfrak{s}(0, x)$ is the accuracy of credence $x$ in a falsehood. Then the expected accuracy of credence $x$ in $X$ is
\begin{equation}\label{eeu}
r\mathfrak{s}(1, x) + (1-r)\mathfrak{s}(0, x)\tag{*}
\end{equation}
But nearly all advocates of epistemic utility theory for credences agree that rationality requires that $\mathfrak{s}$ is a strictly proper scoring rule. And that means that (*) is maximized, as a function of $x$, at $x = r$. So differences in how you value epistemic utility don't give rise to differences in what credences you should have. Your credences should always match the objective evidential probability of $X$ given your evidence. Epistemic permissivism about credences would therefore be false.

I think Kelly's observation, supplemented with Easwaran's precise formulation of epistemic value, furnishes a strong argument for permissivism about beliefs. But I think we can appeal to epistemic risk to give something more, namely, two versions of permissivism about credences: first, an wide common interpersonal version, and second a narrow common intrapersonal version.

Epistemic risk encoded in decision rules


To take the first step towards these versions of permissivism for credences, let's begin with the observation that there are two ways in which risk enters into the rational evaluation of a set of options. First, risk might be encoded in the utility function, which measures the value of each option at each possible world; or, second, it might be encoded in the choice rule, which takes in various features of the options, including their utilities at different worlds, and spits out the set of options that are rationally permissible.

Before we move to the epistemic case, let's look at how this plays out in the practical case. I am about to flip a fair coin. I make you an offer: pay me £30, and I will pay you £100 if the coin lands heads and nothing if it lands tails. You reject my offer. There are two ways to rationalise your decision. On the first, you choose using expected utility theory, which is a risk-neutral decision rule. However, because the utility you assign to an outcome is a sufficiently concave function of the money you get in that outcome, and your current wealth is sufficiently small, the expected utility of accepting my offer is less than the expected utility of rejecting it. For instance, perhaps your utility for an outcome in which your total wealth is £$n$ is $\log n$. And perhaps your current wealth is £$40$. Then your expected utility for accepting my offer is $\frac{1}{2}\log 110 + \frac{1}{2} \log 10 \approx 3.502$ while your expected utility for rejecting it is $\log 40 \approx 3.689$. So you are rationally required to reject. On this way of understanding your choice, your risk-aversion is encoded in your utility function, while your decision rule is risk-neutral. On the second way of understanding your choice, it is the other way around. Instead of expected utility theory, you choose using a risk-sensitive decision rule, such as Wald's Maximin, the Hurwicz criterion, the generalized Hurwicz criterion, Quiggin's rank-dependent utility theory, or Buchak's risk-weighted expected utility theory. According to Maximin, for instance, you are required to choose an option whose worst-case outcome is best. The worst case if you accept the offer is the one in which the coin lands tails and I pay you back nothing, in which case you end up £$30$ down, whereas the worst case if you refuse my offer is the status quo in which you end up with exactly as much as you had before. So, providing you prefer more money to less, the worst-case outcome of accepting the offer is worse than the worse-case outcome of refusing it, so Maximin will lead you to refuse the offer. And it will lead you to do that even if, for instance, you value money linearly. Thus, there is no need to reflect your attitude to risk in your utility function at all, because it is encoded in your decision rule.

I take the lesson of the Allais paradox to be that there is rational risk-sensitive behaviour that we cannot capture entirely using the first method here. That is, there are rational preferences that we cannot recover within expected utility theory by making the utility function concave in money, or applying some other tweak. We must instead permit risk-sensitive choice rules. Now, there are two sorts of such rules: those that require credences among their inputs and those that don't. In the first camp, perhaps the most sophisticated is Lara Buchak's risk-weighted expected utility theory. In the second, we've already met the most famous example, namely, Maximin, which is maximally risk-averse. But there is also Maximax, which is maximally risk-seeking. And there is the Hurwicz criterion, which strikes a balance between the two. And there's my generalization of the Hurwicz criterion, which I'll abbreviate GHC. As I've discussed over the last few blogposts, I favour the latter in the case of picking priors. (For an alternative approach to epistemic risk, see Boris Babic's recent paper here.)

To see what happens when you use GHC to pick priors, let's give a quick example in a situation in which there are just three possible states of the world to which you assign credences, $w_1$, $w_2$, $w_3$, and we write $(p_1, p_2, p_3)$ for a credence function $p$ that assigns $p_i$ to world $w_i$. Suppose your Hurwicz weights are these: $\alpha_1$ for the best case, $\alpha_2$ for the second-best (and second-worst) case, and $\alpha_3$ for the worst case. And your accuracy measure is $\mathfrak{I}$. Then we're looking for those that minimize your Hurwicz score, which is$$H^A(p) = \alpha_1\mathfrak{I}(p, w_{i_1}) + \alpha_2\mathfrak{I}(p, w_{i_2}) + \alpha_3\mathfrak{I}(p, w_{i_3})$$when$$\mathfrak{I}(p, w_{i_1}) \geq \mathfrak{I}(p, w_{i_2}) \geq\mathfrak{I}(p, w_{i_3})$$Now suppose for our example that $\alpha_1 \geq \alpha_2 \geq \alpha_3$. Then the credence functions that minimize $H^A_{\mathfrak{I}}$ are$$\begin{array}{ccc} (\alpha_1, \alpha_2, \alpha_3) & (\alpha_1, \alpha_3, \alpha_2) & (\alpha_2, \alpha_1, \alpha_3) \\ (\alpha_2, \alpha_3, \alpha_1) & (\alpha_3, \alpha_1, \alpha_2) & (\alpha_3, \alpha_2, \alpha_1) \end{array}$$

With that example in hand, and a little insight into how GHC works when you use it to select priors, let's work through Elizabeth Jackson's taxonomy of permissivism from above.

First, since the attitudes we are considering are credences, it's a credal version of permissivism that follows from this risk-based approach in accuracy-first epistemology.

Second, we obtain both an interpersonal and an intrapersonal permissivism. A particular person will have risk attitudes represented by specific Hurwicz weights. And yet, even once those are fixed, there will usually be a number of different permissible priors. That is, rationally will permit a number of different credal states in the absence of evidence. For instance, if my Hurwicz weights are $\alpha_1 = 0.5$, $\alpha_2 = 0.3$, $\alpha_3 = 0.2$, then rationality allows me to assign 0.5 to world $w_1$, 0.3 to $w_2$ and 0.2 to $w_3$, but it also permits me to assign $0.3$ to $w_1$, $0.2$ to $w_2$, and $0.5$ to $w_3$.

So there is intrapersonal credal permissivism, but it is reasonably narrow---there are only six rationally permissible credence functions for someone with the Hurwicz scores just specified, for instance. On the other hand, the interpersonal permissivism we obtain is very wide. Indeed, it is as wide as range of permissible attitudes to risk. As we noted in a previous post, for any probabilistic credence function over a space of possible worlds, there are Hurwicz weights that will render those credences permissible. So providing those weights are rationally permissible, so are the credences.

Finally, is the permissivism we get from this risk-based approach common or rare? So far, we've just considered it in the case of priors. That is, we've only established permissivism in the case in which you have no evidence. But of course, once it's established there, it's also established for many other bodies of evidence, since we obtain the rational credences given a body of evidence by looking to what we obtain by updating rational priors by conditioning on that evidence. And, providing a body of evidence isn't fully informative, if there are multiple rational priors, they will give rise to multiple rational posteriors when we condition them on that evidence. So the wide interpersonal credal permissivism we obtain is common, and so is the narrow intrapersonal credal permissivism.