Thursday, 3 September 2020

Accuracy and Explanation in a Social Setting: thoughts on Douven and Wenmackers

For a PDF version of this post, see here.

In this post, I want to continue my discussion of the part of van Fraassen's argument against inference to the best explanation (IBE) that turns on its alleged clash with Bayesian Conditionalization (BC). In the previous post, I looked at Igor Douven's argument that there are at least some ways of valuing accuracy on which updating by IBE comes out better than BC. I concluded that Douven's arguments don't save IBE; BC is still the only rational way to update.

The setting for Douven's arguments was individualist epistemology. That is, he considered only the single agent collecting evidence directly from the world and updating in the light of it. But of course we often receive evidence not directly from the world, but indirectly through the opinions of others. I learn how many positive SARS-CoV-2 tests there have been in my area in the past week not my inspecting the test results myself but by listening to the local health authority. In their 2017 paper, 'Inference to the Best Explanation versus Bayes’s Rule in a Social Setting', Douven joined with Sylvia Wenmackers to ask how IBE and BC fare in a context in which some of my evidence comes from the world and some from learning the opinions of others, where those others are also receiving some of their evidence from the world and some from others, and where one of those others from whom they're learning might be me. Like Douven's study of IBE vs BC in the individual setting, Douven and Wenmackers conclude in favour of IBE. Indeed, their conclusion in this case is considerably stronger than in the individual case:

The upshot will be that if agents not only update their degrees of belief on the basis of evidence, but also take into account the degrees of belief of their epistemic neighbours, then the noted advantage of Bayesian updating [from Douven's earlier paper] evaporates and IBE does better than Bayes’s rule on every reasonable understanding of inaccuracy minimization. (536-7)

As in the previous post, I want to stick up for BC. As in the individualist setting, I think this is the update rule we should use in the social setting.

Following van Fraassen's original discussion and the strategy pursued in Douven's solo piece, Douven and Wenmackers take the general and ill-specified question whether IBE is better than BC and make it precise by asking it in a very specific case. We imagine a group of individuals. Each has a coin. All coins have the same bias. No individual knows what this shared bias is, but they do know that it is the same bias for each coin, and they know that the options are given by the following bias hypotheses:

$B_0$: coin has 0% chance of landing heads

$B_1$: coin has 10% chance of landing heads

$\ldots$

$B_9$: coin has 90% chance of landing heads

$B_{10}$: coin has 100% chance of landing heads

Though they don't say so, I think Douven and Wenmackers assume that all individuals have the same prior over $B_0, \ldots, B_{10}$, namely, the uniform prior; and each satisfies the Principal Principle, and so their credences in everything else follows from their credences in $B_0, \ldots, B_{10}$. As we'll see, we needn't assume that they all have the uniform prior over the bias hypotheses. In any case, they assume that things proceed as follows:

Step (i) Each member tosses their coin some fixed number of times. This produces their worldly evidence for this round.

Step (ii) Each then updates their credence function on this worldly evidence they've obtained. To do this, each member uses the same updating rule, either BC or a version of IBE. We'll specify these in more detail below.

Step (iii) Each then learns the updated credence functions of the others in the group. This produces their social evidence for this round.

Step (iv) They then update their own credence function by taking the average of their credence function and the other credence functions in the group that lie within a certain distance of theirs. The set of credence functions that lie within a certain distance of one's own, Douven and Wenmackers call one's bounded confidence interval.

They then repeat this cycle a number of times, each time an individual begins with the credence function they reached at the end of the previous cycle.

Douven and Wenmackers use simulation techniques to see how this group of individuals perform for different updating rules used in step (ii) and different specifications of how close a credence function must lie to yours in order to be included in the average in step (iv). Here's the class of updating rules that they consider: if $P$ is your prior and $E$ is your evidence then your updated credence function should be$$P^c_E(B_i) = \frac{P(B_i)P(E|B_i) + f_c(B_i, E)}{\sum^{10}_{k=0} \left (P(B_k)P(E|B_k) + f_c(B_k, E) \right )}$$where$$f_c(B_i, E) = \left \{ \begin{array}{ll} c & \mbox{if } P(E | B_i) > P(E | B_j) \mbox{ for all } j \neq i \\ \frac{1}{2}c & \mbox{if } P(E | B_i) = P(E|B_j) > P(E | B_k) \mbox{ for all } k \neq j, i \\  0 & \mbox{otherwise} \end{array} \right. $$That is, for $c = 0$, this update rule is just BC, while for $c > 0$, it gives a little boost to whichever hypothesis best explains the evidence $E$, where providing the best explanation for a series of coin tosses amounts to making it most likely, and if two bias hypotheses make the evidence most likely, they split the boost between them. Douven and Wenmackers consider $c = 0, 0.1, \ldots, 0.9, 1$. For each rule, specified by $c$, they also consider different sizes of bounded confidence intervals. These are specified by the parameter $\varepsilon$. Your bounded confidence interval for $\varepsilon$ includes each credence function for which the average difference between the credences it assigns and the credences you assign is at most $\varepsilon$. Thus, $\varepsilon = 0$ is the most exclusive, and includes only your own credence function, while $\varepsilon = 1$ is the most inclusive, and includes all credence functions in the group. Again, Douven and Wenmackers consider $\varepsilon = 0, 0.1, \ldots, 0.9, 1$. Here are two of their main results:

  1. For each bias other than $p = 0.1$ or $0.9$, there is an explanationist rule (i.e. $c > 0$ and some specific $\varepsilon$) that gives rise to a lower average inaccuracy at the end of the process than all BC rules (i.e. $c = 0$ and any $\varepsilon$).
  2. There is an averaging explanationist rule (i.e. $c > 0$ and $\varepsilon > 0$) such that, for each bias other than $p = 0, 0.1, 0.9, 1$, it gives rise to lower average inaccuracy than all BC rules (i.e. $c = 0$ and any $\varepsilon$).

Inaccuracy is measured by the Brier score throughout.

Now, you can ask whether these results are enough to tell so strongly in favour of IBE. But that isn't my concern here. Rather, I want to focus on a more fundamental problem: Douven and Wenmackers' argument doesn't really compare BC with IBE. They're comparing BC-for-worldly-data-plus-Averaging-for-social-data with IBE-for-worldly-data-plus-Averaging-for-social-data. So their simulation results don't really impugn BC, because the average inaccuracies that they attribute to BC don't really arise from it. They arise from using BC in step (ii), but something quite different in step (iv). Douven and Wenmackers ask the Bayesian to respond to the social evidence they receive using a non-Bayesian rule, namely, Averaging. And we can see just how far Averaging lies from BC by considering the following version of the example we have been using throughout.

Consider the biased coin case, and suppose there are just three members of the group. And suppose they all start with the uniform prior over the bias hypotheses. At step (i), they each toss their coin twice. The first individual's coin lands $HT$, the second's $HH$, and the third's $TH$. So, at step (ii), if they all use BC (i.e. $c = 0$), they update on this worldly evidence as follows, where $P$ is the shared prior:
$$\begin{array}{r|ccccccccccc}
& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\
\hline
&&&&&&&&&& \\
P & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} \\
&&&&&&&&&& \\
P(-|HT) & 0 & \frac{9}{165} & \frac{16}{165}& \frac{21}{165}& \frac{24}{165} & \frac{25}{165}& \frac{24}{165}& \frac{21}{165}& \frac{16}{165}& \frac{9}{165}& 0\\
&&&&&&&&&& \\
P(-|HH) & 0 &   \frac{1}{385} &  \frac{4}{385}&  \frac{9}{385}&  \frac{16}{385}&  \frac{25}{385}&  \frac{36}{385}&  \frac{49}{385}&  \frac{64}{385}&  \frac{81}{385}&  \frac{100}{385}\\
&&&&&&&&&& \\
P(-|TH) &  0 & \frac{9}{165} & \frac{16}{165}& \frac{21}{165}& \frac{24}{165} & \frac{25}{165}& \frac{24}{165}& \frac{21}{165}& \frac{16}{165}& \frac{9}{165}& 0\\
\end{array}$$
Now, at step (iii), they each learn the other's distribution. And they average on that. Let's suppose I'm the first individual. Then I have two choices for my BCI. It either includes my own credence function $P(-|HT)$ and the third individual's $P(-|TH)$, which are identical, or it includes all three, $P(-|HT), P(-|HH), P(-|TH)$. Let's suppose it includes all three. Here is the outcome of averaging:$$\begin{array}{r|ccccccccccc}
& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\
\hline
&&&&&&&&&& \\
\mbox{Av} & 0 & \frac{129}{3465} & \frac{236}{3465}& \frac{321}{3465}& \frac{384}{3465}& \frac{425}{3465}& \frac{444}{3465}& \frac{441}{3465}& \frac{416}{3465}& \frac{369}{3465}& \frac{243}{3465}
\end{array}$$
And now compare that with what they would do if they updated at step (iv) using BC rather than Averaging. I learn the distributions of the second and third individuals. Now, since I know how many times they tossed their coin, and I know that they updated by BC at step (ii), I thereby learn something about how their coin landed. I know that it landed in such a way that would lead them to update to $P(-|HH)$ and $P(-|TH)$, respectively. Now what exactly does this tell me? In the case of the second individual, it tells me that their coin landed $HH$, since that's the only evidence that would lead them to update to $P(-|HH)$. In the case of the third individual, my evidence is not quite so specific. I learn that their coin either landed $HT$ or $TH$, since either of those, and only one of those, would lead them to update to $P(-|TH)$. In general, learning an individual's posteriors when you know their prior and the number of times they've tossed the coin will teach you how many heads they saw and how many tails, though it won't tell you the order in which they saw them. But that's fine. We can still update on that information using BC, and indeed BC will tell us to adopt the same credence as we would if we were to learn the more specific evidence of the order in which the coin tosses landed. If we do so in this case, we get:
$$\begin{array}{r|ccccccccccc}
& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\
\hline&&&&&&&&&& \\
\mbox{Bayes} & 0 & \frac{81}{95205} & \frac{1024}{95205} & \frac{3969}{95205} & \frac{9216}{95205} & \frac{15625}{95205} & \frac{20736}{95205} & \frac{21609}{95205} & \frac{16384}{95205} & \frac{6561}{95205} &0 \\
\end{array}
$$And this is pretty far from what I got by Averaging at step (iv).

So updating using BC is very different from averaging. Why, then, do Douven and Wenmackers use Averaging rather than BC for step (iv)? Here is their motivation:

[T]aking a convex combination of the probability functions of the individual agents in a group is the best studied method of forming social probability functions. Authors concerned with social probability functions have mostly considered assigning different weights to the probability functions of the various agents, typically in order to reflect agents’ opinions about other agents’ expertise or past performance. The averaging part of our update rule is in some regards simpler and in others less simple than those procedures. It is simpler in that we form probability functions from individual probability functions by taking only straight averages of individual probability functions, and it is less simple in that we do not take a straight average of the probability functions of all given agents, but only of those whose probability function is close enough to that of the agent whose probability is being updated. (552)

In some sense, they're right. Averaging or linear pooling or taking a convex combination of individual credence functions is indeed the best studied method of forming social credence functions. And there are good justifications for it: János Aczél and Carl Wagner and, independently, Kevin J. McConway, give a neat axiomatic characterization; and I've argued that there are accuracy-based reasons to use it in particular cases. The problem is that our situation in step (iv) is not the sort of situation in which you should use Averaging. Arguments for Averaging concern those situations in which you have a group of individuals, possibly experts, and each has a credence function over the same set of propositions, and you want to produce a single credence function that could be called the group's collective credence function. Thus, for instance, if I wish to give the SAGE group's collective credence that there will be a safe and effective SARS-CoV-2 vaccine by March 2021, I might take the average of their individual credences. But this is quite a different task from the one that faces me as the first individual when I reach step (iv) of Douven and Wenmackers' process. There, I already have credences in the propositions in question. What's more, I know how the other individuals update and the sort of evidence they will have received, even if I don't know which particular evidence of that sort they have. And that allows me to infer from their credences after the update at step (ii) a lot about the evidence they receive. And I have opinions about the propositions in question conditional on the different evidence my fellow group members received. And so, in this situation, I'm not trying to summarise our individual opinions as a single opinion. Rather, I'm trying to use their opinions as evidence to inform my own. And, in that case, BC is better than Averaging. So, in order to show that IBE is superior to BC in some respect, it doesn't help to compare BC at step (ii) + Averaging at step (iv) with IBE at (ii) + Averaging at (iv). It would be better to compare BC at (ii) and (iv) with IBE at (ii) and (iv).

So how do things look if we do that? Well, it turns out that we don't need simulations to answer the question. We can simply appeal to the mathematical results we mentioned in the previous post: first, Hilary Greaves and David Wallace's expected accuracy argument; and second, the accuracy dominance argument that Ray Briggs and I gave. Or, more precisely, we use the slight extensions of those results to multiple learning experiences that I sketched in the previous post. For both of those results, the background framework is the same. We begin with a prior, which we hold at $t_0$, before we begin gathering evidence. And we then look forward to a series of times $t_1, \ldots, t_n$ at each of which we will learn some evidence. And, for each time, we know the possible pieces of evidence we might receive, and we plan, for each time, which credence function we would adopt in response to each of the pieces of evidence we might learn at that time. Thus, formally, for each $t_i$ there is a partition from which our evidence at $t_i$ will come. For each $t_{i+1}$, the partition is a fine-graining of the partition at $t_i$. That is, our evidence gets more specific as we proceed. In the case we've been considering, at $t_1$, we'll learn the outcome of our own coin tosses; at $t_2$, we'll add to that our fellow group members' credence functions at $t_1$, from which we can derive a lot about the outcome of their first run of coin tosses; at $t_3$, we'll add to that the outcome of our next run of our own coin tosses; at $t_4$, we'll add our outcomes of the other group members' coin tosses by learning their credences at $t_3$; and so on. The results are then as follows: 

Theorem (Extended Greaves and Wallace) For any strictly proper inaccuracy measure, the updating rule that minimizes expected inaccuracy from the point of view of the prior is BC.

Theorem (Extended Briggs and Pettigrew) For any continuous and strictly proper inaccuracy measure, if your updating rule is not BC, then there is an alternative prior and alternative updating rule that accuracy dominates your prior and your updating rule.

Now, these results immediately settle one question: if you are an individual in the group, and you know which update rules the others have chosen to use, then you should certainly choose BC for yourself. After all, if you have picked your prior, then it expects picking BC to minimize your inaccuracy, and thus expects picking BC to minimize the total inaccuracy of the group that includes you; and if you have not picked your prior, then if you consider a prior together with something other than BC as your updating rule, there's some other combination you could chose instead that is guaranteed to do better, and thus some other combination you could choose that is guaranteed to improve the total accuracy of the group. But Douven and Wenmackers don't set up the problem like this. Rather, they assume that all members of the group use the same updating rule. So the question is whether everyone picking BC is better than everyone picking something else. Fortunately, at least in the case of the coin tosses, this does follow. As we'll see, things could get more complicated with other sorts of evidence.

If you know the updating rules that others will use, then you pick your updating rule simply on the basis of its ability to get you the best accuracy possible; the others have made their choices and you can't affect that. But if you are picking an updating rule for everyone to use, you must consider not only its properties as an updating rule for the individual, but also its properties as a means of signalling to the other members what evidence you have. Thus, prior to considering the details of this, you might think that there could be an updating rule that is very good at producing accurate responses to evidence, but poor at producing a signal to others of the evidence you've received---there might be a wide range of different pieces of evidence you could receive that would lead you to update to the same posterior using this rule, and in that case, learning your posterior would give little information about your evidence. If that were so, we might prefer an updating rule that does not produce such accurate updates, but does signal very clearly what evidence is received. For, in that situation, each individual would produce a less accurate update at step (ii), but would then receive a lot more evidence at step (iv), because the update at step (ii) would signal the evidence that the other members of the group received much more clearly. However, in the coin toss set up that Douven and Wenmackers consider, this isn't an issue. In the coin toss case, learning someone's posterior when you know their prior and how many coin tosses they have observed allows you to learn exactly how many heads and how many tails they observed. It doesn't tell you the order in which you learned them, but knowing that further information wouldn't affect how you would update anyway, either on the BC rule or on the IBE rule---learning $HT \vee TH$ leads to the same update as learning $HT$ for both Bayesian and IBEist. So when we are comparing them, we can consider the information learned at step (ii) and step (iv) both to be worldly information. Both give us information about the tosses of the coin that our peers witnessed. So when we are comparing them, we needn't take into account how good they are at signalling the evidence you have. They are both equally good and both very good. So comparing them when choosing a single rule that each member of the group must use, we need only compare the accuracy of using them as update rules. And the theorems above indicate that BC wins out on that measure.