Wednesday, 6 January 2021

Life on the edge: a response to Schultheis' challenge to epistemic permissivism about credences

In their 2018 paper, 'Living on the Edge', Ginger Schultheis issues a powerful challenge to epistemic permissivism about credences, the view that there are bodies of evidence in response to which there are a number of different credence functions it would be rational to adopt. The heart of the argument is the claim that a certain sort of situation is impossible. Schultheis thinks that all motivations for permissivism must render situations of this sort possible. Therefore, permissivism must be false, or at least these motivations for it must be wrong.

Here's the situation, where we write $R_E$ for the set of credence functions that it is rational to have when your total evidence is $E$. 

  • Our agent's total evidence is $E$.
  • There is $c$ in $R_E$ that our agent knows is a rational response to $E$.
  • There is $c'$ in $R_E$ that our agent does not know is a rational response to $E$.

Schultheis claims that the permissivist must take this to be possible, whereas in fact it is impossible. Here are a couple of specific examples that the permissivist will typically take to be possible.

Example 1: we might have a situation in which the credences it is rational to assign to a proposition $X$ in response to evidence $E$ form the interval $[0.4, 0.7]$. But we might not be sure of quite the extent of the interval. For all we know, it might be $[0.41, 0.7]$ or $[0.39, 0.71]$. Or it might be $[0.4, 0.7]$. So we are sure that $0.5$ is a rational credence in $X$, but we're not sure whether $0.4$ is a rational credence in $X$. In this case, $c(X) = 0.5$ and $c'(X) = 0.4$.

Example 2: you know that Probablism is a rational requirement on credence functions, and you know that satisfying the Principle of Indifference is rationally permitted, but you don't know whether or not it is also rationally required. In this case, $c$ is the uniform distribution required by the Principle of Indifference, but $c'$ is any other probability function.

Schultheis then appeals to a principle called Weak Rationality Dominance. We say that one credence function $c$ rationally dominates another $c'$ if $c$ is rational in all worlds in which $c'$ is rational, and also rational in some worlds in which $c'$ is not rational. Weak Rationality Dominance says that it is irrational to adopt a rationally dominated credence function. The important consequence of this for Schultheis' argument is that, if you know that $c$ is rational, but you don't know whether $c'$ is, then $c'$ is irrational. As a result, in our example above, $c'$ is not rational, contrary to what the permissivist claims, because it is rationally dominated by $c$. So permissivism must be false.

If Weak Rationality Dominance is correct, then, it follows that the permissivist must say that, for any body of evidence $E$ and set $R_E$ of rational responses, the agent with evidence $E$ either must know of each credence function in $R_E$ that it is in $R_E$, or they must not know of any credence function in $R_E$ that it is in $R_E$. If they know of some credence functions in $R_E$ that they are in $R_E$ and not know of others in $R_E$ that they are in $R_E$, then they clash with Weak Rationality Dominance. But, whatever your reason for being a permissivist, it seems very likely that it will entail situations in which there are some credence functions that are rational responses to your evidence and that you know are such responses, while you are unsure about other credence functions that are, in fact, rational responses whether or not they are, in fact, rational responses. This is Schultheis' challenge.

I'd like to explore a response to Schultheis' argument that takes issue with Weak Rationality Dominance (WRD). I'll spell out the objection in general to begin with, and then see how it plays out for a specific motivation for permissivism, namely, the Jamesian motivation I sketched in this previous blogpost

One worry about WRD is that it seems to entail a deference principle of exactly the sort that I objected to in this blogpost. According to such deference principles, for certain agents in certain situations, if they learn of a credence function that it is rational, they should adopt it. For instance, Ben Levinstein claims that, if you are certain that you are irrational, and you learn that $c$ is rational, then you should adopt $c$ -- or at least you should have the conditional credences that would lead you to do this if you were to apply conditionalization. We might slightly strengthen Levinstein's version of the deference principle as follows: if you are unsure whether you are rational or not, and you learn that $c$ is rational, then you should adopt $c$. WRD entails this deference principle. After all, suppose you have credence function $c'$, and you are unsure whether or not it is rational. And suppose you learn that $c$ is rational (and don't thereby learn that $c'$ is as well). Then, according to Schultheis' principle, you are irrational if you stick with $c'$.

In the previous blogpost, I objected to Levinstein's deference principle, and others like it, because it relies on the assumption that all rational credence functions are better than all irrational credence functions. I think that's false. I think there are certain sorts of flaw that render you irrational, and lacking those flaws renders you rational. But lacking those flaws doesn't ensure that you're going to be better than someone who has those flaws. Consider, for instance, the extreme subjective Bayesian who justifies their position using an accuracy dominance argument of the sort pioneered by Jim Joyce. That is, they say that accuracy is the sole epistemic good for credence functions. And they say that non-probabilistic credence functions are irrational because, for any such credence function, there are probabilistic ones that accuracy dominate them; and all probabilistic credence functions are rational because, for any such credence function, there is no probabilistic one that accuracy dominates it. Now, suppose I have credence $0.91$ in $X$ and $0.1$ in $\overline{X}$. And suppose I am either sure that this is irrational, or I'm uncertain it is. I then learn that assigning credence $0.1$ to $X$ and $0.9$ to $\overline{X}$ is rational. What should I do? It isn't at all obvious to me that I should move from my credence function to the one I've learned is rational. After all, even from my slightly incoherent standpoint, it's possible to see that the rational one is going to be a lot less accurate than mine if $X$ is true, and I'm very confident that it is. 

So I think that the rational deference principle is wrong, and therefore any version of WRD that entails it is also wrong. But perhaps there is a more restricted version of WRD that is right. And one that is nonetheless capable of sinking permissivism. Consider, for instance, a restricted version of WRD that applies only to agents who have no credence function --- that is, it applies to your initial choice of a credence function; it does not apply when you have a credence function and you are deciding whether to adopt a new one. This makes a difference. The problem with a version that applies when you already have a credence function $c'$ is that, even if it is irrational, it might nonetheless be better than the rational credence function $c$ in some situation, and it might be that $c'$ assigns a lot of credence to that situation. So it's hard to see how to motivate the move from $c'$ to $c$. However, in a situation in which you have no credence function, and you are unsure whether $c'$ is rational (even though it is) and you're certain that $c$ is rational (and indeed it is), WRD's demand that you should not pick $c'$ seems more reasonable. You occupy no point of view such that $c'$ is less of a depature from that point of view than $c$ is. You know only that $c$ lacks the flaws for sure, whereas $c'$ might have them. Better, then, to go for $c$, is it not? And if it is, this is enough to defeat permissivism.

I think it's not quite that simple. I noted above that Levinstein's deference principle relies on the assumption that all rational credence functions are better than all irrational credence functions. Schultheis' WRD seems to rely on something even stronger, namely, the assumption that all rational credence functions are equally good in all situations. For suppose they are not. You might then be unsure whether $c'$ is rational (though it is) and sure that $c$ is rational (and it is), but nonetheless rationally opt for $c'$ because you know that $c'$ has some good feature that you know $c$ lacks and you're willing to take the risk of having an irrational credence function in order to open the possibility of having that good feature.

Here's an example. You are unsure whether it is rational to assign $0.7$ to $X$ and $0.3$ to $\overline{X}$. It turns out that it is, but you don't know that. On the other hand, you do know that it is rational to assign 0.5 to each proposition. But the first assignment and the second are not equally good in all situations. The second has the same accuracy whether $X$ is true or false; the first, in constrast, is better than the first if $X$ is true and worse than the first if $X$ is false. The second does not open up the possibility of high accuracy that the first does; though, to compensate, it also precludes the possibility of low accuracy, which the first doesn't. Surveying the situation, you think that you will take the risk. You'll adopt the first, even though you aren't sure whether or not it is rational. And you'll do this because you want the possibility of being rational and having that higher accuracy. This seems a rational thing to do. So, it seems to me, WRD is false.

Although I think this objection to WRD works, I think it's helpful to see how it might play out for a particular motivation for permissivism. Here's the motivation: Some credence functions offer the promise of great accuracy -- for instance, assigning 0.9 to $X$ and 0.1 to $\overline{X}$ will be very accurate if $X$ is true. However, those that do so also open the possibility of great inaccuracy -- if $X$ is false, the credence function just considered is very inaccurate. Other credence functions neither offer great accuracy nor risk great inaccuracy. For instance, assigning 0.5 to both $X$ and $\overline{X}$ guarantees the same inaccuracy whether or not $X$ is true. You might say that you are more risk-averse the lower is the maximum possible inaccuracy you are willing to risk. Thus, the options that are rational for you are those undominated options with maximum inaccuracy at most whatever the threshold is that you set. Now, suppose you use the Brier score to measure your inaccuracy -- so that the inaccuracy of the credence function $c(X) = p$ and $c(\overline{X}) = 1-p$ is $2(1-p)^2$ if $X$ is true and $2p^2$ if $X$ is false. And suppose you are willing to tolerate a maximum possible inaccuracy of $0.5$, which also gives you a mininum inaccuracy of $0.5$. In that case, only $c(X) = 0.5 = c(\overline{X})$ will be rational from the point of view of your risk attitudes --- since $2(1-0.5)^2 = 0.5 = 2(0.5^2)$. On the other hand, suppose you are willing to tolerate a maximum inaccuracy of $0.98$, which also gives you a minimum inaccuracy of $0.18$. In that case, any credence function $c$ with $0.3 \leq c(X) \leq 0.7$ and $c(\overline{X}) = 1-c(X)$ is rational from the point of view of your risk attitudes.

Now, suppose that you are in the sort of situation that Schultheis imagines. You are uncertain of the extent of the set $R_E$ of rational responses to your evidence $E$. On the account we're considering, this must be because you are uncertain of your own attitudes to epistemic risk. Let's say that the threshold of maximum inaccuracy that you're willing to tolerate is $0.98$, but you aren't certain of that --- you think it might be anything between $0.72$ and $1.28$. So you're sure that it's rational to assign anything between 0.4 and 0.6 to $X$, but unsure whether it's rational to assign $0.7$ to $X$ --- if your threshold turns out to be less than 0.98, then assigning $0.7$ to $X$ would be irrational, because it risks inaccuracy of $0.98$. In this situation, is it rational to assign $0.7$ to $X$? I think it is. Among the credence functions that you know for sure are rational, the ones that give you the lowest possible inaccuracy are the one that assigns 0.4 to $X$ and the one that assigns 0.6 to $X$. They have maximum inaccuracy of 0.72, and they open up the possibility of an inaccuracy of 0.32, which is lower than the lowest possible inaccuracy opened up by any others that you know to be rational. On the other hand, assigning 0.7 to $X$ opens up the possibility of an inaccuracy of 0.18, which is considerably lower. As a result, it doesn't seem irrational to assign 0.7 to $X$, even though you don't know whether it is rational from the point of view of your attitudes to risk, and you do know that assigning 0.6 is rational. 

There is another possible response to Schultheis' challenge for those who like this sort of motivation for permissivism. You might simply say that, if your attitudes to risk are such that you will tolerate a maximum inaccuracy of at most $t$, then regardlesss of whether you know this fact, indeed regardless of your level of uncertainty about it, the rational credence functions are precisely those that have maximum inaccuracy of at most $t$. This sort of approach is familiar from expected utility theory. Suppose I have credences in $X$ and in $\overline{X}$. And suppose I face two options whose utility is determined by whether or not $X$ is true or false. Then, regardless of what I believe about my credences in $X$ and $\overline{X}$, I should choose whichever option maximises expected utility from the point of view of my actual credences. The point is this: if what it is rational for you to believe or to do is determined by some feature of you, whether it's your credences or your attitudes to risk, being uncertain about those features doesn't change what it is rational for you to do. This introduces a certain sort of externalism to our notion of rationality. There are features of ourselves -- our credences or our attitudes to risk -- that determine what it is rational for us to believe or do, which are nonetheless not luminous to us. But I think this is inevitable. Of course, we might might move up a level and create a version of expected utility theory that appeals not to our first-order credences but to our credences concerning those first-order credences -- perhaps you use the higher-order credences to define a higher-order expected value for the first-order expected utilities, and you maximize that. But it simply pushes the problem back a step. For your higher-order credences are no more luminous than your first-order ones. And to stop the regress, you must fix some level at which the credences at that level simply determine the expectation that rationality requires you to maximize, and any uncertainty concerning those does not affect rationality. And the same goes in this case. So, given this particular motivation for permissivism, which appeals to your attitudes to epistemic risk, it seems that there is another reason why WRD is false. If $c$ is in $R_E$, then it is rational for you, regardless of your epistemic attitude to its rationality.

Monday, 4 January 2021

Using a generalized Hurwicz criterion to pick your priors

Over the summer, I got interested in the problem of the priors again. Which credence functions is it rational to adopt at the beginning of your epistemic life? Which credence functions is it rational to have before you gather any evidence? Which credence functions provide rationally permissible responses to the empty body of evidence? As is my wont, I sought to answer this in the framework of epistemic utility theory. That is, I took the rational credence functions to be those declared rational when the appropriate norm of decision theory is applied to the decision problem in which the available acts are all the possible credence functions, and where the epistemic utility of a credence function is measured by a strictly proper measure. I considered a number of possible decision rules that might govern us in this evidence-free situation: Maximin, the Principle of Indifference, and the Hurwicz criterion. And I concluded in favour of a generalized version of the Hurwicz criterion, which I axiomatised. I also described which credence functions that decision rule would render rational in the case in which there are just three possible worlds between which we divide our credences. In this post, I'd like to generalize the results from that treatment to the case in which there any finite number of possible worlds.

Here's the decision rule (where $a(w_i)$ is the utility of $a$ at world $w_i$).

Generalized Hurwicz Criterion  Given an option $a$ and a sequence of weights $0 \leq \lambda_1, \ldots, \lambda_n \leq 1$ with $\sum^n_{i=1} \lambda_i = 1$, which we denote $\Lambda$, define the generalized Hurwicz score of $a$ relative to $\Lambda$ as follows: if $$a(w_{i_1}) \geq a(w_{i_2}) \geq \ldots \geq a(w_{i_n})$$ then $$H^\Lambda(a) := \lambda_1a(w_{i_1}) + \ldots + \lambda_na(w_{i_n})$$That is, $H^\Lambda(a)$ is the weighted average of all the possible utilities that $a$ receives, where $\lambda_1$ weights the highest utility, $\lambda_2$ weights the second highest, and so on.

The Generalized Hurwicz Criterion says that you should order options by their generalized Hurwicz score relative to a sequence $\Lambda$ of weightings of your choice. Thus, given $\Lambda$,$$a \preceq^\Lambda_{ghc} a' \Leftrightarrow H^\Lambda(a) \leq H^\Lambda(a')$$And the corresponding decision rule says that you should pick your Hurwicz weights $\Lambda$ and then, having done that, it is irrational to choose $a$ if there is $a'$ such that $a \prec^\Lambda_{ghc} a'$.

Now, let $\mathfrak{U}$ be an additive strictly proper epistemic utility measure. That is, it is generated by a strictly proper scoring rule. A strictly proper scoring rule is a function $\mathfrak{s} : \{0, 1\} \times [0, 1] \rightarrow [-\infty, 0]$ such that, for any $0 \leq p \leq 1$, $p\mathfrak{s}(1, x) + (1-p)\mathfrak{s}(0, x)$ is maximized, as a function of $x$, uniquely at $x = p$. And an epistemic utility measure is generated by $\mathfrak{s}$ if, for any credence function $C$ and world $w_i$,$$\mathfrak{U}(C, w_i) = \sum^n_{j=1} \mathfrak{s}(w^j_i, c_j)$$where

  • $c_j = C(w_j)$, and
  • $w^j_i = 1$ if $j=i$ and $w^j_i = 0$ if $j \neq i$

In what follows, we write the sequence $(c_1, \ldots, c_n)$ to represent the credence function $C$.

Also, given a sequence $(\alpha_1, \ldots, \alpha_k)$ of numbers, let$$\mathrm{Av}((\alpha_1, \ldots, \alpha_k)) := \frac{\alpha_1 + \ldots  + \alpha_k}{k}$$That is, $\mathrm{av}(A)$ is the average of the numbers in $A$. And given $1 \leq k \leq n$, let $A|_k = (a_1, \ldots, a_k)$. That is, $A|_k$ is the truncation of the sequence $A$ that omits all terms after $a_k$. Then we say that $A$ does not exceed its average if, for each $1 \leq k \leq n$,$$\mathrm{av}(A) \geq \mathrm{av}(A|_k)$$That is, at no point in the sequence does the average of the numbers up to that point exceed the average of all the numbers in the sequence.

Theorem 1 Suppose $\Lambda = (\lambda_1, \ldots, \lambda_n)$ is a sequence of generalized Hurwicz weights. Then there is a sequence of subsequences $\Lambda_1, \ldots, \Lambda_m$ of $\Lambda$ such that

  1. $\Lambda = \Lambda_1 \frown \ldots \frown \Lambda_m$
  2. $\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av} (\Lambda_m)$
  3. each $\Lambda_i$ does not exceed its average

Then, the credence function$$(\underbrace{\mathrm{av}(\Lambda_1), \ldots, \mathrm{av}(\Lambda_1)}_{\text{length of $\Lambda_1$}}, \underbrace{\mathrm{av}(\Lambda_2), \ldots, \mathrm{av}(\Lambda_2)}_{\text{length of $\Lambda_2$}}, \ldots, \underbrace{\mathrm{av}(\Lambda_m), \ldots, \mathrm{av}(\Lambda_m)}_{\text{length of $\Lambda_m$}})$$maximizes $H^\Lambda(\mathfrak{U}(-))$ among credence functions $C = (c_1, \ldots, c_n)$ for which $c_1 \geq \ldots \geq c_n$.

This is enough to give us all of the credence functions that maximise $H^\Lambda(\mathfrak{U}(-))$: they are the credence function mentioned together with any permutation of it --- that is, any credence function obtained from that one by switching around the credences assigned to the worlds.

Proof of Theorem 1. Suppose $\mathfrak{U}$ is a measure of epistemic value that is generated by the strictly proper scoring rule $\mathfrak{s}$. And suppose that $\Lambda$ is the following sequence of generalized Hurwicz weights $0 \leq \lambda_1, \ldots, \lambda_n \leq 1$ with $\sum^n_{i=1} \lambda_i = 1$.

First, due to a theorem that originates in Savage and is stated and proved fully by Predd, et al., if $C$ is not a probability function---that is, if $c_1 + \ldots + c_n \neq 1$---then there is a probability function $P$ such that $\mathfrak{U}(P, w_i) > \mathfrak{U}(C, w_i)$ for all worlds $w_i$. Thus, since GHC satisfies Strong Dominance, whatever maximizes $H^\Lambda(\mathfrak{U}(-))$ will be a probability function.

Now, since $\mathfrak{U}$ is generated by a strictly proper scoring rule, it is also truth-directed. That is, if $c_i > c_j$, then $\mathfrak{U}(C, w_i) > \mathfrak{U}(C, w_j)$. Thus, if $c_1 \geq c_2 \geq \ldots \geq c_n$, then$$H^\Lambda(\mathfrak{U}(C)) = \lambda_1\mathfrak{U}(C, w_1) + \ldots + \lambda_n\mathfrak{U}(C, w_n)$$This is what we seek to maximize. But notice that this is just the expectation of $\mathfrak{U}(C)$ from the point of view of the probability distribution $\Lambda = (\lambda_1, \ldots, \lambda_n)$.

Now, Savage also showed that, if $\mathfrak{s}$ is strictly proper and continuous, then there is a differentiable and strictly convex function $\varphi$ such that, if $P, Q$ are probabilistic credence functions, then
\mathfrak{D}_\mathfrak{s}(P, Q) & = & \sum^n_{i=1} \varphi(p_i) - \sum^n_{i=1} \varphi(q_i) - \sum^n_{i=1} \varphi'(q_i)(p_i - q_i) \\
& = & \sum^n_{i=1} p_i\mathfrak{U}(P, w_i) - \sum^n_{i=1} p_i\mathfrak{U}(Q, w_i)
So $C$ maximizes $H^\Lambda(\mathfrak{U}(-))$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$ iff $C$ minimizes $\mathfrak{D}_\mathfrak{s}(\Lambda, -)$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$. We now use the KKT conditions to calculate which credence functions minimize $\mathfrak{D}_\mathfrak{s}(\Lambda, -)$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$.

Thus, if we write $x_n$ for $1 - x_1 - \ldots - x_{n-1}$, then
f(x_1, \ldots, x_{n-1}) = \mathfrak{D}((\lambda_1, \ldots, \lambda_n), (x_1, \ldots, x_n)) = \\
\sum^n_{i=1} \varphi(\lambda_i) - \sum^n_{i=1} \varphi(x_i) - \sum^n_{i=1} \varphi'(x_i)(\lambda_i - x_i)
\nabla f = \langle \varphi''(x_1) (x_1 - \lambda_1) - \varphi''(x_n)(x_n - \lambda_n), \\
\varphi''(x_2) (x_2 - \lambda_2) - \varphi''(x_n)(x_n - \lambda_n), \ldots \\
\varphi''(x_{n-1}) (x_{n-1} - \lambda_{n-1}) - \varphi''(x_n)(x_n - \lambda_n) )\rangle

Let $$\begin{array}{rcccl}
g_1(x_1, \ldots, x_{n-1}) & = & x_2 - x_1&  \leq & 0\\
g_2(x_1, \ldots, x_{n-1}) & = & x_3 - x_2&  \leq & 0\\
\vdots & \vdots & \vdots & \vdots & \vdots \\
g_{n-2}(x_1, \ldots, x_{n-1}) & = & x_{n-1} - x_{n-2}&  \leq & 0 \\
g_{n-1}(x_1, \ldots, x_{n-1}) & = & 1 - x_1 - \ldots - x_{n-2} - 2x_{n-1} & \leq & 0
\nabla g_1 & = & \langle -1, 1, 0, \ldots, 0 \rangle \\
\nabla g_2 & = & \langle 0, -1, 1, 0, \ldots, 0 \rangle \\
\vdots & \vdots & \vdots \\
\nabla g_{n-2} & = & \langle 0, \ldots, 0, -1, 1 \rangle \\
\nabla g_{n-1} & = & \langle -1, -1, -1, \ldots, -1,  -2 \rangle \\
So the KKT theorem says that $x_1, \ldots, x_n$ is a minimizer iff there are $0 \leq \mu_1, \ldots, \mu_{n-1}$ such that$$\nabla f(x_1, \ldots, x_{n-1}) + \sum^{n-1}_{i=1} \mu_i \nabla g_i(x_1, \ldots, x_{n-1}) = 0$$That is, iff there are $0 \leq \mu_1, \ldots, \mu_{n-1}$ such that
\varphi''(x_1) (x_1 - \lambda_1) - \varphi''(x_n)(x_n - \lambda_n) - \mu_1 - \mu_{n-1} & = & 0 \\
\varphi''(x_2) (x_2 - \lambda_2) - \varphi''(x_n)(x_n - \lambda_n) + \mu_1 - \mu_2 - \mu_{n-1} & = & 0 \\
\vdots & \vdots & \vdots \\
\varphi''(x_{n-2}) (x_{n-2} - \lambda_{n-2}) - \varphi''(x_n)(x_n - \lambda_n) + \mu_{n-3} - \mu_{n-2} - \mu_{n-1}& = & 0 \\
\varphi''(x_{n-1}) (x_{n-1} - \lambda_{n-1}) - \varphi''(x_n)(x_n - \lambda_n)+\mu_{n-2} - 2\mu_{n-1} & = & 0
By summing these identities, we get:
\mu_{n-1} &  = & \frac{1}{n} \sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{n-1}{n} \varphi''(x_n)(x_n - \lambda_n) \\
&= & \frac{1}{n} \sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \varphi''(x_n)(x_n - \lambda_n) \\
& = & \sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{n-1}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)
So, for $1 \leq k \leq n-2$,
\mu_k & = & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - k\varphi''(x_n)(x_n - \lambda_n) - \\
&& \hspace{20mm} \frac{k}{n}\sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) + k\frac{n-1}{n} \varphi''(x_n)(x_n - \lambda_n) \\
& = & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) -\frac{k}{n} \varphi''(x_n)(x_n - \lambda_n) \\
&= & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)
So, for $1 \leq k \leq n-1$,
$$\mu_k = \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)$$
Now, suppose that there is a sequence of subsequences $\Lambda_1, \ldots, \Lambda_m$ of $\Lambda$ such that

  1. $\Lambda = \Lambda_1 \frown \ldots \frown \Lambda_m$
  2. $\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av}(\Lambda_m)$
  3. each $\Lambda_i$ does not exceed its average.

And let $$P = (\underbrace{\mathrm{av}(\Lambda_1), \ldots, \mathrm{av}(\Lambda_1)}_{\text{length of $\Lambda_1$}}, \underbrace{\mathrm{av}(\Lambda_2), \ldots, \mathrm{av}(\Lambda_2)}_{\text{length of $\Lambda_2$}}, \ldots, \underbrace{\mathrm{av}(\Lambda_m), \ldots, \mathrm{av}(\Lambda_m)}_{\text{length of $\Lambda_m$}})$$Then we write $i \in \Lambda_j$ if $\lambda_i$ is in the subsequence $\Lambda_j$. So, for $i \in \Lambda_j$, $p_i = \mathrm{av}(\Lambda_j)$. Then$$\frac{k}{n}\sum^n_{i=1} \varphi''(p_i)(p_i - \lambda_i) = \frac{k}{n} \sum^m_{j = 1} \sum_{i \in \Lambda_j} \varphi''(\mathrm{av}(\Lambda_j))(\mathrm{av}(\Lambda_j) - \lambda_i) = 0 $$
Now, suppose $k$ is in $\Lambda_j$. Then
\mu_k = \sum^k_{i=1} \varphi''(p_i)(p_i - \lambda_i) = \\
\sum_{i \in \Lambda_1} \varphi''(p_i)(p_i - \lambda_i) + \sum_{i \in \Lambda_2} \varphi''(p_i)(p_i - \lambda_i) + \ldots + \\
\sum_{i \in \Lambda_{j-1}} \varphi''(p_i)(p_i - \lambda_i) + \sum_{i \in \Lambda_j|_k} \varphi''(p_i)(p_i - \lambda_i) = \\
\sum_{i \in \Lambda_j|_k} \varphi''(p_i)(p_i - \lambda_i) = \sum_{i \in \Lambda_j|_k} \varphi''(\mathrm{av}(\Lambda_j)(\mathrm{av}(\Lambda_j) - \lambda_i)
So, if $|\Lambda|$ is the length of the sequence $\Lambda$,$$\mu_k \geq 0 \Leftrightarrow |\Lambda_j|_k|\mathrm{av}(\Lambda_j) - \sum_{i \in \Lambda_j|_k} \lambda_i \geq 0 \Leftrightarrow \mathrm{av}(\Lambda_j) \geq \mathrm{av}(\Lambda_j|_k)$$But, by assumption, this is true for all $1 \leq k \leq n-1$. So $P$ minimizes $H^\Lambda(\mathfrak{U}(-))$, as required.

We now show that there is always a series of subsequences that satisfy (1), (2), (3) from above.  We proceed by induction. 

Base Case  $n = 1$. Then it is clearly true with the subsequence $\Lambda_1 = \Lambda$.

Inductive Step  Suppose it is true for all sequences $\Lambda = (\lambda_1, \ldots, \lambda_n)$ of length $n$. Now consider a sequence $(\lambda_1, \ldots, \lambda_n, \lambda_{n+1})$. Then, by the inductive hypothesis, there is a sequence of sequences $\Lambda_1, \ldots, \Lambda_m$ such that

  1. $\Lambda \frown (\lambda_{n+1}) = \Lambda_1 \frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})$
  2. $\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av} (\Lambda_m)$
  3. each $\Lambda_i$ does not exceed its average.

Now, first, suppose $\mathrm{av}(\Lambda_m) \geq \lambda_{n+1}$. Then let $\Lambda_{m+1} = (\lambda_{n+1})$ and we're done.

So, second, suppose $\mathrm{av}(\Lambda_m) < \lambda_{n+1}$. Then we find the greatest $k$ such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$Then we let $\Lambda^*_{k+1} = \Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})$. Then we can show that

  1. $(\lambda_1, \ldots, \lambda_n, \lambda_{n+1}) = \Lambda_1 \frown \Lambda_2 \frown \ldots \frown \Lambda_k \frown \Lambda^*_{k+1}$.
  2. Each $\Lambda_1, \ldots, \Lambda_k, \Lambda^*_{k+1}$ does not exceed average.
  3. $\mathrm{av}(\Lambda_1) \geq \mathrm{av}(\Lambda_2) \geq \ldots \geq \mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda^*_{k+1})$.

(1) and (3) are obvious. So we prove (2). In particular, we show that $\Lambda^*_{k+1}$ does not exceed average. We assume that each subsequence $\Lambda_j$ starts with $\Lambda_{i_j+1}$

  • Suppose $i \in \Lambda_{k+1}$. Then, since $\Lambda_{k+1}$ does not exceed average, $$\mathrm{av}(\Lambda_{k+1}) \geq \mathrm{av}(\Lambda_{k+1}|_i)$$But, since $k$ is the greatest number such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$We know that$$\mathrm{av}(\Lambda_{k+2}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1}|_i)$$
  • Suppose $i \in \Lambda_{k+2}$. Then, since $\Lambda_{k+2}$ does not exceed average, $$\mathrm{av}(\Lambda_{k+2}) \geq \mathrm{av}(\Lambda_{k+2}|_i)$$But, since $k$ is the greatest number such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$We know that$$\mathrm{av}(\Lambda_{k+3}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+2})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+2}|_i)$$But also, from above,$$ \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1} \frown \Lambda_{k+2}|_i)$$
  • And so on.

This completes the proof. $\Box$

Friday, 1 January 2021

How permissive is rationality? Horowitz's value question for moderate permissivism

Rationality is good; irrationality is bad. Most epistemologists would agree with this rather unnuanced take, regardless of their view of what exactly constitutes rationality and its complement. Granted this, a good test of a thesis in epistemology is whether it can explain why these two claims are true. Can it answer the value question: Why is rationality valuable and irrationality not? And indeed Sophie Horowitz gives an extremely illuminating appraisal of different degrees of epistemic permissivism and impermissivism by asking of each what answer it might give. Her conclusion is that the extreme permissivist -- played in her paper by the extreme subjective Bayesian, who thinks that satisfying Probabilism and being certain of your evidence is necessary and sufficient for rationality -- can give a satisfying answer to this question, or, at least, an answer that is satisfying from their own point of view. And the extreme impermissivist -- played here by the objective Bayesian, who thinks that rationality requires something like the maximum entropy distribution relative to your evidence -- can do so too. But, Horowitz argues, the moderate permissivist -- played by the moderate Bayesian, who thinks rationality imposes requirements more stringent than merely Probabilism, but who does not think they're stringent enough to pick out a unique credence function -- cannot. In this post, I'd like to raise some problems for Horowitz's assessment, and try to offer my own answer to the value question on behalf of the moderate Bayesian. (Full disclosure: If I'm honest, I think I lean towards extreme permissivism, but I'd like to show that moderate permissivism can defend itself against Horowitz's objection.)

Let's begin with the accounts that Horowitz gives on behalf of the extreme permissivist and the impermissivist.

The extreme permissivist -- the extreme subjective Bayesian, recall -- can say that only by being rational can you have a credence function that is immodest -- where a credence function is immodest if it uniquely maximizes expected epistemic utility from its own point of view. This is because Horowitz, like others in the epistemic utility theory literature, assume that epistemic utility is measured by strictly proper measures, so that, every probabilistic credence function expects itself to be better than any alternative credence function. From this, we can conclude that, on the extreme permissivist view, rationality is sufficient for immodesty. It's trickier to show that it is also necessary, since it isn't clear what we mean by the expected epistemic utility of a credence function from the point of view of a non-probabilistic credence function -- the usual definitions of expectation make sense only for probabilistic credence functions. Fortunately, however, we don't have to clarify this much. We need only say that, at the very least, if one credence function is epistemically better than another at all possible worlds -- that is, in decision theory parlance, the first dominates the second -- then any credence function, probabilistic or not, will expect the first to be better than the second. We then combine this with the result that, if epistemic utility is measured by a stricty proper measure, then, for each non-probabilistic credence function, there is a probabilistic credence function that dominates it, while for each probabilistic credence function, there is no such dominator (this result traces back to Savage's 1971 paper; Predd, et al. give the proof in detail when the measure is additive; I then generalised it to remove the additivity assumption). This then shows that being rational is necessary for being immodest. So, according to Horowitz's answer on behalf of the extreme permissivist, being rational is good and being irrational is bad because being rational is necessary and sufficient for being immodest; and it's good to be immodest and bad to be modest.

On the other hand, the impermissivist can say that, by being rational, you are maximizing expected accuracy from the point of view of the one true rational credence function. That's their answer to the value question, according to Horowitz.

We'll return to the question of whether these answers are satisfying below. But first I want to turn to Horowitz's claim that the moderate Bayesian cannot give a satisfactory answer. I'll argue that, if the two answers just given on behalf of the extreme permissivist and extreme impermissivist are satisfactory, then there is a satisfactory answer that the moderate permissivist can give. Then I'll argue that, in fact, these answers aren't very satisfying. And I'll finish by sketching my preferred answer on behalf of the moderate permissivist. This is inspired by William James' account of epistemic risks in The Will to Believe, which leads me to discuss another Horowitz paper.

Horowitz's strategy is to show that the moderate permissivist cannot find a good epistemic feature of credence functions that belongs to all that they count as rational, but does not belong to any they count as irrational. The extreme permissivist can point to immodesty; the extreme impermissivist can point to maximising expected epistemic utility from the point of view of the sole rational credence function. But, for the moderate, there's nothing. Or so Horowitz argues.

For instance, Horowitz initially considers the suggestion that rational credence functions guarantee you a minimum amount of epistemic utility. As she notes, the problem with this is that either it leads to impermissivism, or it fails to include all and only the credence functions the moderate considers rational. Let's focus on the case in which we have opinions about a proposition and its negation -- the point generalizes to richer sets of propositions. We'll represent the credence functions as pairs $(c(X), c(\overline{X}))$. And let's measure epistemic utility using the Brier score. So, when $X$ is true, the epistemic utility of $(x, y)$ is $-(1-x)^2 - y^2$, and when $X$ is false, it is $-x^2 - (1-y)^2$. Then, for $r > -0.25$, there is no credence function that guarantees you at least epistemic value $-0.25$ -- if you have at least that epistemic value at one world, you have less than that epistemic value at a different world. For $r = 0.25$, there is exactly one credence function that guarantees you at least epistemic value $-0.25$ -- it is the uniform credence function $(0.5, 0.5)$. And for $r < -0.25$, there are both probabilistic and non-probabilistic credence functions that guarantee you at least epistemic utility $r$. So, Horowitz concludes, a certain level of guaranteed epistemic utility can't be what separates the rational from the irrational for the moderate permissivist, since for any level, either no credence function guarantees it, exactly one does, or there are both credence functions the moderate considers rational and credence functions they consider irrational that guarantee it.

She identifies a similar problem if we think not about guaranteed accuracy but about expected accuracy. Suppose, as the moderate permissivist urges, that some but not all probability functions are rationally permissible. Then for many rational credence functions, there will be irrational ones that they expect to be better than they expect some rational credence functions to be. Horowitz gives the example of a case in which the rational credence in $X$ is between 0.6 and 0.8 inclusive. Then someone with credence 0.8 will expect the irrational credence 0.81 to be better than it expects the rational credence 0.7 to be -- at least according to many many strictly proper measures of epistemic utility. So, Horowitz concludes, whatever separates the rational from the irrational, it cannot be considerations of expected epistemic utility.

I'd like to argue that, in fact, Horowitz should be happy with appeals to guaranteed or expected epistemic utility. Let's take guaranteed utility first. All that the moderate permissivist needs to say to answer the value question is that there are two valuable things that you obtain by being rational: immodesty and a guaranteed level of epistemic utility. Immodesty rules out all non-probabilistic credence functions, while the guaranteed level of epistemic utility narrows further -- how narrow depends on how much epistemic utility you wish to guarantee. So, for instance, suppose we say that the rational credence functions are exactly those $(x, 1-x)$ with $0.4 \leq x \leq 0.6$. Then each is immodest. And each has a guaranteed epistemic utility of at least $-(1-0.4)^2 - 0.6^2 = -0.72$. If Horowitz is satisfied with the immodesty answer to the value question when the extreme permissivist gives it, I think she should also be satisfied with it when the moderate permissivist combines it with a requirement not to risk certain low epistemic utilities (in this case, utilities below $-0.72$). And this combination of principles rules in all of the credence functions that the moderate counts as rational and rules out all they count as irrational.

Next, let's think about expected epistemic utility. Suppose that the set of credence functions that the moderate permissivist counts as rational is a closed convex set. For instance, perhaps the set of rational credence function is $$R = \{c : \{X, \overline{X}\} \rightarrow [0, 1] : 0.6 \leq c(X) \leq 0.8\ \&\ c(\overline{X}) = 1- c(X)\}$$ Then we can prove the following: if a credence function is not in $R$, then there is $c^*$ in $R$ such that each $p$ in $R$ expects $c^*$ to be better than it expects $c$ to be (for the proof strategy, see Section 3.2 here, but replace the possible chance functions with the rational credence functions). Thus, just as the extreme impermissivist answers the value question by saying that, if you're irrational, there's a credence function the unique rational credence function prefers to yours, while if you're rational, there isn't, the moderate permissivist can say that, if you're irrational, there is a credence function that all the rational credence functions prefer to yours, while if you're rational, there isn't. 

Of course, you might think that it is still a problem for moderate permissivists that there are rational credence functions that expect some irrational credence functions to be better than some alternative rational ones. But I don't think Horowitz will have this worry. After all, the same problem affects extreme permissivism, and she doesn't take issue with this -- at least, not in the paper we're considering. For any two probabilistic credence functions $p_1$ and $p_2$, there will be some non-probabilistic credence function $p'_1$ that $p_1$ will expect to be better than it expects $p_2$ to be -- $p'_1$ is just a very slight perturbation of $p_1$ that makes it incoherent; a perturbation small enough to ensure it lies closer to $p_1$ than $p_2$ does.

A different worry about the account of the value of rationality that I have just offered on behalf of the moderate permissivist is that it seems to do no more than push the problem back a step. It says that all irrational credence functions have a flaw that all rational credence functions lack. The flaw is this: there is an alternative preferred by all rational credence functions. But to assume that this is indeed a flaw seems to presuppose that we should care how rational credence functions evaluate themselves and other credence functions. But isn't the reason for caring what they say exactly what we have been asking for? Isn't the person who posed the value question in the first place simply going to respond: OK, but what's so great about all the rational credence functions expecting something else to be better, when the question on the table is exactly why rational credence functions are so good?

This is a powerful objection, but note that it applies equally well to Horowitz's response to the value question on behalf of the impermissivist. There, she claims that what is good about being rational is that you thereby maximise expected accuracy from the point of view of the unique rational credence function. But without an account of what's so good about being rational, I think we equally lack an account of what's so good about maximizing expected accuracy from the point of view of the rational credence functions.

So, in the end, I think Horowitz's answer to the value question on behalf of the impermissivist and my proposed expected epistemic utility answer on behalf of the moderate permissivist are ultimately unsatisfying.

What's more, Horowitz's answer on behalf of the extreme permissivist is also a little unsatisfying. The answer turns on the claim that immodesty is a virtue, together with the fact that precisely those credence functions identified as rational by subjective Bayesianism have that virtue. But is it a virtue? Just as arrogance in a person might seem excusable if they genuinely are very competent, but not if they are incompetent, so immodesty in a credence function only seems virtuous if the credence function itself is good. If the credence function is bad, then evaluating itself as uniquely the best seems just another vice to add to its collection. 

So I think Horowitz's answer to the value question on behalf of the extreme permissivist is a little unsatisfactory. But it lies very close to an answer I find compelling. That answer appeals not to immodesty, but to non-dominance. Having a credence function that is dominated is bad. It leaves free epistemic utility on the table in just the same way that a dominated action in practical decision theory leaves free pragmatic utility on the table. For the extreme permissivist, what is valuable about rationality is that it ensures that you don't suffer from this flaw. 

One noteworthy feature of this answer is the conception of rationality to which it appeals. On this conception, the value of rationality does not derive fundamentally from the possession of a positive feature, but from the lack of a negative feature. Ultimately, the primary notion here is irrationality. A credence function is irrational if it exhibits certain flaws, which are spelled out in terms of its success in the pursuit of epistemic utility. You are rational if you are free of these flaws. Thus, for the extreme permissivist, there is just one such flaw -- being dominated. So the rational credences are simply those that lack that flaw -- and the maths tells us that those are precisely the probabilistic credence functions.

We can retain this conception of rationality, motivate moderate permissivism, and answer the value question for it. In fact, there are at least two ways to do this. We have met something very close to one of these ways when we tried to rehabilitate the moderate permissivist's appeal to guaranteed epistemic utility above. There, we said that what makes rationality good is that it ensures that you are immodest and also ensures a certain guaranteed level of accuracy. But, a few paragraphs back, we argued that immodesty is no virtue. So that answer can't be quite right. But we can replace the appeal to immodesty with an appeal to non-dominance, and then the answer will be more satisfying. Thus, the moderate permissivist who says that the rational credence functions are exactly those $(x, 1-x)$ with $0.4 \leq x \leq 0.6$ can say that being rational is valuable for two reasons: (i) if you're rational, you aren't dominated; (ii) if you're rational you are guaranteed to have epistemic utility at least $-0.72$; (iii) only if you are rational will (i) and (ii) both hold. This answers the value question by appealing to how well credence functions promote epistemic utility, and it separates out the rational from the irrational precisely.

To explain the second way we might do this, we invoke William James. Famously, in The Will to Believe, James said that we have two goals when we believe: to believe truth, and to avoid error. But these pull in different directions. If we pursue the first by believing something, we open ourselves up to the possibility of error. If we pursue the second by suspending judgment on something, we foreclose the possibility of believing the truth about it. Thus, to govern our epistemic life, we must balance these two goals. James held that how we do this is a subjective matter of personal judgment, and a number of different ways of weighing them are permissible. Thomas Kelly has argued that this can motivate permissivism in the case of full beliefs. Suppose the epistemic utility you assign to getting things right -- that is, believing truths and disbelieving falsehoods -- is $R > 0$. And suppose you assign epistemic utility $-W < 0$ to getting things wrong -- that is, disbelieving truths and believing falsehoods. And suppose you assign $0$ to suspending judgment. And suppose $W > R$. Then, as Kenny Easwaran and Kevin Dorst have independently pointed out, if $r$ is the evidential probability of $X$, believing $X$ maximises expected epistemic utility from its point of view iff $\frac{W}{R + W} \leq r$, while suspending on $X$ maximises expected epistemic utility iff $\frac{R}{W+R} \leq r \leq \frac{W}{R+W}$. If William James is right, different values for $R$ and $W$ are permissible. The more you value believing truths, the greater will be $R$. The more you value avoiding falsehoods, the greater will be $W$ (and the lower will be $-W$). Thus, there will be a possible evidential probability $r$ for $X$, as well as permissible values $R$, $R'$ for getting things right and permissible values $W$, $W'$ for getting things wrong such that $$\frac{W}{R+W} < r < \frac{W'}{R'+W'}$$So, for someone with epistemic utilities characterised by $R$, $W$, it is rational to suspend judgment on $X$, while for someone with $W'$, $R'$, it is rational to believe $X$. Hence, permissivism about full beliefs.

As Horowitz points out, however, the same trick won't work for credences. After all, as we've seen, all legitimate measures of epistemic utility for credences are strictly proper measures. And thus, if $r$ is the evidential probability of $X$, then credence $r$ in $X$ uniquely maximises expected epistemic utility relative to any one of those measures. So, a Jamesian permissivism about measures of epistemic value gives permissivism about doxastic states in the case of full belief, but not in the case of credence.

Nonetheless, I think we can derive permissivism about credences from James' insight. The key is to encode our attitudes towards James' two great goals for belief not in our epistemic utilities but in the rule we adopt when we use those epistemic utilities to pick our credences. Here's one suggestion, which I pursued at greater length in this paper a few years ago, and that I generalised in some blog posts over the summer -- I won't actually present the generalization here, since it's not required to make the basic point. James recognised that, by giving yourself the opportunity to be right about something, you thereby run the risk of being wrong. In the credal case, by giving yourself the opportunity to be very accurate about something, you thereby run the risk of being very inaccurate. In the full belief case, to avoid that risk completely, you must never commit on anything. It was precisely this terror of being wrong that he lamented in Clifford. By ensuring he could never be wrong, there were true beliefs to which Clifford closed himself off. James believed that the extent to which you are prepared to take these epistemic risks is a passional matter -- that is, a matter of subjective preference. We might formalize it using a decision rule called the Hurwicz criterion. This rule was developed by Leonid Hurwicz for situations in which no probabilities are not available to guide our decisions, so it is ideally suited for the situation in which we must pick our prior credences. 

Maximin is the rule that says you should pay attention only to the worst-case scenario and choose a credence function that does best there -- you should maximise your minimum possible utility. Maximax is the rule that says you should pay attention only to the best-case scenario and choose a credence function that does best there -- you should maximise your maximum possible utility. The former is maximally risk averse, the latter maximally risk seeking. As I showed here, if you measure epistemic utility in a standard way, maximin demands that you adopt the uniform credence function -- its worst case is best. And almost however you measure epistemic utility, maximax demands that you pick a possible world and assign maximal credence to all propositions that are true there and minimal credence to all propositions that are false there -- its best case, which obviously occurs at the world you picked, is best, because it is perfect there. 

The Hurwicz criterion is a continuum of decision rules with maximin at one end and maximax at the other. You pick a weighting $0 \leq \lambda \leq 1$ that measures how risk-seeking you are and you define the Hurwicz score of an option $a$, with utility $a(w)$ at world $w$, to be$$H^\lambda(a) = \lambda \max \{a(w) : w \in W\} + (1-\lambda) \min \{a(w) : w \in W\}$$And you pick an option with the highest Hurwicz score.

Let's see how this works out in the simplest case, namely, that in which you have credences only in $X$ and $\overline{X}$. As before, we write credence functions defined on these two propositions as $(c(X), c(\overline{X})$. Then, if $\lambda \leq \frac{1}{2}$ --- that is, if you give at least as much weight to the worst case as to the best case --- then the uniform distribution $(\frac{1}{2}, \frac{1}{2})$ maximises the Hurwicz score relative to any strictly proper measure. And if $\lambda > \frac{1}{2}$ --- that is, if you are risk seeking and give more weight to the best case than the worst --- then $(\lambda, 1 - \lambda)$ and $(1-\lambda, \lambda)$ both maximise the Hurwicz score.

Now, if any $0 \leq \lambda \leq 1$ is permissible, then so is any credence function $(x, 1-x)$, and we get extreme permissivism. But I think we're inclined to say that there are extreme attitudes to risk that are not rationally permissible, just as there are preferences relating the scratching of one's finger and the destruction of the world that are not rationally permissible. I think we're inclined to think there is some range from $a$ to $b$ with $0 \leq a < b \leq 1$ such that the only rational attitudes to risk are precisely those encoded by the Hurwicz weights that lie between $a$ and $b$. If that's the case, we obtain moderate permissivism.

To be a bit more precise, this gives us both moderate interpersonal and intrapersonal permissivism. It gives us moderate interpersonal permissivism if $\frac{1}{2} < b < 1$ -- that is, if we are permitted to give more than half our weight to the best case epistemic utility. For then, since $a < b$, there is $b'$ such that $\frac{1}{2} < b' < b$, and then both $(b, 1-b)$ and $(b', 1-b')$ are both rationally permissible. But there is also $b < b'' < 1$, and for any such $b''$, $(b'', 1-b'')$ is not rationally permissible. It also gives us moderate intrapersonal permissivism under the same condition. For if $\frac{1}{2} < b$ and $b$ is your Hurwicz weight, then for you, both $(b, 1-b)$ and $(1-b, b)$ are different, but both are rationally permissible.

How does this motivation for moderate permissivism fare with respect to the value question? I think it fares as well as the non-dominance-based answer I sketched above for the extreme permissivist. There, I appealed to a single flaw that a credence function might have: it might be dominated by another. Here, I introduced another flaw. It might be rationalised only by Jamesian attitudes to epistemic risk that are too extreme or otherwise beyond the pale. Like being dominated, this is a flaw that relates to the pursuit of epistemic utility. If you exhibit it, you are irrational. And to be rational is to be free of such flaws. The moderate permissivist can thereby answer the value question that Horowitz poses.

Tuesday, 15 December 2020

Deferring to rationality -- does it preclude permissivism?

Permissivism about epistemic rationality is the view that there are bodies of evidence in response to which rationality permits a number of different doxastic attitudes. I'll be thinking here about the case of credences. Credal permissivism says: there are bodies of evidence in response to which rationality permits a number of different credence functions.

Over the past year, I've watched friends on social media adopt remarkably different credence functions based on the same information about aspects of the COVID-19 pandemic, the outcome of the US election, and the withdrawal of the UK from European Union. And while I watch them scream at each other, cajole each other, and sometimes simply ignore each other, I can't shake the feeling that they are all taking rational stances. While they disagree dramatically, and while some will end up closer to the truth than others when it is finally revealed, it seems to me that all are responding rationally to their shared evidence, their opponents' protestations to the contrary. So permissivism is a very timely epistemic puzzle for 2020. What's more, this wonderful piece by Rachel Fraser made me see how my own William James-inspired approach to epistemology connects with a central motivation for believing in conspiracy theories, another major theme of this unloveable year.

One type of argument against credal permissivism turns on the claim that rationality is worthy of deference. The argument begins with a precise version of this claim, stated as a norm that governs credences. It proceeds by showing that, if epistemic rationality is permissive, then it is sometimes impossible to meet the demands of this norm. Taking this to be a reductio, the argument concludes that rationality cannot be permissive. I know of two versions of the argument, one due to Daniel Greco and Brian Hedden, and one due to Ben Levinstein. I'll mainly consider Levinstein's, since it fixes some problems with Greco and Hedden's. I'll consider David Thorstad's response to Greco and Hedden's argument, which would also work against Levinstein's argument were it to work at all. But I'll conclude that, while it provides a crucial insight, it doesn't quite work, and I'll offer my own alternative response.

Roughly speaking, you defer to someone on an issue if, upon learning their attitude to that issue, you adopt it as your own. So, for instance, if you ask me what I'd like to eat for dinner tonight, and I say that I defer to you on that issue, I'm saying that I will want to eat whatever I learn you would like to eat. That's a case of deferring to someone else's preferences---it's a case where we defer conatively to them. Here, we are interested in cases in which we defer to someone else's beliefs---that is, where we defer doxastically to them. Thus, I defer doxastically to my radiographer on the issue of whether I've got a broken finger if I commit to adopting whatever credence they announce in that diagnosis. By analogy, we sometimes say that we defer doxastically to a feature of the world if we commit to setting our credence in some way that is determined by that feature of the world. Thus, I might defer doxastically to a particular computer simulation model of sea level change on the issue of sea level rise by 2030 if I commit to setting my credence in a rise of 10cm to whatever probability that model reports when I run it repeatedly while perturbing its parameters and initial conditions slightly around my best estimate of their true values.

In philosophy, there are a handful of well-known theses that turn on the claim that we are required to defer doxastically to this individual or that feature of the world---and we're required to do it on all matters. For instance, van Fraassen's Reflection Principle says that you should defer doxastically to your future self on all matters. That is, for any proposition $X$, conditional on your future self having credence $r$ in $X$, you should have credence $r$ in $X$. In symbols:$$c(X\, |\, \text{my credence in $X$ at future time $t$ is $r$}) = r$$And the Principal Principle says that you should defer to the objective chances on all doxastic matters by setting your credences to match the probabilities that they report. That is, for any proposition $X$, conditional on the objective chance of $X$ being $r$, you should have credence $r$ in $X$. In symbols:$$c(X\, |\, \text{the objective chance of $X$ now is $r$}) = r$$Notice that, in both cases, there is a single expert value to which you defer on the matter in question. At time $t$, you have exactly one credence in $X$, and the Reflection Principle says that, upon learning that single value, you should set your credence in $X$ to it. And there is exactly one objective chance of $X$ now, and the Principal Principle says that, upon learning it, you should set your credence in $X$ equal to it. You might be uncertain about what that single value is, but it is fixed and unique. So this account of deference does not cover cases in which there is more than one expert. For instance, it doesn't obviously apply if I defer not to a specific climate model, but to a group of them. In those cases, there is usually no fixed, unique value that is the credence they all assign to a proposition. So principles of the same form as the Reflection or Principal Principle do not say what to do if you learn one of those values, or some of them, or all of them. This problem lies at the heart of the deference argument against permissivism. Those who make the argument think that deference to groups should work in one way; those who defend permissivism against it think it should work in some different way.

As I mentioned above, the deference argument begins with a specific, precise norm that is said to govern the deference we should show to rationality. The argument continues by claiming that, if rationality is permissive, then it is not possible to satisfy this norm. Here is the norm as Levinstein states it, where $c \in R_E$ means that $c$ is in the set $R_E$ of rational responses to evidence $E$:

Deference to Rationality Suppose:

  1. $c$ is your credence function;
  2. $E$ is your total evidence;
  3. $c(c \in R_E) = 0$;
  4. $c'$ is a probabilistic credence function;
  5. $c(c' \in R_E) > 0$;

then rationality requires$$c(-|c' \in R_E) = c'(-|c' \in R_E)$$That is, if you are certain that your credence function is not a rational response to your total evidence, then, conditional on some alternative probabilistic credence function being a rational response to that evidence, you should set your credences in line with that alternative once you've brought it up to speed with your new evidence that it is a rational response to your original total evidence.

Notice, first, that Levinstein's principle is quite weak. It does not say of just anyone that they should defer to rationality. It says only that, if you are in the dire situation of being certain that you are yourself irrational, then you should defer to rationality. If you are sure you're irrational, then your conditional credences should be such that, were you to learn of a credence function that it's a rational response to your evidence, you should fall in line with the credences that it assigns conditional on that same assumption that it is rational. Restricting its scope in this way makes it more palatable to permissivists who will typically not think that someone who is already pretty sure that they are rational must switch credences when they learn that there are alternative rational responses out there.

Notice also that you need only show such deference to rational credence functions that satisfy the probability axioms. This restriction is essential, for otherwise (DtR) will force you to violate the probability axioms yourself. After all, if $c(-)$ is probabilistic, then so is $c(-|X)$ for any $X$ with $c(X) > 0$. Thus, if $c'(-|c' \in R_E)$ is not probabilistic, and $c$ defers to $c'$ in the way Levinstein describes, then $c(-|c' \in R_E)$ is not probabilistic, and thus neither is $c$.

Now, suppose:

  • $c$ is your credence function;
  • $E$ is your total evidence;
  • $c'$ and $c''$ are probabilistic credence functions with$$c'(-|c' \in R_E\ \&\ c'' \in R_E) \neq c''(-|c' \in R_E\ \&\ c'' \in R_E)$$That is, $c'$ and $c''$ are distinct and remain distinct even once they become aware that both are rational responses to $E$;
  • $c(c' \in R_E\ \&\ c'' \in R_E) > 0$. That is, you give some credence to both of them being rational responses to $E$;
  • $c(c \in R_E) = 0$. That is, you are certain that your own credence function is not a rational response to $E$.

Then, by (DtR),

  • $c(-|c' \in R_E) = c'(-|c' \in R_E)$
  • $c(-|c'' \in R_E) = c''(-|c'' \in R_E)$ 

Thus, conditioning both sides of the first identity on $c'' \in R_E$ and both sides of the second identity on $c' \in R_E$, we obtain

  • $c(-|c' \in R_E\ \&\ c'' \in R_E) = c'(-|c' \in R_E\ \&\ c'' \in R_E)$ 
  • $c(-|c'' \in R_E\ \&\ c' \in R_E) = c''(-|c' \in R_E\ \&\ c'' \in R_E)$

But, by assumption, $c'(-| c' \in R_E\ \&\ c'' \in R_E) \neq c''(-|c' \in R_E\ \&\ c'' \in R_E)$. So (DtR) cannot be satisfied.

One thing to note about this argument: if it works, it establishes not only that there can be no two different rational responses to the same evidence, but that it is irrational to be anything less than certain of this. After all, what is required to derive the contradiction from DtR is not that there are two probabilistic credence functions $c'$ and $c''$ such that $c'(-|c' \in R_E\ \&\ c'' \in R_E) \neq c''(-|c' \in R_E\ \&\ c'' \in R_E)$ that are both rational responses to $E$. Rather, what is required is only that there are two probabilistic credence functions $c'$ and $c''$ with $c'(-|c' \in R_E\ \&\ c'' \in R_E) \neq c''(-|c' \in R_E\ \&\ c'' \in R_E)$ that you think might both be rational responses to $E$---that is, $c(c' \in R_E\ \&\ c'' \in R_E) > 0$. The conclusion that it is irrational to even entertain permissivism strikes me as too strong, but perhaps those who reject permissivism will be happy to accept it.

Let's turn, then, to a more substantial worry, given compelling voice by David Thorstad: (DtR) is too strong because the deontic modality that features in it is too strong. As I hinted above, the point is that the form of the deference principles that Greco & Hedden and Levinstein use is borrowed from cases---such as the Reflection Principle and the Principal Principle---in which there is just one expert value, though it might be unknown to you. In those cases, it is appropriate to say that, upon learning the single value and nothing more, you are required to set your credence in line with it. But, unless we simply beg the question against permissivism and assume there is a single rational response to every body of evidence, this isn't our situation. Rather, it's more like the case where you defer to a group of experts, such as a group of climate models. And in this case, Thorstad says, it is inappropriate to demand that you set your credence in line with an expert's credence when you learn what it is. Rather, it is at most appropriate to permit you to do that. That is, Levinstein's principle should not say that rationality requires your credence function to assign the conditional credences stated in its consequent; it should say instead that rationality allows it.

Thorstad motivates his claim by drawing an analogy with a moral case that he describes. Suppose you see two people drowning. They're called John and James, and you know that you will be able to save at most one. So the actions available to you are: save John, save James, save neither. And the moral actions are: save John, save James. But now consider a deference principle governing this situation that is analogous to (DtR): it demands that, upon learning that it is moral to save James, you must do that; and upon learning that it is moral to save John, you must do that. From this, we can derive a contradiction in a manner somewhat analogous to that in which we derived the contradiction from (DtR) above: if you learn both that it is moral to save John and moral to save James, you should do both; but that isn't an available action; so moral permissivism must be false. But I take it no moral theory will tolerate that in this case. So, Thorstad argues, there must be something wrong with the moral deference principle; and, by analogy, there must be something wrong with the analogous doxastic principle (DtR).

Thorstad's diagnosis is this: the correct deference principle in the moral case should say: upon learning that it is moral to save James, you may do that; upon learning that it is moral to save John, you may do that. You thereby avoid the contradiction, and moral permissivism is safe. Similarly, the correct doxastic deference principle is this: upon learning that a credence function is rational, it is permissible to defer to it. In Levinstein's framework, the following is rationally permissible, not rationally mandated:$$c(-|c' \in R_E) = c'(-|c' \in R_E)$$

I think Thorstad's example is extremely illuminating, but for reasons rather different from his. Recall that a crucial feature of Levinstein's version of the deference argument against permissivism is that it applies only to people who are certain that their current credences are irrational. If we add the analogous assumption to Thorstad's case, his verdict is less compelling. Suppose, for instance, you are currently committed to saving neither John nor James from drowning; that's what you plan to do; it's the action you have formed an intention to perform. What's more, you're certain that this action is not moral. But you're uncertain whether either of the other two available actions are moral. And let's add a further twist to drive home the point. Suppose, furthermore, that you are certain that you are just about learn, of exactly one of them, that it is permissible. And add to that the fact that, immediately after you learn, of exactly one of them, that it is moral, you must act---failing to do so will leave both John and James to drown. In this case, I think, it's quite reasonable to say that, upon learning that saving James is permissible, you are not only morally permitted to drop your intention to save neither and replace it with the intention to save James, but you are also morally required to do so; and the same should you learn that it is permissible to save John. It would, I think, be impermissible to save neither, since you're certain that's immoral and you know of an alternative that is moral; and it would be impermissible to save John, since you are still uncertain about the moral status of that action, while you are certain that saving James is moral; and it would be morally required to save James, since you are certain of that action alone that it is moral. Now, Levinstein's principle might seem to holds for individuals in an analogous situation. Suppose you're certain that your current credences are irrational. And suppose you will learn of only one credence function that it is rationally permissible. At least in this situation, it might seem that it is rationally required that you adopt the credence function you learn is rationally permissible, just as you are morally required to perform the single act you learn is moral. So, is Levinstein's argument rehabilitated?

I think not. Thorstad's example is useful, but not because the case of rationality and morality are analogous; rather, precisely because it draws attention to the fact that they are disanalogous. After all, all moral actions are better than all immoral ones. So, if you are committed to an action you know is immoral, and you learn of another that it is moral, and you know you'll learn nothing more about morality, you must commit to perform the action you've learned is moral. Doing so is the only way you know how to improve the action you'll perform for sure. But this is not the case for rational attitudes. It is not the case that all rational attitudes are better than all irrational attitudes. Let's see a few examples.

Suppose my preferences over a set of acts $a_1, \ldots, a_N$ are as follows, where $N$ is some very large number:$$a_1 \prec a_2 \prec a_3 \prec \ldots \prec a_{N-3} \prec a_{N-2} \prec a_{N-1} \prec a_N \prec a_{N-2}$$This is irrational, because, if the ordering is irreflexive, then it is not transitive: $a_{N-2} \prec a_{N-1} \prec a_N \prec a_{N-2}$, but $a_{N-2} \not \prec a_{N-2}$. And suppose I learn that the following preferences are rational:$$a_1 \succ a_2 \succ a_3 \succ \ldots \succ a_{N-3} \succ a_{N-2} \succ a_{N-1} \succ a_N$$Then surely it is not rationally required of me to adopt these alternative preferences. (Indeed, it seems to me that rationality might even prohibit me from transitioning from the first irrational set to the second rational set, but I don't need that stronger claim.) In the end, my original preferences are irrational because of a small, localised flaw. But they nonetheless express coherent opinions about a lot of comparisons. And, concerning all of those comparisons, the alternative preferences take exactly the opposite view. Moving to the latter in order to avoid having preferences that are flawed in the way that the original set are flawed does not seem rationally required, and indeed might seem irrational.

Something similar happens in the credal case, at least according to the accuracy-first epistemologist. Suppose I have credence $0.1$ in $X$ and $1$ in $\overline{X}$. And suppose the single legitimate measure of inaccuracy is the Brier score. I don't know this, but I do know a few things: first, I know that accuracy is the only fundamental epistemic value, and I know that a credence function's accuracy scores at different possible worlds determine its rationality at this world; furthermore, I know that my credences are accuracy dominated and therefore irrational, but I don't know what dominates them. Now suppose I learn that the following credences are rational: $0.95$ in $X$ and $0.05$ in $\overline{X}$. It seems that I am not required to adopt these credences (and, again, it seems that I am not even rationally permitted to do so, though again this latter claim is stronger than I need). While my old credences are irrational, they do nonetheless encode something like a point of view. And, from that point of view, the alternative credences look much much worse than staying put. While I know that mine are irrational and accuracy dominated, though I don't know what by, I also know that, from my current, slightly incoherent point of view, the rational ones look a lot less accurate than mine. And indeed they will be much less accurate than mine if $X$ turns out to be false.

So, even in the situation in which Levinstein's principle is most compelling, namely, when you are certain you're irrational and you will learn of only one credence function that it is rational, still it doesn't hold. It is possible to be sure that your credence function is an irrational response to your evidence, sure that an alternative is a rational response, and yet not be required to adopt the alternative because learning that the alternative is rational does not teach you that it's better than your current irrational credence function for sure---it might be much worse. This is different from the moral case. So, as stated, Levinstein's principle is false.

However, to make the deference argument work, Levinstein's principle need only hold in a single case. Levinstein describes a family of cases---those in which you're certain you're irrational---and claims that it holds in all of those. Thorstad's objection shows that it doesn't. Responding on Levinstein's behalf, I narrowed the family of cases to avoid Thorstad's objection---perhaps Levinstein's principle holds when you're certain you're irrational and know you'll only learn of one credence function that it's rational. After all, the analogous moral principle holds in those cases. But we've just seen that the doxastic version doesn't always hold there, because learning that an alternative credence function is rational does not teach you that it is better than your irrational credence function in the way that learning an act is moral teaches you that it's better than the immoral act you intend to perform. But perhaps we can narrow the range of cases yet further to find one in which the principle does hold.

Suppose, for instance, you are certain you're irrational, you know you'll learn of just one credence function that it's rational, and moreover you know you'll learn that it is better than yours. Thus, in the accuracy-first framework, suppose you'll learn that it accuracy dominates you. Then surely Levinstein's principle holds here? And this would be sufficient for Levinstein's argument, since each non-probabilistic credence function is accuracy dominated by many different probabilistic credence functions; so we could find the distinct $c'$ and $c''$ we need for the reductio.

Not so fast, I think. How you should respond when you learn that $c'$ is rational depends on what else you think about what determines the rationality of a credence function. Suppose, for instance, you think that a credence function is rational just in case it is not accuracy dominated, but you don't know which are the legitimate measures of accuracy. Perhaps you think there is only one legitimate measure of accuracy, and you know it's either the Brier score---$\mathfrak{B}(c, i) = \sum_{X \in \mathcal{F}} |w_i(X) - c(X)|^2$---or the absolute value score---$\mathfrak{A}(c, i) = \sum_{X \in \mathcal{F}} |w_i(X) - c(X)|^2$---but you don't know which. And suppose your credence function is $c(X) = 0.1$ and $c(\overline{X}) = 1$, as above. Now you learn that $c'(X) = 0.05$ and $c'(\overline{X}) = 0.95$ is rational and an accuracy dominator. So you learn that $c'$ is more accurate than $c$ at all worlds, and, since $c'$ is rational, there is nothing that is more accurate than $c'$ at all worlds. Then you thereby learn that the Brier score is the only legitimate measure of accuracy. After all, according to the absolute value score, $c'$ does not accuracy dominate $c$; in fact, $c$ and $c'$ have exactly the same absolute value score at both worlds. You thereby learn that the credence functions that accuracy dominate you without themselves being accuracy dominated are those for which $c(X)$ lies strictly between the solution of $(1-x)^2 + (1-x)^2 = (1-0.05)^2 + (0-1)^2$ that lies in $[0, 1]$ and the solution of $(0-x)^2 + (1-(1-x))^2 = (0-0.05)^2 + (1-1)^2$ that lies in $[0, 1]$, and $c(\overline{X}) = 1 - c(X)$. You are then permitted to pick any one of them---they are all guaranteed to be better than yours. You are not obliged to pick $c'$ itself.

The crucial point is this: learning that $c'$ is rational teaches you something about the features of a credence function that determine whether it is rational---it teaches you that they render $c'$ rational! And that teaches you a bit about the set of rational credence functions---you learn it contains $c'$, of course, but you also learn other normative facts, such as the correct measure of inaccuracy, perhaps, or the correct decision principle to apply with the correct measure of inaccuracy to identify the rational credence functions. And learning those things may well shift your current credences, but you are not compelled to adopt $c'$.

Indeed, you might be compelled to adopt something other than $c'$. An example: suppose that, instead of learning that $c'$ is rational and accuracy dominates $c$, you learn that $c''$ is rational and accuracy dominates $c$, where $c''$ is a probability function that Brier dominates $c$, and $c'' \neq c'$. Then, as before, you learn that the Brier score and not the absolute value score is the correct measure of inaccuracy, and thereby learn the set of credence functions that accuracy dominates yours. Perhaps rationality then requires you to fix up your credence function so that it is rational, but in a way that minimizes the amount by which you change your current credences. How to measure this? Well, perhaps you're required to pick an undominated dominator $c^*$ such that the expected inaccuracy of $c$ from the point of view of $c^*$ is minimal. That is, you pick the credence function that dominates you and isn't itself dominated and which thinks most highly of your original credence function. Measuring accuracy using the Brier score, this turns out to be the credence function $c'$ described above. Thus, given this reasonable account of how to respond when you learn what the rational credence functions are, upon learning that $c''$ is rational, rationality then requires you to adopt $c'$.

In sum: For someone certain their credence function $c$ is irrational, learning only that $c'$ is rational is not enough to compel them to move to $c'$, nor indeed to change their credences at all, since they've no guarantee that doing so will improve their situation. To compel them to change their credences, you must teach them how to improve their epistemic situation. But when you teach them that doing a particular thing will improve their epistemic situation, that usually teaches them normative facts of which they were uncertain before---how to measure epistemic value, or the principles for choosing credences once you've fixed how to measure epistemic value---and doing that will typically teach them other ways to improve their epistemic situation besides the one you've explicitly taught them. Sometimes there will be nothing to tell between all the ways they've learned to improve their epistemic situation, and so all will be permissible, as Thorstad imagines; and sometimes there will be reason to pick just one of those ways, and so that will be mandated, even if epistemic rationality is permissive. In either case, Levinstein's argument does not go through. The deference principle on which it is based is not true.

Thursday, 3 September 2020

Accuracy and Explanation in a Social Setting: thoughts on Douven and Wenmackers

For a PDF version of this post, see here.

In this post, I want to continue my discussion of the part of van Fraassen's argument against inference to the best explanation (IBE) that turns on its alleged clash with Bayesian Conditionalization (BC). In the previous post, I looked at Igor Douven's argument that there are at least some ways of valuing accuracy on which updating by IBE comes out better than BC. I concluded that Douven's arguments don't save IBE; BC is still the only rational way to update.

The setting for Douven's arguments was individualist epistemology. That is, he considered only the single agent collecting evidence directly from the world and updating in the light of it. But of course we often receive evidence not directly from the world, but indirectly through the opinions of others. I learn how many positive SARS-CoV-2 tests there have been in my area in the past week not my inspecting the test results myself but by listening to the local health authority. In their 2017 paper, 'Inference to the Best Explanation versus Bayes’s Rule in a Social Setting', Douven joined with Sylvia Wenmackers to ask how IBE and BC fare in a context in which some of my evidence comes from the world and some from learning the opinions of others, where those others are also receiving some of their evidence from the world and some from others, and where one of those others from whom they're learning might be me. Like Douven's study of IBE vs BC in the individual setting, Douven and Wenmackers conclude in favour of IBE. Indeed, their conclusion in this case is considerably stronger than in the individual case:

The upshot will be that if agents not only update their degrees of belief on the basis of evidence, but also take into account the degrees of belief of their epistemic neighbours, then the noted advantage of Bayesian updating [from Douven's earlier paper] evaporates and IBE does better than Bayes’s rule on every reasonable understanding of inaccuracy minimization. (536-7)

As in the previous post, I want to stick up for BC. As in the individualist setting, I think this is the update rule we should use in the social setting.

Following van Fraassen's original discussion and the strategy pursued in Douven's solo piece, Douven and Wenmackers take the general and ill-specified question whether IBE is better than BC and make it precise by asking it in a very specific case. We imagine a group of individuals. Each has a coin. All coins have the same bias. No individual knows what this shared bias is, but they do know that it is the same bias for each coin, and they know that the options are given by the following bias hypotheses:

$B_0$: coin has 0% chance of landing heads

$B_1$: coin has 10% chance of landing heads


$B_9$: coin has 90% chance of landing heads

$B_{10}$: coin has 100% chance of landing heads

Though they don't say so, I think Douven and Wenmackers assume that all individuals have the same prior over $B_0, \ldots, B_{10}$, namely, the uniform prior; and each satisfies the Principal Principle, and so their credences in everything else follows from their credences in $B_0, \ldots, B_{10}$. As we'll see, we needn't assume that they all have the uniform prior over the bias hypotheses. In any case, they assume that things proceed as follows:

Step (i) Each member tosses their coin some fixed number of times. This produces their worldly evidence for this round.

Step (ii) Each then updates their credence function on this worldly evidence they've obtained. To do this, each member uses the same updating rule, either BC or a version of IBE. We'll specify these in more detail below.

Step (iii) Each then learns the updated credence functions of the others in the group. This produces their social evidence for this round.

Step (iv) They then update their own credence function by taking the average of their credence function and the other credence functions in the group that lie within a certain distance of theirs. The set of credence functions that lie within a certain distance of one's own, Douven and Wenmackers call one's bounded confidence interval.

They then repeat this cycle a number of times, each time an individual begins with the credence function they reached at the end of the previous cycle.

Douven and Wenmackers use simulation techniques to see how this group of individuals perform for different updating rules used in step (ii) and different specifications of how close a credence function must lie to yours in order to be included in the average in step (iv). Here's the class of updating rules that they consider: if $P$ is your prior and $E$ is your evidence then your updated credence function should be$$P^c_E(B_i) = \frac{P(B_i)P(E|B_i) + f_c(B_i, E)}{\sum^{10}_{k=0} \left (P(B_k)P(E|B_k) + f_c(B_k, E) \right )}$$where$$f_c(B_i, E) = \left \{ \begin{array}{ll} c & \mbox{if } P(E | B_i) > P(E | B_j) \mbox{ for all } j \neq i \\ \frac{1}{2}c & \mbox{if } P(E | B_i) = P(E|B_j) > P(E | B_k) \mbox{ for all } k \neq j, i \\  0 & \mbox{otherwise} \end{array} \right. $$That is, for $c = 0$, this update rule is just BC, while for $c > 0$, it gives a little boost to whichever hypothesis best explains the evidence $E$, where providing the best explanation for a series of coin tosses amounts to making it most likely, and if two bias hypotheses make the evidence most likely, they split the boost between them. Douven and Wenmackers consider $c = 0, 0.1, \ldots, 0.9, 1$. For each rule, specified by $c$, they also consider different sizes of bounded confidence intervals. These are specified by the parameter $\varepsilon$. Your bounded confidence interval for $\varepsilon$ includes each credence function for which the average difference between the credences it assigns and the credences you assign is at most $\varepsilon$. Thus, $\varepsilon = 0$ is the most exclusive, and includes only your own credence function, while $\varepsilon = 1$ is the most inclusive, and includes all credence functions in the group. Again, Douven and Wenmackers consider $\varepsilon = 0, 0.1, \ldots, 0.9, 1$. Here are two of their main results:

  1. For each bias other than $p = 0.1$ or $0.9$, there is an explanationist rule (i.e. $c > 0$ and some specific $\varepsilon$) that gives rise to a lower average inaccuracy at the end of the process than all BC rules (i.e. $c = 0$ and any $\varepsilon$).
  2. There is an averaging explanationist rule (i.e. $c > 0$ and $\varepsilon > 0$) such that, for each bias other than $p = 0, 0.1, 0.9, 1$, it gives rise to lower average inaccuracy than all BC rules (i.e. $c = 0$ and any $\varepsilon$).

Inaccuracy is measured by the Brier score throughout.

Now, you can ask whether these results are enough to tell so strongly in favour of IBE. But that isn't my concern here. Rather, I want to focus on a more fundamental problem: Douven and Wenmackers' argument doesn't really compare BC with IBE. They're comparing BC-for-worldly-data-plus-Averaging-for-social-data with IBE-for-worldly-data-plus-Averaging-for-social-data. So their simulation results don't really impugn BC, because the average inaccuracies that they attribute to BC don't really arise from it. They arise from using BC in step (ii), but something quite different in step (iv). Douven and Wenmackers ask the Bayesian to respond to the social evidence they receive using a non-Bayesian rule, namely, Averaging. And we can see just how far Averaging lies from BC by considering the following version of the example we have been using throughout.

Consider the biased coin case, and suppose there are just three members of the group. And suppose they all start with the uniform prior over the bias hypotheses. At step (i), they each toss their coin twice. The first individual's coin lands $HT$, the second's $HH$, and the third's $TH$. So, at step (ii), if they all use BC (i.e. $c = 0$), they update on this worldly evidence as follows, where $P$ is the shared prior:
& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\
&&&&&&&&&& \\
P & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} \\
&&&&&&&&&& \\
P(-|HT) & 0 & \frac{9}{165} & \frac{16}{165}& \frac{21}{165}& \frac{24}{165} & \frac{25}{165}& \frac{24}{165}& \frac{21}{165}& \frac{16}{165}& \frac{9}{165}& 0\\
&&&&&&&&&& \\
P(-|HH) & 0 &   \frac{1}{385} &  \frac{4}{385}&  \frac{9}{385}&  \frac{16}{385}&  \frac{25}{385}&  \frac{36}{385}&  \frac{49}{385}&  \frac{64}{385}&  \frac{81}{385}&  \frac{100}{385}\\
&&&&&&&&&& \\
P(-|TH) &  0 & \frac{9}{165} & \frac{16}{165}& \frac{21}{165}& \frac{24}{165} & \frac{25}{165}& \frac{24}{165}& \frac{21}{165}& \frac{16}{165}& \frac{9}{165}& 0\\
Now, at step (iii), they each learn the other's distribution. And they average on that. Let's suppose I'm the first individual. Then I have two choices for my BCI. It either includes my own credence function $P(-|HT)$ and the third individual's $P(-|TH)$, which are identical, or it includes all three, $P(-|HT), P(-|HH), P(-|TH)$. Let's suppose it includes all three. Here is the outcome of averaging:$$\begin{array}{r|ccccccccccc}
& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\
&&&&&&&&&& \\
\mbox{Av} & 0 & \frac{129}{3465} & \frac{236}{3465}& \frac{321}{3465}& \frac{384}{3465}& \frac{425}{3465}& \frac{444}{3465}& \frac{441}{3465}& \frac{416}{3465}& \frac{369}{3465}& \frac{243}{3465}
And now compare that with what they would do if they updated at step (iv) using BC rather than Averaging. I learn the distributions of the second and third individuals. Now, since I know how many times they tossed their coin, and I know that they updated by BC at step (ii), I thereby learn something about how their coin landed. I know that it landed in such a way that would lead them to update to $P(-|HH)$ and $P(-|TH)$, respectively. Now what exactly does this tell me? In the case of the second individual, it tells me that their coin landed $HH$, since that's the only evidence that would lead them to update to $P(-|HH)$. In the case of the third individual, my evidence is not quite so specific. I learn that their coin either landed $HT$ or $TH$, since either of those, and only one of those, would lead them to update to $P(-|TH)$. In general, learning an individual's posteriors when you know their prior and the number of times they've tossed the coin will teach you how many heads they saw and how many tails, though it won't tell you the order in which they saw them. But that's fine. We can still update on that information using BC, and indeed BC will tell us to adopt the same credence as we would if we were to learn the more specific evidence of the order in which the coin tosses landed. If we do so in this case, we get:
& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\
\hline&&&&&&&&&& \\
\mbox{Bayes} & 0 & \frac{81}{95205} & \frac{1024}{95205} & \frac{3969}{95205} & \frac{9216}{95205} & \frac{15625}{95205} & \frac{20736}{95205} & \frac{21609}{95205} & \frac{16384}{95205} & \frac{6561}{95205} &0 \\
$$And this is pretty far from what I got by Averaging at step (iv).

So updating using BC is very different from averaging. Why, then, do Douven and Wenmackers use Averaging rather than BC for step (iv)? Here is their motivation:

[T]aking a convex combination of the probability functions of the individual agents in a group is the best studied method of forming social probability functions. Authors concerned with social probability functions have mostly considered assigning different weights to the probability functions of the various agents, typically in order to reflect agents’ opinions about other agents’ expertise or past performance. The averaging part of our update rule is in some regards simpler and in others less simple than those procedures. It is simpler in that we form probability functions from individual probability functions by taking only straight averages of individual probability functions, and it is less simple in that we do not take a straight average of the probability functions of all given agents, but only of those whose probability function is close enough to that of the agent whose probability is being updated. (552)

In some sense, they're right. Averaging or linear pooling or taking a convex combination of individual credence functions is indeed the best studied method of forming social credence functions. And there are good justifications for it: János Aczél and Carl Wagner and, independently, Kevin J. McConway, give a neat axiomatic characterization; and I've argued that there are accuracy-based reasons to use it in particular cases. The problem is that our situation in step (iv) is not the sort of situation in which you should use Averaging. Arguments for Averaging concern those situations in which you have a group of individuals, possibly experts, and each has a credence function over the same set of propositions, and you want to produce a single credence function that could be called the group's collective credence function. Thus, for instance, if I wish to give the SAGE group's collective credence that there will be a safe and effective SARS-CoV-2 vaccine by March 2021, I might take the average of their individual credences. But this is quite a different task from the one that faces me as the first individual when I reach step (iv) of Douven and Wenmackers' process. There, I already have credences in the propositions in question. What's more, I know how the other individuals update and the sort of evidence they will have received, even if I don't know which particular evidence of that sort they have. And that allows me to infer from their credences after the update at step (ii) a lot about the evidence they receive. And I have opinions about the propositions in question conditional on the different evidence my fellow group members received. And so, in this situation, I'm not trying to summarise our individual opinions as a single opinion. Rather, I'm trying to use their opinions as evidence to inform my own. And, in that case, BC is better than Averaging. So, in order to show that IBE is superior to BC in some respect, it doesn't help to compare BC at step (ii) + Averaging at step (iv) with IBE at (ii) + Averaging at (iv). It would be better to compare BC at (ii) and (iv) with IBE at (ii) and (iv).

So how do things look if we do that? Well, it turns out that we don't need simulations to answer the question. We can simply appeal to the mathematical results we mentioned in the previous post: first, Hilary Greaves and David Wallace's expected accuracy argument; and second, the accuracy dominance argument that Ray Briggs and I gave. Or, more precisely, we use the slight extensions of those results to multiple learning experiences that I sketched in the previous post. For both of those results, the background framework is the same. We begin with a prior, which we hold at $t_0$, before we begin gathering evidence. And we then look forward to a series of times $t_1, \ldots, t_n$ at each of which we will learn some evidence. And, for each time, we know the possible pieces of evidence we might receive, and we plan, for each time, which credence function we would adopt in response to each of the pieces of evidence we might learn at that time. Thus, formally, for each $t_i$ there is a partition from which our evidence at $t_i$ will come. For each $t_{i+1}$, the partition is a fine-graining of the partition at $t_i$. That is, our evidence gets more specific as we proceed. In the case we've been considering, at $t_1$, we'll learn the outcome of our own coin tosses; at $t_2$, we'll add to that our fellow group members' credence functions at $t_1$, from which we can derive a lot about the outcome of their first run of coin tosses; at $t_3$, we'll add to that the outcome of our next run of our own coin tosses; at $t_4$, we'll add our outcomes of the other group members' coin tosses by learning their credences at $t_3$; and so on. The results are then as follows: 

Theorem (Extended Greaves and Wallace) For any strictly proper inaccuracy measure, the updating rule that minimizes expected inaccuracy from the point of view of the prior is BC.

Theorem (Extended Briggs and Pettigrew) For any continuous and strictly proper inaccuracy measure, if your updating rule is not BC, then there is an alternative prior and alternative updating rule that accuracy dominates your prior and your updating rule.

Now, these results immediately settle one question: if you are an individual in the group, and you know which update rules the others have chosen to use, then you should certainly choose BC for yourself. After all, if you have picked your prior, then it expects picking BC to minimize your inaccuracy, and thus expects picking BC to minimize the total inaccuracy of the group that includes you; and if you have not picked your prior, then if you consider a prior together with something other than BC as your updating rule, there's some other combination you could chose instead that is guaranteed to do better, and thus some other combination you could choose that is guaranteed to improve the total accuracy of the group. But Douven and Wenmackers don't set up the problem like this. Rather, they assume that all members of the group use the same updating rule. So the question is whether everyone picking BC is better than everyone picking something else. Fortunately, at least in the case of the coin tosses, this does follow. As we'll see, things could get more complicated with other sorts of evidence.

If you know the updating rules that others will use, then you pick your updating rule simply on the basis of its ability to get you the best accuracy possible; the others have made their choices and you can't affect that. But if you are picking an updating rule for everyone to use, you must consider not only its properties as an updating rule for the individual, but also its properties as a means of signalling to the other members what evidence you have. Thus, prior to considering the details of this, you might think that there could be an updating rule that is very good at producing accurate responses to evidence, but poor at producing a signal to others of the evidence you've received---there might be a wide range of different pieces of evidence you could receive that would lead you to update to the same posterior using this rule, and in that case, learning your posterior would give little information about your evidence. If that were so, we might prefer an updating rule that does not produce such accurate updates, but does signal very clearly what evidence is received. For, in that situation, each individual would produce a less accurate update at step (ii), but would then receive a lot more evidence at step (iv), because the update at step (ii) would signal the evidence that the other members of the group received much more clearly. However, in the coin toss set up that Douven and Wenmackers consider, this isn't an issue. In the coin toss case, learning someone's posterior when you know their prior and how many coin tosses they have observed allows you to learn exactly how many heads and how many tails they observed. It doesn't tell you the order in which you learned them, but knowing that further information wouldn't affect how you would update anyway, either on the BC rule or on the IBE rule---learning $HT \vee TH$ leads to the same update as learning $HT$ for both Bayesian and IBEist. So when we are comparing them, we can consider the information learned at step (ii) and step (iv) both to be worldly information. Both give us information about the tosses of the coin that our peers witnessed. So when we are comparing them, we needn't take into account how good they are at signalling the evidence you have. They are both equally good and both very good. So comparing them when choosing a single rule that each member of the group must use, we need only compare the accuracy of using them as update rules. And the theorems above indicate that BC wins out on that measure.