M-Phi

Discount code for Bertrand's Paradox and the Principle of Indifference by Nicholas Shackel

2024-05-23T08:34:00.001+01:00

Nicholas Shackel (Cardiff) has a new book out that might be of interest to readers, and there is a discount code available that Nick has asked me to pass on. Details below.

Bertrand’s Paradox and the Principle of Indifference

Nicholas Shackel

This book casts a new light on Bertrand's Paradox, giving original analyses of the paradox, its possible solutions, the source of the paradox, the philosophical errors we make in attempting to solve it and what the paradox proves for the philosophy of probability. Bertrand's Paradox and the Principle of Indifference will appeal to scholars and advanced students working in the philosophy of mathematics, epistemology, philosophy of science, probability theory, and mathematical physics.

“This is a very useful resource for graduate students and researchers interested in one of the most challenging puzzles in the theory of probability.”

Hykel Hosni, University of Milan, Italy

“This is essential reading for anyone seriously interested in Bertrand’s chord paradox or the broader epistemic issue of the status of the principle of indifference (and the maximum entropy principle).”

Darrell P. Rowbottom, Lingnan University, Hong Kong

Discount code: 20% Discount with code AFL04

www.routledge.com/9781032597935

On contractualism, reasonable compromise, and the source of priority for the worst-off

2023-08-22T07:28:00.000+01:00

Different policies introduced by a social planner, whether the government of a country or the head of an institution, lead to situations in which different peoples' lives go better or worse. That is, in the jargon of this area, they lead to different distributions of welfare across the individuals they affect. If we allow the unfettered accumulation of private wealth, that will lead to one distribution of welfare across the people in the country where the policy is adopted; if we cap such wealth, tax it progressively, or prohibit it altogether, those policies will lead to different distributions. The question I want to think about in this post is a central question of social choice theory: How should we choose between such policies? Again using the jargon of the area, I want to sketch a particular sort of social contract argument for a version of the prioritarian's answer to this question, and to show that this answer avoids an objection to the standard version of prioritarianism raised by Alex Voorhoeve and Mike Otsuka. But for those unfamiliar with this jargon, all will hopefully become clear.

Disclaimer: I've only newly arrived in ethics and social choice theory, so while I've tried to find versions of this argument and failed, it's quite possible, indeed quite likely, that it already exists. Part of my hope in writing this post is that someone points me towards it!

A cat with whom there is no reasonable compromise

1. Two approaches to social planning: axiological and contractual

There are two sorts of situation in which the social planner might find themselves: in the first, they are certain of the consequences of the policies they might adopt; on the second, they are not. Throughout the post, I'll assume there are just two people in the population that the social planner's choices will affect, Ada and Bab, and I'll write $(u, v)$ for the welfare distribution in which Ada has welfare level $u$ and Bab has $v$. The following table represents a situation in which the social planner knows the world is in state $s$, and they have two options, $o_1$ and $o_2$, where the first gives Ada $1$ unit of welfare and Bab $9$, while the second option gives Ada and Bab $4$ units each. $$\begin{array}{r|c} & s \\ \hline o_1 & (1, 9) \\ o_2 & (4,4)\end{array}$$And this table represents a situation in which the social planner is uncertain whether the world is in state $s_1$ or $s_2$, and the welfare distributions are as indicated: $$\begin{array}{r|cc} & s_1 & s_2 \\ \hline o_1 & (2, 10) & (7,7) \\ o_2 & (4,4) & (1,20)\end{array}$$

There are (at least) two ways to approach the social planner's choice: axiological and contractual.

An axiologist provides a recipe that takes a distribution of welfare at a state of the world, such as $(1, 9)$, and aggregates it in some way to give a measure of the social welfare of that distribution, which we might think of as the group's welfare given that option at that state of the world. If there is no uncertainty, then the social planner ranks the options by their social welfare; if there is uncertainty, then the social planner uses standard decision theory to choose, using the social welfare levels as the utilities---so, for instance, they might choose by maximizing expected social welfare.

Average and total utilitarians are axiologists; so are average and total prioritarians.

The average utilitarian takes the social welfare to be the average welfare, so that the social welfare of the distribution $(1, 9)$ is $\frac{1+9}{2} = 5$;

The total utilitarian takes it to be the total, i.e., $1+9=10$.

The average prioritarian takes each level of welfare in the distribution, transforms it by applying a concave function, and then takes the average of these transformed welfare levels. The idea is that, like utilitarianism, increasing an individual's welfare while keeping everything else fixed should increase social welfare, but, unlike utilitarianism, increasing a worse-off person's welfare by a given amount should increase social welfare more than increasing a better-off person's welfare by the same amount; put another way, increasing an individual's welfare should have diminishing marginal moral value, just as we sometimes say that increasing an individual's monetary wealth has diminishing marginal welfare value. So, for instance, if they use the logarithmic function $\log(x)$ to transform the levels of welfare, the social welfare of $(1, 9)$ is $\frac{\log(1) + \log(9)}{2} \approx 1.099$. Notice that, if I increase Bab's 9 units of welfare to 10 and keep Ada's fixed at 1, the average prioritarian value goes from $\frac{\log(1) + \log(9)}{2} \approx 1.099$ to $\frac{\log(1) + \log(10)}{2} \approx 1.151$, whereas if I increase Ada's 1 units to 2 and leave Ada's fixed at 9, the total value goes from $\frac{\log(1) + \log(9)}{2} \approx 1.099$ to $\frac{\log(2) + \log(9)}{2} \approx 1.445$.

The total prioritarian takes the social welfare to be the total of these transformed welfare levels. So, if our concave function is the logarithmic function, the social welfare of $(1,9)$ is $\log(1) + \log(9) \approx 1.099$.

The second approach to the social planner's choice appeals to social contracts. For instance, Harsanyi, Rawls, and Buchak all think the social planner should choose as if they are a member of the society for whom they are choosing, and should do so with complete ignorance of whom, within that society, they are. These theorists differ only in what decision rule is appropriate behind such a veil of ignorance. Others think you should choose a policy that can be justified to each member of the society, where what that entails can be spelled out in a number of ways, such as minimizing the worst legitimate complaints members of the affected population might make against your decision, or minimizing the total legitimate complaints they might make, and where there are different ways to measure the legitimate complaints an individual might make.

2. The Voorhoeve-Otsuka Objection: social planning for individuals

One of the purposes of this blogpost is to bring the axiologists and contractualists together by showing that a certain version of contracturalism leads to a certain axiological approach that resembles prioritarianism, but avoids an objection that has troubled that position. Let me spell out that objection, which was raised originally by Alex Voorhoeve and Mike Otsuka. In it, they ask us to imagine that the social planner is choosing for a population that contains just a single person, Cal. For the sake of concreteness, let's say they face the following choice: $$\begin{array}{r|cc} & 50\% & 50\% \\ & s_1 & s_2 \\ \hline o_1 & (2) & (3) \\ o_2 & (1) & (5)\end{array}$$Then prioritarianism tells the social planner to maximize expected social welfare: for $o_1$, this is $\frac{1}{2}\times \log(2) + \frac{1}{2} \times \log(3) \approx 0.896$; for $o_2$, it is $\frac{1}{2}\times \log(1) + \frac{1}{2} \times \log(5) \approx 0.804$. But standard decision theory says that Cal themselves should maximize their expected welfare: for $o_1$, this is $\frac{1}{2} \times 2 + \frac{1}{2} \times 3 = 2.5$; for $o_2$, it is $\frac{1}{2} \times 1 + \frac{1}{2} \times 5 = 3$. So the social planner will choose $o_1$, while Cal will choose $o_2$. That is, according to the prioritarian, morality requires the social planner to choose against Cal's wishes. But, Voorhoeve and Otsuka contend, that can't be right.

3. Justifying compromises to each

Finally, then, let me spell out my argument. It begins with a version of the social contract approach on which the social planner must be able to justify their choice to each person affected. When a policy affects different people differently, you can't reasonably expect that the social planner will make the choice by maximizing your own personal expected welfare. You must realise that you'll have to tolerate some degree of compromise with the welfare functions of others who are affected. So we seek a measure of social welfare that the social planner can use in her decision-making that effects a compromise between the welfare functions of the various individuals.

But of course compromises can be more or less reasonable; and some compromises an individual might reasonably reject. For instance, take the welfare distribution $(2, 7)$, and suppose I proffer a social welfare function that assigns this a social welfare of $10$. This is not reasonable! Why not? Well, a natural thing to say is that, while a compromise between two competing welfare functions will inevitably lie some distance from at least one of them, a compromise is unreasonable if it lies further than necessary from both, and this one does. After all, consider an alternative social welfare function that assigns a social welfare of $7$ instead of $10$. Then this lies closer to Ada's individual welfare, which is $2$, and to Bab's, which is $7$.

This suggests that one way to justify a compromise to each person affected is to show that the welfare function used to make a decision does not lie unreasonably far from the individual welfare functions. There are many ways to spell this out, but let me describe just two---if this project has any mileage, the major work will be in saying why one of these is the right way to go.

So what we need first is a measure of how far an individual's welfare lies from the social welfare. For the moment, I won't say what this is, but I'll present two alternatives below. Given such a measure, we might select our compromise social welfare to be the one that minimizes the total distance to the individual welfares. This seems like a compromise we could easily justify to all affected parties. "We took you all into account," we might say, "and each equally. As you'll understand," we might continue, "the resulting social welfare had to lie some distance from at least some of your individual welfares. But we've minimized the total distance summed over all of you."

So suppose that works; suppose that is sufficient to justify to each person in the population the social welfare function we'll use to choose between policies. Which measure of distance from a candidate social welfare to an individual welfare should we use when we sum up those distances and pick a candidate in a way that minimizes that sum? Again, the choice here requires some justification, but let me describe two that are popular in a range of contexts in which we must measure how far one number lies from another. There are a bunch of results that characterize each as the unique function with certain apparently desirable properties, but let's leave the question of picking between them aside and see what they say:

The first is known as squared Euclidean distance (SED):$$\mathrm{SED}(a, b) = |a-b|^2.$$ So the distance from $a$ to $b$ is just the square of the difference between them.

The second is known as generalized Kullback-Leibler divergence (GKL): $$\mathrm{GKL}(a, b) = a\log \left ( \frac{a}{b} \right ) - a + b.$$

Both SED and GKL are divergences: that is, the distance from one number to another is always non-negative; it is zero if both numbers are the same; it is positive otherwise.

So now suppose $(u, v)$ is the welfare distribution over Ada and Bab. For each of these two measures of distance, which is the social welfare that minimizes total distance to the individual welfares?

For SED, it is the arithmetic mean of $u$ and $v$, that is, $\frac{u+v}{2}$. That is, $\mathrm{SED}(x, u) + \mathrm{SED}(x, v)$ is minimized, as a function of $x$, at $x = \frac{u+v}{2}$. And, in general, $\mathrm{SED}(x, u_1) + \ldots + \mathrm{SED}(x, u_n)$ is minimized, as a function of $x$, at $x = \frac{u_1 \times \ldots \times u_n}{n}$.

For GKL, it is the geometric mean of $u$ and $v$, that is, $\sqrt{uv}$. That is, $\mathrm{GKL}(x, u) + \mathrm{GKL}(x, v)$ is minimized, as a function of $x$, at $x = \sqrt{uv}$. And, in general, $\mathrm{GKL}(x, u_1) + \ldots + \mathrm{GKL}(x, u_n)$ is minimized, as a function of $x$, at $x = \sqrt[n]{u_1 \times \ldots \times u_n}$.

So, if we use SED, we recover average utilitarianism, while if we use GKL, we introduce a new(-ish) way to form social welfare functions from individual ones. I'll call this new(-ish) way geometric compromise.

Let's see average utilitarianism and geometric compromise at work in the case of choice under certainty from the introduction: $$\begin{array}{r|c} & s \\ \hline o_1 & (1, 9) \\ o_2 & (4,4)\end{array}$$The average utilitarian takes the social welfare of $(1, 9)$ to be $5$, and the social welfare of $(4, 4)$ to be $4$, while the geometric compromiser takes the social welfare of $(1, 9)$ to be $\sqrt{1\times 9} = 3$, and the social welfare of $(4, 4)$ to be $4$. So, while the average utilitarian will choose $o_1$, the geometric compromiser will $o_2$, just as the prioritarian will.

In fact, this agreement between geometric compromiser and prioritarian is no coincidence. In situations in which the welfare distribution delivered by the different options is known, and when the prioritarian transforms individual welfares by taking their logarithm before summing them to give the social welfare, these will always agree. That's because the geometric mean of a sequence of numbers is a strictly increasing function of the average of the logarithms of those numbers: in symbols,$$\sqrt{u \times v} = e^{\frac{\log(u) + \log(v)}{2}},$$ and $e^x$ is a strictly increasing function of $x$.

However, geometric compromise and prioritarianism can come apart when there is uncertainty about the outcome of a policy. That is because the strictly increasing function that transforms the average prioritarian's account of social welfare into the geometric compromise is not a linear one.

But a significant advantage of geometric compromise over prioritarianism is that, when there is only a single person affected by a policy, the social welfare function coincides with that individual's welfare function. After all, our contractualist approach says that you should take the social welfare of a distribution to be the value such that total distance from that value to the welfare levels of the individuals is minimized. When there is just one individual, the value that minimizes this is simply that individual's welfare level, since GKL is a divergence, as mentioned above. So, when deciding under uncertainty on behalf of just one person, the social planner will choose by maximizing expected social welfare, which is just maximizing expected individual welfare, and there will be no tension between what the social planner chooses on the individual's behalf and what the individual would have chosen on their own behalf. We thereby avoid the objection from Voorhoeve and Otsuka.

4. Properties of Geometric Compromise

What properties does geometric compromise have, and how do they compare with the properties that utiliarianism and prioritarianism have?

Welfarism: like utilitarianism, prioritarianism, and egalitarianism, according to geometric compromise, when there is no uncertainty about the outcomes of different policies, the social planner's ranking of those policies depends only on the welfare distributions to which they give rise.

Anonymity: like utilitarianism, prioritarianism, and egalitarianism, according to geometric compromise, if one welfare distribution is obtained from another by changing only the identity of the individuals who receive the different levels of welfare, then both distributions have the same social welfare. That is because $\sqrt{u \times v} = \sqrt{v \times u}$.

Pigou-Dalton: like prioritarianism and egalitarianism, but unlike utilitarianism, according to geometric comprise, if one welfare distribution is obtained from another by taking a particular amount of welfare from a better-off individual and giving it to a worse-off individual in a way that leaves the latter worse-off, then the latter distribution has higher social welfare. That is because, if $\varepsilon > 0$ and $u + \varepsilon < v - \varepsilon$, then $\sqrt{u \times v} < \sqrt{(u+\varepsilon) \times (v - \varepsilon)}$.

Person Separability: like prioritarianism and utilitarianism, but unlike egalitarianism, according to geometric compromise, the order of the social welfare of two distributions depends only on the welfare of the individuals who have different welfare in those two distributions. That is because $\sqrt{u \times v} \leq \sqrt{u \times v'}$ iff $\sqrt{v} \leq \sqrt{v'}$.

5. Generalizing geometric compromise

Prioritarianism says that, to obtain the social welfare of a distribution, you should take each individual's welfare, transform it by a concave function, and add up the transformations. As we've seen, if the concave function is the logarithmic function this orders distributions exactly as geometric compromise orders them; and geometric compromise is the compromise that minimizes total distance to the individual welfares when that distance is given by GKL. But suppose the concave function isn't the logarithmic function but something else---let's call it $f$. Is there a measure of distance such that minimizing total distance to the individual welfares using that gives a compromise social welfare that agrees with prioritarianism-with-$f$ on the ordering of distributions, and again avoids the Voorhoeve-Otsuka objection? Happy news: there is! What follows in this section gets a little technical, so do feel free to skip---it's just an explicit construction of the measure of distance we need.

There is a reasonably well-studied class of functions knowns as the Bregman divergences, which we can use to measure the distance from one number to another. They all have the following form: take a strictly convex, differentiable function $\varphi$ and define $d_\varphi$ as follows:$$d_\varphi(x, y) = \varphi(x) - \varphi(y) - \varphi'(y)(x-y).$$Now, let $F$ be an anti-derivative of $f$---that is, $F' = f$. Then, if $f$ is differentiable, $F$ is strictly convex and differentiable, since $F'' = f'$, and $f$ is strictly increasing, so $f' > 0$, so $F''>0$, so $F$ is strictly convex. Then take the measure of distance used to assess compromises to be $d_F$. Then $d_F(x, u_1) + \ldots + d_F(x, u_n)$ is minimized, as a function of $x$, at $$F^{-1}\left (\frac{F(u_1) + \ldots + F(u_n)}{n} \right ).$$ And $F^{-1}$ is strictly increasing. So, the social welfare of $(u, v)$ according to average-prioritarianism-with-$f$ is greater than the social welfare of $(u',v')$ according to average-prioritarianism-with-$f$ iff the social welfare of $(u, v)$ by the compromise social welfare produced using $d_F$ is greater than the social welfare assigned to $(u', v')$ by the compromise social welfare produced using $d_F$.

What's more, if there is just one individual, the social welfare of the distribution $(u)$ is $F^{-1}(F(u)) = u$, which is just the individual's welfare. So, again, we avoid Voorhoeve and Otsuka's objection.

6. Problems

Let me round this off by mentioning two closely related issues with the proposal. First, geometric compromise isn't defined when the utilities are negative; second, the orderings to which it gives rise aren't invariant under positive linear transformation of the welfare values. Let's start with the second. According to many accounts of how we measure welfare numerically, if one set of numbers adequately represents welfare levels, then so does any positive linear transformation of it: that is, you can multiply the numbers by a positive constant and add or subtract any constant, and you'll end up with a different but equally adequate representation of the levels of welfare. This is sometimes put by saying that there is no privileged zero or unit for welfare. However, while the ordering placed on welfare distributions by geometric compromise is invariant under multiplication by a positive constant (since $\sqrt{ku \times kv} = k\sqrt{u\times v}$), it is not invariant under addition or subtraction of a constant. This means that, in order for our proposal to make sense, there must be a privileged zero for welfare. However, that doesn't seem so implausible. After all, we might just take the minimum welfare level and let that be zero. That would also solve the first problem, since then it would make no sense to ascribe a negative welfare level and we would no longer have to worry about defining geometric compromise for such levels. I suspect this is the right way to go. I can imagine some complaining that there is no minimum welfare level, but I rather doubt that is true.

Teru Thomas on the Veil of Ignorance

2023-08-10T17:27:00.006+01:00

Before you is a range of options: perhaps they are different laws you might implement in the country you govern, or institutions you might inaugurate; perhaps they are public health measures you might implement, or strategies for combating the climate crisis. Whatever they are, there is a population they will affect, and there is some uncertainty about how well-off each option will leave each person in that population. For each way the world might be, you know how well-off each option will leave each person if it is that way; and you have probabilities for each way the world might be. How should you choose between the options?

An old idea is that you should choose for this population as you would if you were choosing for yourself from behind a veil of ignorance. That is, you should reduce this social choice scenario to an individual choice scenario as follows: assume you are in the population, but completely uncertain who in the population you are; and then you choose as a rational individual would choose in that situation. That is, the correct social choice is the correct individual choice made behind the veil of ignorance. Rawls adopts this idea, and thinks that the correct individual choice in that situation is the one whose worst-case scenario is best; Harsanyi, on the other hand, thinks the correct individual choice is the one that maximises expected utility; and Lara Buchak thinks it's the one that maximises risk-weighted expected utility by the lights of the most risk-averse reasonable risk attitudes.

Bluebell hiding beneath the veil of ignorance

It's a nice idea, and intuitively it captures something of what we mean by picking the fairest option. But is there anything inevitable about it? Drawing on his earlier work with David McCarthy and Kalle Mikkola, Teru Thomas has offered an argument that there is. It's an extremely elegant argument, so I thought it might be worth doing a post to advertise it, and to walk through the essential steps.

The structure of the argument is this: Thomas lays down three principles and shows that it follows from those alone that you should choose in the social case as you would choose in the individual case from behind the veil of ignorance. The three principles are all invariance principles: that is, they say that when two social choice situations are similar in a certain way, you should choose similarly in both. I'll illustrate the argument with the simplest example that can illustrate the reasoning: that is, one where there are just three individuals in the population, Ada, Bab, and Cam, only two possible states of the world, $s_1$ and $s_1$, and two options, $o_1$ and $o_2$, from which we must choose. We will transform the original social choice problem into the associated individual choice problem behind the veil of ignorance in three steps; and, at each step, one of the invariance principles will tell us that the options that are permissible in the transformed choice problem are those that correspond to the options that are permissible in the original choice problem.

So let's start with the original social choice problem, which we write in the following payoff matrix, where $(u, v, w)$ is the outcome in which Ada gets utility $u$, Bab gets $v$, and Cam gets $w$, and where the bottom row gives the probabilities of the states of the world. $$\begin{array}{r|cc} & s_1 & s_2 \\ \hline o_1 & (u_{11}, v_{11}, w_{11}) & (u_{12}, v_{12}, w_{12}) \\o_2 & (u_{21}, v_{21},w_{21}) & (u_{22}, v_{22}, w_{w22}) \\ \hline P & p_1 & p_2 \end{array}$$

We begin by describing the states of the world in a more fine-grained way that doesn't affect how well-off each person is at that state of the world. Indeed, we simply assume that, at each state of the world, there is an ordered list of the people in the population. In our case, where the population comprises just Ada, Bab, and Cam, there are six such lists: Ada-Bab-Cam; Ada-Cam-Bab; ...; Cam-Ada-Bab; Cam-Bab-Ada. So there are twelve fine-grained states of the world:

a version of $s_1$ that contains the list Ada-Bab-Cam, which we write $s^{ABC}_1$;
a version of $s_1$ that contains the list Ada-Cam-Bab, which we write $s^{ACB}_1$;
and so on.

As we said, these lists make no difference to how well-off Ada, Bab, or Cam is at any state of the world, and we consider each equally likely given a state of the world, and so the new pay-off matrix is this:$$\begin{array}{r|cccccc} & s^{ABC}_1 & s^{ACB}_1 & s^{BAC}_1 & s^{BCA}_1& s^{CAB}_1 & s^{CBA}_1 \\ \hline o'_1 & (u_{11}, v_{11}, w_{11}) & (u_{11}, v_{11}, w_{11})& (u_{11}, v_{11}, w_{11}) & (u_{11}, v_{11}, w_{11})& (u_{11}, v_{11}, w_{11}) & (u_{11}, v_{11}, w_{11}) \\ o'_2 & (u_{21}, v_{21}, w_{21}) & (u_{21}, v_{21}, w_{21}) & (u_{21}, v_{21}, w_{21}) & (u_{21}, v_{21}, w_{21})& (u_{21}, v_{21}, w_{21}) & (u_{21}, v_{21}, w_{21}) \\ \hline P & p_1/6 & p_1/6& p_1/6& p_1/6& p_1/6& p_1/6\end{array}$$

$$\begin{array}{r|cccccc} & s^{ABC}_2 & s^{ACB}_2 & s^{BAC}_2 & s^{BCA}_2& s^{CAB}_2 & s^{CBA}_2 \\ \hline o'_1 & (u_{12}, v_{12}, w_{12}) & (u_{12}, v_{12}, w_{12}) & (u_{12}, v_{12}, w_{12}) & (u_{12}, v_{12}, w_{12})& (u_{12}, v_{12}, w_{12}) & (u_{12}, v_{12}, w_{12}) \\ o'_2 & (u_{22}, v_{22}, w_{22}) & (u_{22}, v_{22}, w_{22}) & (u_{22}, v_{22}, w_{22}) & (u_{22}, v_{22}, w_{22})& (u_{22}, v_{22}, w_{22}) & (u_{22}, v_{22}, w_{22})\\ \hline P & p_2/6 & p_2/6& p_2/6& p_2/6& p_2/6& p_2/6 \end{array}$$

According to Thomas's first invariance principle, option $o'_i$ is permissible in this new choice situation iff $o_i$ is permissible in the old one. And in general, any fine-graining of the states that does not affect the utilities each individual gets should not affect which options are permissible. Thomas calls this refinement invariance.

The next thing to do is to provide a choice problem in which these lists do make a difference. Indeed, they determine who gets which utilities at a state of the world. So if, at state $s_i$ when the list is Ada-Bab-Cam, Ada gets utility $u$, Bab gets $v$, and Cam get $w$ then, at that state when the list is Bab-Cam-Ada, for instance, Bab gets utility $u$ and Cam gets $v$ and Ada gets $w$. That is, we have a new choice problem, where the probabilities remain the same as before:$$\begin{array}{r|cccccc} & s^{ABC}_1 & s^{ACB}_1 & s^{BAC}_1 & s^{BCA}_1& s^{CAB}_1 & s^{CBA}_1 \\ \hline o''_1 & (u_{11}, v_{11}, w_{11}) & (u_{11}, w_{11}, v_{11})& (v_{11}, u_{11}, w_{11}) & (v_{11}, w_{11}, u_{11})& (w_{11}, u_{11}, v_{11}) & (w_{11}, v_{11}, u_{11}) \\ o''_2 & (u_{21}, v_{21}, w_{21}) & (u_{21}, w_{21}, v_{21}) & (v_{21}, u_{21}, w_{21}) & (v_{21}, w_{21}, u_{21})& (w_{21}, u_{21}, v_{21}) & (w_{21}, v_{21}, u_{21}) \\ \hline P & p_1/6 & p_1/6& p_1/6& p_1/6& p_1/6& p_1/6\end{array}$$

$$\begin{array}{r|cccccc} & s^{ABC}_2 & s^{ACB}_2 & s^{BAC}_2 & s^{BCA}_2& s^{CAB}_2 & s^{CBA}_2 \\ \hline o''_1 & (u_{12}, v_{12}, w_{12}) & (u_{12}, w_{12}, v_{12}) & (v_{12}, u_{12}, w_{12}) & (v_{12}, w_{12}, u_{12})& (w_{12}, u_{12}, v_{12}) & (w_{12}, v_{12}, u_{12}) \\ o''_2 & (u_{22}, v_{22}, w_{22}) & (u_{22}, w_{22}, v_{22}) & (v_{22}, u_{22}, w_{22}) & (v_{22}, w_{22}, u_{22})& (w_{22}, u_{22}, v_{22}) & (w_{22}, v_{22}, u_{22})\\ \hline P & p_2/6 & p_2/6& p_2/6& p_2/6& p_2/6& p_2/6 \end{array}$$

According to Thomas's second invariance principle, option $o''_i$ is permissible in this new choice situation iff $o'_i$ is permissible in the previous one. To state the general version, we need a little terminology: given a choice problem, an individual's predicament in a state of the world relative to that choice problem is just the list of utilities they would get from the different options available in that choice problem at that state of the world. This second invariance principle says that, if two choice problems have the same populations and the same states of the world, and if at a given state of the world the number of people in a given predicament is the same in the two choice problems, then the same options should be permissible. Thomas calls this statewise invariance. The utilities obtained for the individuals by $o'_i$ differ from those obtained by $o''_i$ only in who gets what, but this preserves how many are in a given predicament, and so statewise invariance secures the conclusion.

The final move is to compare the social choice problem we've just described with the individual choice problem behind the veil of ignorance that is extracted from the original social choice problem. In this problem, our chooser is not only ignorant of the state of the world, but also ignorant of who they are. We'll call our chooser Deb. So there are six states:

they're Ada in $s_1$, which we write $s^A_1$;
they're Bab in $s_1$, which we write $s^B_1$;
they're Cam in $s_1$, which we write $s^C_1$;
and so on.

And given a state of the world, it is equally likely our chooser is Ada, Bab, or Cam. So the choice problem is this:$$\begin{array}{r|cccccc} & s^A_1 & s^B_1 & s^C_1 & s^A_2 & s^B_2 & s^C_2 \\ \hline o^*_1 & u_{11} & v_{11} & w_{11}& u_{12} & v_{12} & w_{12} \\o^*_2 & u_{21} & v_{21} & w_{12} & u_{22} & v_{22} &w_{22} \\ \hline P & p_1/3& p_1/3& p_1/3& p_2/3& p_2/3& p_2/3 \end{array}$$

According to Thomas's third and final invariance principle, option $o^*_i$ is permissible iff $o''_i$ is permissible. Take two social choice problems, possibly with different populations and states, but with the same number of options. And now suppose that, for any given predicament, the probability of being in a state at which that is your predicament is the same for any individual in one population and any individual in the other. Then the same options should be permissible. Thomas calls this personwise invariance.

Let's see that why this holds between the veil of ignorance decision between $o^*_1$ and $o^*_2$, and the social choice between $o''_1$ and $o''_2$. From the first decision, take the only individual, namely, our chooser Deb. From the second decision, take Ada, for instance. Now take a predicament that Deb might be in: that is, a utility that $o^*_1$ will give her and a utility $o^*_2$ will give her; and suppose she'll get that in state $s_1$ if she's Ada, and $s_2$ if she's Cam. So the probability she's in that predicament is $p_1/3 + p_2/3$. But then Ada faces that predicament in $s^{ABC}_1$ and $s^{ACB}_1$, since Deb faces it as Ada in state $s_1$, and Ada faces that predicament in $s^{CAB}_2$ and $s^{CBA}_2$, since Deb faces it as Cam in state $s_2$. So the probability she's in that predicament is $p_1/6 + p_1/6 + p_2/6 + p_2/6$. And so the probability that Deb faces that predicament is the same as the probability Ada does. And so on for other individuals and other predicaments. And so personwise invariance secures the conclusion.

Stringing these three steps together, we get the conclusion that $o_i$ is permissible iff $o'_i$ is permissible iff $o''_i$ is permissible iff $o^*_i$ is permissible. That is, what it is permissible to choose in the social choice case is precisely what it's permissible to in the corresponding individual choice case in which you are behind the veil of ignorance, completely uncertain of who you are.

Of course, the invariance principles will not be to everyone's taste. For instance, consider the following two social choice problems. In both the population comprises Ada and Bab. In the first, the pay-off matrix is this:

$$\begin{array}{r|cc} & s_1 & s_2 \\ \hline o_1 & (4, 4) & (4, 4) \\ o_2 & (6, 3) & (6, 3)\end{array}$$In the second, it is this:

$$\begin{array}{r|cc} & s_1 & s_2 \\ \hline o'_1 & (4, 4) & (4, 4) \\ o'_2 & (6, 3) & (3, 6)\end{array}$$I think it's plausible that we would prefer $o_1$ to $o_2$ and $o'_2$ to $o'_1$, which goes against the requirement of statewise invariance, since, for each state and each predicament, the number of people who face that predicament in that state is the same. We prefer $o_1$ to $o_2$ because, while $o_2$ gives greater total welfare, it gives Bab no chance of being the better off person, and the increase in total welfare isn't sufficient to compensate for that unfairness. And we prefer $o'_2$ to $o'_1$ because it gives greater total welfare, and it gives both a chance of being the better off person. Of course, equality considerations might tell in favour of the first option in both choice problems, but we might assume that $3$ is a pretty high level of welfare and some modest inequality with that as the minimum level of welfare is acceptable.

Reviving an old argument for Conditionalization

2023-08-03T16:31:00.003+01:00

As a dear departed friend used to tell me, only a fool never changes their mind. So, endeavouring not to be fool, I have changed my mind about something. In fact, I've changed it back, having changed it once before. The matter is a certain argument that Hannes Leitgeb and I proposed for the Bayesian norm of Conditionalization, which says that you should update your credences in response to evidence by conditionalizing them on it. A suitably-reconstructed and generalized version of the argument has three premises:

Credal Veritism The sole fundamental source of epistemic value for credences is their accuracy.
Strict Propriety Every legitimate measure of the accuracy of a credence function at a world is strictly proper. (That is, every probabilistic credence function expects itself to be most accurate.)
Maximize Evidentially-Truncated Subjective Expected Utility You should choose an option that maximizes a quantity we might call evidentially-truncated subjective expected utility, which is just subjective expected utility but calculated not over all the possible states of the world, but only over those that are compatible with your evidence.

Sapphire stubbornly refusing to receive any visual evidence

It is then straightforward to derive Conditionalization as a conclusion by showing that the credence function that maximizes evidentially-truncated expected accuracy from the point of view of your prior credence function and using a strictly proper measure of accuracy is the one obtained from your prior by conditionalizing on the evidence. First, suppose $C$ is your prior and suppose $C$ is probabilistic. Then, if $E$ is your evidence and $C(E) > 0$, then $C(-\mid E)$ is also probabilistic. And so, if $\mathfrak{A}$ is a strictly proper accuracy measure, then$$\sum_{w \in W} C(w \mid E) \mathfrak{A}(C(-\mid E), w) > \sum_{w \in W} C(w \mid E) \mathfrak{A}(C', w)$$for any $C' \neq C(-\mid E)$. But $C(w \mid E) = 0$ if $E$ is false at $w$ and $C(w \mid E) = \frac{C(w)}{C(E)}$ if $E$ is true at $w$, and so we have$$\sum_{w \in E} C(w)\mathfrak{A}(C(-\mid E), w) > \sum_{w \in E} C(w) \mathfrak{A}(C', w)$$as required.

Later, I came to worry about the third premise of the argument. Dmitri Gallow has formulated two of the main concerns well. First, he notes that the standard official justifications for maximizing expected utility do not support maximizing evidentially-truncated expected utility--for one thing, the credences by which we weight the utilities of an option at different states of the world when we calculate this quantity don't sum to 1. Second, he argues that the intuitive motivation for that principle relies on considerations that are not available to the credal veritist--specifically, it relies on evidential considerations.

I was convinced by these arguments as well as Gallow's remedy and I adopted that in my book, Epistemic Risk and the Demands of Rationality--Gallow thinks you should maximize expected utility, but your measure of accuracy should change when you receive evidence so that it assigns the same neutral constant value to any credence function at any world at which your evidence is not true; and doing this also favours conditionalizing. But I've come to think I was wrong to do this; I've come to think that Hannes and I were right all along. In this post, I try to answer Gallow's worries.

1. Justifying our decision rule

To answer Gallow's concerns, I'll adapt work by Martin Peterson and Kenny Easwaran to give a direct argument for premise 3, the decision-theoretic norm, Maximize Evidentially-Truncated Subjective Expected Utility.

Peterson and Easwaran both set out to provide what Peterson calls an ex post justification of Maximize Expected Subjective Expected Utility. This contrasts with the more familiar ex ante justifications furnished by Savage's representation theorem. An ex ante justification begins with preferences over options, places constraints on those preferences (as well as the space of options), and proves that, for any preferences that satisfy them, there is a unique probability function and a utility function unique up to positive linear transformation such that you weakly prefer one option to another iff the expected utility of the first is at least as great than the expected utility of the second. An ex post justification, on the other hand, begins with a probabilistic credences and utilities already given, places constraints on how you combine them to build preferences over options, and proves that preferences so built order options by their subjective expected utility. I will adapt their constraints so that the preferences so built order options instead by their evidentially-truncated subjective expected utility.

So, suppose we have a finite set of states of the world $W$, and a probability function $P$ over $W$. And suppose the options are functions that take each world in $W$ and return a real number than measures the utility of that option at that world. Then we introduce a few related definitions:

A fine-graining of $W$ is a finite set of worlds $W'$ together with a surjective function $h: W' \rightarrow W$. We write it $W'_h$.
Given an option $o$ defined on $W$, a fine-graining $W'_h$, and an option $o'$ defined on $W'$, $o'$ is the fine-graining of $o$ relative to $h$ if, for all $w'$ in $W$, $o'(w') = o(h(w))$. We write it $o'_h$.
Given a probability function $P$ defined on $W$, a fine-graining $W'_h$, and a probability function $P'$ defined on $W'$, $P'$ is a fine-graining of $P$ relative to $h$ if, for all $w$ in $W$,$$P(w)= \sum_{\substack{w' \in W' \\ h(w') = w}} P(w')$$We write it $P'_h$.
Given a proposition $E \subseteq W$, a fine-graining $W'_h$, and a proposition $E' \subseteq W'$, $E'$ is the fine-graining of $E$ relative to $h$ if, for all $w'$ in $W$, $w'$ is in $E'$ iff $h(w)$ is in $E$. We write it $E'_h$.

And suppose $E \subseteq W$ is a proposition that represents our total evidence at the time in question. We will lay down constraints on weak preference orderings $\preceq_{P'_h, W'_h}$ over options defined on $W'$, where $W'_h$ is a fine-graining of $W$ and $P'_h$ is a fine-graining of $P$ relative to $h$. Note that this imposes constraints on a weak preference order of options defined on $W$ itself, since the identity function on $W$ defines a fine-graining of $W$, and $P$ is a fine-graining of itself relative to this function.

Reflexivity $\preceq_{P'_h, W'_h}$ is reflexive.

Transitivity $\preceq_{P'_h, W'_h}$ is transitive.

These are pretty natural assumptions.

Dominance For any options $o$, $o'$ defined on $W'$,

If $o(w) \leq o'(w)$, for all $w'$ in $E'_h$, then $o \preceq_{P'_h, W'_h} o'$;
If $o(w) < o'(w)$, for all $w'$ in $E'_h$, then $o \prec_{P'_h, W'_h} o'$.

This says that, if one option is at least as good as another at every world compatible with my evidence, I weakly prefer the first to the second; and if one option is strictly better than another at every world compatible with my evidence, I strongly prefer the first to the second.

Grain Invariance $o_h \preceq_{P'_h, W'_h} o^*_h$ iff $o \preceq_{P, W} o^*$.

This says that the grain at which I describe them shouldn't make any difference to my ordering of two options.

Trade-Off Indifference If, for two possible worlds $w'_i, w'_j$ in $E'_h$,

$P'_h(w'_i) = P'_h(w'_j)$,
$o_h(w'_i) - o^*_h(w'_i) = o^*_h(w'_j) - o_h(w'_j)$
$o_h(w'_k) = o^*_h(w'_k)$, for all $w'_k \neq w'_i, w'_j$,

then $o \sim_{P'_h, W'_h} o^*$.

This says that, when two options differ only in their utilities at two equiprobable worlds compatible with my evidence, and the first is better than the second at one world and the second is better than the first at the other world, and by the same amount, then you should be indifferent between them.

Then we have the following theorem:

Main Theorem (Peterson 2004; Easwaran 2014) If, for every fine-graining $W'_h$ of $W$ and every fine-graining $P'_h$ of $P$ relative to $h$, $\preceq_{P'_h, W'_h}$ satisfies Reflexivity, Transitivity, Dominance, Grain Invariance, and Trade-Off Indifference, then, for any two options $o$, $o^*$ defined on $W$,$$o \preceq_{P, W} o^* \Leftrightarrow \sum_{w \in E} P(w)o(w) \leq \sum_{w \in E} P(w)o^*(w)$$

Hopefully this argument goes some way to answer Gallow's first concern, namely, that there's no principled reason to choose between options in the way Hannes Leitgeb and I imagined, that is, by maximizing evidentially-truncated subjective expected utility.

2. Does it appeal to non-veritist considerations?

Does this argument address Gallow's second worry? Here it is in Gallow's words:

"Why should you stop regarding the worlds outside of $E$ as epistemically possible, and thereby completely discount the accuracy of your credences at worlds outside of E? The natural answer to that question is: 'because those worlds are incompatible with your evidence.' This answer relies upon a norm like 'do not value accuracy at a world if it is incompatible with your evidence'. But this is a distinctively evidential norm. And we have done nothing to explain why someone who pursues accuracy alone, and cares not at all about evidence per se except insofar as it helps them attain their goal of accuracy, will have reason to abide by this evidential norm." (Emphasis in the original; page 10)

An appealing consequence of justifying Maximize Evidentially-Truncated Subjective Expected Utility using Peterson's and Easwaran's strategy is that we can see exactly where in our argument we make reference to the evidence. It is in Dominance and in Trade-Off Invariance. And, when we look at these, I think we can see how to meet Gallow's demand for an explanation. We can explain why someone whose only goal is the pursuit of accuracy should have preferences that obey both of these principles. The reason is that they care about actual accuracy; when we say they care about accuracy, we mean that they care about how accurate their credence function actually is. And what evidence does is rule out some worlds, telling us they're not actual. And so, if the accuracy of one credence function is greater than the accuracy of another at every world compatible with my evidence, then we should prefer the first to the second, since it is now sure that the actual accuracy of the first is greater than the actual accuracy of the second. It doesn't matter if the first is less accurate at some world that's incompatible with my evidence, because my evidence has ruled out that world; it's told me I'm not at it, and I care only about my accuracy at the world I inhabit. So that's how we motivate Dominance. In some sense, of course, it's an evidential norm, since what it demands depends on our evidence, and our motivation of course had to talk of evidence; but it is also a motivation that a credal veritist should find compelling.

And similarly for Trade-Off Indifference. Suppose two options differ only in that the first is better than the second at a world incompatible with the evidence and the second is better than the first by the same amount at an equiprobable world compatible with the evidence. Then this should not necessarily lead to indifference, since the second might be the actual world, but the first, our evidence tells us, is not, and so the betterness of the first at the world where it is better is no compensation for its poorer performance at the world where the second is better. So we can motivate the restriction of Trade-Off Indifference again by pointing to the fact that the credal veritist cares about their actual accuracy, and the evidence tells them something about that, and what it tells us motivates Dominance and Trade-Off Indifference.

3. Resistance to evidence revisited

Just to tie this in to the previous blogpost, it's worth repeating again that this argument gives a curious parallel to the Good-Myrvold approach to gathering evidence that I've been thinking about recently. On that approach, you're considering an evidential situation, and you're deciding whether or not to place yourself in it. Such a situation is characterized by what proposition you'll learn at each different way the world might be. If you also know how you'll respond to whatever proposition you learn, you can use measures of accuracy, or epistemic utility functions more generally, to calculate the expected epistemic utility of putting yourself in the evidential situation.

However, in the cases of resistance to evidence that interest Mona Simion, you've already gathered the evidence, and you're evaluating what to do in response. At that point, the argument Hannes and I gave kicks in. And it tells you to conditionalize. Doing so has greater evidentially-truncated expected utility than not doing so. And so, while there might be cases in which you shouldn't put yourself in the evidential situation, because doing so doesn't maximize expected epistemic utility, if you do nonetheless find yourself in that situation and you learn some evidence, whether because you chose irrationally or because you were placed in it against your will, then you should update by conditionalizing on that evidence.

Mona Simion on resistance to evidence

2023-07-31T13:36:00.002+01:00

More preparation for the formal (or fine-grained) epistemology meets mainstream (or coarse-grained) epistemology conference! In my previous post, I asked what a certain Bayesian approach might say about certain questions from the recent mainstream epistemology literature on inquiry. Now I want to ask what that same approach might say about what Mona Simion calls 'epistemic duties to believe', and particularly what it says about the sorts of violations of those norms that Simion gathers together under the heading of 'resistance to evidence'.

This is a natural application of the approach I described in the previous post, since that approach grows out of Good's justification of the Principle of Total Evidence (PTE), which seems like a credal analogue of the norm Simion calls The Duty to Believe (DTB). Here's the Principle of Total Evidnence:

PTE: At any given time, a subject's credence function should be acquired from her previous ones by conditionalizing on her total evidence (that is, the strongest proposition she has as evidence).

Here's Simion's norm:

DTB: A subject $S$ has an epistemic duty to form a belief that $p$ if there is sufficient and undefeated evidence for $S$ supporting $p$.

Bluebell fulfilling her positive epistemic duties by attending to all avaiable evidence out the window

1. The Good-Myrvold approach to gathering and incorporating evidence

In the previous post, I presented Good's approach as concerning inquiry. I presented him as interested in the following question: suppose (i) you know you'll respond to any evidence you get by conditionalizing on it; (ii) you know that a certain evidence-gathering episode will teach you which proposition from a certain partition is true; (iii) that evidence-gathering episode will cost you nothing; then, when should you choose to engage in it? And his answer is: in expectation, it's always better to engage in it than not. (As Brian Weatherson pointed out to me after my post, this result is actually already there in David Blackwell's 'Comparisons of Experiments' paper; and, as with so much in formal epistemology, it turns out Frank Ramsey was there before us with his unpublished note, 'Weight, and the Value of Knowledge'.) Good was interested in the pragmatic utility of evidence-gathering, but Wayne Myrvold has shown how to adapt his argument to concern purely epistemic utility. But, while I presented the approach as concerning when to gather evidence, it can equally be applied to tell us when we should incorporate evidence we've already gathered. Indeed, it likely applies better in the latter case, since that is much more likely to be cost-free. And so, Good's theorem goes, in expectation, it is always pragmatically better to incorporate evidence than not, and so the Principle of Total Evidence follows, as Good wanted; and, replacing pragmatic value with epistemic value, it is always epistemically better to incorporate evidence than not, and so the Principle of Total Evidence follows from epistemic considerations as well.

It's maybe worth saying here that this argument for the Principle of Total Evidence relies on a particular model of what's going on when you have evidence but you haven't incorporated it. It suggests that, at the point at which you are deciding whether or not to incorporate it, you remain uncertain which evidence you ended up gathering, and so it still makes sense to make the decision by maximizing expected utility from this position of uncertainty. It's as if you gathered the evidence, popped it in a cupboard, and forgot everything except the evidential situation in which you gathered it. But there are other ways of modelling the situation. According to one of them, the proposition you learned is there in full view, but you haven't changed your credences in response to it. In Section 6, I'll ask what happens if we use a model more faithful to that way of understanding things. In the end, not much turns on this for the situations that Simion considers; but it does make a difference when we ask how general the Principle of Total Evidence is: using the original model, resistance to evidence is sometimes rational, as we'll see in Section 5; using the alternative model from Section 6, it never is.

I won't rehearse all of Good's framework here, nor Myrvold's adaptation, but David Papineau pointed out to me after the last post that Good's result assumes that we update by conditionalizing on our evidence. Now, for Good's purposes, this is fine. All he really needs to show is that there is some available way to respond to the evidence, should you gather it, that is better, in expectation, than not gathering it and sticking with the credence function you currently have. But it's a legitimate question whether the best available way to respond is to conditionalize. And indeed it turns out that we can extend Good's theorem by showing that gathering-and-conditionalizing is the best combined available option; and we can extend Myvold's version in the same way. This is essentially what Peter M. Brown does in his pragmatic argument for conditionalization; and it follows from Hilary Greaves and David Wallace's epistemic argument for conditionalization. Brown's point is this: Good defines the pragmatic utility of a credence function relative to a decision to be the utility it will acquire for you if you use it to choose between the options available in that decision. So we can then take the pragmatic utility of gathering-evidence-and-updating-in-a-particular-way to be the pragmatic utility of the credence function you'll end up with if you do gather the evidence and update in that way. And then it's possible to show that, when evidence is available and cost-free, the thing that maximizes expected pragmatic utility is gathering the evidence and updating by conditionalizing. And similarly for Myrvold's approach: measure the epistemic utility of a credence function by a strictly proper epistemic utility function, then take epistemic utility of gathering-evidence-and-updating-in-a-particular-way to be the epistemic utility of the credence function you'll end up with if you do gather the evidence and update in that way. And again it's possible to show that, when evidence is available, the thing that maximizes expected epistemic utility is gathering the evidence and updating by conditionalizing.

2. Simion's examples

So Myrvold's version of Good's approach gives a rather different route to the sort of positive epistemic norm concerning respect for evidence that Simion wants: where Simion appeals to an account of the proper functioning of our cognitive processes, this appeals to epistemic value and the best means to take in the pursuit of it. I'm not sure whether I want to say that Myrvold's result tells us that we have an epistemic duty to maximize expected epistemic utility, and therefore incorporate all the evidence we've gathered, but I certainly think it tells us we should incorporate that evidence, from an epistemic point of view, and we're irrational if we don't.

So let's ask now how this approach treats the examples that Simion uses to motivate her norm and her justification for it. Here's the first one:

Case #1: Testimonial Injustice. Anna is an extremely reliable testifier and an expert in the geography of Glasgow. She tells George that Glasgow Central is to the right. George believes women are not to be trusted, and therefore fails to form the corresponding belief.

I think the Bayesian's assessment of this case is a little different from Simion's, since the Bayesian deals with the agent's subjective prior credences, while Simion works with a notion of evidential probability. On perhaps the most natural reading, it isn't really a case of resistance to evidence, but rather a case of irrational prior credences. After all, let's take the evidence that George obtains in this situation to be that Anna says Glasgow Central is to the right. He might well incorporate that evidence exactly as the Bayesian says he should and yet retain a low or middling credence that Glasgow Central is to the right. After all, Simion says that George believes women aren't to be trusted, and so this is something that is encoded in the credence function he has when he meets Anna and hears her testimony. The Bayesian says he should conditionalize on his priors, but doing so will lead him to have something pretty close to his previous middling credence about the direction of Glasgow Central, since he'll think Anna's testimony is little indicator of the truth. So, for the Bayesian, George is certainly flawed, but it's not because he is resistant to the evidence Anna gives him in the sense that he fails to incorporate it, but because he has an irrational prior that leads him to have an irrational posterior after he does incorporate it in the required way. Of course, his irrational prior might be the result of resistance to evidence in the past. There are at least two ways George might have ended up with that prior. On the first, his ur-prior, the credence function he has at the beginning of his epistemic life, might have assigned very low credence to the reliability of women's testimony, and that will be judged irrational since it's taking an extreme stand on a proposition about which George had no evidence at that time, and if he assigns higher credence to the reliability of men's testimony, say, we will judge it further irrational because it differentiates between two cases when he has no evidence to justify such different treatment. On the second way he might have arrived at his irrational prior, his ur-prior might have assigned middling credence to the reliability of women's testimony, just as it did to the reliability of everyone else's testimony, but then as he went through life he incorporated any evidence he received that told against women's reliability and failed to incorporate any evidence he received in its favour, leaving him with the biased credence function he has when he meets Anna and hears her testimony. In that latter case, he showed genuine resistance to evidence, and the Principle of Total Evidence and its epistemic utility justification tells us what was wrong with him.

In the second half of the paper, Simion considers two alternative approaches to evidence to see whether they can explain her motivating cases, and she finds that, when you tweak them so that they can account for George's case, they end up also judging Alvin Goldman's benighted cognizer Ben negatively. As Simion tells his sad tale:

"This fellow lives on a secluded island where he's been taught that reading astrology is an excellent way to form beliefs, and where he has no access to any clue to the contrary. Plausibly, there is no evidence available to Ben for p: 'Astrology is an unreliable way to form beliefs,' nor is he in a position to know it."

How does our account treat such a case? Like Simion's, it doesn't judge Ben negatively. You can only be criticized for failing to update on evidence that is available to you, and no evidence against his view of astrology is available to Ben, since his island is secluded and contains no such evidence.

So it looks like the Myrvold-inspired account can justify the sort of positive epistemic norm that Simion wants. And similar stories can be told about her other motivating cases. In each case, either: (i) the subject has ignored evidence that is available to them; or (ii) they've incorporated it in a way that is irrational given their priors--they've failed to conditionalize on it; or (iii) they've incorporated it by conditionalizing, but their priors were irrational, and again this can be because (iiia) their ur-priors were irrational, or (iiib) because they ignored evidence in the past, or (iiic) because they incorporated evidence incorrectly in the past.

3. When is evidence available to me?

As Simion notes, her approach needs an account of when evidence is available to a subject. She enumerates three ways in which it might be unavailable:

Qualitative Availability: You might not be the sort of subject who can access or process this sort of evidence. Simion gives the example of a three year old child who can receive certain sorts of evidence but cannot process them to draw the right conclusion. And presumably there are also cases in which we don't have the necessary sensory apparatus to access the evidence, such as if there is a sound above the range of the subject's hearing or a colour beyond their visual perception or something occluded behind an opaque barrier.

Quantitative Availability: You might be able to process various pieces of evidence individually, but there might be too many for you to attend to and process them all. Simion gives the example of things within one's visual field: each is available in some sense, but no subject could adequately process all of them.

Environmental Availability: Your social and physical environment might ensure that, while the evidence is present in some sense, it is not available to you. Simion gives the example of a letter under a doormat and a newspaper on the table, and says that, because we can't read everything, we have to pick the source with the most helpful evidence, which she takes to be the newspaper on the table--I'm definitely more intrigued by the letter under the doormat, but that doesn't seem like a substantial disagreement!

One advantage of the Good-Myvold approach is that we can give a reasonably unified account of when evidence is sufficiently available that we should gather or incorporate it. You should gather evidence, or incorporate evidence already gathered, if (i) doing so is an available action for you in the sense of decision theory, that is, if you choose to do it, there's a high probability you will in fact do it; and (ii) gathering it or incorporating it is cost-effective--that is, the expected pragmatic or epistemic gain of doing so isn't outweighed by the cost.

I think this speaks to a worry that Simion mentions Tim Williamson raised for her account (footnote 16). The problem is that many different pieces of evidence I might gather or incorporate will be equally available to me, and yet gathering or incorporating all of them will unavailable, or too costly, and so it can't be quite right that we have an epistemic duty to incorporate any evidence that is available to us. In the decision-theoretic approach we've been developing here, this conclusion falls out naturally: there might be lots of different evidence-gathering or evidence-incorporating episodes that have equal expected pragmatic or epistemic utility, and each might be cost-effective, but that means only that it's rationally permissible to pick any of them, not that it's rationally required to pick all of them; compare: if going to the Isle of Skye on holiday and going to the Isle of Mull on holiday both maximize expected utility, I'm permitted to do either, but I'm certainly not required to do both.

4. Gathering evidence and incorporating it: are these so different?

A point of disagreement with Simion, though I can't tell how serious it is. When considering whether an approach based on Tim Williamson's E=K account of evidence can deliver the sort of norm she seeks, Simion points to "one important distinction between epistemic shoulds [that such an approach would miss]: that between the synchronic 'should' of epistemic justification and the diachronic 'should' of responsibility in inquiry." On the account I've been sketching, there is no significant difference in kind between the two cases. Both involve considering different options you might take: in the inquiry case, it's a combination of gathering evidence and updating on whatever it teaches you in a particular way; in the incorporating case, it's just whether to updating on the evidence you already have in a particular way. And in both cases we judge the options by the pragmatic or epistemic value of the credence functions they'll land you with.

5. Is it ever rational to resist to evidence?

It's worth noting that, because of the sort of case I mentioned in the previous post, which John Geanokoplos and Nilanjan Das have studied in the pragmatic case, resistance to evidence is sometimes rational. I'll just given a brief reminder here.

Suppose there are just three possibilities: the weather is currently dry outside, or it's raining lightly, or it's raining heavily. As my friend returns from outside, he tells me which it is. But, growing up in Scotland as he did, my friend doesn't pay much attention to how wet it is, since he pretty much always just prepares for it raining heavily. I know that, if it's dry outside, he'll tell me, 'It's dry or raining lightly, not sure which'; and if it's raining lightly or heavily, he'll tell me, 'It's raining lightly or heavily, not sure which'. And suppose I start off with these credences before I hear his testimony: I have credence 1/10 it's dry, 1/2 it's raining lightly, and 2/5 it's raining heavily. Then, at least if we measure epistemic utility using the negative Brier score, ignoring this evidence is better, in expectation, than incorporating it and conditionalizing on it. The reason is that, if I conditionalize on the evidence I'll get if it's dry, updating to a credence of 1/6 that it's dry and 5/6 that it's raining lightly, the Brier score considers that misleading evidence, because it actually makes my credence function less accurate at that world; and while if I incorporate the evidence I'll get if it's raining lightly or if it's raining heavily, updating to a credence of 5/9 that it's raining lightly and 4/9 that it's raining heavily in both cases, the Brier score will count this as good evidence at both of those worlds because it increases my accuracy, it doesn't increase it by enough in either case to counteract the decrease in accuracy I'll suffer from conditionalizing on my friend's testimony if it's dry. So, in this case, resistance to evidence is rational. I don't choose to gather the evidence: my friend simply walks in and tells me. But, having been presented with it, I've epistemic reason not to incorporate it; I've reason to resist it.

6. Modelling resistance to evidence

As I mentioned above, the approach I've been sketching involves a particular modelling choice. On it, we assume that the person who is resistant to evidence has the evidence, but is still uncertain what it is and is choosing as if they haven't actually gathered it yet. I think this is a perfectly legitimate way to model the situation, but it does have the peculiar feature that they are treating themselves as if they still haven't gathered the evidence; or at least as if they have, and then they've put it into safe storage and forgotten what it is. But what if we model things a different way? What if we assume they can see the evidence they gathered and are choosing whether to incorporate it, and how. Then we might turn to a proposal that Hannes Leitgeb and I made in our 2010 papers on accuracy arguments for epistemic norms: when you have a credence function and a proposition that gives your total evidence, and you're figuring out what to do with the proposition, we argued that you should pick a posterior credence function by maximizing expected epistemic utility, but not in the standard way where you sum the credence-weighted epistemic utilities across all of the ways the world might be, but rather across all ways the world might be that are compatible with your evidence; that is, all the ways the world might be on which your evidence is true. And then, if you do that, we showed, your posterior should always be obtained from your prior by conditionalizing on your total evidence. And so, if we go this way, we have an argument for an unrestricted version of the positive epistemic norm given by the Principle of Total Evidence. In this case, we needn't worry about the sorts of case that Das considers. While it is still true that, if you have to choose whether or not to gather the evidence about whether it's dry, raining lightly, or raining heavily, you should choose not to, once you've in fact gathered that evidence, or had it imposed upon you, you should update on it by conditionalizing, as the Principle of Total Evidence tells you to.

Taking a Good look at the recent literature on inquiry

2023-07-27T08:15:00.004+01:00

In August, I'll speak at a conference that aims to bring together what we might call 'mainstream epistemology' and 'formal epistemology'. I always find it hard to draw the line between these, since what people usually call mainstream epistemology often uses formal tools, such as those developed in epistemic logic or in modal semantics, and formal epistemology quite frequently uses none. But they do tend to differ in the way they represent their objects of study, namely, beliefs. Mainstream epistemology tends to favour a coarse-grained representation: you can believe a proposition, suspend on it, and perhaps disbelieve it, but those are usually the only options considered. Some recent work also talks of individuals thinking a proposition is true, being sure it is, holding it to be true, or accepting it, but even if we add all of those, that's still fairly few sorts of attitude. Formal epistemology, on the other hand, born as it was out of the foundations of statistics, the epistemology of science, and theory of rational choice under uncertainty, favours a more fine-grained approach: in the standard version, you can believe a proposition to a particular degree, which is represented by a real number at least 0, which is minimal belief, and at most 1, which is maximal, yielding a continuum of possible attitudes to each proposition; these degrees of belief we call your credences--they're the states you report when you say 'I'm 80% sure there's milk in the fridge' or 'I'm 50-50 whether it's going to rain in the next hour'. The theory of hyperreal credences and the various theories of imprecise credences, from sets of probability functions to sets of desirable gambles, provide even more fine-grained representations, but I'm no expert in those, so I'll set them aside here.

Coming as I do from fine-grained epistemology, for my contribution to the conference, I'd like to give the view from there of a topic that has been discussed a great deal recently in coarse-grained epistemology, hoping of course that there will be something I can offer that is useful to them. The topic is inquiry.

An inquiring mind

1. Recent work on inquiry

In the recent literature on inquiry, mainstream epistemologists point out that epistemology has often begun at the point at which you already have your evidence, and it has then focussed on identifying the beliefs for which that evidence provides justification or which count as knowledge for someone with that evidence. Yet we are not mere passive recipients of the evidence we have. We often actively collect it. We often choose to put ourselves in positions from which we'll gather some pieces of evidence but not others: we'll move to a position from which we'll see or hear or smell how the world is in one respect but miss how it is in another; we'll prod the world in one way to see how it responds but we won't prod it in another; and so on. As many in the recent literature point out, this has long been recognised, but typically mainstream epistemologists have taken the norms that govern inquiry to be pragmatic or practical ones, not epistemic ones--we inquire in order to find out things that inform our practical decisions, and so the decision what to find out is governed by practical considerations, and epistemologists leave well alone. One upshot of this is that, where norms of inquiry might appear to clash with norms of belief, mainstream epistemologists have not found this troubling, since they're well aware that the pragmatic and the epistemic can pull in different directions. But, contends much of the recent literature, norms of inquiry, norms of evidence-gathering, and so on, are often epistemic norms not practical ones. And that renders clashes between those norms and norms of belief more troubling, since both are now epistemic.

What I'd like to do in my paper at this conference, and in this blogpost, is give a primer on a framework for thinking about norms of inquiry that is reasonably well-established in fine-grained epistemology, and then turn to some of the questions from the recent debate about inquiry and ask what this framework has to say about them. Questions will include: When should we initiate an inquiry, when should we continue it, when should we conclude it, and when should we reopen it or double-check our findings? Are there purely epistemic norms that govern these actions, as Carolina Flores and Elise Woodard contend? How do epistemic norms of inquiry relate to epistemic norms of belief or credence, and can they conflict, as Jane Friedman contends? How should we resolve the apparent puzzle raised by Jane Friedman's example of counting the windows in the Chrysler Building? How should we understand Julia Staffel's distinction between transitional attitudes and terminal attitudes (here, here, here)?

I'm not sure much, if anything, I'll say will be new. I'm really drawing out an insight that dates back to I. J. Good's famous Value of Information Theorem, and has been pursued in Wayne Myrvold's version of that theorem within epistemic utility theory, some further developments of Myrvold's ideas by Alejandro Pérez Carballo, and some recent work on generalizing Good's result that begins with the economist John Geanokoplos and has been developed by Nilanjan Das, Miriam Schoenfield, and Kevin Dorst, among others.

2. Representing an individual as having credences

Throughout, we'll represent an individual's doxastic state by their credence function. Collect into a set all the propositions about which the individual has an opinion and call it their agenda. Their credence function is then a function that takes each proposition in their agenda and returns a real number, at least 0 and at most 1, that measures how strongly they believe that proposition. So when I say 'Chris is 65% sure it's going to rain', I'm ascribing to him a credence of 0.65 in the proposition that it is going to rain.

We'll assume throughout that our individual's credence function at any time is probabilistic. That is, it assigns 1 to all necessary truths, 0 to all necessary falsehoods, and the credence it assigns to a disjunction of two mutually exclusive propositions is the sum of the credences it assigns to the disjuncts. (I present this more formally in Appendix A.1)

We'll also assume that, when they do receive evidence, our individual will respond by conditionalizing their credences on it. That is, their new unconditional credence in a proposition will be their old conditional credence in that proposition given the proposition learned as evidence; that is, their new unconditional credence in a proposition will be the proportion of their old credence in the proposition they've now learned that they then also assigned to the proposition in question.

3. Good on the pragmatic value of gathering evidence

While most discussion of Good's 1967 paper, 'On the Principle of Total Evidence', focuses on what has become known as the Value of Information Theorem, the real contribution lies in his account of the pragmatic value of gathering evidence.

This account begins with an account of the pragmatic value of a credence function. Suppose you will face a particular decision between a range of options (an option is defined by now much utility it has at each possible state of the world, and the utility of an option at a world is a real number that measures how much you value that option at that world). Then the standard theory of choice under uncertainty says that you should pick an option with maximal expected utility from the point of view of the credence function you have when you face the decision. So let's assume you'll do this. Then we define the pragmatic value for you, at a particular state of the world, of having a particular credence function when faced with a particular decision: it is the utility, at that state of the world, of the option that this credence function will lead you to pick from those available in the decision. This will be one of the options that maximizes expected utility from the point of view of that credence function; but there might be more than one that maximizes that, so we assume you have a way of breaking ties between them.

So, for instance, suppose I have to walk to the shops and I must decide whether or not to take an umbrella with me. And suppose I have credences concerning whether or not it will rain as I walk there. Let's suppose first that taking the umbrella uniquely maximizes expected value from the point of view of those credences. Then the pragmatic value of those credences at a world at which it does rain is the utility of walking to the shops in the rain with an umbrella, while their pragmatic value at a world at which it doesn't rain is the utility of walking to the shops with no rain carrying an umbrella. And similarly if leaving without the umbrella uniquely maximizes expected utility from the point of view of those credences, then their pragmatic value at a rainy world is the utility of walking to the shops in the rain without an umbrella, and their pragmatic value at a dry world is the utility of walking to the shops with no rain and no umbrella. And if they both maximize expected utility from the point of view of the credences, then the pragmatic value of the credences will depend on how I break ties.

Now, with the pragmatic value of a credence function defined relative to a particular decision you'll face, Good can define the pragmatic value of a particular episode of evidence-gathering relative to such a decision. We represent such an episode as follows: for each state of the world, we specify the strongest proposition you'll learn as evidence at that state of the world. Then the pragmatic value, at a particular world, of an episode of evidence-gathering is the pragmatic value, at that world, of the credence function you'll have after learning whatever evidence you'll gather at that world and updating on it by conditionalizing. So, holding fixed the decision problem you'll face, the pragmatic value of a credence function is the utility of the option it'll lead you to pick, and the pragmatic value of gathering evidence is the pragmatic value of the credence function it will lead you to have.

So, for instance, suppose I have to walk to the shops later and, at that point, I'll have to decide whether or not to take an umbrella with me. And suppose that, between now and then, I can gather evidence by looking at the weather forecast. If I do, I'll learn one of two things: rain is forecast, or rain is not forecast. And updating on that evidence, if I choose to gather it, will change my credences concerning whether or not it will rain on my way to the shops. Then what is the value, at a particular state of the world, of gathering evidence by looking at the forecast? Consider a world at which (i) rain is not forecast but (ii) it does rain; and suppose that, upon learning that rain is not forecast, I'll drop my credence in rain low enough that I'll not take my umbrella. Then the value of gathering evidence at that world is the utility of walking to the shops in the rain without an umbrella. In contrast, consider a world at which (i) rain is forecast but (ii) it doesn't rain; and suppose that, upon learning that rain is forecast, I raise my credence in rain high enough that I take the umbrella. Then the value of gathering evidence at that world is the utility of talking to the shops with no rain but carrying an umbrella. And so on.

This is Good's account of the pragmatic value, at a particular world, of a particular episode of evidence-gathering. With this in hand, we can now define the expected pragmatic value of such an episode, and we can also define the expected pragmatic value of not gathering evidence at all, since that is just the degenerate case of evidence-gathering in which you simply learn a tautology at every state of the world. Good's Value of Information Theorem then runs as follows: Fix a decision problem you'll face at a later time; and fix the way you break ties between a set of options when they all maximize expected utility; now suppose that, for no cost, you may gather evidence that will teach you which element of a particular partition is true; then the expected pragmatic value, from the point of view of your current credences, of gathering that evidence is at least as great expected pragmatic value, from the point of view of your current credences, as not gathering it; and, if you assign some positive credence to a state of the world in which the evidence you'll learn will change how you make the decision you'll face, then the expected pragmatic value of gathering the evidence is strictly greater than the expected pragmatic value of not gathering it. (I run through this using formal notation in Appendix A.2.)

4. Myrvold on the epistemic value of gathering evidence

Good's theorem tells us something about when you have practical reason to engage in a certain sort of evidence-gathering. When it will teach you which element of a partition is true, when it costs nothing, and when you consider it possible it will change how you'll choose, then you should do it. But, as Wayne Myrvold shows, building on work by Graham Oddie and Hilary Greaves & David Wallace, there is also a version that tells us something about when you have epistemic reason to gather evidence. Alejandro Pérez Carballo has extended Myrvold's approach in various ways.

Recall: Good's insight is that the pragmatic value of a credence function is the utility of the option it leads you to choose, and the pragmatic value of an episode of evidence-gathering is the pragmatic value of the credence function it will lead you to have. But credence functions don't just have pragmatic value; we don't use them only to guide our decisions. We also use them to represent the world, and their purely epistemic value derives from how well they do that, regardless of whether we need them to help us choose.

Many ways of measuring this purely epistemic value have been proposed, but by far the most popular characterizations of the legitimate epistemic utility functions says that they are all strictly proper, where this means that, if we measure epistemic utility in this way, any probabilistic credence function expects itself to have strictly greater epistemic utility than it expects any alternative credence function to have; that is, it thinks of itself as uniquely best from the epistemic point of view. (Jim Joyce defends this view here, and Robbie Williams and I have recently defended it here.)

Perhaps the most well-known strictly proper epistemic utility function is the so-called negative Brier score: given a proposition, we say that the omniscient credence in it is 1 if it's true and 0 if it's false; the Brier score of a credence function at a world is then obtained by taking each proposition to which it assigns a credence, taking the difference between the credence it assigns to that proposition and the omniscient credence in that proposition at that world, squaring that difference, and then summing these squared differences; the negative Brier score is then, as the name suggests, merely the negative of the Brier score. In the negative Brier score, each proposition is given equal weight in the sum, but we can also give greater weight to some propositions than others in order to record that we consider them more important. This gives a weighted negative Brier score. This is important in the current context, since it allows us to explain why it is better, epistemically speaking, to engage in some evidence-gathering episodes rather than others, even when the latter will improve certain credences more than the former will improve others; the explanation is that the credences the latter will improve are less important to us. So, one evidence-gathering episode might, in expectation, greatly improve the accuracy of my credences concerning how many blades of grass there are on my neighbour's lawn, while another might, in expectation, only slightly improve the accuracy of my credences about the fundamental nature of reality, and yet I might favour the latter because the propositions it concerns are more important to me.

So now we have a way of assigning epistemic value to a credence function at a world. And so we can simply appeal to Good's insight to say that the epistemic value, at a world, of gathering evidence is the epistemic value of the credence function you'll end up with when you update on the evidence you'll get at that world. And now we can state Myrvold's epistemic version of Good's theorem: suppose you may gather evidence that will teach you which element of a particular partition is true, and suppose your epistemic utility function is strictly proper; then the expected epistemic value of gathering the evidence, from the point of view of your current credences, is always at least as great as the expected epistemic value of not gathering the evidence, from the same point of view; and, if you give some positive credence to a state of the world at which what you will learn will lead you to change your credences, then the expected epistemic value of gathering the evidence is strictly greater than the expected epistemic value of not doing so. (I present a more formal version in Appendix A.3.)

4. Expanding the approach

What Good and Myrvold offer is almost what we need to investigate both pragmatic and epistemic reasons for gathering evidence and the norms to which they give rise. But we should make them a little more general before we move on.

First, gathering evidence is rarely cost-free, and so in any evaluation of whether to do so, we must include not only Good's account of it pragmatic value but also its cost; but of course that's easy to do. So the true pragmatic value of an evidence-gathering episode at a world is not just the utility at that world of the option you'll choose using the credence function it will lead you to have; it's that utility minus the cost of gathering the evidence.

Second, Good assumes that you know for sure which decision you'll face using your credences, but of course you might be uncertain of this. But again, it's easy to incorporate this: we simply ensure that our possible worlds specify not only the truth values of the propositions to which we assign credences, but also which decision we'll face with our credences; we then ensure that we assign credences to these; and, having done all this, we can define the pragmatic value of a credence function at a world to be the utility at that world of the option it would lead us to choose from the decision we face at that world; and then the pragmatic value of an evidence-gathering episode is again the pragmatic value of the credence function you'll end up with after gathering the evidence and updating on it. And Good's theorem still goes through with this amendment.

Third: Good's theorem and Myrvold's epistemic version only tells you how a particular evidence-gathering episode compares with gathering no evidence at all. But our options are rarely so limited. Often, we can choose between various different evidence-gathering episodes. Perhaps there are different partitions from which we can learn the true element, such as when I choose which weather app to open to learn their forecast, or when I choose to measure the weight of a chemical sample rather than its reactive properties. Good's insight and Myrvold's epistemic version allow us to compare these are well, as Pérez Carballo notes: we can simply compare the expected pragmatic or epistemic value of the different available evidence-gathering episodes and pick one that is maximal.

Fourth, Good's theorem and Myrvold's epistemic version only cover cases in which the evidence-gathering episode will teach you which element of a partition is true. This is very idealized, but it is true to a certain way in which we gather evidence in science. When I measure the weight of a chemical sample, or when I ask how many organisms in a given population are infected after exposure to a particular pathogen, there is a fixed partition from which my evidence will come: I'll learn the sample is this weight or that weight or another one; I'll learn the number of infected organisms was zero or one or two or...up to the size of the population. But of course there are many cases in which our evidence-gathering will not be partitional or even factive in this way; at one world I might learn it will rain tomorrow, at another, I might learn it will rain or snow; and, on some non-factive conceptions of evidence, I might learn it will snow at a world at which it won't. What happens to Good's theorem and Myrvold's epistemic version in these cases? The answer is that it depends. Building on work by the economist John Geanokoplos, Nilanjan Das has brought some order to the pragmatic case, but it would be interesting to see what happens in the epistemic case. (I discuss this a little in Appendix A.3.)

In any case, here's just a small example to give a flavour. It will either not rain tomorrow, rain lightly, or rain heavily. I have the opportunity to check the weather forecast. But it never commits fully, and it is typically a bit pessimistic. So, in the world where it won't rain, it'll report: no rain or light rain. In the world where it will rain lightly, it'll report: light rain or heavy rain. And in the world where it will rain heavily, it will also report: light rain or heavy rain. Should I check the weather forecast? In the practical case, it depends on the decision I'll face, but the important point is that, whatever regular credences I have in the three possibilities, there is a decision I might face such that I expect myself to face it better with my current credences than with the credences I'll acquire after gathering evidence from the forecast. The reason is that, given the way the evidence overlaps, I know that my credence in light rain will rise regardless of which evidence I obtain and update on--Das calls such evidence-gathering episodes biased inquiries. So consider a bet that pays out a pound if there is light rain and nothing if there isn't. Then there's a price I'll pay for that bet after gathering the evidence that I won't pay for it now, since I'll be more confident in it then for sure. So, from my current point of view, I shouldn't update on it. What about the epistemic case? Interestingly, in this case, at least if we measure epistemic utility using the negative Brier score, there are prior credences I might have from whose point of view gathering the evidence is the best thing to do, and prior credences from whose point of view not gathering the evidence is the best thing to do. So it's a mixed bag. And if we use another well-known epistemic utility function, namely, the negative log score, every prior credence function expects gathering the evidence to be best; and indeed this is true for any factive evidence-gathering episode.

One response to this sort of case is to change how we update in response to evidence. For instance, Miriam Schoenfield proposes that we should update by conditionalizing not on our evidence itself but on the face that we learned it. And, if we do this, Good's theorem and Myrvold's epistemic version are restored. The debate then centres on whether Schoenfield's way of updating is really available to us (see Gallow).

5. Questions from the recent literature

So the framework inspired by Good's insight, and given an epistemic twist by Myrvold, is very general. It provides very general norms for when to gather evidence. So it's natural to ask what these norms say about questions from the recent literature on inquiry.

5.1 Flores and Woodard on epistemic norms of evidence-gathering

Carolina Flores and Elise Woodard offer two compelling arguments that there are purely epistemic norms on inquiry: the first is defensive, showing that standard state-centred or evidentialist sources of resistance to such norms are misguided; the second is offensive, showing that we criticize one another in epistemic ways for our evidence-gathering practices, and arguing that this gives reason to think there are epistemic norms that govern those practices. Myrvold's adaptation of Good's insight adds a third argument. Here is a norm:

The Epistemic Norm of Inquiry Gather evidence so as to maximize the expected epistemic utility of your future credence function.

It governs evidence-gathering and it is epistemic. It captures the idea that we care about our beliefs not just for their instrumental pragmatic value as guides to action, but also for their epistemic value as representations of the world we inhabit. And, as a result, there are purely epistemic norms that govern us when we can take steps to change them. Of course, as Flores and Woodard note, these are pro tanto norms--they can be overridden by other considerations, epistemic or moral or both. But they are norms all the same. And, of course, any pragmatic norm about evidence-gathering, such as that you should gather evidence so as to maximize the expected pragmatic utility of your future credence function given the decision you face with it, is also pro tanto--it can be overridden by moral or epistemic considerations. There is no primacy of the pragmatic.

What's more, the Epistemic Norm of Inquiry helps to make sense of Flores and Woodard's examples. Their first, Cloistered Claire, gets all of her evidence from a single source; their second, Gullible Gabe, gets his information from a source he believes to be reliable, but that his existing evidence suggests is not reliable; and Lazy Larry gets his evidence from a good source, but only attends to part of the evidence that source provides due to laziness. In each case, Flores and Woodard submit, we want to criticize their evidence-gathering practices on purely epistemic grounds. Let's discuss them in turn.

The case of Claire doesn't seem obvious to me. It seems that we can sometimes be in situations in which it's best to stick with a single source; that is, contra Flores and Woodard, it's not always best to diversify one's sources of evidence. For instance, if all but one of the news sources available to you is owned by people with a vested interest in a particular policy, you might reasonably stick with the only independent outlet. The explanation might be that gathering evidence from the others is a biased inquiry in Nilanjan Das's sense from above: that is, you can know in advance that it will raise your credence in some proposition, and so it is something your current credences tell you not to do from an epistemic point of view. So I think the devil will be in the details here. Flores and Woodard are certainly correct that it is sometimes best to diversify your sources of evidence, and in these cases you'll violate the Epistemic Norm of Inquiry by doing so; but sometimes it's best to stick with just one, and in those cases again the Epistemic Norm of Inquiry will entail that. How we fill in the details around Claire's case will determine which of these sorts of situations she's in.

The cases of Gabe and Larry are clearer. In Gabe's case, it might not be the evidence-gathering that is ultimately at fault, but his high credence that the source is reliable. From the point of view of a high credence that the source is reliable, the expected pragmatic or epistemic value of gathering evidence from it is likely to be high, and therefore rationally required, given there are few costs. But that original high credence itself might be irrational because it isn't a good response to Gabe's evidence concerning the source's reliability.

In Larry's case, it's important that he ignores the further evidence his source can provide through laziness and not simply a lack of time. If he had only a little time with his source and couldn't attend to all the evidence it provides, then there might be nothing wrong with his evidence-gathering--he did the best he could under the constraints placed on him! But, as Flores and Woodard describe the case, he could have gathered more evidence, but he failed to do so. In that case, it's likely that the cost of gathering that extra evidence and updating on it is considerably less than the pragmatic or epistemic value he expects to get from it. And this explains why he is criticizable: he violates the Epistemic Norm of Inquiry.

5.2 Willard-Kyle on the conclusion of inquiry

So far, we have just been talking about evidence-gathering episodes, and not inquiry. But an inquiry is simply a sequence of such episodes, and we usually embark upon one with the aim of answering some question. When is an inquiry complete? When should you cease inquiring further? Some say when you have knowledge of the answer to the question at which the inquiry was aimed; some say when you have true belief in it; and so on. Christopher Willard-Kyle argues that none of these answers can be right because, even after you've achieved any of these, it's always possible to improve your epistemic situation. For instance, you might obtain better knowledge of the correct answer: you might obtain a safer belief, even though your current belief is sufficiently safe to count as knowledge; or you might obtain the belief you currently have, but using an even more reliable process, even though your current belief was formed by a sufficiently reliable process. Good's insight and Myrvold's epistemic version shed light on this.

In fact, your pragmatic reasons for further inquiry can just run out, and from that point of view it can be irrational to pursue that inquiry any further. This happens if you care only about the pragmatic value of your credence function as a guide to action in the face of the decision you know you'll face with it. At some point, you come to know that all further evidence-gathering episodes that are actually available to you either won't change your mind about what to choose when faced with the decision, or that any that will change your mind are too costly. At this point, further inquiry is irrational from this myopic pragmatic point of view. While you might continue to improve your credence function from an epistemic point of view, you achieve no further gains from a pragmatic point of view.

In the epistemic case, however, things are different. Unless you somehow acquire certainty about the correct answer to the question at which your inquiry aims, there will always be some evidence-gathering episode that you'll expect to improve your credence function from a purely epistemic point of view, though of course that episode may not be available to you. Indeed, you will rarely acquire such certainty. After all, for most inquiries, the evidence-gathering episodes don't give definitive answers to the target question; they give definitive answers to related questions that bear on the target question, such as when I gather evidence about what the weather forecast says as part of my inquiry into whether or not it will rain tomorrow. So this vindicates Willard-Kyle's claim. There will nearly always be room for improvement from an epistemic point of view.

5.3 Staffel on transitional and terminal attitudes

This last point casts doubt on Julia Staffel's distinction between transitional and terminal attitudes in inquiry. On her account, during the course of an inquiry, we form transitional versions of the attitudes we seek--outright beliefs, perhaps, or precise credences. Only when the inquiry is complete do we form terminal versions of those attitudes. So, for instance, a detective who is methodically working her way through the body of evidence her team has amassed forms transitional credences concerning the identity of the culprit, and only after she has surveyed all this evidence does she form terminal credences on that matter. Staffel says that what distinguishes these attitudes is what we're prepared to do with them: for one thing, we are prepared to act on terminal attitudes but not on transitional ones. The views I've been describing here can shed light on Staffel's view.

Here's an argument that there can be no transitional attitudes that answer to Staffel's description. If I face a decision in the midst of my inquiry that I only expected to face at the end, it seems that I have no choice but to choose using the credences I have at that point, which have been obtained from my credences at the beginning of the inquiry by updating on the evidence I've received during its course to date. What else is available to me? Of course, there are my credences at the beginning of my inquiry. Should I use those instead? The problem with that suggestion is that those credences themselves don't think I should use them, at least if the evidence-gathering episodes I've embarked on so far are ones that the prior credences expect to have greater pragmatic value than not embarking on them, such as if those evidence-gathering episodes have the features that make Good's Value of Information Theorem applicable. Sure, my prior credences would have liked it even more if I'd got to complete my inquiry, but the world has prevented that and I must act now. So, if I must act either on my priors or on my current credences, which I hold mid-inquiry, I should act on my current ones, which suggests they're not transitional.

But the argument isn't quite right. If an inquiry is made up of a series of evidence-gathering episodes, each of which satisfies the conditions that make Good's Value of Information Theorem hold, then the argument works: there can be no transitional attitudes during such an inquiry. But not all inquiries are like that. Sometimes the whole sequence of evidence-gathering episodes is such that we expect our credence function to be better after they're all completed, but there are points in the course of the investigation at which we expect our credence function will be worse. This might happen, for instance, if we string together a bunch of biased inquiries, where those in the first stretch are biased in one direction and those in the second are biased in the other, but taken together, they aren't biased in either direction. For instance, suppose our detective divides up the evidence her team has collected into that which suggests the first suspect is guilty and that which suggests the second culprit is guilty. She plans to work through the first set first and the second set second. Then, while her prior credences expect the credences she'll have once she's worked through both sets to be better than they are, they also expect the credences she'll have once she's only worked through the first set to be worse. And so, if she's interrupted just as she completes the first set and suddenly has to make a decision she was hoping to make only at the end, she might well decide not to use her current credences. And in that sense they are transitional. Staffel considers a case very much like in her recent book manuscript.

Of course, you might wonder how it could be rational to embark upon a series of evidence-gathering episodes with the expectation that, at various points along the way you'll be doing worse than you were doing at the beginning. But that's a pretty standard thing we do: we commit to a sequence of actions that, if completed will bring about benefits, but if left half done will leave us worse off; and we simply have to weigh our credence that we'll get to complete the sequence against the benefits if we do and the detriments if we don't and see whether it's worth it. And the same goes in the epistemic case.

Nonetheless, while I think the argument against transitional attitudes that I gave above fails to show there are no such attitudes, I think it might show that they're rather rarer than Staffel imagines. Most inquiries involve a sequence of evidence-gathering episodes each of which improves your credences in expectation; the minority are like the detective running through a series of individually biased but collectively unbiased inquiries.

5.4 Friedman's example of the Chrysler building

Let's turn now to an example that motivates much of Jane Friedman's recent contributions to the literature. I'll quote at length:

"I want to know how many windows the Chrysler Building in Manhattan has (say I’m in the window business). I decide that the best way to figure this out is to head down there myself and do a count. [...] Say it takes me an hour of focused work to get the count done and figure out how many windows that building has. [...] Now think about the hour during which I’m doing my counting. During that hour there are many other ways I could make epistemic gains. [...] First, I’m a typical epistemic subject and so I arrive at Grand Central with an extensive store of evidence: the body of total evidence, relevant to all sorts of topics and subject matters, that I’ve acquired over my lifetime. Second, I’m standing outside Grand Central Station for that hour and so the amount of perceptual information available to me is absolutely vast. [...] However, during my hour examining the Chrysler Building I barely do any of that. I need to get my count right, and to do that I really have to stay focused on the task. Given this, during that hour I don’t extend my current stores of knowledge by drawing inferences that aren’t relevant to my counting task, and I do my best to ignore everything else going on around me. And this seems to be exactly what I should be doing during that hour if I want to actually succeed in the inquiry I’m engaged in. [...] There is an important sense in which I succeed in inquiry by failing to respect my evidence for some stretch of time. It’s not that my success in this case comes by believing things my evidence doesn’t support, but it does come by ignoring a lot of my evidence and failing to come to know a great deal of what I’m in a position to know."

I think the natural thing to say here is that, as Friedman faces the Chrysler Building, she faces a choice between a number of different evidence-gathering episodes she might undertake. Some of them are the ones that form the inquiry she is there to undertake, namely, determining the number of windows in the building; some involve attending to sensory information and perhaps testimony that is available at the spot where she's ended up, but which is irrelevant to her inquiry; and some involve drawing inferences from the store of memories and other evidence she's previous collected, which is again irrelevant to her inquiry.

Of course, it's rather unusual to think of these last episodes as involving evidence-gathering. After all, you already have the evidence, and you're simply drawing conclusions from it that you haven't drawn before. But I think it's reasonable to view logical reasoning as doing something similar to what gathering empirical evidence does. In both cases, they are ruling out states of the world that are in some sense possible. When I see that it is raining, I rule out those states of the world in which it is not. And when I reason to the conclusion that one proposition entails another, I rule out those states of the world in which the former is true and the latter false. Now these latter states of the world were never truly possible; or at least, they were never logically possible. But until I did this reasoning, they were epistemically or personally possible for me, to borrow Ian Hacking's term. And we can give a version of credal epistemology on which they are the states of the world, sets of those represent the propositions to which we assign credences, suitably adapted probability axioms hold of them, and conditionalization is defined: building on Hacking's insights, I do this here.

So, having seen this we can understand the logical reasoning that Friedman doesn't do when she's in front of the Chrysler Building as just another sort of evidence she doesn't gather, just as she doesn't gather the evidence she might do if she were to attend to the conversation between the two commuters standing to her left, say. And once we do that, we can say that Friedman does the right thing by continuing with her window-counting inquiry so long as, at each stage, the evidence-gathering episode that comes next in that inquiry is the one that maximizes expected pragmatic or epistemic value among those episodes that are available to her. And if we see things in this way, there is no clash between an epistemic norm and a zetetic one. There are just two norms: gather evidence in the way that maximizes expected utility; and respond to that evidence by conditionalizing. And they govern what Friedman should do in front of the Chrysler Building.

6. Conclusion

One of the attractions of the picture I've been sketching here is that it is unified. It tells you there are pragmatic norms for inquiry and epistemic ones and all-things-considered norms as well. Each arises from a different source of value. The pragmatic norms arise from focussing only on the pragmatic value of credences as guides to action, and the pragmatic value of evidence-gathering as the pragmatic value of the credences it leads to. The norm is then: Gather evidence so as to maximize the expected pragmatic value of your credences. The epistemics norms arise from focussing only on the epistemic value of credences as representations of the world, and the epistemic value of evidence-gathering as the epistemic value of the credences it leads to. The norm is then: Gather evidence so as to maximize the expected epistemic value of your credences. But we can also combine the two sources of value into a measure of all-things-considered value, and then an analogous all-things-considered norm emerges.

Appendices

Appendix A.1: the framework of precise credences

Let $W$ be a set of possible worlds.
Let $\mathcal{F}$ be the set of all subsets of $\mathcal{W}$; these represent propositions.
A credence function on $\mathcal{F}$ is a function $C : \mathcal{F} \rightarrow [0, 1]$.
A credence function is probabilistic if (i) $C(\emptyset) = 0$, (ii) $C(W) = 1$, and (iii) $C(X \cup Y) = C(X) + C(Y)$ for $X \cap Y = \emptyset$.
A credence function $C$ on $\mathcal{F}$ is regular if $C(w) > 0$, for all $w$ in $W$.
Given a regular probabilistic credence function $C$ on $\mathcal{F}$, and a proposition $E$ in $\mathcal{F}$, we define $C_E$ as follows: for each $X$ in $\mathcal{F}$,$$C_E(X) = C(X \mid E) = \frac{C(X \cap E)}{C(E)}$$

Appendix A.2: Good's Pragmatic Value of Information Theorem

An option on $W$ is a function $o : W \rightarrow \mathbb{R}$.
A decision problem on $W$ is a set of options on $W$.
Given a probabilistic credence function $C$ and an option $o$, the expected utility of $o$ from the point of view of $C$ is $\sum_{w \in W} C(w)o(w)$.
A tie-breaker function $t$ takes a set of options and returns a single option from among them.
An evidence-gathering episode $E$ is a function that takes each possible world and returns a proposition.
An evidence function is partitional if $\{E(w) \mid w \in W\}$ is a partition.
An evidence function is factive if $w$ is in $E(w)$ for all $w$ in $W$.
Given a probabilistic credence function $C$, a decision problem $D$, and a tie-breaker function $t$, if $O^D_C$ is the set of options in $D$ that maximize expected utility from the point of view of $C$, then let $o^{D,t}_C = t(O^D_C)$.
Given a probabilistic credence function $C$, a decision problem $D$, a tie-breaker function $t$, and a world $w$, the pragmatic value at $w$ of $C$ for someone who faces $D$ is defined as follows: $PU^{D,t}(C, w) =o^{D,t}_C(w)$.

Good's Pragmatic Value of Information Theorem Given a regular probabilistic credence function $C$, a partitional and factive evidence function $E$, a decision problem $D$, and a tie-breaker function $t$, if $o^{D,t}_C \neq o^{D,t}_{C_{E(w)}}$, for some $w$, then $$\sum_{w \in W} C(w)PU^{D, t}(C,w) < \sum_{w \in W} C(w)PU^{D, t}(C_{E(w)}, w)$$

Appendix A.3: Myrvold's Epistemic Value of Information Theorem

An epistemic utility function $EU$ takes a credence function $C$ on $\mathcal{F}$ and a world $w$ in $W$ and returns $EU(C, w)$, which measures the epistemic value of $C$ at $w$.
$EU$ is strictly proper if, for any probabilistic credence function $C$ and any credence function $C'$, if $C \neq C'$, then $\sum_{w \in W} C(w)EU(C, w) > \sum_{w \in W} C(w)EU(C', w)$.

Myrvold's Epistemic Value of Information Theorem Given a regular probabilistic credence function $C$, a partitional and factive evidence function $E$, and a strictly proper epistemic utility function, if $C \neq C_{E(w)}$, for some $w$, then$$\sum_{w \in W} C(w)EU(C,w) < \sum_{w \in W} C(w)EU(C_{E(w)}, w)$$

Some strictly proper epistemic utility functions:

The negative Brier score of $C$ at $w$ is$$-\sum_{X \in \mathcal{F}} |V_w(X) - C(X)|^2$$, where $V_w(X) = 0$ if $X$ is false at $w$ and $V_w(X) = 1$ if $X$ is true at $w$.
Given a weighting $0 < \lambda_X$ for each $X$ in $\mathcal{F}$, the weighted negative Brier score of $C$ at $w$ relative to those weights is $$-\sum_{X \in \mathcal{F}} \lambda_X|V_w(X) - C(X)|^2$$
The negative log score of $C$ at $w$ is $\log(C(w))$.

Consider the evidence function described in the main text:

$W = \{w_1, w_2, w_3\}$
$E(w_1) = \{w_1, w_2\}$, $E(w_2) = \{w_2, w_3\}$, $E(w_3) = \{w_2, w_3\}$

Then if you use the negative Brier score...

This credence function expects evidence-gathering to be better than not: $(10/39, 5/13, 14/39)$.
The credence function expects evidence-gathering to be worse than not: $(1/16, 7/16, 1/2)$.

But if you use the negative log score, every credence function expects evidence-gathering to be better than not. And indeed, for any factive evidence function, every credence function expects evidence-gathering to be better than not, if you use the log score. That's because, for any such evidence function, learning it will raise your credence in the true world and that's all the negative log score cares about.

Utilitarianism and risk: a reply to Mogensen

2023-06-01T13:03:00.001+01:00

Like Lara Buchak, I think rationality permits many different attitudes to risk: without ever falling into irrationality, you might be extremely risk-averse, quite risk-averse, just a little risk-averse, risk-neutral, a teeny bit risk-inclined, very risk-inclined, exorbitantly risk-inclined, and many points in between. To illustrate, suppose you are offered a choice between two units of utility for sure (Sure Thing in the pay-off table below) and a gamble that gives a 50% chance of five units of utility and 50% chance of none (Gamble below). Then I think rationality permits you to be risk-averse and choose the sure thing, and it permits you to be risk-neutral and choose the gamble, in line with expected utility theory.

$$\begin{array}{r|cc}& \text{Heads} & \text{Tails} \\ \hline \text{Sure Thing} & 2 & 2 \\ \text{Gamble} & 5 & 0 \end{array}$$

This raises an interesting question for utilitarians. In what follows, I'll focus on total utilitarianism, in particular. This says the best action is the one that produces the greatest total aggregate utility. However, this is not a complete theory of moral action, since we are often uncertain which action produces the greatest total utility. Instead, we assign probabilities to the different outcomes of the different possible actions. But how are we to use these, together with the total aggregate utility in each outcome, to determine which action to choose? Standardly, utilitarians appeal to expected utility theory: the morally right action is the one that produces the greatest total aggregate utility in expectation; that is, we take each possible action, take each of its outcomes, take the total utility of that outcome and weight it by the probability the action brings about the outcome, sum up these weighted utilities to give the expectation of the action's total utility, and then pick an action whose expectation is maximal. But, if many different attitudes to risk are permitted, and rationality does not require you to maximise expected utility, then why should morality require this? But if morality doesn't require this, what does it require? Is there a particular attitude to risk you should adopt when choosing morally? Or is any rationally permitted attitude also morally permitted?

Local risk-taker

Again like Lara Buchak, I think that, when choosing an action that will affect a particular group of people, the attitude to risk you should use is determined by their attitudes to risk. In a straightforward case, for instance, if they all share the same attitudes to risk and you know this, then you should also use those attitudes to risk when choosing your action. Let's call this the Risk Principle.

I have recently appealed to something like the Risk Principle to argue that, contrary to the claims of longtermists, total utilitarianism demands not that we devote our resources to ensuring a long future for humanity, but rather that we should use them to hasten human extinction---the argument is essentially that extinction is the less risky option, since a long future for humanity could contain vast amounts of pleasure and happiness, but it could also contain unfathomable amounts of pain and suffering; and we should choose using risk-averse preferences since these are predominant in our current society, and likely to continue to be predominant. Andreas Mogensen has since responded to my argument by highlighting a tension between the way in which Buchak argues for the Risk Principle and the version of total utilitarianism that results from adding the principle to it.

Mogensen notes that Buchak argues for the principle by appealing to considerations that are more typically associated with Scanlon's version of contractualism. He writes:

"[The principle] is motivated by the thought that when choosing for others, we should err against subjecting people to risks we’re not sure they would take on their own behalf. Thus, Buchak holds that 'we cannot choose a more-than-minimally risky gamble for another person unless we have some reason to think that he would take that gamble himself' (Buchak 2019: 74). The ideal of justifiability to each individual is also taken to support [the principle], in the form of the idea that we should 'take only the risks that no one could reasonably reject.' (ibid.)"

And then, at least as I understand him, he makes two claims.

First, if a consequentialist appeals to contractualist considerations, such as whether you can justify your decision to each of the people it affects, then this must be because they must value something beyond the welfare of those people; they must also assign value to being able to justify the decision to them. Call this the Extra Source of Value Objection.

Second, he notes that the version of total utilitarianism we obtain when we append the Risk Principle to it in fact runs contrary to the contractualist norms we used to motivate that principle. Call this the Self-Undermining Objection.

Let's take them in turn. Of course, it's true that consequentialists, and certainly total utilitarians, don't usually think there is any component of an action that must be justifiable to the individuals it will affect before it can count as moral. But that's because, for the most part, they've worked in a framework that hasn't really reckoned with permissivism about rationality, and so there has not really been any room for it. For even before we reckon with permissivism about rational risk attitudes, a problem arises for utilitarians because of permissivism about rational belief.

Suppose we must choose between the two options, Sure Thing and Gamble, which will affect two individuals, Asa and Bea. The total aggregate utility of the options is given in the table below. The outcome of Gamble is determined by whether the FTSE 100 stock index rises or not.

$$\begin{array}{r|cc}& \text{FTSE rises} & \text{FTSE doesn't rise} \\ \hline \text{Sure Thing} & 2 & 2 \\ \text{Gamble} & 5 & 0 \end{array}$$

We will choose by maximising expected total aggregate utility. Asa thinks it is 50% likely the FTSE will rise, and 50% it won't, and so prefers Gamble to Sure Thing. Bea is less bullish, thinking it is only 25% likely the stock index will rise, and 75% it won't, and so prefers Sure Thing to Gamble. They disagree about how likely the two states of the world are, but this is not because they have different evidence: they don't. Rather, it's because they began their epistemic lives with different prior probabilities. What's more, for both of them, the priors with which they began were rationally permitted. Now it is our job to choose between the two options on their behalf. How should we choose? Utilitarianism is silent. It tells us how to choose once we've fixed our probabilities over the future of the stock index. But it doesn't tell us how to do this. No considerations of total welfare tell us what to do here, because this isn't about the ends we seek, which utilitarianism specifies clearly, but rather about the means we take to achieve them, about which it says nothing. At this point, then, we might bring in considerations more usually associated with contractualism. We might say that, in our calculation of the expected utilities, we should use probabilities that we can justify to each of the individuals affected: perhaps some sort of aggregate of their probabilities. So, for instance, if they were all to agree on the probabilities they assign, while I, as decision maker, assign different ones, then providing theirs are rationally permissible and based on all the evidence available, we should use theirs, since only by doing this can we justify our choice of probabilities to them. Nothing here has added anything to our assessment of the value of the outcomes: those are still valued precisely at their total aggregate welfare.

The same might be said, suitably adapted, about moral decisions in the face of permissivism about rational attitudes to risk. We might imagine instead that Asa and Bea agree that it's 50-50 whether the FTSE will rise or not, but Asa is risk-neutral, and thus prefers Gamble to Sure Thing, while Bea is sufficiently risk-averse that she prefers Sure Thing to Gamble. Again, we must ask how to choose? And again utilitarianism is silent because we are asking not about the ends we seek, which total utilitarianism fixes for us, but about the means by which we pursue those ends, which it doesn't. And so again we might ask which risk attitudes we can justify using to all affected parties without creating any tension with the core commitments of total utilitarianism. So I think the Extra Source of Value Objection fails.

Let's turn now to the Self-Undermining Objection. Suppose we are again choosing between two options on behalf of Asa and Bea. This time, the pay-off table looks like this, where we specify not only the total aggregate utility, but also the individual utilities that we add together to give it:

$$\begin{array}{r|cc}& \text{FTSE rises} & \text{FTSE doesn't rise} \\ \hline \text{Sure Thing} & \text{Asa:} 2 & \text{Asa:} 2 \\ & \text{Bea:} 2 & \text{Bea:} 2 \\ \text{Gamble} & \text{Asa:} 5 & \text{Asa:} 0 \\ & \text{Bea:} 0 & \text{Bea:} 5 \end{array}$$

Both Asa and Bea are risk-averse, and so, thinking purely of their own utility, they prefer Sure Thing to Gamble. However, from the point of view of total utility, Gamble strictly dominates Sure Thing: it obtains five units of utility in total, regardless of how the FTSE performs, while Sure Thing obtains only four. So, from the point of view of total utility, whichever attitudes to risk we use, we will pick Gamble, going against the unanimous preferences of those affected. But surely this is not a choice that can be justified to each affected party?

But while I agree that some of what Buchak writes in defence of the Risk Principle can be read in a way that seems in tension with this conclusion, I think there's another way to think about it. As noted in response to the Extra Source of Value Objection, total utilitarianism gives an account of what is valuable, but, at least if you think that there are many rational prior probabilities or many rational attitudes to risk, it doesn't give an account of how to choose morally when you're uncertain how the world is, because it doesn't tell you which probabilities or which attitudes to risk to use when you make your decision. It is the choice of these that we should be able to justify to the individuals affected, not the choice itself, or at least not primarily. If we can do that, then we justify the choice we make to them by saying: I think the value of an outcome is its total utility; and I needed to fix on a single probability function and a single set of attitudes to risk in order to choose between options given this account of final value; and I used these probabilities and these attitudes to risk, which I have justified to you. Now, they might retort: But you've chosen one option when we all preferred the other! But to that we can respond: Ah, true, but you were not considering the same decision problem I was. For you, the value of each outcome was the utility it obtained for you; for me, it was the total utility it obtained for the population. Now, it is of course surprising that these two things come apart: on the one hand, choosing when you value an outcome for its total welfare; on the other, preserving unanimous preferences. But that's just the way things have to shake out if we think rationality is permissive.

So, in the end, I think both of Mogensen's objections can be answered once we understand better how we appeal to the contractualist idea to justify something like the Risk Principle.

Transformative experiences and choosing when you don't know what you value

2023-04-23T18:08:00.002+01:00

There is a PDF version of the blogpost here.

There are some things whose value for us we can't know until we have them. Love might be one of them, becoming a parent another. In many such cases, we can't know how much we value that thing until we have it because only by having it can we know what it's like. The nature of the experience of having it---how it feels, its phenomenal character---plays a crucial role in determining the value that we assign to it. What it will feel like to be in love is part of what determines its value for the person who will experience it. The phenomenal quality of the bliss when that love is reciprocated, what it's like for you to feel the pain when it's not---together with the other aspects of being in love, these determine its value for you. And, you might reasonably think, you can't know what it will be like until you experience it. Let's follow Laurie Paul in calling these epistemically transformative experiences. In this post, I'd like to consider Paul's argument in Transformative Experience that they pose a problem for our theories of rational choice. I used to think those theories had the resources to meet this challenge, but now I'm not so hopeful.

Sapphire, uncertain of her utilities, as of so much

To understand Paul's challenge, let me describe it in a particular case. Currently, I'm not a parent. Now, suppose I'm trying to decide whether or not to adopt a child. I know that many factors contribute to the value I assign to becoming a parent, but one of them is what the experience will be like for me. I might enjoy it enormously; I might find it on balance agreeable, but often exhausting and upsetting; or I might hate it. Looking around at my friends and hearing of others' experience of adoption, I come to the conclusion that there are two sorts of person. For one of them, the character of their child doesn't make any difference to their experience of being a parent---they enjoy the experience well enough regardless, and a bit more than not being a parent. For others, it makes an enormous difference---if the child's character is like their own, their experience of parenting is wonderful; if not, it's awful.

To put some numbers on this, suppose you conclude that you are one of these two types of person, but you don't know which, and the two types assign utilities as follows:$$\begin{array}{r|cc} U_1 & \textit{Similar} & \textit{Different} \\ \hline \textit{Child-free} & 0 & 0 \\ \textit{Parent} & 1 & 1 \end{array} \hspace{10mm} \begin{array}{r|cc} U_2 & \textit{Similar} & \textit{Different} \\ \hline \textit{Child-free} & 0 & 0 \\ \textit{Parent} & 25 & -7 \end{array}$$How should you choose? Paul argues that our theories of rational choice can't help us, since, to apply them, we must know our utilities. To that challenge, I responded in Choosing for Changing Selves that you should simply incorporate your uncertainty about your utilities into your representation of the decision problem. That is, you should not take the states of the world to be simply Similar or Different, since you don't know your utilities for the different options at those states of the world. Rather, you should take them to be Similar & my utilities are given by $U_1$, Similar & my utilities are given by $U_2$, Different & my utilities are given by $U_1$, and Different & my utilities are given by $U_2$, since you do know your utilities at the utilities of the option at these states of the world---the utility of being a parent at Similar & my utilities are given by $U_2$ is 25, for instance. So the decision now looks like this:$$\begin{array}{r|cccc} & \textit{Similar}\ \&\ U_1 & \textit{Different}\ \&\ U_1 & \textit{Similar}\ \&\ U_2 & \textit{Different}\ \&\ U_2 \\ \hline \textit{Child-free} & 0 & 0 & 0 & 0\\ \textit{Parent} & 1 & 1 & 25 & -7 \end{array}$$And you should assign probabilities to these states and then apply your theory of rational choice to this newly-specified decision. I called this the fine-graining response to Paul's challenge.

So, for instance, suppose that you think Similar and Different are equally likely, and $U_1$ and $U_2$ are equally likely, and they're independent of one another; then you should assign credence $\frac{1}{4}$ to each of these states. Suppose that your theory of rational choice tells you to maximise expected utility. Then you should become a parent, because its expected utility is $\frac{1}{4}\times 1 + \frac{1}{4}\times 1 + \frac{1}{4}\times 25 + \frac{1}{4}\times (-7) = 5$, while the expected utility of the alternative is 0.

Now, notice something about that solution in the case of expected utility theory. Suppose that, whether your utilities are given by $U_1$ or by $U_2$, your expected utility for one option is greater than your expected utility for another. That is, while you don't know what your utilities are, you do know how they'd order those two options. Then, at least assuming that you consider the original states of the world independent of the facts about your utilities, then the fine-graining strategy I described above will also give higher expected utility to the first option than to the second.*

But I've become convinced over the past few years that expected utility theory isn't the correct theory of rational choice. Rather, it's something closer to Lara Buchak's risk-weighted expected utility theory, which builds on John Quiggin's rank-dependent utility theory. According to this theory, we don't choose between different options by comparing their expected utility, we compare their risk-weighted expected utility instead, which is calculated using our own personal risk function.

A risk function is a function $R : [0, 1] \rightarrow [0, 1]$ that is continuous, strictly increasing, and for which $R(0) = 0$ and $R(1) = 1$. Given a risk function $R$, and an option $o$ such that $U(o, s_1) \leq U(o, s_2) \leq \ldots \leq U(o, s_n)$, and where the probability of state $s_i$ is $p_i$, the risk-weighted expected utility of $o$ is

$REU_R(U(o)) = U(o, s_1) + $

$R(p_2 + \ldots + p_n)(U(o, s_2) - U(o, s_1)) + $

$R(p_3 + \ldots + p_n)(U(o, s_3) - U(o, s_2)) + \ldots +$

$R(p_{n-1} + p_n)(U(o, s_{n-1}) - U(o, s_{n-2})) + $

$R(p_n)(U(o, s_n) - U(o, s_{n-1}))$

One natural family of risk functions is $R_n(p) = p^n$ for $n>0$. If $0 < n < 1$, then $R_n$ is risk-inclined; if $1 < n$, then $R_n$ is risk-seeking; and if $n = 1$, then $R_n$ is risk-netural.

Now suppose we consider my decision whether to become a parent again from the point of view of this theory of rational choice. And let's suppose I'm quite risk-averse, with a risk function $R(p) = p^2$. Then let's begin by looking at the decision from the point of view of $U_1$ and $U_2$ separately. If your utilities are given by $U_1$, then the risk-weighted expected utility of remaining child-free is 0, and of becoming a parent is 1; if $U_2$, then the risk-weighted expected utility of remaining child-free is 0 again, but of becoming a parent is $-7+R(1/2)\times 32 = 1$. So, either way, you'll prefer to become a parent. But now consider the fine-grained version of the decision:$$\begin{array}{r|cccc} & \textit{Similar}\ \&\ U_1 & \textit{Different}\ \&\ U_1 & \textit{Similar}\ \&\ U_2 & \textit{Different}\ \&\ U_2 \\ \hline \textit{Child-free} & 0 & 0 & 0 & 0\\ \textit{Parent} & 1 & 1 & 25 & -7 \end{array}$$Then the risk-weighted expected utility of remaining child-free is still 0, but of becoming a parent is $-7 + R(3/4)\times 8 + R(1/4)\times 24 = -1$. So you prefer remaining child-free.

In general, it turns out, if you use risk-weighted expected utility theory with a risk function that isn't risk-neutral, there will be cases like this, where you are uncertain of your utilities, know that whichever are your utilities they'll prefer one option to another, but when you fine-grain the states of the world and make the decision like that, you'll prefer the second option to the first.

It seems to me, then, that epistemically transformative experiences do pose a problem for theories of rational choice like Buchak's. It's true that the fine-graining strategy allows us to apply our theory of rational choice even when we are uncertain about our own utilities. But adopting that strategy only pushes the problem elsewhere, for it then turns out that there will be cases in which I know I prefer one option to another---since I know that, whichever of the possible utility functions I actually have, it prefers that option---but I also prefer the other option to the first---since, when I apply the fine-graining strategy to determine which option I prefer, I prefer the other. And this, it seems to me, is an untenable situation.

----------------------------------

* Suppose $U_1, \ldots, U_m$ are the possible utilities you might have, and $q_1, \ldots, q_m$ are the probabilities you might have them; and suppose $s_1, \ldots, s_n$ are the possible states of the world, and $p_1, \ldots, p_n$ are their probabilities; and suppose $U_j(o, s_i)$ is the utility of $o$ at state $s_i$ by the lights of utilities $U_j$. Then, the expected utility of an option $o$ from the point of view of the fine-grained set of states, $s_i\ \&\ U_j$, is $$\sum^m_{j=1} \sum^n_{i=1} q_jp_i U(o, s_i\ \&\ U_j)= \sum^m_{j=1} \sum^n_{i=1} q_jp_i U_j(o, s_i) = \sum^m_{j=1} q_j \sum^n_{i=1} p_i U_j(o, s_i) $$So, if $$\sum^n_{i=1} p_i U_j(o, s_i) < \sum^n_{i=1} p_i U_j(o', s_i) $$for all $j= 1, \ldots, m$, then $$\sum^m_{j=1} \sum^n_{i=1} q_jp_i U(o', s_i\ \&\ U_j) < \sum^m_{j=1} \sum^n_{i=1} q_jp_i U(o, s_i\ \&\ U_j)$$

Choosing for others when you don't know their attitudes to risk

2023-04-18T07:24:00.002+01:00

A PDF of this blogpost is available here.

[This blogpost is closely related to the one from last week. In this one, I ask how someone should choose on behalf of a group of people when she doesn't know their attitudes to risk; in the previous one, I asked how someone should choose for themselves when one of their decisions might cause them to change their attitudes to risk. Many of the same considerations carry over from the previous post to this one, and so I apologise for some repetition.]

A new virus has emerged and it is spreading at speed through the population of your country. As the person charged with public health policy, it falls to you to choose which measures to take. Yet there is much you don't know about the virus at this early stage, and therefore much you don't know about the effects of the different possible measures you might take. You don't know how severe the virus is, for instance. It might typically cause mild illness and very few deaths, but it might be much more dangerous than that. Initial data suggests it's mild, but it isn't definitive.

As well as your ignorance of these medical facts about the virus, there are also facts you don't know about the population that will be affected by whatever measures you impose. You might not know, for instance, the value that each person in the country assigns to the different possible outcomes of those measures. You might not know how much people value being able to gather in groups during the winter months, either to chat with friends, look in on vulnerable family, stave off loneliness, or attend communal worship as part of their religion. You might not know how much people disvalue feeling ill, suffering the long-term effects of a virus, or seeing their children's education stall and their anxieties worsen.

Is he risk-averse or risk-neutral? Answer unknown.

If this is the only source of your uncertainty, then a natural solution presents itself, at least in principle: you add your uncertainty about the public's utilities for the various outcomes of the various possible measures into your specification of the decision you face, and you maximise expected utility. Simplifying the choice greatly, we might present it as follows: you think it's 25% likely that the virus is severe, and 75% likely it's mild; you have two public health measures at your disposal, impose restrictions or don't; if the virus is severe, restrictions work pretty well at stemming its spread, and if you impose none it's a disaster; if the virus is mild, however, restrictions create problems and things go badly, though they aren't a disaster, and if you impose none, then everything goes pretty well. Perhaps you know that the utilities the public assigns to the four possible outcomes are given either by the first table below (U1), or by the second (U2):$$\begin{array}{r|cc} \textbf{U1} & \text{Mild} & \text{Severe} \\ & \textit{Probability = 1/4} & \textit{Probability = 3/4} \\ \hline \text{Restrictions} & 10 & 7\\ \text{No Restrictions} & 2 & 10 \end{array}$$

$$\begin{array}{r|cc} \textbf{U2} & \text{Mild} & \text{Severe} \\ & \textit{Probability = 1/4} & \textit{Probability = 3/4} \\ \hline \text{Restrictions} & 12 & 7\\ \text{No Restrictions} & 0 & 10 \end{array}$$

If it's the former, then you maximise expected utility by imposing no restrictions; if it's the latter, you do so by imposing restrictions. So how to choose? Well, you incorporate the uncertainty about the public's utilities into the decision problem. Perhaps you think each is as likely as the other.$$ \begin{array}{r|cccc} & \text{Mild + U1} & \text{Severe + U1} & \text{Mild + U2} & \text{Severe + U2} \\ & \textit{1/8} & \textit{3/8} & \textit{1/8} & \textit{3/8} \\ \hline \text{Restrictions} & 10 & 7 & 12 & 7\\ \text{No Restrictions} & 2 & 10 & 0 & 10 \end{array}$$And again we compare the expected utilities of the two measures: and this time the restrictions win out.

But let's suppose there is no uncertainty about how the population values the possible outcomes: you're sure it's as described in the first table above (U1). Nonetheless, there is another relevant fact you don't know about the attitudes in the population. You don't know their attitudes to risk.

Notice that the option on which you impose no restrictions is a risky one. It's true that, in the most probable state of the world, where the virus is mild, this turns out very well. But in the other state, where the virus is severe, it turns out very badly. Notice that, as a result, the expected utility of imposing no restrictions only just exceeds the expected utility of imposing some; reduce the utility of no restrictions in the severe case by just one unit of utility and the expectations are equal; reduce it by two, and the restrictions win out.

This is relevant because some people are risk-averse. That is, they give more weight to the worst-case outcomes than expected utility says they should. If offered the choice between a single unit of utility for sure, and the toss of a fair coin that gives three units if heads and none if tails, people will often choose the guaranteed single unit, even though the coin toss has higher expected utility. Plausibly, this is because the worst-case scenario looms larger in their calculation than expected utility theory allows. When assessing the value of the coin toss, the case in which it lands tails and they receive nothing gets more weight than the case in which it lands heads and they receive three units of utility, even though those two cases have the same probability. What's more, many think such behaviour is perfectly rational---and if it is, then it is something that a public health policy-maker should take into account.

Let me pause briefly to describe one way in which we might incorporate our own attitudes to risk into our personal decision-making in a rational way. It's due to Lara Buchak, following John Quiggin's earlier work. In expected utility theory, we take the value of an option to be its expected utility and we choose the most valuable option; in Buchak's risk-weighted expected utility theory, we take the value of an option to be its risk-weighted expected utility and we choose the most valuable option. I'll describe how that works for an option $o$ that is defined on just two states of the world, $s_1$ and $s_2$, where the probability of $s_i$ is $p_i$, and the utility of $o$ at $s_i$ is $u_i$. The expected utility of $o$ is $$EU(o) = p_1u_1 + p_2u_2$$. If $u_1 \leq u_2$, this can also be written as $$EU(o) = u_1 + p_2(u_2-u_1)$$That is, to calculate the expected utility of $o$, you take the utility it will gain for you in the worst-case scenario (in this case, $u_1$) and add to it the extra utility you will gain from it in the best-case scenario, weighted by the probability you're in the best-case scenario. Similarly, if $u_2 \leq u_1$, it can be written as $$EU(o) = u_2 + p_1(u_1-u_2)$$You calculate the risk-weighted expected utility in the same way, except that you transform the probability of the best-case scenario using a function, $R$, which Buchak calls your risk function, which encodes your attitudes to risk. So, if $u_1 \leq u_2$, then $$REU_R(o) = u_1 + R(p_2)(u_2 - u_1)$$ And, if $u_2 \leq u_1$, then $$REU_R(o) = u_2 + R(p_1)(u_1-u_2)$$

To illustrate with the choice between restrictions and none:$$\begin{array}{rcl} REU(\text{restrictions}) & = & 7 + R(1/4)(10-7) \\ REU(\text{no restrictions}) & = & 2 + R(3/4)(10-2) \end{array} $$So suppose you are risk-averse, and so wish to place less weight on the best-case scenarios than expected utility requires. That is, $R(1/4) < 1/4$ and $R(3/4) < 3/4$. Perhaps, for instance, $R(1/4) = 1/8$ and $R(3/4) = 5/8$. Then$$\begin{array}{rcl} REU(\text{restrictions}) & = & 7 + (1/8)(10-7) = 7.375\\ REU(\text{no restrictions}) & = & 2 + (5/8)(10-2) = 7 \end{array} $$So you prefer the restrictions. On the other hand, if you're risk-neutral, and follow expected utility theory, in which case your risk function is just $R(p) = p$ for all $0 \leq p \leq 1$, then you prefer no restrictions.

One of the deepest lessons of Buchak's analysis is that two people who agree exactly on how likely each state of the world is, and agree exactly on how valuable each outcome would be, can disagree rationally about what to do. They can do this because they have different attitudes to risk, those different attitudes can be rational, and those attitudes lead them to combine the relevant probabilities and utilities in different ways to give the value of the options under consideration.

Why is this an important lesson? Because ignoring it leads to a breakdown in debate and public deliberation. Often, during the COVID-19 pandemic, participants on both sides of debates about the wisdom of restrictions seemed to assume that any disagreement about what to do must arise from disagreement about the probabilities or about the utilities or simply from the irrationality of their interlocutors. If they felt it was a disagreement over probabilities, they often dismissed their opponents as stupid or ignorant; if they felt it was a disagreement about utilities, they concluded their opponent was selfish or callous. But there was always a further possibility, namely, attitudes to risk, and it seemed to me that often this was really the source of the disagreement.

But it's not my purpose here to diagnose a common failure of pandemic discussions. Instead, I want to draw attention to a problem that arises when we don't know the risk attitudes of the people who will be affected by our decision. To make the presentation simple, let's idealise enormously and suppose that everyone in the population agrees on the probabilities, the utilities, and their attitudes to risk. We know their probabilities and their utilities---they're the ones given in the following table (U1):$$\begin{array}{r|cc} & \text{Mild} & \text{Severe} \\ & \textit{Probability = 1/4} & \textit{Probability = 3/4} \\ \hline \text{Restrictions} & 10 & 7\\ \text{No Restrictions} & 2 & 10 \end{array}$$But we don't know their shared attitudes to risk. They're either risk-neutral, choosing by maximizing expected utility, in which case they prefer no restrictions to restrictions; or they're risk-averse to the extent that they prefer the restrictions. In such a case, how should we choose?

A natural first thought is that we might do as I described above when I asked how to choose when you're unsure of the utilities assigned to the outcomes by those affected. In that case, you simply incorporate your uncertainty about the utilities into your decision problem. And so here we might hope to incorporate your uncertainty about the risk attitudes into your decision problem.

The problem is that, at least on Buchak's account, your attitudes to risk determine the ways you think probabilities and utilities should be combined to give an evaluation of an option. Even if we include our uncertainty about the attitudes to risk into our specification of the decision, so that each state of the world specifies not only whether the virus is mild or severe, but also whether the population is risk-averse or risk-neutral, in order to evaluate an option, you must use a risk function to bring the probabilities of these more fine-grained states of the world together with the utilities of an option at those states together to give a value for the option. And which option is evaluated as better depends on which risk attitudes you use to do this.

An analogy might be helpful at this point. Consider a classic case of decision-making under normative uncertainty: you don't know whether utilitarianism or Kantianism is the true moral theory. You're 80% sure it's the former and 20% the latter. Now imagine you're faced with a binary choice and utilitarianism says one option is morally required, while Kantianism says it's the other. How should you proceed? One natural proposal: if we can ask each moral theory how much value they assign to each option at each state of the world, then we can calculate our expected value for each option, taking into account both uncertainty about the world and uncertainty about the normative facts. The problem is that, while utilitarianism would likely think this is the morally right way to make this meta-decision, Kantianism wouldn't. The problem is structurally the same as in the case of uncertainty about risk attitudes: in both cases, one of the things we're uncertain about is the right way to make the sort of decision in question.

This suggests that the literature on normative uncertainty would be a good place to look for a solution to this problem, and indeed you can easily see how putative solutions to that problem might translate into putative solutions to ours: for instance, we might go with the My Favourite Theory proposal and use the risk attitudes that we consider it most likely the population has. But the problems that arise for that arise here as well. In any case, I'll leave this line of investigation until I'm better acquainted with that literature. For the time being, let me wrap up by noting a putative solution that I think won't work.

Inspired by Pietro Cibinel's intriguing recent paper, we might try a contractualist approach to the problem. The idea is that, when someone else chooses on your behalf, depending on what they choose and how things turn out, you might have a legitimate complaint about the choice they make, and it seems a reasonable principle to try to choose in a way that minimizes such complaints across the population for whom you're choosing. To apply this idea, we have to say when you have a legitimate complaint about a decision made on your behalf, and how strong it is. Here is Cibinel's account, which I find compelling: you don't have a legitimate complaint when you'd have chosen the same option; and you don't have one when you'd have chosen a different option, but that different option would have left you worse off than the one that was chosen; but you do have a legitimate complaint when you'd have chosen differently, and your choice would have left you better off than the choice that was made---furthermore, the strength of the complaint is proportional to how much better off you'd have been had they chosen the option you favoured. Based on this, we have the following set of complaints in the different possible states of the world, where 'R-a' means 'risk-averse', and 'R-n' means 'risk-neutral':$$ \begin{array}{r|cccc} & \text{Mild + R-a} & \text{Severe + R-a} & \text{Mild + R-n} & \text{Severe + R-n} \\ & \textit{1/8} & \textit{3/8} & \textit{1/8} & \textit{3/8} \\ \hline \text{Restrictions} & 0 & 0 & 0 & 3\\ \text{No Restrictions} & 8 & 0 & 0 & 0 \end{array}$$ In this situation, how should we choose?

One initial thought might be that you should choose the option with the lowest worst complaint: that is, we should choose the restrictions because, in its worst-case scenario, it generates a complaint of strength 3, while in its worst-case scenario, imposing no restrictions generates a complaint of strength 8. The problem with this is that it isn't sensitive to probabilities: yes, imposing no restrictions might generate a worse complaint than imposing the restrictions might, but it would do so with considerably lower probability. But if we are to choose in a way that is sensitive to the probabilities, our problem arises again: we must combine the probabilities of different states of the world with something like the utilities of choosing particular options at those states---the utility of a complaint is the negative of its strength, say---and different risk attitudes demand different ways of doing that. If we rank the options by the risk-weighted expectation of the legitimate complaints they generate, we see that a risk-neutral person would prefer no restrictions, while a sufficiently risk-averse person will favour restrictions.$$ \begin{array}{rcl} REU(\text{restrictions}) & = & -3 + R(5/8)\times 3\\ REU(\text{no restrictions}) & = & -8 + R(7/8) \times 8 \end{array} $$ So, if $R(5/8) = 4/8$ and $R(7/8) = 6/8$, then restrictions exceeds no restrictions, but if $R(5/8) = 5/8$ and $R(7/8) = 7/8$, then no restrictions wins out.

What to do? I don't know. It's an instance of a more general problem. When you face a decision under uncertainty, you are trying to choose the best means to your ends. But there might be reasonable disagreement about how to do this. Different people might have different favoured ways of evaluating the available means to their ends, even when the ends themselves are shared. If you are charged with making a decision on behalf of a group of people in which there is such disagreement, or in which there is in fact agreement but you are uncertain of which approach they all favour, you must choose how to choose on their behalf. But the different ways of choosing means to ends might also give different conclusions about how to make the meta-decision about how to choose on their behalf. And then it is unclear how you should proceed. This is a social choice problem that often receives less attention.

What to do when a transformative experience might change your attitudes to risk

2023-04-03T15:40:00.003+01:00

A PDF of this blogpost is available here.

A transformative experience is one that changes something significant about you. In her book, Laurie Paul distinguishes two sorts of transformative experience, one epistemic and one personal. An epistemically transformative experience changes you by providing knowledge that you couldn't have acquired other than by having the experience: you eat a durian fruit and learn what it's like to do so; you become a parent and learn what it's like to be one. A personally transformative experience changes other aspects of you, such as your values: you move to another country and come to adopt more of the value system that is dominant there; you become a parent and come to value the well-being of your child over other concerns you previously held dear.

Both types of experience seem to challenge our standard approach to decision theory. Often, I must know what an experience is like before I can know how valuable it is to have it---the phenomenal character of the experience is part of what determines its value. So, if I can only know what it is like to be a parent by becoming one, I can't know in advance the utility I assign to becoming one. Yet this utility seems to be an essential ingredient I must feed in to my theory of rational choice to make the decision whether or not to do become a parent. What's more, if my values will change when I have the experience, I must decide whether to use my current values or my future values to make the choice. Yet neither seems privileged.

Elsewhere (here and here), I've argued that these problems are surmountable. To avoid the first, we simply include our uncertainty about the utilities we assign to the different outcomes in the framing of the decision problem, so that the states of the world over which the different available options are defined specify not only what the world is like, but also what utilities I assign to those states. To avoid the second, we set the global utility of a state of the world to be some aggregate of the local utilities I'll have at different times within that state of the world, and we use our global utilities in our decision-making. I used to think these solutions settled the matter: transformative experience does not pose a problem for decision theory. But then I started thinking about risk, and I've convinced myself that there is a serious problem lurking. In this blogpost, I'll try to convince you of the same.

Sapphire takes a risky step

The problem of choosing for changing risk attitudes

I'm going to start with a case that's a little different from the usual transformative experience cases. Usually, in those cases, it is you making the decision on your own behalf; in our example, it someone else making the decision on your behalf. At the very end of the post, I'll return to the standard question about how you should choose for yourself.

So suppose you arrive at a hospital unconscious and in urgent need of medical attention. The doctor who admits you quickly realises it's one of two ailments causing your problems, but they don't know which one: they think there's a 25% chance it's the first and a 75% chance it's the second. There are two treatments available: the first targets the first possible ailment perfectly and restores you to perfect health if you have that, as well as doing a decent but not great job if you have the second possible ailment; the second treatment targets the second ailment perfectly and restores you to perfect health if you have that, but does very poorly against the first ailment. Which treatment should the doctor choose? Let's put some numbers to this decision to make it explicit. There are two options: Treatment 1 and Treatment 2. There are two states: Ailment 1 and Ailment 2. Here are the probabilities and utilities:

$$\begin{array}{r|cc} \textit{Probabilities} & \text{Ailment 1} & \text{Ailment 2} \\ \hline P & 1/4 & 3/4 \end{array}$$

$$\begin{array}{r|cc} \textit{Utilities} & \text{Ailment 1} & \text{Ailment 2} \\ \hline \text{Treatment 1} & 10 & 7\\ \text{Treatment 2} & 2 & 10 \end{array}$$

How should the doctor choose? A natural thing to think is that, were they to know what you would choose when faced with this decision, they should just go with that. For instance, they might talk to your friends and family and discover that you're a risk-neutral person who makes all of their choices in line with the dictates of expected utility theory. That is, you evaluate each option by its expected utility---the sum of the utilities of its possible outcomes, each weighted by its probability of occurring---and pick an option among those with the highest evaluation. So, in this case, you'd pick Treatment 2, whose expected utility is $(1/4 \times 2) + (3/4 \times 10) = 8$, over Treatment 1, whose expected utility is $(1/4 \times 10) + (3/4 \times 7) = 7.75$.

On the other hand, when they talk to those close to you they might learn instead that you're rather risk-averse and make all of your choices in line with the dictates of Lara Buchak's risk-weighted expected utility theory with a specific risk function. Let me pause here briefly to introduce that theory and what it says. I won't explain it in full; I'll just say how it tells you to choose when the options available are defined over just two states of the world, as they are in the decision that faces the doctor when you arrive at the hospital.

It's pretty straightforward to see that the expected utility of an option over two states can be calculated as follows: you take the minimum utility it might obtain for you, that is, the utility it gives in the worst-case scenario; then you add to that the extra utility you would get in the best-case scenario but you weight that extra utility by the probability that you'll get it. So, for instance, the expected utility of Treatment 1 is $7 + (1/4)\times(10-7)$, since that is equal to $(1/4 \times 10) + (3/4 \times 7)$. The risk-weighted expected utility of an option is calculated in the same way except that, instead of weighting the extra utility you'll obtain in the best-case scenario by the probability you'll obtain it, you weight it by that probability after it's been transformed by your risk function, which is a formal element of Buchak's theory that represents your attitudes to risk in the same way that your probabilities represent your doxastic attitudes. Let's write $R$ for that risk function---it takes a real number between 0 and 1 and returns a real number between 0 and 1. Then the risk-weighted expected utility of Treatment 1 is $7 + R(1/4)\times(10-7)$. And the risk-weighted expected utility of Treatment 2 is $2 + R(3/4)\times(10-2)$. So suppose for instance that $R(p) = p^2$. That is, your risk function takes your probability and squares it. Then the risk-weighted expected utility of Treatment 1 is $7+R(1/4)\times (10-7) = 7.1875$. And the risk-weighted expected utility of Treatment 2 is $2+R(3/4)\times (10-2) = 6.5$. So, in this case, you'd pick Treatment 1 over Treatment 2, and the doctor should too.

So far, so simple. But let's suppose that what the doctor actually discovers when they try to find out about your attitudes to risk is that, while you're currently risk-averse---and indeed choose in line with risk-weighted expected utility theory with risk function $R(p) = p^2$---your attitudes to risk might change depending on which option is chosen and what outcome eventuates. In particular, if Treatment 2 is chosen, you'll retain your risk-averse risk function for sure; and if Treatment 1 is chosen and things go well, so that you get the excellent outcome, then again you'll retain it; but if Treatment 1 is chosen and things go poorly, then you'll shift from being risk-averse to being risk-neutral---perhaps you see that being risk-averse hasn't saved you from a bad outcome, and you'd have been better off going risk-neutral and picking Treatment 2. How then should the doctor choose on your behalf?

Let me begin by explaining why we can't deal with this problem in the way I proposed to deal with the case of personally transformative experiences above. I'll begin by saying briefly how I did propose to deal with those cases. Suppose that, instead of changing your attitudes to risk, Treatment 1 might change your utilities---what I call your local utilities; the ones you have at a particular time. Let's suppose the first table below gives the local utilities you assign before receiving treatment, as well as after receiving Treatment 2, or after receiving Treatment 1 if you have Ailment 1; the second table gives the local utilities you assign after receiving Treatment 1 if you have Ailment 2.

$$ \begin{array}{r|cc} \textit{Utilities} & \text{Ailment 1} & \text{Ailment 2} \\ \hline \text{Treatment 1} & 10 & 7\\ \text{Treatment 2} & 2 & 10 \end{array}$$

$$\begin{array}{r|cc} \textit{Utilities} & \text{Ailment 1} & \text{Ailment 2} \\ \hline \text{Treatment 1} & 8 & 1\\ \text{Treatment 2} & 4 & 4 \end{array} $$

Then I proposed that we take your global utility for a treatment at a state of the world to be some weighted average of the local utilities you'd have for that state of the world at that state were you to choose that treatment. So the global utility of Treatment 1 if you have Ailment 2 would be a weighted average of its local utility before treatment, which is 7, and its local utility after treatment, which is 1---perhaps we'd give equal weights to both selves, and settle on 4 as the global utility. So the following table would give the global utilities:

$$ \begin{array}{r|cc} \textit{Utilities} & \text{Ailment 1} & \text{Ailment 2} \\ \hline \text{Treatment 1} & 10 & 4\\ \text{Treatment 2} & 2 & 10 \end{array}$$

We'd then apply our favoured decision theory with those global utilities.

Surely we could use something similar in the present case, where the utilities don't change, but the attitudes to risk do? Surely we can aggregate your risk function at the earlier time and at the later time and use that to make your decisions? Again, maybe a weighted average would work? But the problem is that, at least if you take Treatment 1, which risk function you have at the later time depends on whether you have Ailment 1 or Ailment 2. That's not a problem when we aggregate local utilities to give global utilities, because we just need the utility of a treatment at a specific state of the world generated by the local utilities you'll have at that state of the world if you undergo that treatment. But risk functions don't figure into our evaluations of acts in that way. They specify a way that the probabilities and utilities of different outcomes should be brought together to give an overall evaluation of the option. And that means we can't aggregate risk functions held at different times within one state of the world to give the risk function we should use to evaluate an option, since an option is defined at different states of the world.

Suppose, for instance, we were to aggregate the risk function you have now and the risk function you'll have if you undergo Treatment 1 and have Ailment 1: they're both the same, so the aggregate will probably be the same. Now suppose we were to aggregate the risk function you have now and the risk function you'll have if you undergo Treatment 1 and have Ailment 2: they're different, so the aggregate is probably some third risk function. Now suppose we come to evaluate Treatment 1 using risk-weighted expected utility theory: which of these two aggregate risk functions should we use? There is no non-arbitrary way to choose.

Choosing to limit legitimate complaints

Is this the only route we can try for a solution? Not quite, I think, but it will turn out that the other two I've considered also lead to problems. The first, which I'll introduce in this section, is based on a fascinating paper by Pietro Cibinel. I won't explain the purpose for which he introduces his account, but in that paper he proposes that, when we are choosing for others with different attitudes to risk, we should use a complaints-centred approach. That is, when we evaluate the different options available, we should ask: if we were to choose this, what complaints would the affected individuals have in the different states of the world? And we should try to choose options that minimise the legitimate complaints they'll generate.

Cibinel offers what seems to me the right account of what complaints an individual might legitimately make when an option is chosen on their behalf. He calls it the Hybrid Competing Claims Model. Here's how it goes: If you would have chosen the same option that the decision-maker chooses, you have no legitimate complaints regardless of how things turn out; if you would not have chosen the same option the decision-maker chooses, but the option they chose is better for you, in the state you actually inhabit, than the option you would have chosen, then you have no legitimate complaint; but if you would not have chosen the same option the decision-maker chooses and the option they choose is worse for you, in the state you actually inhabit, than the option you would have chosen, then you do have a legitimate complaint---and, what's more, we measure its strength to be how much more utility you'd have received at the state you inhabit from the option you'd have chosen than you receive from the option chosen for you.

Let's see that work out in the case of Treatment 1 and Treatment 2. Suppose you prefer Treatment 1 to Treatment 2: if the doctor chooses Treatment 1, you have no complaint, whether you have Ailment 1 or Ailment 2; if the doctor chooses Treatment 2 and you have Ailment 1, then you have a complaint, since Treatment 1 gives 10 units of utility while Treatment 2 gives only 2, and you preferred Treatment 1; what's more, your complaint has strength $10-2=8$; if the doctor chooses Treatment 2 and you have Ailment 2, then you have no complaint, since you preferred Treatment 1, but Treatment 2 in fact leaves you better off in this state.

So let's now look at the legitimate complaints that would be made by your different possible selves in the four different treatment/ailment pairs. The first table below reminds us of the risk attitudes you'll have in those four situations, while the second table records the preferences between the two treatments that those risk attitudes entail, when combined with the probabilities---25% chance of Ailment 1; 75% chance of Ailment 2.

$$ \begin{array}{r|cc} \textit{Risk} & \text{Ailment 1} & \text{Ailment 2} \\ \hline \text{Treatment 1} & R(p) = p^2 & R(p) = p\\ \text{Treatment 2} & R(p) = p^2 & R(p) = p^2 \end{array} $$

$$\begin{array}{r|cc} \textit{Preferences} & \text{Ailment 1} & \text{Ailment 2} \\ \hline \text{Treatment 1} & T_2 \prec T_1 & T_1 \prec T_2\\ \text{Treatment 2} & T_2 \prec T_1 & T_2 \prec T_1 \end{array} $$

And the following tables gives the probabilities, utilities, and strengths of complaint you'd legitimately make in each situation:

$$\begin{array}{r|cc} \textit{Utilities} & \text{Ailment 1} & \text{Ailment 2} \\ \hline \text{Treatment 1} & 10 & 7\\ \text{Treatment 2} & 2 & 10 \end{array}$$

$$ \begin{array}{r|cc} \textit{Complaints} & \text{Ailment 1} & \text{Ailment 2} \\ \hline \text{Treatment 1}& 0 & 3\\ \text{Treatment 2} & 8 & 0 \end{array}$$

Now, having given this, we need a decision rule: given these different possible complaints, which should we choose? Cibinel doesn't take a line on this, because he doesn't need to for his purposes: he just needs to assume that, if one option creates no possible complaints while another creates some, you should choose the former. But we need more than that, since that principle doesn't tell between Treatment 1 and Treatment 2. And there are many different ways you might go. I won't be able to survey them all, but I hope to say enough to convince you that none is without its problems.

First, you might say that we should pick the option whose worst possible complaint is best. In this case, that would be Treatment 1. Its worst complaint occurs in $s_2$ and has strength 3, while the worst complaint of Treatment 2 occurs in $s_1$ and has strength 8.

I see a couple of problems with that proposal, both of which stem from the fact that this proposal essentially applies the Minimax decision rule to strengths of complaint rather than utilities---that decision rule says pick the option with the best worst-case outcome. But there are two problems with Minimax make it unsuitable for the current situation.

First, it is a risk-averse decision rule. Indeed, it is plausibly the most extremely risk-averse decision rule there is, for it pays attention only to the worst-case scenario. Why is that inappropriate? Well, because one of the things we're trying to do is adjudicate between a risk-netural risk function, which you'll have if the doctor chooses Treatment 1 and it goes poorly, and a risk-averse risk function, which you'll have in all other situations. So using the most risk-averse decision rule seems play favourites, giving priority unduly to the more risk-averse of your possible risk functions.

The second problem is that Minimax is insensitive to the probabilities of the states. While Treatment 2 gives rise to the worst worse-case, it also gives rise to it in a less probable state of the world. Surely we should take that into account? But Minimax doesn't.

How might we take the probability of different complaints into account? The problem is that it is exactly the job of a theory of rational decision-making under uncertainty to do this---they are designed precisely to combine judgments of how good or bad an option is at different states of the world, together with judgments of how likely those states of the world are, to give an overall evaluation and from that a ranking of the options. So, if you subscribe to expected utility theory, you'll think we should minimize expected complaint strength; if you subscribe to expected utility theory with a particular risk function, you'll think we should minimize risk-weighted expected complaint strength. But which decision theory to use is precisely the point of disagreement between some of the different possible complainants in our central case. So, for instance, suppose we approach this problem from the point of view of the risk-averse risk function that you will have in three out of the four possible situations. We minimise risk-weighted expected complaint strength by maximising the risk-weighted expectation of the negative of the complaints. So the risk-weighted expected negative complaint strength for Treatment 1 is $-3 + R(1/4)\times (0-(-3)) = -2.8125$, while for Treatment 2 it is $-8+R(3/4)\times (0-(-8)) = -3.5$. So Treatment 1 is preferable. But now suppose we approach the problem from the point of view of the risk-neutral risk function you will have in one of the four possible situations. So the expected negative complaint strength for Treatment 1 is $(1/4)\times 0 + (3/4)\times (-3) = -2.25$, while for Treatment 2 it is $(1/4)\times (-8) + (3/4)\times 0 = -2$. So Treatment 2 is preferable. That is, the different selves you might be after the treatment disagree on which is to be preferred from the point of a complaints-centred approach. And so that approach doesn't help us a great deal.

Putting risk attitudes into the utilities

Let's try a different approach. According to Buchak's approach, your attitudes to risk are a third component of your mental state that is combined with your utilities and your probabilities using her formula to give an overall evaluation for an option. Your attitudes to risk don't affect your utilities for different outcomes; instead, they tell you how to combine those utilities with your probabilities to give the evaluation. But perhaps it's more natural to say that our attitudes to risk are really components that partly determine the utilities we assign to different outcomes as a result of different options. And once we do that, we can just use expected utility theory to make our decisions, knowing that our attitudes to risk are factored in to our utilities. Indeed, it turns out that we can do that in a way that perfectly recovers the preferences given by risk-weighted expected utility theory with a particular risk function (see here and here).

Suppose $R$ is your risk function. And suppose an option $o$ gives utility $u_1$ in state $s_1$, which has probability $p_1$, and utility $u_2$ in state $s_2$, which has probability $p_2$. And suppose $u_1 < u_2$. Then$$REU(o) = u_1 + R(p_2)(u_2 - u_1)$$But then$$REU(o) = p_1\left ( \frac{1-R(p_2)}{p_1}u_1 \right ) + p_2\left ( \frac{R(p_2)}{p_2} u_2 \right )$$So if we let $\frac{1-R(p_2)}{p_1}u_1$ be the utility of this option at state $s_1$ and we let $\frac{R(p_2)}{p_2} u_2$ be the utility of this option at state $s_2$, then we can represent you as evaluating an option by its expected utility, and maximising risk-weighted expected utility with the old utilities is the same as maximising expected utility with these new ones. The attraction of this approach is that it transforms a chance of risk attitudes into a change in utilities, and that's something we know how to deal with.

But I think there's a problem. When we transform the utilities to incorporate the risk attitudes, so that maximising expected utility with respect to the transformed utilities is equivalent to maximising risk-weighted expected utility with respect to the untransformed utilities, they don't seem to incorporate them in the right way. Let's suppose we have the following decision:

$$\begin{array}{r|cc} \textit{Probabilities} & s_1 & s_2 \\ \hline P & 1/2 & 1/2 \end{array}$$

$$\begin{array}{r|cc} \textit{Utilities} & s_1 & s_2 \\ \hline o_1 & 9 & 1\\ o_2 & 4 & 5 \end{array}$$

So it's really a choice between a risky option ($o_1$) with higher expected utility and a less risky one ($o_2$) with lower expected utility. It's easy to see that, if $R(p) = p^2$, then$$REU(o_1) = 1+(1/4)\times (9-1) = 3 < 4.25 = 4 + (1/4)\times (5-4) = REU(o_2)$$even though $EU(o_1) > EU(o_2)$.

Now, you'd expect that we'd incorporate risk into the utilities in this case by simply inflating the utilities of $o_2$ on the grounds that this option is less risky, and maybe deflating the utilities of $o_1$. But that's not what happens. Here are the new utilities once we've incorporated the attitudes to risk in the way described above:

$$\begin{array}{r|cc} \text{Probabilities} & s_1 & s_2 \\ \hline P & 1/2 & 1/2 \end{array}$$

$$\begin{array}{r|cc} \text{Utilities} & s_1 & s_2 \\ \hline o_1 & 9\frac{R(1/2)}{1/2} = 4.5 & 1 \frac{1-R(1/2)}{1/2} = 1.5\\ o_2 & 4\frac{1-R(1/2)}{1/2} = 6 & 5 \frac{R(1/2)}{1/2} = 2.5 \end{array}$$

Notice that the utilities of the worst-case outcomes of the two options are inflated, while the utilities of the best-cases are deflated. These don't seem like the utilities of someone who values less risky options, and therefore values outcomes more when they are the result of a less risky outcome. And indeed they just seem extremely strange utilities to have. Suppose you pick option the less risky option, $o_2$, in line with your preferences. Then, in state $s_2$, even though you've picked the option you most wanted, and you're in the situation that gets you most of the original utility---5 units as opposed to the 4 you'd have received in state $s_1$---you'd prefer to be in state $s_1$---after all, that gives 6 units of the new utility, whereas $s_2$ gives you only 2.5.

So, in the end, I don't think this approach is going to help us either. Whatever is going on with individuals who set their preferences using probabilities, utilities, and a risk function in the way Buchak's theory requires, they are not maximizing the expectation of some alternative conception of utility that incorporates their attitudes to risk.

Choosing for others and choosing for yourself

It's at this point that I can't see a way forward, so I'll leave it here in the hope that someone else can. But let me conclude by noting that, just as the problem arises for the doctor choosing on behalf of the patient in front of them, so it seems to arise in exactly the same way for the individual choosing on behalf of their future selves. Instead of arriving at the hospital unconscious so that the doctor must choose on your behalf, suppose you arrive fully conscious and have to choose between the two treatments based on the different outcomes the doctor describes to you and the probabilities that you assign to each based on the doctor's expertise. Then you face the same choice on behalf of your future selves in this example that the doctor faced on their behalf in the original case. And it seems like the same moves are available, and they're problematic for the same reason: transformative experiences that might lead to changes in your attitudes to risk pose a problem for decision theory.

The Robustness of the Diversity Prediction Theorem II: problems with asymmetry

2023-03-17T17:00:00.002+00:00

There is a PDF of this post here.

Take a quantity whose value you wish to estimate---the basic reproduction number for a virus; the number of jelly beans in a jar at the school fête; the global temperature rise caused by doubling the concentration of CO2 in the atmosphere; the number of years before humanity goes extinct. Ask a group of people to provide their estimate of that value, and take the mean of their answers. The Diversity Prediction Theorem says that, if you measure distance as squared difference, so that, for example, the distance from $2$ to $5$ is $(2-5)^2$, the distance that the mean of the answers lies from the true value will be equal to the average distance from the answers to the true value less a quantity that measures the diversity of the answers, namely, the average distance from the answers to the mean answer.

In a previous blogpost, I asked: to what extent is this result a quirk of squared difference? Extending a result due to David Pfau, I showed that it is true of exactly the Bregman divergences. But there's a problem. In the Diversity Prediction Theorem, we measure the diversity of estimates as the average distance from the individual answers to the mean answer. But why this, and not the average distance to the individual answers from the mean answer? Of course, if we use squared difference, then these values are the same, because squared difference is a symmetric measure of distance: the squared difference from one value to another is the same as the squared difference from the second value to the first. And so one set of answers will be more diverse than another according to one definition if it is more diverse according to the other definition. But squared difference is in fact the only symmetric Bregman divergence. So, for all other definitions of Bregman divergence, the two definitions of diversity come apart.

This has an odd effect on the Diversity Prediction Theorem. One of the standard lessons the theorem is supposed to teach is that the mean of a more diverse group is more accurate than the mean of a less diverse group. In fact, even the original version of the theorem doesn't tell us that. It tells us that the mean of a more diverse group is more accurate than the mean of a less diverse group when the average distance from the truth is the same for both groups. But, if we use a non-symmetric distance measure, i.e., one of the Bregman divergences that isn't squared error, and we use the alternative measure of diversity mentioned in the previous paragraph---that is, the average distance to the individual answers from the mean answer---then we can get a case in which the mean of a less diverse group is more accurate than the mean of a more diverse group, even though the average distance from the answers to the truth is the same for both groups. So it seems that we have three choices: (i) justify using squared difference only; (ii) justify the first of the two putative definitions of diversity in terms of average distance between mean answer and the answers; (iii) give up the apparent lesson of the Diversity Prediction Theorem that diversity leads to more accurate average answers. For my money, I think (iii) is the most plausible.

Let me finish off by providing an example. First, define two measures of distance:

Squared difference: $q(x, y) = (x-y)^2$

Generalized Kullback-Leibler divergence: $l(x, y) = x\log(x/y) - x + y$

Then the Diversity Prediction Theorem says that, for any $a_1, \ldots, a_n, t$, if $a^\star = \frac{1}{n}\sum^n_{i=1} a_i$,$$q(a^\star, t) = \frac{1}{n}\sum^n_{i=1} q(a_i, t) - \frac{1}{n}\sum^n_{i=1} q(a_i, a^\star)$$And the generalization I discussed in the previous blogpost entails that$$l(a^\star, t) = \frac{1}{n}\sum^n_{i=1} l(a_i, t) - \frac{1}{n}\sum^n_{i=1} l(a_i, a^\star)$$But drawing from this the conclusion that groups with the same average distance to the truth are more accurate if they're more diverse relies on defining the diversity of the estimates $a_1, \ldots, a_n$ to be $\frac{1}{n}\sum^n_{i=1} l(a_i, a^\star)$, rather than $\frac{1}{n}\sum^n_{i=1} l(a^\star, a_i)$. Suppose that we define it in the second way instead. And take two groups each containing two individuals. Here are their estimates of a quantity (perhaps the R number of a virus):$$\begin{array}{c|c|c|c|c|c} a_1 & a_2 & a^\star & b_1 & b_2 & b^\star \\ \hline 0.5 & 0.1 & 0.3 & 0.3 & 0.9 & 0.6 \end{array}$$Then $a_1, a_2$ is less diverse then $b_1, b_2$ according to the original definition of diversity, but more diverse according to the second definition. That is,$$\frac{l(0.5, 0.3) + l(0.1, 0.3)}{2} < \frac{l(0.3, 0.6) + l(0.9, 0.6)}{2}$$and$$\frac{l(0.3, 0.5) + l(0.3, 0.1)}{2} > \frac{l(0.6, 0.3) + l(0.6, 0.9)}{2}$$What's more, if $t = 0.44994$, then the average distance to the truth is the same for both groups. That is,$$\frac{l(0.5, 0.44994) + l(0.1, 0.44994)}{2} = \frac{l(0.3, 0.44994) + l(0.9, 0.44994)}{2}$$But then it follows that the mean of $a_1, a_2$ is less accurate than the mean of $b_1, b_2$, even though, according to one seemingly legitimate definition of diversity, $a_1, a_2$ is more diverse than $b_1, b_2$.$$l(0.3, 0.44994) > l(0.6, 0.44994)$$

The Robustness of the Diversity Prediction Theorem I: generalizing the result

2023-03-14T08:40:00.006+00:00

There is a PDF version of this post here.

Pick a question to which the answer is a number. What is the population of Weston-super-Mare? How many ants are there in the world? Then go out into the street and ask the first ten people you meet for their answers. Look how far those answers lie from the truth. Now take the average of the answers---their mean. And look how far it lies from the truth. If you measure the distance between an answer and the true value in the way many statisticians do---that is, you take the difference and square it---then you'll notice that the distance of the average from the truth is less than the average distance from the truth. I can predict this with confidence because it is no coincidence, nor even just a common but contingent outcome---it's a mathematical fact. And it remains a mathematical fact for many alternative ways of measuring the distance from an answer to the truth---essentially, all that's required is that the measure is strictly convex in its first argument (Larrick, et al. 2012). That's a neat result! In expectation, you're better off going with the average answer than picking an answer at random and going with that.

But, if we keep using squared difference, we get an even more interesting result. The distance the average answer lies from the truth is equal to the average distance the answers lie from the truth less a quantity that measures the diversity of the answers. Scott Page calls this the Diversity Prediction Theorem and uses it as part of his general argument that diversity is valuable in group reasoning and decision-making. The quantity that measures the diversity of the answer is the average distance the answers lie from the mean answer: the answers in a homogeneous set will, on average, lie close to their mean, while those in a heterogeneous set will, on average, lie far from their mean. This has an intriguing corollary: given two collections of answers with the same average distance from the truth, the mean of the more diverse one will be more accurate than the mean of the less diverse one. That's another neat result! It isn't quite as neat as people sometimes make out. Sometimes, they say that it shows that more diverse sets of answers are more accurate; but that's only true when you're comparing sets with the same average distance from the truth; increasing diversity can often increase the average distance from the truth. But it's still a neat result!

A natural question: is it a quirk of mean squared error? In fact, it's not. There are many many ways of measuring the distance between answers, mean answers, and the truth for which the Diversity Prediction Theorem holds. As I'll show here, they are exactly the so-called Bregman divergences, which I'll introduce below.

That every Bregman divergence gives the Diversity Prediction Theorem was shown by David Pfau (2013) in an unpublished note. I haven't been able to find a proof that only those functions give it, but it follows reasonably straightforwardly from a characterization of Bregman divergences due to Banerjee, et al. (2005).

Having talked informally through the results I want to present, let me now present them precisely.

First, the measure of distance between two numbers that is used in the standard version of the Diversity Prediction Theorem:

Definition 1 (Squared difference)$$q(x, y) := (x-y)^2$$

So our first result is this:

Proposition 1 Suppose $a_1, \ldots, a_n, t$ are real numbers with $a_i \neq a_j$ for some $i, j$. And let $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$. Then$$q(a^\star, t) < \frac{1}{n} \sum^n_{i=1} q(a_i, t)$$

As I mentioned, this generalises to any measure of distance that is strictly convex in its first argument.

Definition 2 (Strictly convex function) Suppose $X$ is a convex subset of the real numbers and $d : X \times X \rightarrow \mathbb{R}$. Then $d$ is strictly convex in its first argument if, for any $x_1, x_2, y$ in $X$ and any $0 < \lambda < 1$,$$d(\lambda x_1 + (1-\lambda) x_2, y) < \lambda d(x_1, y) + (1-\lambda) d(x_2, y)$$

Proposition 2 Suppose $X$ is a convex subset of the real numbers and $d : X \times X \rightarrow [0, \infty]$ is strictly convex in its first argument. Suppose $a_1, \ldots, a_n, t$ are real numbers in $X$ with $a_i \neq a_j$ for some $i, j$. And let $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$. Then$$d(a^\star, t) < \frac{1}{n} \sum^n_{i=1} d(a_i, t)$$

Proof. This follows immediately from Jensen's inequality, which entails that, for any strictly convex function $f$, and any $a_1, \ldots, a_n$ with $a_i \neq a_j$ for some $i, j$,$$f\left (\frac{1}{n}\sum^n_{i=1} a_i \right ) < \frac{1}{n} \sum^n_{i=1} f(a_i).$$$\Box$

Next, we meet the Diversity Prediction Theorem:

Proposition 3 Suppose $a_1, \ldots, a_n, t$ are real numbers with $a_i \neq a_j$ for some $i, j$. And let $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$. Then$$q(a^\star, t) = \frac{1}{n} \sum^n_{i=1} q(a_i, t) - \frac{1}{n} \sum^n_{i=1} q(a_i, a^\star)$$

I won't offer a proof, since it follows from the more general version below. To state that, we need the notion of a Bregman divergence:

Definition 3 (Bregman divergence) Suppose $X$ is a convex subset of the real numbers and $\varphi : X \rightarrow \mathbb{R}$ is a continuously differentiable, strictly convex function. Then, for $x, y$ in $X$, $$d_\varphi(x, y) = \varphi(x) - \varphi(y) - \varphi'(y)(x-y)$$We say that $d_\varphi$ is the Bregman divergence generated by $\varphi$.

To calculate the Bregman divergence from $x$ to $y$ generated by $\varphi$, you take the tangent to $\varphi$ at $y$ and calculate the difference between $\varphi$ at $x$ and this tangent at $x$. This is illustrated in Figure 1. Note that, for any $x, y$ in $X$, $d_\varphi(x, y) \geq 0$, with equality iff $x = y$.

Figure 1: $\varphi(x) = x\log(x)$ is plotted in blue; the tangent to $\varphi$ at $y$ is plotted in yellow; $d_\varphi(x, y)$ is the length of the dashed line.

Two examples of Bregman divergences, with the strictly convex functions that generate them:

Squared Euclidean distance Suppose $\varphi(x) = x^2$. Then$$d_\varphi(x, y) = (x-y)^2$$

Generalized Kullback-Leibler divergence Suppose $\varphi(x) = x\log(x)$. Then$$d_\varphi(x, y) = x\log \left ( \frac{x}{y} \right ) - x + y$$

Proposition 4 (Pfau 2013) Suppose $X$ is a convex subset of the real numbers and $\varphi : X \rightarrow \mathbb{R}$ is a continuously differentiable, strictly convex function. Suppose $a_1, \ldots, a_n, t$ are real numbers in $X$ with $a_i \neq a_j$ for some $i, j$. And let $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$. Then$$d_\varphi(a^\star, t) = \frac{1}{n} \sum^n_{i=1} d_\varphi(a_i, t) - \frac{1}{n} \sum^n_{i=1} d_\varphi(a_i, a^\star)$$

Proof. The result follows easily from the following three equations:

\begin{eqnarray*} d_\varphi(a^\star, t) & = & \varphi(a^\star) - \varphi(t) - \varphi'(t)(a^\star - t) \\ & & \\ \frac{1}{n}\sum^n_{i=1} d_\varphi(a_i, t) & = & \frac{1}{n} \sum^n_{i=1} \varphi(a_i) - \varphi(t) - \varphi'(t)\left (\frac{1}{n} \sum^n_{i=1} a_i - t \right ) \\ & = & \frac{1}{n} \sum^n_{i=1} \varphi(a_i) - \varphi(t) - \varphi'(t)\left (a^\star - t \right ) \\ & & \\ \frac{1}{n} \sum^n_{i=1} d_\varphi(a_i, a^\star) & = & \frac{1}{n}\sum^n_{i=1}\varphi(a_i) - \varphi(a^\star) - \varphi'(a^\star)\left (\frac{1}{n}\sum^n_{i=1} a_i - \varphi(a^\star) \right ) \\ & = & \frac{1}{n}\sum^n_{i=1}\varphi(a_i) - \varphi(a^\star)\end{eqnarray*}

$\Box$

Next, we prove that the Bregman divergences are the only functions that give the Diversity Prediction Theorem. The proof relies on the following characterization of Bregman divergences.

Lemma 5 (Banerjee et al. 2005) Suppose $X$ is a convex subset of the real numbers and $d : X \times X \rightarrow [0, \infty]$. And suppose that, for any $a_1, \ldots, a_n$ in $X$, with $a^\star = \frac{1}{n}\sum^n_{i=1} a_i$, the function $t \mapsto \frac{1}{n} \sum^n_{i=1} d(a_i, t)$ is minimized at $t = a^\star$. Then there is a continuously differentiable and strictly convex function on $X$ such that $d = d_\varphi$.

Proposition 6 Suppose $X$ is a convex subset of the real numbers and $d : X \times X \rightarrow [0, \infty]$. And suppose that, for any $a_1, \ldots, a_n, t$ in $X$ with $a_i \neq a_j$ for some $i, j$, with $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$, $$d(a^\star, t) = \frac{1}{n} \sum^n_{i=1} d(a_i, t) - \frac{1}{n} \sum^n_{i=1} d(a_i, a^\star)$$Then there is a continuously differentiable and strictly convex function on $X$ such that $d = d_\varphi$.

Proof. Suppose that, for any $a_1, \ldots, a_n, t$ are real numbers in $X$ with $a_i \neq a_j$ for some $i, j$, with $a^\star = \frac{1}{n} \sum^n_{i=1} a_i$, $$d(a^\star, t) = \frac{1}{n} \sum^n_{i=1} d(a_i, t) - \frac{1}{n} \sum^n_{i=1} d(a_i, a^\star)$$Then$$\frac{1}{n} \sum^n_{i=1} d(a_i, t) = d(a^\star, t) + \frac{1}{n} \sum^n_{i=1} d(a_i, a^\star)$$And so the function $t \mapsto \frac{1}{n} \sum^n_{i=1} d(a_i, t)$ is minimal when $t \mapsto d(a^\star, t)$ is minimal; and since, for all $x, y$ in $X$, $d(x, y) \geq 0$ with equality iff $x = y$, $t \mapsto d(a^\star, t)$ is minimal at $t = a^\star$. So, by Lemma 5, there is a continuously differentiable and strictly convex function on $X$ such that $d = d_\varphi$. $\Box$

References

Banerjee, A., Guo, X., & Wang, H. (2005). On the Optimality of Conditional Expectation as a Bregman Predictor. IEEE Transactions of Information Theory, 51, 2664–69.

Larrick, R. P., Mannes, A. E., & Soll, J. B. (2012). The Social Psychology of the Wisdom of the Crowds. In J. I. Krueger (Ed.) Social Judgment and Decision Making, chap. 13, (pp. 227–242). New York: Psychology Press.

Pfau, D. (2013). A Generalized Bias-Variance Decomposition for Bregman Divergences. Unpublished manuscript.

Should longtermists recommend hastening extinction rather than delaying it?

2022-08-04T16:34:00.005+01:00

I'm cross-posting this from the Effective Altruism Forum. It's the latest version of my critique of longtermism that uses Lara Buchak's risk-weighted expected utility theory.

Here's the abstract: Longtermism is the view that the most urgent global priorities, and those to which we should devote the largest portion of our current resources, are those that focus on ensuring a long future for humanity, and perhaps sentient or intelligent life more generally, and improving the quality of those lives in that long future. The central argument for this conclusion is that, given a fixed amount of a resource that we are able to devote to global priorities, the longtermist’s favoured interventions have greater expected goodness than each of the other available interventions, including those that focus on the health and well-being of the current population. In this paper, I argue that, even granting the longtermist's axiology and their consequentialist ethics, we are not morally required to choose whatever option maximises expected utility, and may not be permitted to do so. Instead, if their axiology and consequentialism is correct, we should choose using a decision theory that is sensitive to risk, and allows us to give greater weight to worse-case outcomes than expected utility theory. And such decision theories do not recommend longtermist interventions. Indeed, sometimes, they recommend hastening human extinction. Many, though not all, will take this as a reductio of the longtermist's axiology or consequentialist ethics. I remain agnostic on the conclusion we should draw.

A kitten playing the long game

Self-recommending decision theories for imprecise probabilities

2022-07-27T11:03:00.001+01:00

A PDF version of this post is available here.

The question of this blogpost is this: Take the various decision theories that have been proposed for individuals with imprecise probabilities---do they recommend themselves? It is the final post in a trilogy on the topic of self-recommending decision theories (the others are here and here).

One precise kitten and one imprecise kitten

Let's begin by unpacking the question.

First, imprecise probabilities (sometimes known as mushy credences; for an overview, see Seamus Bradley's SEP entry here). For various reasons, some formal epistemologists think we should represent an individual's beliefs not by a precise probability function, which assigns to each proposition about which they have an option a single real number between 0 and 1, but rather a set of such functions. Some think that, whatever rationality requires of them, most individuals simply don't make sufficiently strong and detailed probabilistic judgments to pick out a single probability function. Others think that, at least when the individual's evidence is very complex or very vague or very sparse, rationality in fact requires them not to make judgments that pick out just one function. Whatever the reason, many think we should represent an individual's beliefs by a set $P$ of probability functions, which we call their representor, following van Fraassen. One way to understand a representor is like this: $P$ contains all the probability functions that respect the probabilistic judgments that the individual makes. For instance, if they judge that proposition $A$ is more likely than $B$, then every function in $P$ should assign higher probability to $A$ than to $B$; if they judge $A$ is more likely than not, then every function in $P$ should assign higher probability to $A$ than to its negation. If these are the only two probabilistic judgments they make, then their representor will be $P = \{p : p(A) > p(B)\ \&\ p(A) > p(\overline{A})\}$. And this clearly contains more than one probability function!

Second, decision theories for an individual with imprecise probabilities. Suppose you have opinions only about two possible worlds $w_1$ and $w_2$, which are exclusive and exhaustive. And suppose your representor is$$P = \{(x, 1-x) : 0.3 \leq x \leq 0.4\}$$where $(x, 1-x)$ is the probability function that assigns probability $x$ to $w_1$ and $1-x$ to $w_2$. That is, you think $w_1$ is between 30% and 40% likely, and $w_2$ is between 60% and 70% likely, but you make no stronger judgment than that. And now suppose you face a decision problem that consists of a choice between two options, $a$ and $b$, with the following payoff table:$$\begin{array}{r|cc}& w_1 & w_2 \\ \hline a & 13 & 0 \\ b & 0 & 7 \end{array}$$Then, if you take the probability function $(0.3, 0.7)$, which takes $w_1$ to be 30% likely and $w_2$ to be 70% likely, then it expects $b$ to be better than $a$, but if you take the probability function $(0.4, 0.6)$, which takes $w_1$ to be 40% likely and $w_2$ to be 60% likely, then it expects $a$ to be better than $b$. How, then, should you choose? It turns out that there are many possibilities! I'll list what I take to be the main contenders below before asking whether any of them recommend themselves.

Third, self-recommending decision theories. Suppose you are unsure what decision problem you're about to face. Indeed you think each possible decision problem is equally likely. Then you can use any decision theory to pick between the available decision theories that you might use when faced with whichever decision problem arises. A decision theory is self-recommending if it says that it's permissible to pick itself.

Let's meet the decision theories for imprecise probabilities. This list may well not be comprehensive, but I've tried to identify the main ones (thanks to Jason Konek for an impromptu tutorial on $E$-admissibility and Maximality!). I state them in terms of impermissibility, but nothing hangs on that.

Suppose $P$ is a set of probability functions and $O$ is a set of options. Following Brian Weatherson, given a particular option $o$, we let $$l_o = \min_{p \in P} \mathrm{Exp}_p(o) \hspace{10mm} \text{ and } \hspace{10mm} u_o = \max_{p \in P} \mathrm{Exp}_p(o)$$

Global Dominance $o$ in $O$ is impermissible iff there is $o'$ in $O$ such that $u_o < l_{o'}$.

$\Gamma$-Maximin $o$ in $O$ is impermissible iff there is $o'$ in $O$ such that $l_o < l_{o'}$.

$\Gamma$-Maxi $o$ in $O$ is impermissible iff there is $o'$ in $O$ such that one of the following hold:

$l_o < l_{o'}$ and $u_o < u_{o'}$
$l_o < l_{o'}$ and $u_o = u_{o'}$
$l_o = l_{o'}$ and $u_o < u_{o'}$

$\Gamma$-Hurwicz$_\lambda$ $o$ in $O$ is permissible iff there is $o'$ in $O$ such that $$\lambda l_o + (1-\lambda) u_o < \lambda l_{o'} + (1-\lambda) u_{o'}$$

$E$-Admissibility $o$ in $O$ is impermissible iff for all $p$ in $P$ there is $o'$ in $O$ such that$$\mathrm{Exp}_p(o) < \mathrm{Exp}_p(o')$$

Maximality $o$ in $O$ is impermissible iff there is $o'$ in $O$ such that, for all $p$ in $P$,$$\mathrm{Exp}_p(o) < \mathrm{Exp}_p(o')$$

Notice that the difference between $E$-Admissibility and Maximality is in the order of the quantifiers.

Next, let me specify the situation in which we're testing for self-recommendation a little more precisely. We imagine that there's a maximum utility that you might receive in the decision problem, let's say $n$; and the utilities you might receive come from $\{0, 1, 2, \ldots, n\}$. And we imagine that the decision problem will consist of either three available options, $a, b, c$, each defined on the two worlds $w_1$ and $w_2$. So each decision problem has a payoff table like this where $a_1, a_2, b_1, b_2, c_1, c_2$ come from $\{0, 1, 2, \ldots, n\}$:$$\begin{array}{r|cc}& w_1 & w_2 \\ \hline a & a_1 & a_2 \\ b & b_1 & b_2 \\ c & c_1 & c_2 \end{array}$$And we assume that each such decision problem is equally probable; and we assume that which decision problem you face is independent of which world you inhabit. So the probability of being at world $w_1$ and facing the decision problem $(a_1, a_2, b_1, b_2, c_1, c_2)$ is the probability of being at world $w_1$ multiplied by the probability of facing $(a_1, a_2, b_1, b_2, c_1, c_2)$, which is $\frac{1}{n^6}$.

For all of the decision theories we've mentioned, they will sometimes permit more than one option: for instance, Global Dominance permits an option $o$ if, for any other option $o'$, $u_o > l_{o'}$. In these cases, we assume that the individual picks at random between the permissible options.

Now, let $P = \{(x, 1-x) : 0.3 \leq x \leq 0.4\}$. Then all of the decision theories listed above are not self-recommending. Indeed, every one of them prefers using expected utility with probability function $m = (0.35, 0.65)$ to using themselves. That is:

the maximum expected utility of using Global Dominance with $P$ is less than the minimum expected utility of using Expected Utility with $m$;
the minimum expected utility of using $\Gamma$-Maximin with $P$ is less than the minimum expected utility of using Expected Utility with $m$;
the minimum expected utility of using $\Gamma$-Maxi is less than the minimum expected utility of using Expected Utility with $m$ and the maximum expected utility of using $\Gamma$-Maxi is less than the maximum expected utility of using Expected Utility with $m$;
the weighted average of the minimum and maximum expected utilities of using $\Gamma$-Hurwicz$_\lambda$ with $P$ is less than the weighted average of the minimum and maximum expected utilities of using Expected Utility with $m$;
for all $p$ in $P$, the expected utility of using $E$-admissibility with $P$ is less than the expected utility of using Expected Utility with $m$;
for all $p$ in $P$, the expected utility of using Maximality with $P$ is less than the expected utility of using Expected Utility with $m$.

More on self-recommending decision theories

2022-07-19T12:41:00.002+01:00

A PDF of this blogpost can be found here.

Last week, I wrote about how we might judge a decision theory by its own lights. I suggested that we might ask the decision theory whether it would choose to adopt itself as a decision procedure if it were uncertain about which decisions it would face. And I noted that many instances of Lara Buchak's risk-weighted expected utility theory (REU) do not recommend themselves when asked this question. In this post, I want to give a little more detail about that case, and also note a second decision theory that doesn't recommend itself, namely, $\Gamma$-Maximin (MM), a decision theory designed to be used when uncertainty is modeled by imprecise probabilities.

A cat judging you...harshly

The framework is this. For the sake of the calculations I'll present here, we assume that you'll face a decision problem with the following features:

there will be two available options, $a$ and $b$;
each option is defined for two exhaustive and exclusive possibilities, which we'll call worlds, $w_1$ and $w_2$; so, each decision problem is determined by a quadruple $(a_1, a_2, b_1, b_2)$, where $a_i$ is the utility of option $a$ at $w_i$, and $b_i$ is the utility of $b$ at $w_i$;
all of the utilities will be drawn from the set $\{0, 1, 2, \ldots, 20\}$; so, there are $21^4 = 194,481$ possible decision problems.

This is all you know about the decision problem you'll face. You place a uniform distribution over the possible decision problems you'll face. You take each to have probability $\frac{1}{194,481}$.

You also assign probabilities to $w_1$ and $w_2$.

In the case of REU, you have a credence function $(p, 1-p)$ over $w_1$ and $w_2$, so that $p$ is your credence in $w_1$ and $1-p$ is your credence in $w_2$.
In the case of MM, you represent your uncertainty by a set of such credence functions $\{(x, 1-x) : p \leq x \leq q\}$.

In both cases, you take the worlds to be independent from the decision problem you'll face.

In the case of REU, you also have a risk function $r$. In this post, I'll only consider risk functions of the form $r(x) = r^k$. I'll write $r_k$ for that function.

With all of that in place, we can ask our question about REU. Fix your credence function $(p, 1-p)$. Now, given a particular risk function $r$, does REU-with-$r$ judge itself to be the best decision theory available? Or is there an alternative decision theory---possibly REU-with-a-different-risk-function, but not necessarily---such that the risk-weighted expected utility from the point of view of $r$ of REU-with-$r$ is less that the risk-weighted expected utility from the point of view of $r$ of this alternative? And the answer is that, for many natural choices of $r$, there is.

Let's see this in action. Let $p = \frac{1}{2} = 1-p$. Let's consider the risk functions $r_k(x) = x^k$ for $0.5 \leq k \leq 2$. Then the following table gives some results. A particular entry, (row $k$, column $k'$), gives the risk-weighted expected utility from the point of view of $r_{k'}$ of using risk-weighted expected utility from the point of view of $r_k$ to make your decisions. In each column $k'$, the entry in green is what the risk function $r_{k'}$ thinks of itself; the entries in blue indicate the risk functions $r_k$ that $r_{k'}$ judges to be better than itself; and the entry in red is the risk function $r_k$ that $r_{k'}$ judges to be best.

How risk-weighted expected utility theories judge each other

There are a few trends to pick out:

Risk-inclined risk functions ($r_{k'}(x) = x^{k'}$ for $0.5 \leq k' < 1$) judge less risk-inclined ones to the better than themselves;
The more risk-inclined, the further away the risk functions can be and still count as better, so $r_{0.5}$ judges $r_{0.9}$ to be better than itself, but $r_{0.7}$ doesn't judge $r_1$ to be better;
And similarly, mutatis mutandis, for risk-averse risk functions ($r_{k'}(x) = x^{k'}$ for $1 < k' \leq 2$). Each judges less risk-averse risk functions to be better than themselves;
And the more risk-averse, the further away a risk function can be and still be judged better.
It might look like $r_{0.9}$ and $r_{1.1}$ are self-recommending, but that's just because we haven't consider more fine-grained possibilities between them and $r_1$. When we do, we find they follow the pattern above.
The risk-neutral risk function $r_1$ is genuinely self-recommending. REU with this risk function is just expected utility theory.

So much for REU. Let's turn now to MM. First, let me describe this decision rule. Suppose you face a decision problem between $a = (a_1, a_2)$ and $b = (b_1, b_2)$. Then, first, you decide how much you value each option. You take this to be the minimum expected utility it gets from credence functions in the set of credence functions that represent your uncertainty. That is, you take each $(x, 1-x)$ from your set of credence functions, so that $p \leq x \leq q$, you calculate the expected utility of $a$ relative to that credence function, and you value $a$ by the minimum expectation you come across. Then you pick whichever of the two options you value most. That is, you pick the one whose minimum expected utility is greatest.

Let's turn to asking how the theory judges itself. Here, we don't have different versions of the theory specified by different risk-functions. But let me consider different sets of credences that might represent our uncertainty. I'll ask how the theory judges itself, and also how it judges the version of expected utility theory (EU) where you use the precise credence function that sits at the midpoint of the credence functions in the set that represents your uncertainty. So, for instance, if $p = 0.3$ and $q = 0.4$ and the representor is $\{(x, 1-x) : 0.3 \leq x \leq 0.4\}$, I'll be comparing how MM thinks of itself and how it thinks of the version of expected utility theory that uses the precise credence function $(0.35, 0.65)$. In the first column here, we have the values of $p$ and $q$; in the second, we have the minimum expected utility for MM relative to each of these pairs of values; in the final column, we have the minimum expected utility for EU relative to those pairs.

$$\begin{array}{r|cc} & \text{MM} & \text{EU} \\ \hline (0.3, 0.4) & 12.486 & 12.490 \\ (0.3, 0.6) & 12.347 & 12.377 \\ (0.3, 0.9) & 12.690 & 12.817 \\ (0.1, 0.2) & 12.870 & 12.873\end{array}$$

Again, some notable features:

in each case, MM judges EU to be better than itself (I suspect this is connected to the fact that ther is no strictly proper scores for imprecise credences, but I'm not sure quite how yet! For treatments of that, see Seidenfeld, Schervish, & Kadane, Schoenfield, Mayo-Wilson & Wheeler, and Konek.)
greater uncertainty (which is represented by a broader range of credence functions) leads to a bigger difference between MM and EU;
having a midpoint that lies further from the centre also seems to lead to a bigger difference.

At some point, I'll try to write up some thoughts about the consequences of these facts. Could a decision theory that does not recommend itself be rationally adopted? But frankly it's far too hot to think about that today.

Self-recommending decision theories

2022-07-11T10:54:00.002+01:00

A PDF of this blogpost is available here.

Once again, I find myself stumbling upon a philosophical thought that seems so natural that I feel reasonably confident it must have been explored before, but I can't find where. So, in this blogpost, I'll set it out in the hope that a kind reader will know where to find a proper version already written up fully.*

I'd like to develop a type of objection that might be raised against a theory of rational decision-making. Here, I'll raise it against Lara Buchak's risk-weighted expected utility theory, in particular, but there will be many other theories to which it applies.

In brief, the objection applies to decision theories that are not self-recommending. That is, it applies to a decision theory if there is a particular instance of that theory that recommends that you use some alternative decision theory to make your decision; if you were to use this decision theory to choose which decision theory to use to make your choices, it would tell you to choose a different one, and not itself. We might naturally say that a decision theory that is not self-recommending in this sense is not a coherent means by which to make decisions, and that seems to be a strong strike against it.

A self-recommending Timothy Dalton

On what basis might we criticize a theory of rational decision making? One popular way is to show that any individual who adopts the theory is exploitable; that is, there are decisions they might face in response to which the theory will lead them to choose certain options when there are alternative options that are guaranteed to be better; that is, in the jargon of decision theory, there are alternatives that dominate the recommendations of the decision theory in question. This is the sort of objection that money pump arguments raise against their targets. For instance, suppose my decision theory permits cyclical preferences, so that I may prefer $a$ to $b$, $b$ to $c$, and $c$ to $a$. Then, if I have those preferences and choose in line with them, the money pump argument notes that I will choose $b$ over $a$, when presented with a choice between them, then pay money to swap $b$ for $c$, when presented with that option, and then pay money again to swap $c$ for $a$, when presented with that possibility. However, I might simply have chosen $a$ in the first place and then refrained from swapping thereafter, and I would have ended up better off for sure. The second sequence of choices dominates the first. So, the exploitability argument concludes, cyclical preferences are irrational and so any decision theory that permits them is flawed.

This sort of objection is also often raised against decision theories that permit sensitivity to risk. For instance, take the most extreme risk-averse decision theory available, namely Abraham Wald's Maximin. This doesn't just permit sensitivity to risk---it demands it. It says that, in any decision problem, you should choose the option whose worst-case outcome is best. So, suppose you are faced with the choice between $a$ and $b$, both of which are defined at two possible worlds, $w_1$ and $w_2$:
$$\begin{array}{r|cc}& w_1 & w_2 \\ \hline a & 3 & 0 \\ b & 1 & 1 \end{array}$$
Then Maximin says that you should choose $b$, since its worst-case outcome gives 1 utile, while the worst-case outcome of $a$ gives 0 utiles.

Now, after facing that first decision, and choosing $b$ in line with Maximin, suppose you're now faced with the choice between $c$ and $d$:
$$\begin{array}{r|cc} & w_1 & w_2 \\ \hline a' & 0 & 3 \\ b' & 1 & 1 \end{array}$$
You choose $b'$, just as Maximin requires. But it's easy to see that $a$ and $a'$, taken together, dominate $b$ and $b'$. $a + a'$ gives 3 utiles for sure, while $b + b'$ gives 2 utiles for sure.
$$\begin{array}{r|cc} & w_1 & w_2 \\ \hline a+a' & 3 & 3 \\ b+b' & 2 & 2 \end{array} $$
So Maximin is exploitable.

I've argued in various places that I don't find exploitability arguments compelling.** They show only that the decision rule will lead to a bad outcome when the decision maker is faced with quite a specific series of decision problems. But that tells me little about the performance of the decision rule over the vast array of possible decisions I might face. Perhaps an exploitable rule compensates for its poor performance in those particular cases by performing extremely well in other cases. For all the exploitability objection tells me, that could well be the case.

Recognising this problem, you might instead ask: how does this decision rule perform on average over all decision problems you might face? And indeed it's easy to show that decision theories that disagree with expected utility theory will perform worse on average than expected utility theory itself. But that's partly because we've stacked the deck in favour of expected utility theory. After all, looking at the average performance over all decision problems is just looking at the expected performance from the point of view of a credence function that assigns equal probability to all possible decision problems. And, as we'll show explicitly below, expected utility theory judges itself to be the best decision theory to use; that is, it does best in expectation; that is, it does best on average.

But while this argument begs the question against non-expected utility theories, it does suggest a different way to test a decision theory: ask not whether it does best on average, and thus by the lights of expected utility theory; ask rather whether it does best by its own lights; ask whether it judges itself to be the best decision theory. Of course, this is a coherence test, and like all coherence tests, passing it is not sufficient for rationality. But it does seem that failing is sufficient for irrationality. It is surely irrational to use a method for selecting the best means to your ends that does not think it is the best method for selecting the best means to your ends.

Let's begin by seeing a couple of theories that pass the test. Expected utility theory is the obvious example, and running through that will allow us to set up the formal framework. We begin with the space of possible states. There are two components to these states:

Let $W$ be the set of possible worlds grained finely enough to determine the utilities of all the options between which you will pick;
Let $D$ be the set of decision problems you might face.

Then a state is a pair $(d, w)$ consisting of a decision problem $d$ from $D$ and a possible world $w$. And now we define your credence function $p : W \times D \rightarrow [0, 1]$. So $p(w\ \&\ d)$ is your credence that you're at world $w$ and will face decision problem $d$.

Then expected utility theory says that, faced with a decision problem $d$ from $D$, you should pick an option $a$ from among those in $d$ that have maximal expected utility from the point of view of $p$. Given option $a$ and world $w$, we write $a(w)$ for the utility of $a$ at $w$. So the expected utility of $a$ is
$$
\sum_{w \in W} p(w | d)a(w)
$$
Now, let's ask how expected utility theory judges itself. Given a decision theory $R$ and a decision problem $d$, let $R(d)$ be the option in $d$ that $R$ requires you to take; so $R(d)(w)$ is the utility at world $w$ of the option from decision problem $d$ that decision theory $R$ requires you to take. Note that, in order for this to be well-defined for theories like $EU$ in which there might be multiple options with maximal expected utility, we must supplement those theories with a mechanism for breaking ties; but that is easily done. So $EU(d)$ is one of the acts in $d$ that maximises expected utility. Then expected utility theory assigns the following value or choiceworthiness to a decision theory $R$:
$$
EU(R) = \sum_{d \in D} \sum_{w \in W} p(d\ \&\ w)R(d)(w)
$$
Now, suppose $R$ is a decision theory and $d$ is a decision problem. Then
$$
\sum_{w \in W} p(w | d)EU(d)(w) \geq \sum_{w \in W} p(w | d)R(d)(w)
$$
with strict inequality if $R$ picks an option that doesn't maximise expected utility and $p$ assigns positive credence to a world $w$ at which this option differs from the one that does. But then
$$
\sum_{d \in D} p(d) \sum_{w \in W} p(w | d)EU(d)(w) \geq \sum_{d \in D} p(d) \sum_{w \in W} p(w | d)R(d)(w)
$$
So
$$
EU(EU) = \sum_{d \in D} \sum_{w \in W} p(d\ \&\ w)EU(d)(w) \geq \sum_{d \in D}\sum_{w \in W} p(d\ \&\ w)R(d)(w) = EU(R)
$$
with strict inequality if $p$ assigns positive credence to $d\ \&\ w$ where $R$ chooses an option that doesn't maximise expected utility and at world the utility of that option is different from the utility of the option that does. So, expected utility recommends itself.

Maximin, which we met above, is another self-recommending decision theory. What is the value or choiceworthiness that Maximin assigns to a decision theory $R$? It is the lowest utility you can obtain from using $R$ to make a decision:
$$
M(R) = \min_{\substack{d \in D \\ w \in W}} \{R(d)(w)\}
$$
Now, suppose $M$ judges $R$ to be better than it judges itself to be. That is, $$M(R) > M(M)$$Then pick a decision problem $d^\star$ and a world $w^\star$ at which $M$ obtains its minimum. That is, $M(d^\star)(w^\star) = M(M)$. And suppose $R$ picks option $a$ from $d^\star$. Then, for each world $w$, $a(w) > M(d^\star)(w^\star)$. If this were not the case, then $R$ would achieve as low a minimum as $M$. But then $M$ would have recommended $a$ instead of $M(d^\star)$ when faced with $d^\star$, since the worst-case of $a$ is better than the worst-case of $M(d^\star)$, which occurs at $w^\star$. So, Maximin is a self-recommending decision theory.

Now, Maximin is usually rejected as a reasonable decision rule for other reasons. For one thing, without further supplementary principles, it permits choices that are weakly dominated---that is, in some decision problems, it will declare one option permissible when there is another that is at least as good at all worlds and better at some. And since it pays no attention to the probabilities of the outcomes, it also permits choices that are stochastically dominated---that is, in some decision problems, it will declare one option permissible when there is another with the same possible outcomes, but higher probabilities for the better of those outcomes and lower probabilities for the worse. For another thing, Maximin just seems too extreme. It demands that you to take £1 for sure instead of a 1% chance of 99p and 99% chance of £10trillion.

An alternative theory of rational decision making that attempts to accommodate less extreme attitudes to risk is Lara Buchak's risk-weighted expected utility theory. This theory encodes your attitudes to risk in a function $r : [0, 1] \rightarrow [0, 1]$ that is (i) continuous, (ii) strictly increasing, and (iii) assigns $r(0) = 0$ and $r(1) = 1$. This function is used to skew probabilities. For risk-averse agents, the probabilities are skewed in such a way that worse-case outcomes receive more weight than expected utility theory gives them, and best-case outcomes receive less weight. For risk-inclined agents, it is the other way around. For risk-neutral agents, $r(x) = x$, the probabilities aren't skewed at all, and the theory agrees exactly with expected utility theory.

Now, suppose $W = \{w_1, \ldots, w_n\}$. Given a credence function $p$ and a risk function $r$ and an option $a$, if $a(w_1) \leq a(w_2) \leq \ldots \leq a(w_n)$, the risk-weighted expected utility of $a$ is
\begin{eqnarray*}
& & REU_{p, r}(a) \\
& = & a(w_1) + \sum^{n-1}_{i=1} r(p(w_{i+1}) + \ldots + p(w_n))(a(w_{i+1}) - a(w_i)) \\
& = & \sum^{n-1}_{i=1} [r(p(w_i) + \ldots + p(w_n)) - r(p(w_{i+1}) + \ldots + p(w_n))]a(w_i) + r(p(w_n))a(w_n)
\end{eqnarray*}
So the risk-weighted expected utility of $a$, like the expected utility of $a$, is a weighted sum of the various utilities that $a$ can take at the different worlds. But, whereas expected utility theory weights the utility of $a$ at $w_i$ by the probability of $w_i$, risk-weighted utility theory weights it by the difference between the skewed probability that you will receive at least $a(w_i)$ from choosing $a$ and the skewed probability that you will receive more than $a(w_i)$ from choosing $a$.

Here is an example to illustrate. The decision is between $a$ and $b$:
$$\begin{array}{r|cc} & w_1 & w_2 \\ \hline a & 1 & 4 \\ b & 2 & 2 \end{array}$$
And suppose $p(w_1) = p(w_2) = 0.5$ and $r(x) = x^2$, for all $0 \leq x \leq 1$. Then:
$$
REU_{p, r}(a) = 1 + r(w_2)(4-1) = 1 + \frac{1}{2}^23 = \frac{7}{4}
$$
while
$$
REU_{p, r}(b) = 2 + r(w_2)(2-2) = 2 = \frac{8}{4}
$$
So, while the expected utility of $a$ (i.e. 2.5) exceeds the expected utility of $b$ (i.e. 2), and so expected utility theory demands you pick $a$ over $b$, the risk-weighted expected utility of $b$ (i.e. 2) exceeds the risk-weighted expected utility of $a$ (i.e. 1.75), and so risk-weighted expected utility demands you pick $b$ over $a$.

Now we're ready to ask the central question of this post: does risk-weighted utility theory recommend itself? And we're ready to give our answer, which is that it doesn't.

It's tempting to think it does, and for the same reason that expected utility theory does. After all, if you're certain that you'll face a particular decision problem, risk-weighted expected utility theory recommends using it to make the decision. How could it not? After all, it recommends picking a particular option, and therefore recommends any theory that will pick that option, since using that theory will have the same utility as picking the option at every world. So, you might expect, it will also recommend itself when you're uncertain which decision you'll face. But risk-weighted expected utility theory doesn't work like that.

Let me begin by noting the simplest case in which it recommends something else. This is the case in which there are two decision problems, $d$ and $d'$, and you're certain that you'll face one or the other, but you're unsure which.
$$
\begin{array}{r|cc}
d & w_1 & w_2 \\
\hline
a & 3 & 6 \\
b & 2 & 8
\end{array}\ \ \
\begin{array}{r|cc}
d' & w_1 & w_2 \\
\hline
a' & 4 & 19 \\
b' & 7 & 9
\end{array}
$$
You think each is equally likely, you think each world is equally likely, and you think the worlds and decision problems are independent. So, $$p(d\ \&\ w_1) = p(d\ \&\ w_2)=p(d'\ \&\ w_1) = p(d'\ \&\ w_2) = \frac{1}{4}$$Then:
$$
REU(a) = 3 + \frac{1}{2}^2(6-3) = 3.75 > 3.5 = 2 + \frac{1}{2}^2(8-2) = REU(b)
$$
$$
REU(a') = 4 + \frac{1}{2}^2(19-4) = 7.75 > 7.5 = 7 + \frac{1}{2}^2(9-7) = REU(b')
$$
So REU will tell you to choose $a$ when faced with $d$ and $a'$ when faced with $d'$. Now compare that with a decision rule $R$ that tells you to pick $b$ and $b'$ respectively.
$$
REU(REU) = 3 + \frac{3}{4}^2(4-3) + \frac{1}{2}^2(6-4) + \frac{1}{4}^2(19-6) = 4.875
$$
and
$$
REU(R) = 2 + \frac{3}{4}^2(7-2) + \frac{1}{2}^2(8-7) + \frac{1}{4}^2(9-8) = 5.125
$$
So risk-weighted utility theory does not recommend itself in this situation. Yet it doesn't seem fair to criticize it on this basis. After all, perhaps it redeems itself by its performance in the face of other decision problems. In exploitability arguments, we only consider one series of decisions. Here, we only consider a pair of possible decisions. What happens when we have much much more limited information about the decision problems we'll face?

Let's suppose that there is a finite set of utilities, $\{0, \ldots, n\}$.*** And suppose you consider every two-world, two-option decision problem with utilities from that set is possible and equally likely. That is, the following decision problems are equally likely: decision problems $d$ in which there are exactly two available options, $a$ and $b$, which are defined only on mutually exclusive and exhaustive worlds $w_1$ and $w_2$, and where the utilities $a(w_1), a(w_2), b(w_1), b(w_2)$ lie in $\{0, \ldots, n\}$.

Here are some results: Set $n = 22$, and let $p(w_1) = p(w_2) = 0.5$. And consider risk functions of the form $r_k(x) = x^k$. For $k > 1$, $r_k$ is risk-averse; for $k = 1$, $r_k$ is risk-neutral and risk-weighted expected utility theory agrees with expected utility theory; and for $k < 1$, $r_k$ is risk-seeking. We say that $REU_{p, r_k}$ judges $REU_{p, r_{k'}}$ better than it judges itself if $$REU_{p, r_k}(REU_{p, r_k}) < REU_{p, r_k}(REU_{p, r_{k'}})$$And we write $REU_{p, r_k} \rightarrow REU_{p, r_{k'}}$. Then we have the following results:
$$
REU_{p, r_2} \rightarrow REU_{p, r_{1.5}} \rightarrow REU_{p, r_{1.4}} \rightarrow REU_{p, r_{1.3}} \rightarrow REU_{p, r_{1.2}} \rightarrow REU_{p, r_{1.1}}
$$
and
$$
REU_{p, r_{0.5}} \rightarrow REU_{p, r_{0.6}} \rightarrow REU_{p, r_{0.7}} \rightarrow REU_{p, r_{0.8}} \rightarrow REU_{p, r_{0.9}}
$$
So, for many natural risk-averse and risk-seeking risk functions, risk-weighted utility theory isn't self-recommending. And this, it seems to me, is a problem for these versions of the theory.

Now, for all my current results say, it's possible that there is a risk function other than $r(x) = x$ for which the theory recommends itself. But my conjecture is that this doesn't happen. The present results suggest that, for each risk-averse function, there is a less risk-averse one that it judges better, and for risk-seeking ones, there is a less risk-seeking one that it judges better. But even if there were a risk function for which the theory is self-recommending, that would surely limit the versions of risk-weighted expected utility theory that are tenable. That in itself would be an interesting result.

* The work of which I'm aware that comes closest to what interests me here is Catrin Campbell-Moore and Bernhard Salow's exploration of proper scoring rules for risk-sensitive agents in Buchak's theory. But it's not quite the same issue. And the idea of judging some part or whole of our decision-making apparatus by looking at its performance over all decision problems we might face I draw from Mark Schervish's and Ben Levinstein's work. But again, they are interested in using decision theories to judge credences, not using decision theories to judge themselves.

** In Section 13.7 of my Choosing for Changing Selves and Chapter 6 of my Dutch Book Arguments.

*** I make this restriction because it's the one for which I have some calculations; there's no deeper motivation. From fiddling with the calculations, it looks to me as if this restriction is inessential.

Aggregating for accuracy: another accuracy argument for linear pooling

2022-06-23T09:04:00.004+01:00

A PDF of this blogpost is available here.

I don't have an estimate for how long it will be before the Greenland ice sheet collapses, and I don't have an estimate for how long it will be before the average temperature at Earth's surface rises more than 3C above pre-industrial levels. But I know a bunch of people who do have such estimates, and I might hope that learning theirs might help me set mine. Unfortunately, each of these people has a different estimate for each of these two quantities. What should I do? Should I pick one of them at random and adopt their estimates as mine? Or should I pick some compromise between them? If the latter, which compromise?

Cat inaccurately estimates width of step

The following fact gives a hint. An estimate of a quantity, such as the number of years until an ice sheet collapses or the number of years until the temperature rises by a certain amount, is better the closer it lies to the true value of the quantity and worse the further it lies from this. There are various ways to measure the distance between estimate and true value, but we'll stick with a standard one here, namely, squared error, which takes the distance to be the square of the difference between the two values. Then the following is simply a mathematical fact: taking the straight average of the group's estimates of each quantity as your estimate of that quantity is guaranteed to be better, in expectation, than picking a member of the group at random and simply deferring to them. This is sometimes called the Diversity Prediction Theorem, or it's a corollary of what goes by that name.

The result raises a natural question: Is it only by taking the straight average of the group's estimate as your own that you can be guaranteed to do better, in expectation, than by picking at random? Or is there another method for aggregating the estimates that also has this property? As I'll show, only straight averaging has this property. If you combine the group's estimates in any other way to give your own, there is a possible set of true values that will lie further from your estimate than you would lie, in expectation, were you to pick at random. The question is natural, and the answer is not difficult to prove, so I'm pretty confident this has been asked and answered before; but I haven't been able to find it, so I'd be grateful for a reference if anyone has one.

Let's make all of this precise. We have a group of $m$ individuals; each of them has an estimate for each of the quantities $Q_1, \ldots, Q_n$. We represent individual $j$ by the sequence $X_j = (x_{j1}, \ldots, x_{jn})$ of their estimates of these quantities. So $x_{ji}$ is the estimate of quantity $Q_i$ by individual $j$. Suppose $T = (t_1, \ldots, t_n)$ is the sequence of true values of these quantities. So $t_i$ is the true value of $Q_i$. Then the disvalue or badness of individual $j$'s estimates, as measured by squared error, is:$$(x_{j1} - t_1)^2 + \ldots + (x_{jn} - t_n)^2$$The disvalue or badness of individual $j$'s estimate of quantity $Q_i$ is $(x_{ji} - t_i)^2$, and the disvalue or badness of their whole set of estimates is the sum of the disvalue or badness of their individual estimates. We write $\mathrm{SE}(X, T)$ for this sum. That is,$$\mathrm{SE}(X_j, T) = \sum_i (x_{ji} - t_i)^2$$Then the Diversity Prediction Theorem says that, for any $X_1, \ldots, X_m$ and any $T$,$$\mathrm{SE}\left (\frac{1}{m}X_1 + \ldots + \frac{1}{m}X_m, T \right ) < \frac{1}{m}\mathrm{SE}(X_1, T) + \ldots + \frac{1}{m}\mathrm{SE}(X_m, T)$$And we wish to prove a sort of converse, namely, if $V \neq \frac{1}{m}X_1 + \ldots + \frac{1}{m}X_m$, then there is a possible set of true values $T = (t_1, \ldots, t_n)$ such that$$\mathrm{SE}(V, T) > \frac{1}{m}\mathrm{SE}(X_1, T) + \ldots + \frac{1}{m}\mathrm{SE}(X_m, T)$$I'll give the proof below.

Why is this interesting? One question at the core of those parts of philosophy that deal with collectives and their attitudes is this: How should you aggregate the opinions of a group of individuals to give a single set of opinions? When the opinions come in numerical form, such as when they are estimates of quantities or when they are probabilities, there are a number of proposals. Taking the straight arithmetic average as we have done here is just one. How are we to decide which to use? Standard arguments proceed by identifying a set of properties that only one aggregation method boasts, and then arguing that the properties in the set are desirable given your purpose in doing the aggregation in the first place. The result we have just noted might be used to mount just such an argument: when we aggregate estimates, we might well want a method that is guaranteed to produce aggregate estimates that are better, in expectation, than picking at random, and straight averaging is the only method that does that.

Finally, here's a slightly more general version of the result, which considers not just straight averages but also weighted averages; the proof is also given.

Proposition Suppose $\lambda_1, \ldots, \lambda_m$ is a set of weights, so that $0 \leq \lambda_j \leq 1$ and $\sum_j \lambda_j = 1$. Then, if $V \neq \lambda_1 X_1 + \ldots + \lambda_mX_m$, then there is a possible set of true values $T = (t_1, \ldots, t_n)$ such that$$\mathrm{SE}(V, T) > \lambda_1\mathrm{SE}(X_1, T) + \ldots + \lambda_m\mathrm{SE}(X_m, T)$$

Proof. The left-hand side of the inequality is
$$
\mathrm{SE}(V, T) = \sum_i (v_i - t_i)^2 = \sum_i v_i^2 - 2\sum_i v_it_i + \sum_i t^2_i
$$The right-hand side of the inequality is
\begin{eqnarray*}
\sum_j \lambda_j \mathrm{SE}(X_j, T) & = & \sum_j \lambda_j \sum_i (x_{ji} - t_i)^2 \\
& = & \sum_j \lambda_j \sum_i \left ( x^2_{ji} - 2x_{ji}t_i + t_i^2 \right ) \\
& = & \sum_{i,j} \lambda_j x^2_{ji} - 2\sum_{i,j} \lambda_j x_{ji}t_i + \sum_i t_i^2
\end{eqnarray*}
So $\mathrm{SE}(V, T) > \sum_j \lambda_j \mathrm{SE}(X_j, T)$ iff$$
\sum_i v_i^2 - 2\sum_i v_it_i > \sum_{i,j} \lambda_j x^2_{ji} - 2\sum_{i,j} \lambda_j x_{ji}t_i
$$iff$$
2\left ( \sum_i \left ( \sum_j \lambda_j x_{ji}- v_i \right) t_i \right ) > \sum_{i,j} \lambda_j x^2_{ji} - \sum_i v_i^2
$$And, if $(v_1, \ldots, v_n) \neq (\sum_j \lambda_j x_{j1}, \ldots,\sum_j \lambda_j x_{jn})$, there is $i$ such that $\sum_j \lambda_j x_{ji} - v_i \neq 0$, and so it is always possible to choose $T = (t_1, \ldots, t_n)$ so that the inequality holds, as required.

Should we agree? III: the rationality of groups

2022-05-18T13:19:00.003+01:00

In the previous two posts in this series (here and here), I described two arguments for the conclusion that the members of a group should agree. One was an epistemic argument and one a pragmatic argument. Suppose you have a group of individuals. Given an individual, we call the set of propositions to which they assign a credence their agenda. The group's agenda is the union of its member's agendas; that is, it includes any proposition to which some member of the group assigns a credence. The precise conclusion of the two arguments we describe is this: the group is irrational if there no single probability function defined on the group's agenda that gives the credences of each member of the group when restricted to their agenda. Following Matt Kopec, I called this norm Consensus.

Cats showing a frankly concerning degree of consensus

Both arguments use the same piece of mathematics, but they interpret it differently. Both appeal to mathematical functions that measure how well our credences achieve the goals that we have when we set them. There are (at least) two such goals: we aim to have credences that will guide our actions well and we aim to have credences that will represent the world accurately. In the pragmatic argument, the mathematical function measures how well our credences achieve the first goal. In particular, they measure the utility we can expect to gain by having the credences we have and choosing in line with them when faced with whatever decision problems life throws at us. In the epistemic argument, the mathematical function measures how well our credences achieve the second goal. In particular, they measure the accuracy of our credences. As we noted in the second post on this, work by Mark Schervish and Ben Levinstein shows that the functions that measure these goals have the same properties: they are both strictly proper scoring rules. The arguments then appeal to the following fact: given a strictly proper scoring rule, if the members of a group do not agree on the credences they assign in the way required by Consensus, then there are some alternative credences they might assign instead that are guaranteed to be better according to that scoring rule.

I'd like to turn now to assessing these arguments. My first question is this: In the norm of Probabilism, rationality requires something of an individual, but in the norm of Consensus, rationality requires something of a group of individuals. We understand what it means to say that an individual is irrational, but what could it mean to say that a group is irrational?

Here, I follow Kenny Easwaran's suggestion that collective entities---in his case, cities; in my case, groups---can be said quite literally to be rational or irrational. For Easwaran, a city is rational "to the extent that the collective practices of its people enable diverse inhabitants to simultaneously live the kinds of lives they are each trying to live." As I interpret him, the idea is this: a city, no less than its individual inhabitants, has an end or goal or telos. For Easwaran, for instance, the end of a city is enabling its inhabitants to live as they wish to. And a city is irrational if it does not provide---in its physical and technological infrastructure, its byelaws and governing institutions---the best means to that end among those that are available. Now, we might disagree with Easwaran's account of a city's ends. But the template he provides by which we might understand group rationality is nonetheless helpful. Following his lead, we might say that a group, no less than its individual members, has an end. For instance, its end might be maximising the total utility of its members, or it might be maximizing the total epistemic value of their credences. And it is then irrational if it does not provide the best means to that end among those available. So, for instance, as long as agreement between members is available, our pragmatic and epistemic arguments for Consensus seem to show that a group whose ends are as I just described does not provide the best means to its ends if it does not deliver such agreement.

Understanding group rationality as Easwaran does helps considerably. As well as making sense of the claim that the group itself can be assessed for rationality, it also helps us circumscribe the scope of the two arguments we've been exploring, and so the scope of the version of Consensus that they justify. After all, it's clear on this conception that these arguments will only justify Consensus for a group if

that group has the end of maximising total expected pragmatic utility or total epistemic utility, i.e., maximising the quantities measured by the mathematical functions described above;
there are means available to it to achieve Consensus.

So, for instance, a group of sworn enemies hellbent of thwarting each other's plans is unlikely to have as its end maximising total utility, while a group composed of randomly selected individuals from across the globe is unlikely to have as its end maximising total epistemic utility, and indeed a group so disparate might lack any ends at all.

And we can easily imagine situations in which there are no available means by which the group could achieve Consensus, perhaps because it would be impossible to set up reliable lines of communication.

This allows us to make sense of two of the conditions that Donald Gillies places on the groups to which he takes his sure loss argument to apply (this is the first version of the pragmatic argument for Consensus; the one I presented in the first post and then abandoned in favour of the second version in the second post). He says (i) the members of the group must have a shared purpose, and (ii) there must be good lines of communication between them. Let me take these in turn to understand their status more precisely.

It's natural to think that, if a group has a shared purpose, it will have as its end maximising the total utility of the members of the group. And indeed in some cases this is almost certainly true. Suppose, for instance, that every member of a group cares only about the amount of biodiversity in a particular ecosystem that is close to their hearts. Then they will have the same utility function, and it is natural to say that maximising that shared utility is the group's end. But of course maximising that shared utility is equivalent to maximising the group's total utility, since the total utility is simply the shared utility scaled up by the number of members of the group.

However, it is also possible for a group to have a shared purpose without its end being to maximise total utility. After all, a group can have a shared purpose without each member taking that purpose to be the one and only valuable end. Imagine a different group: each member cares primarily about the level of biodiversity in their preferred area, but each also cares deeply about the welfare of their family. In this case, you might take the group's end to be maximising biodiversity in the area in question, particularly if it was this shared interest that brought them together as a group in the first place, but maximising this good might require the group not to maximise total utility, perhaps because some members of the group have family who are farmers and who will be adversely affected by whatever is the best means to the end of greater biodiversity.

What's more, it's possible for a group to have as its end maximising total utility without having any shared purpose at all. For instance, a certain sort of utilitarian might say that the group of all sentient beings has as its end the maximisation of the total utility of its members. But that group does not have any shared purpose.

So I think we can use the pragmatic and epistemic arguments to determine the groups to which the norm of Consensus applies, or at least the groups for which our pragmatic and epistemic arguments can justify its application. It is those groups that have as their end either maximising the total pragmatic utility of the group, or maximising their total epistemic utility, or maximising some weighted average of the two---after all, the weighted average of two strictly proper scoring rules, one measuring epistemic utility and one measuring pragmatic utility, is itself a strictly proper scoring rule. Of course, this requires an account of when a group has a particular end. This, like all questions about when collectives have certain attitudes, is delicate. I won't say anything more about it here.

Let's turn next to Gillies' claim that Consensus applies only to groups between whose members there are reliable lines of communication. In fact, I think our versions of the arguments show that this condition lives a strange double life. On the one hand, if such lines of communcation are necessary to achieve agreement across the group, then the norm of Consensus simply does not apply to a group when these lines of communication are impossible, perhaps because of geographical, social, or technological barriers. A group cannot be judged irrational for failing to achieve something it could not possibly achieve, however much closer it would get to its goal if it could achieve that.

On the other hand, if such lines of communication are available, and if they increase the chance of agreement among members of the group, then our two arguments for Consensus are equally arguments for establishing such lines of communication, providing that the cost of doing so is outweighed by the gain in pragmatic or epistemic utility that comes from achieving agreement.

But these arguments do something else as well. They lend nuance to Consensus. In some cases in which some lines of communication are available but others aren't, or are too costly, our arguments still provide norms. Take, for instance, a case in which some central planner is able to communicate a single set of prior credences that each member of the group should have, but after the members start receiving evidence, this central planner can no longer coordinate their credences. And suppose we know that the members will receive different evidence: they'll be situated in different places, and so they'll see different things, have access to different information sources, and so on. So we know that, if they update on the evidence they receive in the standard way, they'll end up having different credences from one another and therefore violating Consensus. You might think, from looking at Consensus, that the group would do better, both pragmatically and epistemically, if each of its members were to ignore whatever evidence were to come in and to stick with their prior regardless in order to be sure that they remain in agreement and satisfy Consensus both in their priors and their posteriors.

In fact, however, this isn't the case. Let's take an extremely simple example. The group has just two members, Ada and Baz. Each has opinions only about the outcomes of two independent tosses of a fair coin. So the possible worlds are HH, HT, TH, TT. Ada will learn the outcome of the first, and Baz will learn the outcome of the second. A central planner can communicate to them a prior they should adopt, but that central planner can't receive information from them, and so can't receive their evidence and pool it and communicate a shared posterior to them. How should Ada and Baz proceed? How should they pick their priors, and what strategies should each adopt for updating when the evidence comes in? The entity we're assessing for rationality is the quadruple that contains Ada's prior together with her plan for updating, and Baz's prior together with his plan for updating. Which of these are available? Well, nothing constrains Ada's priors and nothing constrain's Baz's. But there are constraints on their updating rules. Ada's updating rule must give the same recommendation at any two worlds at which her evidence is the same---so, for instance, it must give the same recommendation at HH as at HT, since all she learns at both is that the first coin landed heads. And Baz's updating rule must give the same recommendation at any two worlds at which his evidence is the same---so, for instance, it must give the same recommendation at HH as at TH. Then consider the following norm:

Prior Consensus Ada and Baz should have the same prior and both should plan to update on their private evidence by conditioning on it.

And the argument for this is that, if they don't, there's a quadruple of their priors and plans that (i) satisfy the constraint outlined above and (ii) together have greater total epistemic utility at each possible world; and there's a quadruple of their priors and plans that (i) satisfy the constraint outlined above and (ii) together have greater total expected pragmatic utility at each possible world. This is a corollary of an argument that Ray Briggs and I gave, and that Michael Nielsen corrected and improved on. So, if Ada and Baz are in agreement on their prior, and plan to stick with it rather than update on their evidence because that way they'll retain agreement, then they're be accuracy dominated and pragmatically dominated.

You might wonder how this is possible. After all, whatever evidence Ada and Baz each receive, Prior Consensus requires them to update on it in a way that leads them to disagree, and we know that they are then accuracy and pragmatically dominated. This is true, and it would tell against the priors + updating plans recommended by Prior Consensus if there were some way for Ada and Baz to communicate after their evidence came in. It's true that, for each possible world, there is some credence function such that if, at each world, Ada and Baz were to have that credence function rather than the ones they obtain by updating their shared prior on their private evidence, then they'd end up with greater total accuracy and pragmatic utility. But, without the lines of communication, they can't have that.

So, by looking in some detail at the arguments for Consensus, we come to understand better the groups to which it applies and the norms that apply to those groups to which it doesn't apply in its full force.

Should we agree? II: a new pragmatic argument for consensus

2022-05-06T10:08:00.002+01:00

There is a PDF version of this blogpost available here.

In the previous post, I introduced the norm of Consensus. This is a claim about the rationality of groups. Suppose you've got a group of individuals. For each individual, call the set of propositions to which they assign a credence their agenda. They might all have quite different agendas, some of them might overlap, others might not. We might say that the credal states of these individual members cohere with one another if there is a some probability function that is defined for any proposition that appears in any member's agenda, and the credences each member assigns to the propositions in their agenda match those assigned by this probability function to those propositions. Then Consensus says that a group is irrational if it does not cohere.

A group coming to consensus

In that post, I noted that there are two sorts of argument for this norm: a pragmatic argument and an epistemic argument. The pragmatic argument is a sure loss argument. It is based on the fact that, if the individuals in the group don't agree, there is a series of bets that their credences require them to accept that will, when taken together, lose the group money for sure. In this post, I want to argue that there is a problem with the sure loss argument for Consensus. It isn't peculiar to this argument, and indeed applies equally to any argument that tries to establish a rational requirement by showing that someone who violates it is exploitable. Indeed, I've raised it elsewhere against the sure loss argument for Probabilism (Section 6.2, Pettigrew 2020) and the money pump argument against non-exponential discounting and changing preferences in general (Section 13.7.4, Pettigrew 2019). I'll describe the argument here, and then offer a solution based on work by Mark Schervish (1989) and Ben Levinstein (2017). I've described this sort of solution before (Section 6.3, Pettigrew 2020), and Jason Konek (ta) has recently put it to interesting work addressing an issue with Julia Staffel's (2020) account of degrees of incoherence.

Sure loss and money pump arguments judge the rationality of attitudes, whether credences or preferences, by looking at the quality of the choices they require us to make. As Bishop Butler said, probability is the very guide of life. These arguments evaluate credences by exactly how well they provide that guide. So they are teleological arguments: they attempt to derive facts about the epistemic right---namely, what is rationally permissible---from facts about the epistemic good---namely, leading to pragmatically good choices.

Say that one sequence of choices dominates another if, taken together, the first leads to better outcomes for sure. Say that a collection of attitudes is exploitable if there is a sequence of decision problems you might face such that, if faced with them, these attitudes will require you to make a dominated sequence of choices.

For instance, take the sure loss argument for Probabilism: if you violate Probabilism because you believe $A\ \&\ B$ more strongly than you believe $A$, your credence in the former will require you to pay some amount of money for a bet that pays out a pound if $A\ \&\ B$ true and nothing if it's false, and your credence in the latter will require you to sell for less money a bet that pays out a pound if $A$ is true and nothing if it's false; yet you'd be better off for sure rejecting both bets. So rejecting both bets dominates accepting both; your credences require you to accept both; so your credences are exploitable. Or take the money pump argument against cyclical preferences: if you prefer $A$ to $B$ and $B$ to $C$ and $C$ to $A$, then you'll choose $B$ when offered a choice between $B$ and $C$, you'll then pay some amount to swap to $A$, and you'll then pay some further amount to swap to $C$; yet you'd be better off for sure simply choosing $C$ in the first place and not swapping either time that possibility was offered. So choosing $C$ and sticking with it dominates the sequence of choices your preferences require; so your preferences are exploitable.

But, I contend, the existence of a sequence of decision problems in response to which your attitudes require you to make a dominated series of choices does not on its own render those attitudes irrational. After all, it is just one possible sequence of decision problems you might face. And there are many other sequences you might face instead. The argument does not consider how your attitudes will require you to choose when faced with those alternative sequences, and yet surely that is relevant to assessing those attitudes, for it might be that however bad is the dominated sequences of choices the attitudes require you to make when faced with the sequence of decision problems described in the argument for exploitability, there is another sequence of decision problems where those same attitudes require you to make a series of choices that are very good; indeed, they might be so good that they somehow outweigh the badness of the dominated sequence. So, instead of judging your attitudes by looking only at the outcome of choosing in line with them when faced with a single sequence of decision problems, we should rather judge them by looking at the outcome of choosing in line with them when faced with any decision problem that might come your way, weighting each by how likely you are to face it, to give a balanced view of the pragmatic benefits of having those credences. That's the approach I'll present now, and I'll show that it leads to a new and better pragmatic argument for Probabilism and Consensus.

As I presented them, the sure loss arguments for Probabilism and Consensus both begin with a principle that I called Ramsey's Thesis. This is a claim about the prices that an individual's credence in a proposition requires her to pay for a bet on that proposition. It says that, if $p$ is your credence in $A$ and $x < pS$, then you are required to pay $£x$ for a bet that pays out $£S$ if $A$ is true and $£0$ if $A$ is false. Now in fact this is a particular consequence of a more general norm about how our credences require us to choose. Let's call the more general norm Extended Ramsey's Thesis. It says how our credence in a proposition requires us to choose when faced with a series of options, all of whose payoffs depend only on the truth or falsity of that proposition. Given a proposition $A$, let's say that an option is an $A$-option if its payoffs at any two worlds at which $A$ is true are the same, and its payoffs at any two worlds at which $A$ is false are the same. Then, given a credence $p$ in $A$ and an $A$-option $a$, we say that the expected payoff of $a$ by the lights of $p$ is
$$
p \times \text{payoff of $a$ when $A$ is true} + (1-p) \times \text{payoff of $a$ when $A$ is false}
$$Now suppose you face a decision problem in which all of the available options are $A$-options. Then Extended Ramsey's Thesis says that you are required to pick an option whose expected payoff by the lights of your credence in $A$ is maximal.*

Next, we make a move that is reminiscent of the central move in I. J. Good's argument for Carnap's Principle of Total Evidence (Good 1967). We say what we take the payoff to be of having a particular credence in a particular proposition given a particular way the world is and when faced with a particular decision problem. Specifically, we define the payoff of having credence $p$ in the proposition $A$ when that proposition is true, and when you're faced with a decision problem $D$ in which all of the options are $A$-options, to be the payoff when $A$ is true of whichever $A$-option available in $D$ maximises expected payoff by the lights of $p$. And we define the payoff of having credence $p$ in the proposition $A$ when that proposition is false, and when you're faced with a decision problem $D$ in which all of the options are $A$-options, to be the payoff when $A$ is false of whichever $A$-option available in $D$ maximises expected payoff by the lights of $p$. So the payoff of having a credence is the payoff of the option you're required to pick using that credence.

Finally, we make the move that is central to Schervish's and Levinstein's work. We now know the payoff of having a particular credence in propositiojn $A$ when you face a decision problem in which all options are $A$-options. But of course we don't know which such decision problems we'll face. So, when we evaluate the payoff of having a credence in $A$ when $A$ is true, for instance, we look at all the decision problems populated by $A$-options we might face and weight them by how likely we are to face them and then take the payoff of having that credence when $A$ is true to be the expected payoff of the $A$-options it would leave us to choose faced with the decision problems we'll face. And then we note, as Schervish and Levinstein themselves note: if we make certain natural assumptions about how likely we are to face different decisions, then this resulting measure of the pragmatic payoff of having credence $p$ in proposition $A$ is a continuous and strictly proper scoring rule. That is, mathematically, the functions we use to measure the pragmatic value of a credence function are identical to the functions we use to evaluate the epistemic value of a credence that we use in the epistemic utility argument for Probabilism and Consensus.**

With this construction in place, we can piggyback on the theorems stated in the previous post to give new pragmatic arguments for Probabilism and Consensus. First: Suppose your credences do not obey Probabilism. Then there are alternative ones you might have instead that do obey that norm and, at any world, if we look at each decision problem you might face and ask what payoff you'd receive at that world were you to choose from the options in that decision problem as the two different sets of credences require, and then weight those payoffs by how likely they are to face that decision to give their expected payoff, then the alternatives will always have the greater expected payoff. This gives strong reason to obey Probabilism.

Second: Take a group of individuals. Now suppose the group's credences do not obey Consensus. Then there are alternative credences each member might have instead such that, if they were to have them, the group would obey Consensus and, at any world, if we look at each decision problem each member might face and ask what payoff that individual would receive at that world were they to choose from the options in that decision problem as the two different sets of credences require, and then weight those payoffs by how likely they are to face that decision to give their expected payoff, then the alternatives will always have the greater expected total payoff when this is summed across the whole group.

So that is our new and better pragmatic argument for Consensus. The sure loss argument points out a single downside to a group that violates the norm. Such a group is vulnerable to exploitation. But it remains silent on whether there are upsides that might balance out that downside. The present argument addresses that problem. It finds that, if a group violates the norm, there are alternative credences they might have that are guaranteed to serve them better in expectation as a basis for decision making.

* Notice that, if $x < pS$, then the expected payoff of a bet that pays $S$ if $A$ is true and $0$ if $A$ is false is
$$
p(-x + S) + (1-p)(-x) = pS- x
$$
which is positive. So, if the two options are accept or reject the bet, accepting maximises expected payoff by the lights of $p$, and so it is required, as Ramsey's Thesis says.

** Konek (ta) gives a clear formal treatment of this solution. For those who want the technical details, I'd recommend the Appendix of that paper. I think he presents it better than I did in (Pettigrew 2020).

References

Good, I. J. (1967). On the Principle of Total Evidence. The British Journal for the Philosophy of Science, 17, 319–322.

Konek, J. (ta). Degrees of incoherence, dutch bookability & guidance value. Philosophical Studies.

Levinstein, B. A. (2017). A Pragmatist’s Guide to Epistemic Utility. Philosophy of Science, 84(4), 613–638.

Pettigrew, R. (2019). Choosing for Changing Selves. Oxford, UK: Oxford University Press.

Pettigrew, R. (2020). Dutch Book Arguments. Elements in Decision Theory and Philosophy. Cambridge, UK: Cambridge University Press.

Schervish, M. J. (1989). A general method for comparing probability assessors. The Annals of Statistics, 17, 1856–1879.

Staffel, J. (2020). Unsettled Thoughts. Oxford University Press.

Should we agree? I: the arguments for consensus

2022-05-05T07:50:00.001+01:00

You can find a PDF of this blogpost here.

Should everyone agree with everyone else? Whenever two members of a group have an opinion about the same claim, should they both be equally confident in it? If this is sometimes required of groups, of which ones is it required and when? Whole societies at any time in their existence? Smaller collectives when they're engaged in some joint project?

Of course, you might think these are purely academic questions, since there's no way we could achieve such consensus even if we were to conclude that it is desirable, but that seems too strong. Education systems and the media can be deployed to push a population towards consensus, and indeed this is exactly how authoritarian states often proceed. Similarly, social sanctions can create incentives for conformity. So it seems that a reasonable degree of consensus might be possible.

But is it desirable? In this series of blogposts, I want to explore two formal arguments. They purport to establish that groups should be in perfect agreement; and they explain why getting closer to consensus is better, even if perfect agreement isn't achieved---in this case, a miss is not as good as a mile. It's still a long way from their conclusions to practical conclusions about how to structure a society, but they point sufficiently strongly in a surprising direction that it is worth exploring them. In this first post, I set out the arguments as they have been given in the literature and polish them up a bit so that they are as strong as possible.

Since they're formal arguments, they require a bit of mathematics, both in their official statement and in the results on which they rely. But I want to make the discussion as accessible as possible, so, in the main body of the blogpost, I state the arguments almost entirely without formalism. Then, in the technical appendix, I sketch some of the formal detail for those who are interested.

Two sorts of argument for credal norms

There are two sorts of argument we most often use to justify the norms we take to govern our credences: there are pragmatic arguments, of which the betting arguments are the most famous; and there are epistemic arguments, of which the epistemic utility arguments are the most well known.

Take the norm of Probabilism, for instance, which says that your credences should obey the axioms of the probability calculus. The betting argument for Probabilism is sometimes known as the Dutch Book or sure loss argument.* It begins by claiming that the maximum amount you are willing to pay for a bet on a proposition that pays out a certain amount if the proposition is true and nothing if it is false is proportional to your credence in that proposition. Then it shows that, if your credences do not obey the probability axioms, there is a set of bets each of which they require you to accept, but which when taken together lose you money for sure; and if your credences do obey those axioms, there is no such set of bets.

The epistemic utility argument for Probabilism, on the other hand, begins by claiming that any measure of the epistemic value of credences must have certain properties.** It then shows that, by the lights of any epistemic utility function that does have those properties, if your credences do not obey the probability axioms, then there are alternatives that are guaranteed to be have greater epistemic utility than yours; and if they do obey those axioms, there are no such alternatives.

Bearing all of this in mind, consider the following two facts.

(I) Suppose we make the same assumptions about which bets an individual's credences require them to accept that we make in the betting argument for Probabilism. Then, if two members of a group assign different credences to the same proposition, there is a bet the first should accept and a bet the second should accept that, taken together, leave the group poorer for sure (Ryder 1981, Gillies 1991).

(II) Suppose we measure the epistemic value of credences using an epistemic utility function that boasts the properties required of it by the epistemic utility argument for Probabilism. Then, if two members of a group assign different credences to the same proposition, there is a single credence such that the group is guaranteed to have greater total epistemic utility if every member adopts that single credence in that proposition (Kopec 2012).

Given the epistemic utility and betting arguments for Probabilism, neither (I) nor (II) is very surprising. After all, one consequence of Probabilism is that an individual must assign the same credence to two propositions that have the same truth value as a matter of logic. But from the point of view of the betting argument or the epistemic utility argument, this is structurally identical to the requirement that two different people assign the same credence to the same proposition, since obviously a single proposition necessarily has the same truth value as itself! However we construct the sure loss bets against the individual who violates the consequence of Probabilism, we can use an analogous strategy to construct the sure loss bets against the pair who disagree in the credences they assign. And however we construct the alternative credences that are guaranteed to be more accurate than the ones that violate the consequence of Probabilism, we can use an analogous strategy to construct the alternative credence that, if adopted by all members of the group that contains two individuals who currently disagree, would increase their total epistemic utility for sure.

Just as a betting argument and an epistemic utility argument aim to establish the individual norm of Probabilism, we might ask whether there is a group norm for which we can give a betting argument and an epistemic utility argument by appealing to (I) and (II)? That is the question I'd like to explore in these posts. In the remainder of this post, I'll spell out the details of the epistemic utility argument and the betting argument for Probabilism, and then adapt those to give analogous arguments for Consensus.

The Epistemic Utility Argument for Probabilism

Two small bits of terminology first:

Your agenda is the set of propositions about which you have an opinion. We'll assume throughout that all individuals have finite agendas.
Your credence function takes each proposition in your agenda and returns your credence in that proposition.

With those in hand, we can state Probabilism

Probabilism Rationality requires of an individual that their credence function is a probability function.

What does it mean to say that a credence function is a probability function? There are two cases to consider.

First, suppose that, whenever a proposition is in your agenda, its negation is as well; and whenever two propositions are in your agenda, their conjunction and their disjunction are as well. When this holds, we say that your agenda is a Boolean algebra. And in that case your credence function is a probability function if two conditions hold: first, you assign the minimum possible credence, namely 0, to any contradiction and the maximum possible credence, namely 1, to any tautology; second, your credence in a disjunction is the sum of your credences in the disjuncts less your credence in their conjunction (just like the number of people in two groups is the number in the first plus the number in the second less the number in both).

Second, suppose that your agenda is not a Boolean algebra. In that case, your credence function is a probability function if it is possible to extend it to a probability function on the smallest Boolean algebra that contains your agenda. That is, it's possible to fill out your agenda so that it's closed under negation, conjunction, and disjunction, and then extend your credence function so that it assign credences to those new propositions in such a way that the result is a probability function on the expanded agenda. Defining probability functions on agendas that are not Boolean algebras allows us to say, for instance, that, if your agenda is just It will be windy tomorrow and It will be windy and rainy tomorrow, and you assign credence 0.6 to It will be windy and 0.8 to It will be windy and rainy, then you violate Probabilism because there's no way to assign credences to It won't be windy, It will be windy or rainy, It won't be rainy, etc in such a way that the result is a probability function.

The Epistemic Utility Argument for Probabilism begins with three claims about how to measure the epistemic value of a whole credence function. The first is Individual Additivity, which says that the epistemic utility of a whole credence function is simply the sum of the epistemic utilities of the individual credences it assigns. The second is Continuity, which says that, for any proposition, the epistemic utility of a credence in that proposition is a continuous function of that credence. And the third is Strict Propriety, which says that, for any proposition, each credence in that proposition should expect itself to be have greater epistemic utility than it expects any alternative credence in that proposition to have. With this account in hand, the argument then appeals to a mathematical theorem, which tells us two consequences of measuring epistemic value using an epistemic utility function that has the three properties just described, namely, Individual Additivity, Continuity, and Strict Propriety.

(i) For any credence function that violates Probabilism, there is a credence function defined on the same agenda that satisfies it and that has greater epistemic utility regardless of how the world turns out. In this case, we say that the alternative credence function dominates the original one.

(ii) For any credence function that is a probability function, there is no credence function that dominates it. Indeed, there is no alternative credence function that is even as good as it at every world. For any alternative, there will be some world where that alternative is strictly worse.

The argument concludes by claiming that an option is irrational if there is some alternative that is guaranteed to be better and no option that is guaranteed to be better than that alternative.

The Epistemic Utility Argument for Consensus

As I stated it above, and as it is usually stated in the literature, Consensus says that, whenever two members of a group assign credences to the same proposition, they should assign the same credence. But in fact the epistemic argument in its favour establishes something stronger. Here it is:

Consensus Rationality requires of a group that there is a single probability function defined on the union of the agendas of all of the members of the group such that the credence function of each member assigns the same credence to any proposition in their agenda as this probability function does.

This goes further than simply requiring that all agents agree on the credence they assign to any proposition to which they all assign credences. Indeed, it would place constraints even on a group whose members' agendas do not overlap at all. For instance, if you have credence 0.6 that it will be rainy tomorrow, while I have credence 0.8 that it will be rainy and windy, the pair of us will jointly violate Consensus, even though we don't assign credences to any of the same propositions, since no probability function assigns 0.6 to one proposition and 0.8 to the conjunction of that proposition with another one. In these cases, we say that the group's credences don't cohere.

One notable feature of Consensus is that it purports to govern groups, not individuals, and we might wonder what it could mean to say that a group is irrational. I'll return to that in a later post. It will be useful to have the epistemic utility and betting arguments for Consensus to hand first.

The Epistemic Utility Argument for Consensus begins, as the epistemic argument for Probabilism does, with Individual Additivity, Continuity, and Strictly Propriety. And it adds to those Group Additivity, which says that group's epistemic utility is the sum of the epistemic utilities of the credence functions of its members. With this account of group epistemic value in hand, the argument then appeals again to a mathematical theorem, but a different one, which tells us two consequences of Group and Individual Additivity, Continuity, and Strict Propriety:***

(i) For any group that violates Consensus, there is, for each individual, an alternative credence function defined on their agenda that they might adopt such that, if all were to adopt these, the group would satisfy Consensus and it would be more accurate regardless of how the world turns out. In this case, we say that the alternative credence functions collectively dominate the original ones.

(ii) For any group that satisfies Consensus, there are no credence functions the group might adopt that collectively dominate it.

The argument concludes by assuming again the norm that an option is irrational if there is some alternative that is guaranteed to be better.

The Sure Loss Argument for Probabilism

The Sure Loss Argument for Probabilism begins with a claim that I call Ramsey's Thesis. It tells you the prices at which your credences require you to buy and sell bets. It says that, if your credence in $A$ is $p$, and $£x < £pS$, then you should be prepared to pay $£x$ for a bet that pays out $£S$ if $A$ is true and $£0$ if $A$ is false. And this is true for any stakes $S$, whether positive, negative, or zero. Then it appeals to a mathematical theorem, which tells us two consequences of Ramsey's Thesis.

(i) For any credence function that violates Probabilism, there is a series of bets, each of which your credences require you to accept, that, taken together, lose you money for sure.

(ii) For any credence function satisfies Probabilism, there is no such series of bets.

The argument concludes by assuming a norm that says that it is irrational to have credences that require you to make a series of choices when there is an alternative series of choices you might have made that would be better regardless of how the world turns out.

The Sure Loss Argument for Consensus

The Sure Loss Argument for Consensus also begins with Ramsey's Thesis. It appeals to a mathematical theorem that tells us two consequences of Ramsey's Thesis.

(i) For any group that violates Consensus, there is a series of bets, each offered to a member of the group whose credences require that they accept it, that, taken together, lose the group money for sure.

(ii) For any group that satisfies Consensus, there is no such series of bets.

And it concludes by assuming that it is irrational for the members of a group to have credences that require them to make a series of choices when there is an alternative series of choices they might have made that would be better for the group regardless of how the world turns out.

So now we have the Epistemic Utility and Sure Loss Arguments for Consensus. In fact, I think the Sure Loss Argument doesn't work. So in the next post I'll say why and provide a better alternative based on work by Mark Schervish and Ben Levinstein. But in the meantime, here's the technical appendix.

Technical appendix

First, note that Probabilism is the special case of Consensus when the group has only one member. So we focus on establishing Consensus.

Some definitions to begin:

If $c$ is a credence function defined on the agenda $\mathcal{F}_i = \{A^i_1, \ldots, A^i_{k_i}\}$, represent it as a vector as follows:$$c = \langle c(A^i_1), \ldots, c(A^i_{k_i})\rangle$$
Let $\mathcal{C}_i$ be the set of credence functions defined on $\mathcal{F}_i$, represented as vectors in this way.
If $c_1, \ldots, c_n$ are credence functions defined on $\mathcal{F}_1, \ldots, \mathcal{F}_n$ respectively, represent them collectively as a vector as follows:
$$
c_1 \frown \ldots \frown c_n = \langle c_1(A^1_1), \ldots, c_1(A^1_{k_1}), \ldots, c_n(A^n_1), \ldots, c_n(A^n_{k_n}) \rangle
$$
Let $\mathcal{C}$ be the set of sequences of credence functions defined on $\mathcal{F}_1, \ldots, \mathcal{F}_n$ respectively, represented as vectors in this way.
If $w$ is a classically consistent assignment of truth values to the propositions in $\mathcal{F}_i$, represent it as a vector $$w = \langle w(A^i_1), \ldots, w(A^i_{k_i})\rangle$$ where $w(A) = 1$ if $A$ is true according to $w$, and $w(A) = 0$ if $A$ is false according to $w$.
Let $\mathcal{W}_i$ be the set of classically consistent assignments of truth values to the propositions in $\mathcal{F}_i$, represented as vectors in this way.
If $w$ is a classically consistent assignment of truth values to the propositions in $\mathcal{F} = \bigcup^n_{i=1} \mathcal{F}_i$, represent the restriction of $w$ to $\mathcal{F}_i$ by the vector $$w_i = \langle w(A^i_1), \ldots, w(A^i_{k_i})\rangle$$So $w_i$ is in $\mathcal{W}_i$. And represent $w$ as a vector as follows:
$$
w = w_1 \frown \ldots \frown w_n = \langle w(A^1_1), \ldots, w(A^1_{k_1}), \ldots, w(A^n_1), \ldots, w(A^n_{k_n})\rangle
$$
Let $\mathcal{W}$ be the set of classical consistent assignments of truth values to the propositions in $\mathcal{F}$, represented as vectors in this way.

Then we have the following result, which generalizes a result due to de Finetti (1974):

Proposition 1 A group of individuals with credence functions $c_1, \ldots, c_n$ satisfy Consensus iff $c_1 \frown \ldots \frown c_n$ is in the closed convex hull of $\mathcal{W}$.

We then appeal to two sets of results. First, concerning epistemic utility measures, which generalizes a result to Predd, et al. (2009):

Theorem 1

(i) Suppose $\mathfrak{A}_i : \mathcal{C}_i \times \mathcal{W}_i \rightarrow [0, 1]$ is a measure of epistemic utility that satisfies Individual Additivity, Continuity, and Strict Propriety. Then there is a Bregman divergence $\mathfrak{D}_i : \mathcal{C}_i \times \mathcal{C}_i \rightarrow [0, 1]$ such that $\mathfrak{A}_i(c, w) = -\mathfrak{D}_i(w, c)$.

(ii) Suppose $\mathfrak{D}_1, \ldots, \mathfrak{D}_n$ are Bregman divergences defined on $\mathcal{C}_1, \ldots, \mathcal{C}_n$, respectively. And suppose $\mathcal{X}$ is a closed convex subset of $\mathcal{C}$. And suppose $c_1 \frown \ldots \frown c_n$ is not in $\mathcal{X}$. Then there is $c^\star_1 \frown \ldots \frown c^\star_n$ in $\mathcal{Z}$ such that, for all $z_1 \frown \ldots \frown z_n$ in $\mathcal{Z}$,
$$
\sum^n_{i=1} \mathfrak{D}_i(z_i, c^\star_i) < \sum^n_{i=1} \mathfrak{D}_i(z_i, c_i)
$$

So, by Proposition 1, if a group $c_1, \ldots, c_n$ does not satisfy Consensus, then $c_1 \frown \ldots \frown c_n$ is not in the closed convex hull of $\mathcal{W}$, and so by Theorem 1 there is $c^\star_1 \frown \ldots \frown c^\star_n$ in the closed convex hull of $\mathcal{W}$ such that, for all $w$ in $\mathcal{W}$, $$\mathfrak{A}_i(c, w) < \mathfrak{A}(c^\star, w)$$ as required.

Second, concerning bets, which is a consequence of the Separating Hyperplane Theorem:

Theorem 2
Suppose $\mathcal{Z}$ is a closed convex subset of $\mathcal{C}$. And suppose $c_1 \frown \ldots \frown c_n$ is not in $\mathcal{Z}$. Then there are vectors
$$
x = \langle x^1_1, \ldots, x^1_{k_1}, \ldots, x^n_1, \ldots, x^n_{k_n}\rangle
$$
and
$$
S = \langle S^1_1, \ldots, S^1_{k_1}, \ldots, S^n_1, \ldots, S^n_{k_n}\rangle
$$
such that, for all $x^i_j$ and $S^i_j$,
$$
x^i_j < c_i(A^i_j)S^i_j
$$
and, for all $z$ in $\mathcal{Z}$,
$$
\sum^n_{i=1} \sum^{k_i}_{j = 1} x^i_j > \sum^n_{i=1} \sum^{k_i}_{j=1} z^i_jS^i_j
$$

So, by Proposition 1, if a group $c_1, \ldots, c_n$ does not satisfy Consensus, then $c_1 \frown \ldots \frown c_n$ is not in the closed convex hull of $\mathcal{W}$, and so, by Theorem 2, there is $x = \langle x^1_1, \ldots, x^1_{k_1}, \ldots, x^n_1, \ldots, x^n_{k_n}\rangle$ and $S = \langle S^1_1, \ldots, S^1_{k_1}, \ldots, S^n_1, \ldots, S^n_{k_n}\rangle$ such that (i) $x^i_j < c_i(A^i_j)S^i_j$ and (ii) for all $w$ in $\mathcal{W}$,
$$\sum^n_{i=1} \sum^{k_i}_{j = 1} x^i_j > \sum^n_{i=1} \sum^{k_i}_{j=1} w(A^i_j)S^i_j$$
But then (i) says that the credences of individual $i$ require them to pay $£x^i_j$ for a bet on $A^i_j$ that pays out $£S^i_j$ if $A^i_j$ is true and $£0$ if it is false. And (ii) says that the total price of these bets across all members of the group---namely, $£\sum^n_{i=1} \sum^{k_i}_{j = 1} x^i_j$---is greater than the amount the bets will payout at any world---namely, $£\sum^n_{i=1} \sum^{k_i}_{j=1} w(A^i_j)S^i_j$.

* This was introduced independently by Frank P. Ramsey (1931) and Bruno de Finetti (1937). For overviews, see (Hajek 2008, Vineberg 2016, Pettigrew 2020).

**Much of the discussion of these arguments in the literature focusses on versions on which the epistemic value of a credence is taken to be its accuracy. This literature begins with Rosenkrantz (1981) and Joyce (1998). But, following Joyce (2009) and Predd (2009), it has been appreciated that we need not necessarily assume that accuracy is the only source of epistemic value in order to get the argument going.

*** Matthew Kopec (2012) offers a proof of a slightly weaker result. It doesn't quite work because it assumes that all strictly proper measures of epistemic value are convex, when they are not---the spherical scoring rule is not. I offer an alternative proof of this stronger result in the technical appendix below.

References

de Finetti, B. (1937 [1980]). Foresight: Its Logical Laws, Its Subjective Sources. In H. E. Kyburg, & H. E. K. Smokler (Eds.) Studies in Subjective Probability. Huntingdon, N. Y.: Robert E. Kreiger Publishing Co.

de Finetti, B. (1974). Theory of Probability, vol. I. New York: John Wiley & Sons.

Gillies, D. (1991). Intersubjective probability and confirmation theory. The British Journal for the Philosophy of Science, 42(4), 513–533.

Hájek, A. (2008). Dutch Book Arguments. In P. Anand, P. Pattanaik, & C. Puppe (Eds.) The Oxford Handbook of Rational and Social Choice, (pp. 173–195). Oxford: Oxford University Press.

Joyce, J. M. (1998). A Nonpragmatic Vindication of Probabilism. Philosophy of Science, 65(4), 575–603.

Joyce, J. M. (2009). Accuracy and Coherence: Prospects for an Alethic Epistemology of Partial Belief. In F. Huber, & C. Schmidt-Petri (Eds.) Degrees of Belief. Dordrecht and Heidelberg: Springer.

Kopec, M. (2012). We ought to agree: A consequence of repairing Goldman’s group scoring rule. Episteme, 9(2), 101–114.

Pettigrew, R. (2020). Dutch Book Arguments. Cambridge University Press.

Predd, J., Seiringer, R., Lieb, E. H., Osherson, D., Poor, V., & Kulkarni, S. (2009). Probabilistic Coherence and Proper Scoring Rules. IEEE Transactions of Information Theory, 55(10), 4786–4792.

Ramsey, F. P. (1926 [1931]). Truth and Probability. In R. B. Braithwaite (Ed.) The Foundations of Mathematics and Other Logical Essays, chap. VII, (pp. 156–198). London: Kegan, Paul, Trench, Trubner & Co.

Rosenkrantz, R. D. (1981). Foundations and Applications of Inductive Probability. Atascadero, CA: Ridgeview Press.

Ryder, J. (1981). Consequences of a simple extension of the Dutch Book argument. The British Journal for the Philosophy of Science, 32(2), 164–167.

Vineberg, S. (2016). Dutch Book Arguments. In E. N. Zalta (Ed.) Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University.

What we together risk: three vignettes in search of a theory

2021-04-13T11:14:00.003+01:00

For a PDF version of this post, see here.

Many years ago, I was climbing Sgùrr na Banachdich with my friend Alex. It's a mountain in the Black Cuillin, a horseshoe of summits that surround Loch Coruisk at the southern end of the Isle of Skye. It's a Munro---that is, it stands over 3,000 feet above sea level---but only just---it measures 3,166 feet. About halfway through our ascent, the mist rolled in and the rain came down heavily, as it often does near these mountains, which attract their own weather system. At that point, my friend and I faced a choice: to continue our attempt on the summit or begin our descent. Should we continue, there were a number of possible outcomes: we might reach the summit wet and cold but not injured, with the mist and rain gone and in their place sun and views across to Bruach na Frìthe and the distinctive teeth-shaped peaks of Sgùrr nan Gillean; or we might reach the summit without injury, but the mist might remain, obscuring any view at all; or we might get injured on the way and either have to descend early under our own steam or call for help getting off the mountain. On the other hand, should we start our descent now, we would of course have no chance of the summit, but we were sure to make it back unharmed, for the path back is good and less affected by rain.

Alex and I had climbed together a great deal that summer and the summer before. We had talked at length about what we enjoyed in climbing and what we feared. To the extent that such comparisons make sense and can be known, we both knew that we both gained exactly the same pleasure from reaching a summit, the same additional pleasure if the view was clear; we gained the same displeasure from injury, the same horror at the thought of having to call for assistance getting off a mountain. What's more, we both agreed exactly on how likely each possible outcome was: how likely we were to sustain an injury should we persevere; how likely that the mist would clear in the coming few hours; and so on. Nonetheless, I wished to turn back, while Alex wanted to continue.

How could that be? We both agreed how good or bad each of the options was, and both agreed how likely each would be were we to take either of the courses of action available to us. Surely we should therefore have agreed on which course of action would maximise our expected utility, and therefore agreed which would be best to undertake. Yes, we did agree on which course of action would maximise our expected utility. However, no, we did not therefore agree on which was best, for there are theories of rational decision-making that do not demand that you must rank options by their expected utility. These are the risk-sensitive decision theories, and they include John Quiggin's rank-dependent decision theory and Lara Buchak's risk-weighted expected utility theory. According to Quiggin's and Buchak's theories, what you consider best is not determined only by your utilities and your probabilities, but also by your attitudes to risk. The more risk-averse will give greater weight to the worst-case scenarios and less to the best-case ones than expected utility demands; the more risk-inclined will give greater weight to the best outcomes and less to the worst than expected utility does; and the risk-neutral person will give exactly the weights prescribed by expected utility theory. So, perhaps I preferred to begin our descent from Sgùrr na Banachdich while Alex preferred to continue upwards because I was risk-averse and he was risk-neutral or risk-seeking, or I was risk-neutral and he was risk-seeking. In any case, he must have been less risk-averse than I was.

Of course, as it turned out, we sat on a mossy rock in the rain and discussed what to do. We decided to turn back. Luckily, as it happened, for a thunderstorm hit the mountains an hour later at just the time we'd have been returning from the summit. But suppose we weren't able to discuss the decision. Suppose we'd roped ourselves together to avoid getting separated in the mist, and he'd taken the lead, forcing him to make the choice on behalf of both of us. In that case, what should he have done?

As I will do throughout these reflections, let me simply report by own reaction to the case. I think, in that case, Alex should have chosen to descend (and not only because that was my preference---I'd have thought the same had it been he who wished to descend and me who wanted to continue!). Had he chosen to continue---even if all had turned out well and we'd reached the summit unharmed and looked over the Cuillin ridge in the sun---I would still say that he chose wrongly on our behalf. This suggests the following principle (in joint work, Ittay Nissan Rozen and Jonathan Fiat argue for a version of this principle that applies in situations in which the individuals do not assign the same utilities to the outcomes):

Principle 1 Suppose two people assign the same utilities to the possible outcomes, and assign the same probabilities to the outcomes conditional on choosing a particular course of action. And suppose that you are required to choose between those courses of action on their behalf. Then you must choose whatever the more risk-averse of the two would choose.

However, I think the principle is mistaken. A few years after our unsuccessful attempt on Sgùrr na Banachdich, I was living in Bristol and trying to decide whether to take up a postdoctoral fellowship there or a different one based in Paris (a situation that seems an unimaginable luxury and privilege when I look at today's academic job market). Staying in Bristol was the safe bet; moving to Paris was a gamble. I already knew what it would be like to live in Bristol and what the department was like. I knew I'd enjoy it a great deal. I'd visited Paris, but I didn't know what it would be like to live there, and I knew the philosophical scene even less. I knew I'd enjoy living there, but I didn't know how much. I figured I might enjoy it a great deal more than Bristol, but also I might enjoy it somewhat less. The choice was complicated because my partner at the time would move too, if that's what we decided to do. Fortunately, just as Alex and I agreed on how much we valued the different outcomes that faced us on the mountain, so my partner and I agreed on how much we'd value staying in Bristol, how much we'd value living in Paris under the first, optimistic scenario, and how much we'd value living there under the second, more pessimistic scenario. We also agreed how likely the two Parisian scenarios were---we'd heard the same friends describing their experiences of living there, and we'd drawn the same conclusions about how likely we were to value the experience ourselves to different extents. Nonetheless, just as Alex and I had disagreed on whether or not to start our descent despite our shared utilities and probabilities, so my partner and I disagreed on whether or not to move to Paris. Again the more risk-averse of the two, I wanted to stay in Bristol, while he wanted to move to Paris. Again, of course, we sat down to discuss this. But suppose that hadn't been possible. Perhaps my partner had to make the decision for both of us at short notice and I was not available to consult. How should he have chosen?

In this case, I think either choice would have been permissible. My partner might have chosen Paris or he might have chosen Bristol and either of these would have been allowed. But of course this runs contrary to Principle 1.

So what is the crucial difference between the decision on Sgùrr na Banachdich and the decision whether to move cities? In each case, there is an option---beginning our descent or staying in Bristol---that is certain to have a particular level of value; and there is an alternative option---continuing to climb or moving to Paris---that might give less value than the sure thing, but might give more. And, in each case, the more risk-averse person prefers the sure thing to the gamble, while the more risk-inclined prefers the gamble. So why must someone choosing for me and Alex in the first case choose to descend, while someone choosing for me and my partner in the second case choose either Bristol or Paris?

Here's my attempt at a diagnosis: in the choice of cities, there is no risk of harm, while in the decision on the mountain, there is. In the first case, the gamble opens up a possible outcome in which we're harmed---we are injured, perhaps quite badly. In the second case, the gamble doesn't do that---we countenance the possibiilty that moving to Paris might not be as enjoyable as remaining in Bristol, but we are certain it won't harm us! This suggests the following principle:

Principle 2 Suppose two people assign the same utilities to the possible outcomes, and assign the same probabilities to the outcomes conditional on choosing a particular course of action. And suppose that you are required to choose between those courses of action on their behalf. Then there are two cases: if one of the available options opens the possibility of a harm, then you must choose whatever the more risk-averse of the two would choose; if neither of the available options opens the possibility of a harm, then you may choose an option if at least one of the two would choose it.

So risk-averse preferences do not always take precedence, but they do when harms are involved. Why might that be?

A natural answer: to expose someone to the risk of a harm requires their consent. That is, when there is an alternative option that opens no possibility of harm, you are only allowed to choose an option that opens up the possibility of a harm if everyone affected would consent to being subject to that risk. So Alex should only choose to continue our ascent and expose us to the risk of injury if I would consent to that, and of course I wouldn't, since I'd prefer to descend. But my partner is free to choose the move to Paris even though I wouldn't choose that, because it exposes us to no risk of harm.

A couple of things to note: First, in our explanation, reference to risk-aversion, risk-neutrality, and risk-inclination have dropped out. What is important is not who is more averse to risk, but who consents to what. Second, our account will only work if we employ an absolute notion of harm. That is, I must say that there is some threshold and an option harms you if it causes your utility to fall below that threshold. We cannot use a relative notion of harm on which an option harms you if it merely causes your utility to fall. After all, using a relative notion of harm, the move to Paris will harm you should it turn out to be worse than staying in Bristol.

The problem with Principle 2 and the explanation we have just given is that it does not generalise to cases in which more than two people are involved. That is, the following principle seems false:

Principle 3 Suppose each member of a group of people assign the same utilities to the possible outcomes, and assign the same probabilities to the outcomes conditional on choosing a particular course of action. And suppose that you are required to choose between those courses of action on their behalf. Then there are two cases: if one of the available options opens the possibility of a harm, then you must choose whatever the most risk-averse of them would choose; if neither of the available options opens the possibility of a harm, then you may choose an option if at least one member of the group would choose it.

A third vignette might help to illustrate this.

I grew up between two power stations. My high school stood in the shadow of the coal-fired plant at Cockenzie, while the school where my mother taught stood in the lee of the nuclear plant at Torness Point. And I was born two years after the Three Mile Island accident and the Chernobyl tragedy happened as I started school. So the risks of nuclear power were somewhat prominent growing up. Now, let's imagine a community of five million people who currently generate their energy from coal-fired plants---a community like Scotland in 1964, just before its first nuclear plant was constructed. This community is deciding whether to build nuclear plants to replace its coal-fired ones. All agree that having a nuclear plant that suffered no accidents would be vastly preferable to having coal plants, and all agree that a nuclear plant that suffered an accident would be vastly worse than the coal plants. And we might imagine that they also all assign the same probability to the prospective nuclear plants suffering an accident---perhaps they all defer to a recent report from the country's atomic energy authority. But, while they agree on the utilities and the probabilities, they have don't all have the same attitudes to risk. In the end, 4.5million people prefer to build the nuclear facilities, while half a million, who are more risk-averse, prefer to retain the coal-fired alternatives. Principle 3 says that, for someone choosing on behalf of this population, the only option they can choose is to retain the coal-fired plants. After all, a nuclear accident is clearly a harm, and there are individuals who would suffer that harm who would not consent to being exposed to the risk. But surely that's wrong. Surely, despite such opposition, it would be acceptable to build the nuclear plant.

So, while Principle 2 might yet be true, Principle 3 is wrong. And I think my attempt to explain the basis of Principle 2 must be wrong as well, for if it were right, it would also support Principle 3. After all, in no other case I can think of in which a lack of consent is sufficient to block an action does that block disappear if there are sufficiently many people in favour of the action.

So what general principles underpin our reactions to these three vignettes? Why do the preferences of the more risk-averse individuals carry more weight when one of the outcomes involves a harm than when they don't, but not enough weight to overrule a significantly greater number of more risk-inclined individuals? That's the theory I'm in search of here.

Believing is said of groups in many ways

2021-04-06T08:04:00.003+01:00

For a PDF version of this post, see here.

In defence of pluralism

Recently, after a couple of hours discussing a problem in the philosophy of mathematics, a colleague mentioned that he wanted to propose a sort of pluralism as a solution. We were debating the foundations of mathematics, and he wanted to consider the claim that there might be no single unique foundation, but rather many different foundations, no one of them better than the others. Before he did so, though, he wanted to preface his suggestion with an apology. Pluralism, he admitted, is unpopular wherever it is proposed as a solution to a longstanding philosophical problem.

I agree with his sociological observation. Philosophers tend to react badly to pluralist solutions. But why? And is the reaction reasonable? This is pure speculative generalisation based on my limited experience, but I've found that the most common source of resistance is a conviction that there is a particular special role that the concept in question must play; and moreover, in that role, whether or not something falls under the concept determines some important issue concerning it. So, in the philosophy of mathematics, you might think that a proof of a mathematical proposition is legitimate just in case it can be carried out in the system that provides the foundation for mathematics. And, if you allow a plurality of foundations of differing logical strength, the legitimacy of certain proof becomes indeterminate---relative to some foundations, they're legit; relative to others, they aren't. Similarly, you might think that a person who accidentally poisons another person is innocent of murder if, and only if, they were justified in their belief that the liquid they administered was not poisonous. And, if you allow a plurality of concepts of justification, then whether or not the person is innocent might become indeterminate.

I tend to respond to such concerns in two ways. First, I note that, while the special role that my interlocutor picks out for the concept we're discussing is certainly among the roles that this concept needs to play, it isn't the only one; and it is usually not clear why we should take it to be the most important one. One role for a foundation of mathematics is to test the legitimacy of proofs; but another is to provide a universal language that mathematicians might use, and that might help them discover new mathematical truths (see this paper by Jean-Pierre Marquis for a pluralist approach that takes both of these roles seriously).

Second, I note that we usually determine the important issues in question independently of the concept and then use our determinations to test an account of the concept, not the other way around. So, for instance, we usually begin by determining whether we think a particular proof is legitimate---perhaps by asking what it assumes and whether we have good reason for believing that those assumptions are true---and then see whether a particular foundation measures up by asking whether the proof can be carried out within it. We don't proceed the other way around. And we usually determine whether or not a person is innocent independently of our concept of justification---perhaps just by looking at the evidence they had and their account of the reasoning they undertook---and then see whether a particular account of justification measures up by asking whether the person is innocent according to it. Again, we don't proceed the other way around.

For these two reasons, I tend not to be very moved by arguments against pluralism. Moreover, while it's true that pluralism is often greeted with a roll of the eyes, there are a number of cases in which it has gained wide acceptance. We no longer talk of the probability of an event but distinguish between its chance of occurring, a particular individual's credence in it occurring, and perhaps even it's evidential probability relative to a body of evidence. That is, we are pluralists about probability. Similarly, we no longer talk of a particular belief being justified simpliciter, but distinguish between propositional, doxastic, and personal justification. We are, along some dimensions at least, pluralists about justification. We no longer talk of a person having a reason to choose one thing rather than another, but distinguish between their internal and external reasons.

I want to argue that we should extend pluralism to so-called group beliefs or collective beliefs. Britain believes lockdowns are necessary to slow the virus. Scotland believes it would fare well economically as an independent country. The University believes the pension fund has been undervalued and requires no further increase in contributions in the near future to meet its obligations in the further future. In 1916, Russia believed Rasputin was dishonest. In each of these sentences, we seem to ascribe a belief to a group or collective entity. When is it correct to do this? I want to argue that there is no single answer. Rather, as Aristotle said of being, believing is said of groups in many ways---that is, a pluralist account is appropriate.

I've been thinking about this recently because I've been reading Jennifer Lackey's fascinating new book, The Epistemology of Groups (all page numbers in what follows refer to that). In it, Lackey offers an account of group belief, justified group belief, group knowledge, and group assertion. I'll focus here only on the first.

Lackey's treatment of group belief

Three accounts of group belief

Lackey considers two existing accounts of group belief as well as her own proposal.

The first, due to Margaret Gilbert and with amendments by Raimo Tuomela, is a non-summative account that treats groups as having 'a mind of their own'. Lackey calls it the Joint Acceptance Account (JAA). I'll stick with the simpler Gilbert version, since the points I'll make don't rely on Tuomela's more involved amendment (24):

JAA A group $G$ believes that $p$ iff it is common knowledge in $G$ that the members of $G$ individually have intentionally and openly expressed their willingness jointly to accept that $p$ with the other members of $G$.

The second, due to Philip Pettit, is a summative account that treats group belief as strongly linked to individual belief. Lackey calls it the Premise-Based Aggregation Account (PBAA) (29). Here's a rough paraphrase:

PBAA A group $G$ believes that $p$ iff there is some collection of propositions $q_1, \ldots, q_n$ such that (i) it is common knowledge among the operative members of $G$ that $p$ is true iff each $q_i$ is true, (ii) for each operative member of $G$, they believe $p$ iff they believe each $q_i$, and (iii) for each $q_i$, the majority of operative members of $G$ believe $q_i$.

Lackey's own proposal is the Group Agent Account (GAA) (48-9):

GAA A group $G$ believes that $p$ iff (i) there is a significant percentage of $G$'s operative members who believe that $p$, and (ii) are such that adding together the bases of their beliefs that $p$ yields a belief set that is not substantively incoherent.

Group lies (and bullshit) and judgment fragility: two desiderata for accounts of group belief

To distinguish between these three accounts, Lackey enumerates four desiderata for accounts of group belief that she takes to tell against JAA and PBAA and in favour of GAA. The first three are related to an objection to Gilbert's account of group belief that was developed by K. Brad Wray, A. W. M. Meijers, and Raul Hakli in the 2000s. According to this, JAA makes it too easy for groups to actively, consciously, and intentionally choose what they believe: all they need to do is intentionally and openly express their willingness jointly to accept the proposition in question. Lackey notes two consequences of this: (a) on such an account, it is difficult to give a satisfactory account of group lies (or group bullshit, though I'll focus on group lies); (b) on such an account, whether or not a group believes something at a particular time is sensitive to the group's situation at that time in a way that beliefs should not be sensitive.

So Lackey's first desideratum for an account of group belief is that it must be able to accommodate a plausible account of group lies (and the second that it accommodate group bullshit, but as I said I'll leave that for now). Suppose each member of a group strongly believes $p$ on the basis of excellent evidence that they all share, but they also know that the institution will be culpable of a serious crime if it is taken to believe $p$. Then they might jointly agree to accept $\neg p$. And, if they do, Gilbert must say that they do believe $\neg p$. But were they to assert $\neg p$, we would take the group to have lied, which would require that it believes $p$. The point is that, if a group's belief is so thorougly within its voluntary control, it can manipulate it whenever it likes in order to avoid ever lying in situations in which dishonesty would be subject to censure.

Lackey's third desideratum for an account of group belief is that such belief should not be rendered sensitive in certain ways to the situation in which the group formed it. Suppose that, on the basis of the same shared evidence, a substantial majority of members of a group judge the horse Cisco most likely to win the race, the horse Jasper next most likely, and the horse Whiskey very unlikely to win. But, again on the basis of this same shared body of evidence, the remaining minority of members judge Whiskey most likely to win, Jasper next most likely, and Cisco very unlikely to win. The group would like a consensus before it reports its opinion, but time is short---the race is about to begin, say, and the group has been asked for its opinion before the starting gates open. So, in order to achieve something close to a consensus, it unanimously agrees to accept that Jasper will win, even though he is everyone's second favourite. Yet we might also assume that, had time not been short, the majority would have been able to persuade the minority of Cisco's virtues; and, in that case, they'd unanimously agree to accept that Cisco will win. So, according to Gilbert's account, under time pressure, the group believes Jasper will win, while with world enough and time, they would have believed that Cisco will win. Lackey holds that no account of group belief should make it sensitive to the situation in which it is formed in this way, and thus rejects JAA.

Lackey argues that any account of group belief must satisfy the two desiderata we've just considered. I agree that we need at least one account of group belief that satisfies the first desideratum, but I'm not convinced that all need do this---but I'll leave that for later, when I try to motivate pluralism. For now, I'd like to explain why I'm not convinced that any account needs to satisfy the second desideratum. After all, we know from various empirical studies in social psychology, as well as our experience as thinkers and reasoners and believers, that our ordinary beliefs as individuals are sensitive to the situation in which they're formed in just the sort of way that Lackey wishes to rule out for the beliefs of groups. One of the central theses of Amos Tversky and Daniel Kahneman's work is that we use a different reasoning system when we are forced to make a judgment under time pressure from the one we use when more time is available. So, when my implicit biases are mobilised under time pressure, I might come to believe that a particular job candidate is incompetent, while I might judge them to be competent were I to have more time to assess their track record and override my irrational hasty judgment. And, whenever we are faced with a complex body of evidence that, on the face of it, seems to point in one direction, but which, under closer scrutiny, points in the opposite direction, we will form a different belief if we must do so under time pressure than if we have greater leisure to unpick and balance the different components of the evidence. If individual beliefs can be sensitive to the situation in which they're formed in this way, I see no reason why group beliefs might not also be sensitive in this way.

Before moving on, I'd like to consider whether the PBAA---Pettit's premise-based aggregation account---satisfies Lackey's first desideratum. If it doesn't, it can't be for the same reason that Gilbert's JAA doesn't. After all, according to the PBAA, the group's belief is no more under its voluntary control than the beliefs of its individual members. If, for each $q_i$, a majority believes $q_i$, then the group believes $p$. The only way a group could manipulate its belief is by manipulating the beliefs of its members. But if that sort of manipulation rules out a group belief, Lackey's account is just as vulnerable.

So why does Lackey think that PBAA cannot adequately account for group lies. She considers a case in which the three board members of a tobacco company know that smoking is safe to health iff it doesn't cause lung cancer and it doesn't cause emphysema and it doesn't cause heart disease. The first member believes it doesn't cause lung cancer or heart disease, but believes it does cause emphysema, and so believes it is not safe to health; the second believes it doesn't cause emphysema or heart disease, but it does cause lung cancer, and so believes it is not safe to health; and the third believes it doesn't cause lung cancer or emphysema, but it does cause heart disease, and so believes it is not safe to health. The case is illustrated in Table 1.

Then each board member believes it is not safe to health, but PBAA says that it is, because a majority (first and third) believe it doesn't cause lung cancer, a majority (second and third) believe it doesn't cause emphysema, and a majority (first and second) believe it doesn't cause heart disease. If the company then asserts that it is safe to health, then Lackey claims that it lies, while PBAA says that it believes the proposition it asserts and so does not lie.

I think this case is a bit tricky. I suspect our reaction to it is influenced by our knowledge of how the real-world version played out and the devastating effect it has had. So let us imagine that this group of three is not the board of a tobacco company, but the scientific committee of a public health organisation. The structure of the case will be exactly the same, and the nature of the organisation should not affect whether or not belief is present. Now suppose that, since the stakes are so high, each member would only come to believe of a specific putative risk that it is not present if their credence that it is not present is above 95%. That is, there is some pragmatic encroachment here to the extent that the threshold for belief is determined in part by the stakes involved. And suppose further that the first member of the scientific committee has credence 99% that smoking doesn't cause lung cancer, 99% that it doesn't cause heart disease, and 93% that it doesn't cause emphysema. And let's suppose that, by a tragic bout of bad luck that has bestowed on them very misleading evidence, the evidence available to them supports these credences. Then their credence that smoking is safe to health must be at most 93%---since the probability of a conjunction must be at most the probability of any of the conjuncts---and thus below 95%. So the first member doesn't believe it is safe to health. And suppose the same for the other two members of the committee, but for the other combinations of risks. So the second is 99% sure it doesn't cause emphysema and 99% sure it doesn't cause heart disease, but only 93% sure it doesn't cause lung cancer. And the third is 99% sure it doesn't cause lung cancer and 99% sure it doesn't cause emphysema, but only 93% sure it doesn't cause heart disease. So none of the three believe that smoking is safe to health. The case is illustrated in Table 2.

However, just averaging the group's credences in each of the three specific risks, we might say that it is 97% sure that smoking doesn't cause lung cancer, 97% sure it doesn't cause emphysema, and 97% sure it doesn't cause heart disease ($\frac{0.99 + 0.99 + 0.93}{3} = 0.97$). And it is then possible that the group assigns a higher than 95% credence to the conjunction of these three. And, if it does, it seems to me, the PBAA may well get things right, and the group does not lie if it says that smoking carries no health risks.

Nonetheless, I think the PBAA cannot be right. In the example I just described, I noted that, just taking a straight average gives, for each specific risk, a credence of 97% that it doesn't exist. And I noted that it's then possible that the group credence that smoking is safe to health is above 95%. But of course, it's also possible that it's below 95%. This would happen, for instance, if the group were to take the three risks to be independent. Then the group credence that smoking is safe to health would be a little over 91%---too low for the group to believe it given the stakes. But PBAA would still say that the group believes that smoking is safe to health. The point is that PBAA is not sufficiently sensitive to the more fine-grained attitudes to the propositions that lie behind the beliefs in those propositions. Simply knowing what each member believes about the three putative risks is not sufficient to determine what the group thinks about them. You also need to look to their credences.

Of course, there are lots of reasons to dislike straight averaging as a means for pooling credences---it can't preserve judgments of independence, for instance---and lots of reasons to dislike the naive application of a threshold or Lockean view of belief that is in the background here---it gives rise to the lottery paradox. But it seems that, for any reasonable method of probablistic aggregation and any reasonable account of the relationship between belief and credence, there will be cases like this in which the PBAA says the group believes a proposition when it shouldn't. So I agree with Lackey that the PBAA sometimes gets things wrong, but I disagree about exactly when.

Base fragility: a further desideratum

Consider an area of science in which two theories vie for precedence, $T_1$ and $T_2$. Half of the scientists working in this area believe the following:

($A_1$) $T_1$ is simpler than $T_2$,
($B_1$) $T_2$ is more explanatory than $T_1$,
($C_1$) simplicity always trumps explanatory power in theory choice.

These scientists consequently believe $T_1$. The other half of the scientists believe the following:

($A_2$) $T_2$ is simpler than $T_1$,
($B_2$) $T_1$ is more explanatory than $T_2$,
($C_2$) explanatory power always trumps simplicity in theory choice.

These scientists consequently believe $T_1$. So all scientists believe $T_1$. But they do so for diametrically opposed reasons. Indeed, all of their beliefs about the comparisons between $T_1$ and $T_2$ are in conflict, but because their views about theory choice are also in conflict, they end up believing the same theory. Does the scientific community believe $T_1$? Lackey says no. In order for a group to believe a proposition, the bases of the members' beliefs must not be substantively incoherent. In our example, for half of the members, the basis of their belief in $T_1$ is $A_1\ \&\ B_1\ \&\ C_1$, while for the other half, it's $A_2\ \&\ B_2\ \&\ C_2$. And $A_1$ contradicts $A_2$, $B_1$ contradicts $B_2$, and $C_1$ contradicts $C_2$. The bases are about as incoherent as can be.

Is Lackey correct to say that the scientific community does not believe in this case? I'm not so sure. For one thing, attributing belief in $T_1$ would help to explain a lot of the group's behaviour. Why does the scientific community fund and pursue research projects that are of interest only if $T_1$ is true? Why does the scientific community endorse and teach from textbooks that give much greater space to expounding and explaining $T_1$? Why do departments in this area hire those with the mathematical expertise required to understand $T_1$ when that expertise is useless for understanding $T_2$? In each case, we might say: because the community believes $T_1$.

Lackey raises two worries about group beliefs based in incoherent bases: (i) they cannot be subject to rational evaluation; (ii) they cannot coherently figure in accounts of collective deliberation. On (ii), it seems to me that the group belief could figure in deliberation. Suppose the community is deliberating about whether to invite a $T_1$-theorist or a $T_2$-theorist to give the keynote address at the major conference in the area. It seems that the group's belief in the superiority of $T_1$ could play a role in the discussions: 'Yes, we want the speaker who will pose the greatest challenge intellectually, but we don't want to hear a string of falsehoods, so let's go with the $T_1$-theorist,' they might reason.

On (i): Lackey asks what we would say if the group were to receive new evidence that $T_1$ has greater simplicity and less explanatory power than we initially thought. For the first half of the group, this would make their belief in $T_1$ more justified; for the second half, it would make their belief less justified. What would it do to the group's belief? Without an account of justification for group belief, it's hard to say. But I don't think the incoherent bases rule out an answer. For instance, we might be reliabilists about group justification. And if we are, then we look at all the times that the members of the group have made judgments about simplicity and explanatory power that have the same pattern as they have time---that is, half one way, half the other---and we look at the proportion of those times that the group belief---formed by whatever aggregation method we favour---has been true. If it's high, then the belief is justified; if it's not, it's not. And we can do that for the group before and after this new evidence comes in. And by doing that, we can compare the level of justification for the group belief.

Of course, this is not to say that reliabilism is the correct account of justification for group beliefs. But it does suggest that incoherent bases don't create a barrier to such accounts.

Varieties of group belief

One thing that is striking when we consider different proposed accounts of group belief is how large the supervenience base might be; that is, how many different features of a group $G$ might partially determine whether or not it believes a proposition $p$. Here's a list, though I don't pretend that it's exhaustive:

(1) The beliefs of individual members of the group

(1a) Some accounts are concerned only with individual members' beliefs in $p$; others are interested in members' beliefs beyond that. For instance, a simple majoritarian account is interested only in members' beliefs in $p$. But Pettit's PBAA is interested instead in members' beliefs in each proposition from a set $q_1, \ldots, q_n$ whose conjunction is equivalent to $p$. And Lackey's GAA is interested in the members' beliefs in $p$ as well as the members' beliefs that form the bases for their belief in $p$ when they do believe $p$.

(1b) Some accounts are concerned with the individual beliefs of all members of the group, some only with so-called operative members. For instance, some will say that what determines whether a company believes $p$ is only whether or not members of their board believe $p$, while others will say that all employees of the company count.

(2) The credences of individual members of the group

There are distinctions corresponding to (1a) and (1b) here as well.

(3) The outcomes of discussions between the members of the group

(3a) Some will say that only discussions that actually take place make a difference---you might say that, before a discussion takes place, the members of the group each believe $p$, but after they discuss it and retain those beliefs, you can say that the group believes $p$; others will say that hypothetical discussions can also make a difference---if individual members would dramatically change their beliefs were they to discuss the matter, that might mean the group does not believe, even if all members do.

(3b) Some will say that it is not the individual members' beliefs after discussion that is important, but their joint decision to accept $p$ as the group's belief. (Margaret Gilbert's JAA is such an account.)

(4) Belief-forming structures within the group

(4a) Some groups are extremely highly structured, and some of these structures relate to group belief formation. Some accounts of group belief acknowledge this by talking of 'operative members' of groups, and taking their attitudes to have greater weight in determining the group's attitude. For instance, it is common to say that the operative members of a company are its board members; the operative members of a British university might be its senior management team; the operative members of a trade union might be its executive committee. But of course many groups have much more complex structures than these. For instance, many large organisations are concerned with complex problems that break down into smaller problems, each of which requires a different sort of expertise to understand. The World Health Organization (WHO) might be such an example, or the Intergovernmental Panel on Climate Change (IPCC), or Médecins san Frontières (MSF). In each case, there might be a rigid reporting structure whereby subcommittees report their findings to the main committee, but each subcommittee might form its own subcommittees that report to them; and there might be strict rules about how the findings of a subcommittee must be taken into account by the committee to which it reports before that committee itself reports upwards. In such a structure, the notion of operative members and their beliefs is too crude to capture what's necessary.

(5) The actions of the group

(5a) Some might say that a group has a belief just in case it acts in a way that is best explained by positing a group belief. Why does the scientific community persist in appointing only $T_1$-theorists and no $T_2$-theorists? Answer: It believes $T_1$. (I think Kenny Easwaran and Reuben Stern take this view in their recent joint work.)

So, in the case of group beliefs, the disagreement between different accounts does not concern only the conditions on an agree supervenience base; it also concerns the extent of the supervenience base itself. Now, this might soften us up for pluralism, but it is hardly an argument. To give an argument, I'd like to consider a range of possible accounts and, for each, describe a role that group beliefs are typically taken to play and for which this account is best suited.

Group beliefs as summaries

One thing we do when we ascribe beliefs to groups is simply to summarise the views of the group. If I say that, in 1916, Russia believed that Rasputin was dishonest, I simply give a summary of the views of people who belong to the group to which 'Russia' refers in this sentence, namely, Russians alive in 1916. And I say roughly that a substantial majority believed that he was dishonest.

For this role, a simple majoritarian account (SMA) seems best:

SMA A group $G$ believes $p$ iff a substantial majority of members of $G$ believes $p$.

There is an interesting semantic point in the background here. Consider the sentence: 'At the beginning of negotiations at Brest-Litovsk in 1917-8, Russia believed Germany's demands would be less harsh than they turned out to be.' We might suppose that, in fact, this belief was not widespread in Russia, but it was almost universal among the Bolshevik government. Then we might nonetheless say that the sentence is true. At first sight, it doesn't seem that SMA can account for this. But it might do if 'Russia' refers to different groups in the two different sentences: to the whole population in 1916 in the first sentence; to the members of the Bolshevik government in the second.

I'm tempted to think that this happens a lot when we discuss group beliefs. Groups are complex entities, and the name of a group might be used in one sentence to pick out some subset of its structure---just its members, for instance---and in another sentence some other subset of its structure---its members as well as its operative group, for instance---and in another sentence yet some further subset of its structure---its members, its operative group, and the rules by which the operative group abide when they are debating an issue.

Of course, this might look like straightforward synecdoche, but I'm inclined to think it's not, because it isn't clear that there is one default referent of the term 'Russia' such that all other terms are parasitic on that. Rather, there are just many many different group structures that might be picked out by the term, and we have to hope that context determines this with sufficient precision to evaluate the sentence.

Group beliefs as attitudes that play a functional role

An important recent development in our understanding of injustice and oppression has been the recognition of structural forms of racism, sexism, ableism, homophobia, transphobia, and so on. The notion is contested and there are many competing definitions, but to illustrate the point, let me quote from a recent article in the New England Journal of Medicine that considers structural racism in the US healthcare system:

All definitions [of structural racism] make clear that racism is not simply the result of private prejudices held by individuals, but is also produced and reproduced by laws, rules, and practices, sanctioned and even implemented by various levels of government, and embedded in the economic system as well as in cultural and societal norms (Bailey, et al. 2021).

The point is that a group---a university, perhaps, or an entire healthcare system, or a corporation---might act as if it holds racist or sexist beliefs, even though no majority of its members holds those beliefs. A university might pay academics who are women less, promote them less frequently, and so on, even while few individuals within the organisation, and certainly not a majority, believe that women's labour is worth less, and that women are less worthy of promotion. In such a case, we might wish to ascribe those beliefs to the institution as a whole. After all, on certain functionalist accounts of belief, to have a belief simply is to be in a state that has certain casual relationships with other states, including actions. And the state of a group is determined not only by the state of the individuals within it but also by the other structural features of the group, such as its laws, rules and practices. And if the states of the individuals within the group, combined with these laws, rules and practices give rise to the sort of behaviour that we would explain in a individual by positing a belief, it seems reasonable to do so in the group case as well. What's more, doing so helps to explain group behaviour in just the same way that ascribing beliefs to individuals helps to explain their behaviour. (As mentioned above, I take it that Kenny Easwaran and Reuben Stern take something like this view of group belief.)

Group beliefs as ascriptions that have legal standing

In her book, Lackey pays particular attention to cases of group belief that are relevant to corporate culpability and liability. In the 1970s, did the tobacco company Philip Morris believe that their product is hazardous to health, even while they repeatedly denied it? Between 1998 and 2014, did Volkswagen believe that their diesel emissions reports were accurate? In 2003, did the British government believe that Iraq could deploy biological weapons within forty-five minutes of an order to do so? Playing this role well is an important job for an account of group belief. It can have very significant real world consequences: Do those who trusted the assertions of tobacco companies and became ill as a result receive compensation? Do governments have a case against car manufacturers? Should a government stand down?

In fact, I think the consequences are often so large and, perhaps more importantly, so varied that the decision whether or not to put them in train should not depend on the applicability of a single concept with a single precise definition. Consider cases of corporate culpability. There are many ways in which this might be punished. We might fine the company. We might demand that it change certain internal policies or rules. We might demand that it change its corporate structure. We might do many things. Some will be appropriate and effective if the company believes a crucial proposition in one sense; some appropriate if it believes that proposition in some other sense. For instance, a fine does many things, but among them is this: it affects the wealth of the company's shareholders, who will react by putting pressure on the company's board. Thus, it might be appropriate to impose a fine if we think that the company believed the proposition that it denied in its public assertions in the sense that a substantial majority of its board believed it. On the other hand, demanding that the company change certain internal policies or rules would be appropriate if the company believes the proposition that it publicly denied in the sense that it is the outcome of applying its belief-forming rules and policies (such as, for instance, the nested set of subcommittees that I imagined for the WHO or the IPPC or MSF above).

The point is that our purpose in ascribing culpability and liability to a group is essentially pragmatic. We do it in order to determine what sort of punishment we might mete out. This is perhaps in contrast to cases of individual culpability and liability, where we are interested also in the moral status of the individual's action independent of how we respond to it. But, in many cases, such as when a corporation has lied, which punishment is appropriate depends on which of the many ways in which a group can believe the company believed the negation of the proposition it asserted in its lie.

So it seems to me that, even if this role were the only role that our concept of group belief had to play, pluralism would be appropriate. Groups are complex entities and there are consequently many ways in which we can seek to change them in order to avoid the sorts of harms that arise when they behave badly. We need different concepts of group belief in order to identify which is appropriate in a given case.

It's perhaps worth noting that, while Lackey's opens her book with cases of corporate culpability, and this is a central motivation for her emphasis on group lying, it isn't clear to me that her group agent account (GAA) can accommodate all cases of corporate lies. Consider the following situation. The board of a tobacco company is composed of eleven people. Each of them believes that tobacco is hazardous to health. However, some believe it for very different reasons from the others. They have all read the same scientific literature on the topic, but six of them remember it correctly and the other five remember it incorrectly. The six who remember it correctly remember that tobacco contains chemical A and remember that when chemical A comes into contact with tissue X in the human body, it causes cancer in that tissue; and they also remember that tobacco does not contain chemical B and they remember that, when chemical B comes into contact with tissue Y in the human body, it does not cause cancer in that tissue. The five who remember the scientific literature incorrectly believe that tobacco contains chemical B and believe that when chemical B comes into contact with tissue Y in the human body, it causes cancer in that tissue; and they also believe that tobacco does not contain chemical A and they believe that, when chemical A comes into contact with tissue X in the human body, it does not cause cancer in that tissue. So, all board members believe that smoking causes cancer. However, the bases of their beliefs forms an incoherent set. The two propositions on which the six base their belief directly contradict the two propositions on which the five base theirs. The board then issues a statement saying that tobacco does not cause cancer. The board is surely lying, but according to GAA, they are not because the bases of their beliefs conflict and so they do not believe that tobacco does cause cancer.

Permissivism and social choice: a response to Blessenohl

2021-03-14T12:31:00.002+00:00

In a recent paper discussing Lara Buchak's risk-weighted expected utility theory, Simon Blessenohl notes that the objection he raises there to Buchak's theory might also tell against permissivism about rational credence. I offer a response to the objection here.

In his objection, Blessenohl suggests that credal permissivism gives rise to an unacceptable tension between the individual preferences of agents and the collective preferences of the groups to which those agents belong. He argues that, whatever brand of permissivism about credences you tolerate, there will be a pair of agents and a pair of options between which they must choose such that both agents will prefer the first to the second, but collectively they will prefer the second to the first. He argues that this consequence tells against permissivism. I respond that this objection relies on an equivocation between two different understandings of collective preferences: on the first, they are an attempt to summarise the collective view of the group; on the second, they are the preferences of a third-party social chooser tasked with making decisions on behalf of the group. I claim that, on the first understanding, Blessenohl's conclusion does not follow; and, on the second, it follows but is not problematic.

It is well known that, if two people have difference credences in a given proposition, there is a sense in which the pair of them, taken together, is vulnerable to a sure loss set of bets.* That is, there is a bet that the first will accept and a bet that the second will accept such that, however the world turns out, they'll end up collectively losing money. Suppose, for instance, that Harb is 90% confident that Ladybug will win the horse race that is about to begin, while Jay is only 60% confident. Then Harb's credences should lead him to buy a bet for £80 that will pay out £100 if Ladybug wins and nothing if she loses, while Jay's credences should lead him to sell that same bet for £70 (assuming, as we will throughout, that the utility of £$n$ is $n$). If Ladybug wins, Harb ends up £20 up and Jay ends up £30 down, so they end up £10 down collectively. And if Ladybug loses, Harb ends up £80 down while Jay ends up £70 up, so they end up £10 down as a pair.

So, for individuals with different credences in a proposition, there seems to be a tension between how they would choose as individuals and how they would choose as a group. Suppose they are presented with a choice between two options: on the first, $A$, both of them enter into the bets just described; on the second, $B$, neither of them do. We might represent these two options as follows, where we assume that Harb's utility for receiving £$n$ is $n$, and the same for Jay:$$A = \begin{pmatrix}
20 & -80 \\
-30 & 70
\end{pmatrix}\ \ \
B = \begin{pmatrix}
0 & 0 \\
0 & 0
\end{pmatrix}$$The top left entry is Harb's winnings if Ladybug wins, the top right is Harb's winnings if she loses; the bottom left is Jay's winnings if she wins, and the bottom left is Jay's winnings if she loses. So, given a matrix $\begin{pmatrix} a & b \\ c & d \end{pmatrix}$, each row represents a gamble---that is, an assignment of utilities to each state of the world---and each column represents a utility distribution---that is, an assignment of utilities to each individual. So $\begin{pmatrix} a & b \end{pmatrix}$ represents the gamble that the option bequeaths to Harb---$a$ if Ladybug wins, $b$ if she loses---while $\begin{pmatrix} c & d \end{pmatrix}$ represents the gamble bequeathed to Jay---$c$ if she wins, $d$ if she loses. And $\begin{pmatrix} a \\ c \end{pmatrix}$ represents the utility distribution if Ladybug wins---$a$ to Harb, $c$ to Jay---while $\begin{pmatrix} b \\ d \end{pmatrix}$ represents the utility distribution if she loses---$b$ to Harb, $d$ to Jay. Summing the entries in the first column gives the group's collective utility if Ladybug wins, and summing the entries in the second column gives their collective utility if she loses.

Now, suppose that Harb cares only for the utility that he will gain, and Jay cares only his own utility; neither cares at all about the other's welfare. Then each prefers $A$ to $B$. Yet, considered collectively, $B$ results in greater total utility for sure: for each column, the sum of the entries in that column in $B$ (that is, $0$) exceeds the sum in that column in $A$ (that is, $-10$). So there is a tension between what the members of the group unanimously prefer and what the group prefers.

Now, to create this tension, I assumed that the group prefers one option to another if the total utility of the first is sure to exceed the total utility of the second. But this is quite a strong claim. And, as Blessenohl notes, we can create a similar tension by assuming something much weaker.

Suppose again that Harb is 90% confident that Ladybug will win while Jay is only 60% confident that she will. Now consider the following two options:$$A' = \begin{pmatrix}
20 & -80 \\
0 & 0
\end{pmatrix}\ \ \
B' = \begin{pmatrix}
5 & 5 \\
25 & -75
\end{pmatrix}$$In $A'$, Harb pays £$80$ for a £$100$ bet on Ladybug, while in $B'$ he receives £$5$ for sure. Given his credences, he should prefer $A'$ to $B'$, since the expected utility of $A'$ is $10$, while for $B'$ it is $5$. And in $A'$, Jay receives £0 for sure, while in $B'$ he pays £$75$ for a £$100$ bet on Ladybug. Given his credences, he should prefer $A'$ to $B'$, since the expected utility of $A'$ is $0$, while for $B'$ it is $-15$. But again we see that $B'$ will nonetheless end up producing greater total utility for the pair---$30$ vs $20$ if Ladybug wins, and $-70$ vs $-80$ if Ladybug loses. But we can argue in a different way that the group should prefer $B'$ to $A'$. This different way of arguing for this conclusion is the heart of Blessenohl's result.

In what follows, we write $\preceq_H$ for Harb's preference ordering, $\preceq_J$ for Jay's, and $\preceq$ for the group's. First, we assume that, when one option gives a particular utility $a$ to Harb for sure and a particular utility $c$ to Jay for sure, then the group should be indifferent between that and the option that gives $c$ to Harb for sure and $a$ to Jay for sure. That is, the group should be indifferent between an option that gives the utility distribution $\begin{pmatrix} a \\ c\end{pmatrix}$ for sure and an option that gives $\begin{pmatrix} c \\ a\end{pmatrix}$ for sure. Blessenohl calls this Constant Anonymity:

Constant Anonymity For any $a, c$,$$\begin{pmatrix}
a & a \\
c & c
\end{pmatrix} \sim
\begin{pmatrix}
c & c \\
a & a
\end{pmatrix}$$This allows us to derive the following:$$\begin{pmatrix}
20 & 20 \\
0 & 0
\end{pmatrix} \sim
\begin{pmatrix}
0 & 0 \\
20 & 20
\end{pmatrix}\ \ \ \text{and}\ \ \
\begin{pmatrix}
-80 & -80 \\
0 & 0
\end{pmatrix} \sim
\begin{pmatrix}
0 & 0 \\
-80 & -80
\end{pmatrix}$$And now we can introduce our second principle:

Preference Dominance For any $a, b, c, d, a', b', c', d'$, if$$\begin{pmatrix}
a & a \\
c & c
\end{pmatrix} \preceq
\begin{pmatrix}
a' & a' \\
c' & c'
\end{pmatrix}\ \ \ \text{and}\ \ \
\begin{pmatrix}
b & b \\
d & d
\end{pmatrix} \preceq
\begin{pmatrix}
b' & b' \\
d' & d'
\end{pmatrix}$$then$$\begin{pmatrix}
a & b \\
c & d
\end{pmatrix} \preceq
\begin{pmatrix}
a' & b' \\
c' & d'
\end{pmatrix}$$Preference Dominance says that, if the group prefers obtaining the utility distribution $\begin{pmatrix} a \\ c\end{pmatrix}$ for sure to obtaining the utility distribution $\begin{pmatrix} a' \\ c'\end{pmatrix}$ for sure, and prefers obtaining the utility distribution $\begin{pmatrix} b \\ d\end{pmatrix}$ for sure to obtaining the utility distribution $\begin{pmatrix} b' \\ d'\end{pmatrix}$ for sure, then they prefer obtaining $\begin{pmatrix} a \\ c\end{pmatrix}$ if Ladybug wins and $\begin{pmatrix} b \\ d\end{pmatrix}$ if she loses to obtaining $\begin{pmatrix} a' \\ c'\end{pmatrix}$ if Ladybug wins and $\begin{pmatrix} b' \\ d'\end{pmatrix}$ if she loses.

Preference Dominance, combined with the indifferences that we derived from Constant Anonymity, gives$$\begin{pmatrix}
20 & -80 \\
0 & 0
\end{pmatrix} \sim
\begin{pmatrix}
0 & 0 \\
20 & -80
\end{pmatrix}$$And then finally we introduce a closely related principle:

Utility Dominance For any $a, b, c, d, a', b', c', d'$, if $a < a'$, $b < b'$, $c < c'$, and $d < d'$, then$$\begin{pmatrix}
a & b \\
c & d
\end{pmatrix} \prec
\begin{pmatrix}
a' & b' \\
c' & d'
\end{pmatrix}$$

This simply says that if one option gives more utility than another to each individual at each world, then the group should prefer the first to the second. So$$\begin{pmatrix}
0 & 0 \\
20 & -80
\end{pmatrix} \prec
\begin{pmatrix}
5 & 5 \\
25 & -75
\end{pmatrix}$$Stringing these together, we have$$A' = \begin{pmatrix}
20 & -80 \\
0 & 0
\end{pmatrix} \sim
\begin{pmatrix}
0 & 0 \\
20 & -80
\end{pmatrix} \prec
\begin{pmatrix}
5 & 5 \\
25 & -75
\end{pmatrix} = B'$$And thus, assuming that $\preceq$ is transitive, while Harb and Jay both prefer $A'$ to $B'$, the group prefers $B'$ to $A'$.

More generally, Blessenohl proves an impossibility result. Add to the principles we have already stated the following:

Ex Ante Pareto If $A \preceq_H B$ and $A \preceq_J B$, then $A \preceq B$.

And also:

Egoism For any $a, b, c, d, a', b', c', d'$,$$\begin{pmatrix}
a & b \\
c & d
\end{pmatrix} \sim_H \begin{pmatrix}
a & b \\
c' & d'
\end{pmatrix}\ \ \ \text{and}\ \ \
\begin{pmatrix}
a & b \\
c & d
\end{pmatrix} \sim_J \begin{pmatrix}
a' & b' \\
c & d
\end{pmatrix}$$That is, Harb cares only about the utilities he obtains from an option, and Jay cares only about the utilities that he obtains. And finally:

Individual Preference Divergence There are $a, b, c, d$ such that$$\begin{pmatrix}
a & b \\
a & b
\end{pmatrix} \prec_H \begin{pmatrix}
c & d \\
c & d
\end{pmatrix}\ \ \ \text{and}\ \ \
\begin{pmatrix}
a & b \\
a & b
\end{pmatrix} \succ_J \begin{pmatrix}
c & d \\
c & d
\end{pmatrix}$$Then Blessenohl shows that there are no preferences $\preceq_H$, $\preceq_J$, and $\preceq$ that satisfy Individual Preference Divergence, Egoism, Ex Ante Pareto, Constant Anonymity, Preference Dominance, and Utility Dominance.** And yet, he claims, each of these is plausible. He suggests that we should give up Individual Preference Divergence, and with it permissivism and risk-weighted expected utility theory.

Now, the problem that Blessenohl identifies arises because Harb and Jay have different credences in the same proposition. But of course impermissivists agree that two rational individuals can have different credences in the same proposition. So why is this a problem specifically for permissivism? The reason is that, for the impermissivist, if two rational individuals have different credences in the same proposition, they must have different evidence. And for individuals with different evidence, we wouldn't necessarily want the group preference to preserve unanimous agreement between the individuals. Instead, we'd want the group to choose using whichever credences are rational in the light of the joint evidence obtained by pooling the evidence held by each individual in the group. And those might render one option preferable to the other even though each of the individuals, with their less well informed credences, prefer the second option to the first. So Ex Ante Pareto is not plausible when the individuals have different evidence, so impermissivism is safe.

To see this, consider the following example: There are two medical conditions, $X$ and $Y$, that affect racehorses. If they have $X$, they're 90% likely to win the race; if they have $Y$, they're 60% likely; if they have both, they're 10% likely to win. Suppose Harb knows that Ladybug has $X$, but has no information about whether she has $Y$; and suppose Jay knows Ladybug has $Y$ and no information about $X$. Then both are rational. And both prefer $A$ to $B$ from above. But we wouldn't expect the group to prefer $A$ to $B$, since the group should choose using the credence it's rational to have if you know both that Ladybug has $X$ and that she has $Y$; that is, the group should choose by pooling the individual's evidence to give the group evidence, and then choose using the probabilities relative to that. And, relative to that evidence, $B$ is preferable to $A$.

The permissivist, in contrast, cannot make this move. After all, for them it is possible for two rational individuals to disagree even though they have exactly the same evidence, and therefore the same pooled evidence. Blessenohl considers various ways the permissivist or the risk-weighted expected utility theorist might answer his objection, either by denying Ex Ante Pareto or Preference or Utility Dominance. He considers each response unsuccessful, and I tend to agree with his assessments. However, oddly, he explicitly chooses not to consider the suggestion that we might drop Constant Anonymity. I'd like to suggest that we should consider doing exactly that.

I think Blessenohl's objection relies on an ambiguity in what the group preference ordering $\preceq$ represents. On one understanding, it is no more than an attempt to summarise the collective view of the group; on another, it represents the preferences of a third party brought in to make decisions on behalf of the group---the social chooser, if you will. I will argue that Ex Ante Pareto is plausible on the first understanding, but Constant Anonymity isn't; and Constant Anonymity is plausible on the second understanding, but Ex Ante Pareto isn't.

Let's treat the first understanding of $\preceq$. On this, $\preceq$ represents the group's collective opinions about the options on offer. So just as we might try to summarise the scientific community's view on the future trajectory of Earth's average surface temperate or the mechanisms of transmission for SARS-CoV-2 by looking at the views of individual scientists, so might we try to summarise Harb and Jay's collective view of various options by looking at their individual views. Understood in this way, Constant Anonymity does not look plausible. Its motivation is, of course, straightforward. If $a < b$ and$$\begin{pmatrix}
a & a \\
b & b
\end{pmatrix} \prec
\begin{pmatrix}
b & b \\
a & a
\end{pmatrix}$$then the group's collective view unfairly and without justification favours Harb over Jay. And if$$\begin{pmatrix}
a & a \\
b & b
\end{pmatrix} \succ
\begin{pmatrix}
b & b \\
a & a
\end{pmatrix}$$then it unfairly and without justification favours Jay over Harb. So we should rule out both of these. But this doesn't entail that the group preference should be indifferent between these two options. That is, it doesn't entail that we should have$$\begin{pmatrix}
a & a \\
b & b
\end{pmatrix} \sim
\begin{pmatrix}
b & b \\
a & a
\end{pmatrix}$$After all, when you compare two options $A$ and $B$, there are four possibilities:

$A \preceq B$ and $B \preceq A$---that is, $A \sim B$;
$A \preceq B$ and $B \not \preceq A$---that is, $A \prec B$;
$A \not \preceq B$ and $B \preceq A$---that is, $A \succ B$;
$A \not \preceq B$ and $B \not \preceq A$---that is, $A$ and $B$ and not compatible.

The argument for Constant Anonymity rules out (2) and (3), but it does not rule out (4). What's more, it's easy to see that, if we weaken Constant Anonymity so that it requires (1) or (4) rather than requiring (1), then we see that all of the principles are consistent with it. So introduce Weak Constant Anonymity:

Weak Constant Anonymity For any $a, c$, then either$$\begin{pmatrix}
a & a \\
c & c
\end{pmatrix} \sim
\begin{pmatrix}
c & c \\
a & a
\end{pmatrix}$$or$$\begin{pmatrix}
a & a \\
c & c
\end{pmatrix}\ \ \text{and}\ \
\begin{pmatrix}
c & c \\
a & a
\end{pmatrix}\ \ \text{are incomparable}$$

Then define the preference ordering $\preceq^*$ as follows:$$A \preceq^* B \Leftrightarrow \left ( A \preceq_H B\ \&\ A \preceq_J B \right )$$Then $\preceq^*$ satisfies Ex Ante Pareto, Weak Constant Anonymity, Preference Dominance, and Utility Dominance. And indeed $\preceq^*$ seems a very plausible candidate for the group preference ordering understood in this first way: where Harb and Jay disagree, it simply has no opinion on the matter; it has opinions only where Harb and Jay agree, and then it shares their shared opinion.

On the understanding of $\preceq$ as summarising the group's collective view, if $\begin{pmatrix}
a & a \\
c & c
\end{pmatrix} \sim
\begin{pmatrix}
c & c \\
a & a
\end{pmatrix}$ then the group collectively thinks that this option $\begin{pmatrix}
a & a \\
c & c
\end{pmatrix}$ is exactly as good as this option $\begin{pmatrix}
c & c \\
a & a
\end{pmatrix}$. But the group absolutely does not think that. Indeed, Harb and Jay both explicitly deny it, though for opposing reasons. So Constant Anonymity is false.

Let's turn next to the second understanding. On this, $\preceq$ is the preference ordering of the social chooser. Here, the original, stronger version of Constant Anonymity seems more plausible. After all, unlike the group itself, the social chooser should have the sort of positive commitment to equality and fairness that the group definitively does not have. As we noted above, Harb and Jay unanimously reject the egalitarian assessment represented by $\begin{pmatrix}
a & a \\
c & c
\end{pmatrix} \sim
\begin{pmatrix}
c & c \\
a & a
\end{pmatrix}$. They explicitly both think that these two options are not equally good---if $a < c$, then Harb thinks the second is strictly better, while Jay thinks the first is strictly better. So, as we argued above, we take the group view to be that they are incomparable. But the social chooser should not remain so agnostic. She should overrule the unanimous rejection of the indifference relation between them and accept it. But, having thus overruled one unanimous view and taken a different one, it is little surprise that she will reject other unanimous views, such as Harb and Jay's unanimous view that $A'$ is better than $B'$ above. That is, it is little surprise that she should violate Ex Ante Pareto. After all, her preferences are not only informed by a value that Harb and Jay do not endorse; they are informed by a value that Harb and Jay explicitly reject, given our assumption of Egoism. This is the value of fairness, which is embodied in the social chooser's preferences in Constant Anonymity and rejected in Harb's and Jay's preferences by Egoism. If we require of our social chooser that they adhere to this value, we should not expect Ex Ante Pareto to hold.

* See Philippe Mongin's 1995 paper 'Consistent Bayesian Aggregation' for wide-ranging results in this area.

** Here's the trick: if$$\begin{pmatrix}
a & b \\
a & b
\end{pmatrix} \prec_H \begin{pmatrix}
c & d \\
c & d
\end{pmatrix}\ \ \ \text{and}\ \ \
\begin{pmatrix}
a & b \\
a & b
\end{pmatrix} \succ_J \begin{pmatrix}
c & d \\
c & d
\end{pmatrix}$$
Then let$$A' = \begin{pmatrix}
c & d \\
a & b
\end{pmatrix}\ \ \ \text{and}\ \ \
B' = \begin{pmatrix}
a & b \\
c & d
\end{pmatrix}$$Then $A' \succ_H B'$ and $A' \succ_J B'$, but $A' \sim B'$.

Life on the edge: a response to Schultheis' challenge to epistemic permissivism about credences

2021-01-06T14:24:00.002+00:00

In their 2018 paper, 'Living on the Edge', Ginger Schultheis issues a powerful challenge to epistemic permissivism about credences, the view that there are bodies of evidence in response to which there are a number of different credence functions it would be rational to adopt. The heart of the argument is the claim that a certain sort of situation is impossible. Schultheis thinks that all motivations for permissivism must render situations of this sort possible. Therefore, permissivism must be false, or at least these motivations for it must be wrong.

Here's the situation, where we write $R_E$ for the set of credence functions that it is rational to have when your total evidence is $E$.

Our agent's total evidence is $E$.
There is $c$ in $R_E$ that our agent knows is a rational response to $E$.
There is $c'$ in $R_E$ that our agent does not know is a rational response to $E$.

Schultheis claims that the permissivist must take this to be possible, whereas in fact it is impossible. Here are a couple of specific examples that the permissivist will typically take to be possible.

Example 1: we might have a situation in which the credences it is rational to assign to a proposition $X$ in response to evidence $E$ form the interval $[0.4, 0.7]$. But we might not be sure of quite the extent of the interval. For all we know, it might be $[0.41, 0.7]$ or $[0.39, 0.71]$. Or it might be $[0.4, 0.7]$. So we are sure that $0.5$ is a rational credence in $X$, but we're not sure whether $0.4$ is a rational credence in $X$. In this case, $c(X) = 0.5$ and $c'(X) = 0.4$.

Example 2: you know that Probablism is a rational requirement on credence functions, and you know that satisfying the Principle of Indifference is rationally permitted, but you don't know whether or not it is also rationally required. In this case, $c$ is the uniform distribution required by the Principle of Indifference, but $c'$ is any other probability function.

Schultheis then appeals to a principle called Weak Rationality Dominance. We say that one credence function $c$ rationally dominates another $c'$ if $c$ is rational in all worlds in which $c'$ is rational, and also rational in some worlds in which $c'$ is not rational. Weak Rationality Dominance says that it is irrational to adopt a rationally dominated credence function. The important consequence of this for Schultheis' argument is that, if you know that $c$ is rational, but you don't know whether $c'$ is, then $c'$ is irrational. As a result, in our example above, $c'$ is not rational, contrary to what the permissivist claims, because it is rationally dominated by $c$. So permissivism must be false.

If Weak Rationality Dominance is correct, then, it follows that the permissivist must say that, for any body of evidence $E$ and set $R_E$ of rational responses, the agent with evidence $E$ either must know of each credence function in $R_E$ that it is in $R_E$, or they must not know of any credence function in $R_E$ that it is in $R_E$. If they know of some credence functions in $R_E$ that they are in $R_E$ and not know of others in $R_E$ that they are in $R_E$, then they clash with Weak Rationality Dominance. But, whatever your reason for being a permissivist, it seems very likely that it will entail situations in which there are some credence functions that are rational responses to your evidence and that you know are such responses, while you are unsure about other credence functions that are, in fact, rational responses whether or not they are, in fact, rational responses. This is Schultheis' challenge.

I'd like to explore a response to Schultheis' argument that takes issue with Weak Rationality Dominance (WRD). I'll spell out the objection in general to begin with, and then see how it plays out for a specific motivation for permissivism, namely, the Jamesian motivation I sketched in this previous blogpost.

One worry about WRD is that it seems to entail a deference principle of exactly the sort that I objected to in this blogpost. According to such deference principles, for certain agents in certain situations, if they learn of a credence function that it is rational, they should adopt it. For instance, Ben Levinstein claims that, if you are certain that you are irrational, and you learn that $c$ is rational, then you should adopt $c$ -- or at least you should have the conditional credences that would lead you to do this if you were to apply conditionalization. We might slightly strengthen Levinstein's version of the deference principle as follows: if you are unsure whether you are rational or not, and you learn that $c$ is rational, then you should adopt $c$. WRD entails this deference principle. After all, suppose you have credence function $c'$, and you are unsure whether or not it is rational. And suppose you learn that $c$ is rational (and don't thereby learn that $c'$ is as well). Then, according to Schultheis' principle, you are irrational if you stick with $c'$.

In the previous blogpost, I objected to Levinstein's deference principle, and others like it, because it relies on the assumption that all rational credence functions are better than all irrational credence functions. I think that's false. I think there are certain sorts of flaw that render you irrational, and lacking those flaws renders you rational. But lacking those flaws doesn't ensure that you're going to be better than someone who has those flaws. Consider, for instance, the extreme subjective Bayesian who justifies their position using an accuracy dominance argument of the sort pioneered by Jim Joyce. That is, they say that accuracy is the sole epistemic good for credence functions. And they say that non-probabilistic credence functions are irrational because, for any such credence function, there are probabilistic ones that accuracy dominate them; and all probabilistic credence functions are rational because, for any such credence function, there is no probabilistic one that accuracy dominates it. Now, suppose I have credence $0.91$ in $X$ and $0.1$ in $\overline{X}$. And suppose I am either sure that this is irrational, or I'm uncertain it is. I then learn that assigning credence $0.1$ to $X$ and $0.9$ to $\overline{X}$ is rational. What should I do? It isn't at all obvious to me that I should move from my credence function to the one I've learned is rational. After all, even from my slightly incoherent standpoint, it's possible to see that the rational one is going to be a lot less accurate than mine if $X$ is true, and I'm very confident that it is.

So I think that the rational deference principle is wrong, and therefore any version of WRD that entails it is also wrong. But perhaps there is a more restricted version of WRD that is right. And one that is nonetheless capable of sinking permissivism. Consider, for instance, a restricted version of WRD that applies only to agents who have no credence function --- that is, it applies to your initial choice of a credence function; it does not apply when you have a credence function and you are deciding whether to adopt a new one. This makes a difference. The problem with a version that applies when you already have a credence function $c'$ is that, even if it is irrational, it might nonetheless be better than the rational credence function $c$ in some situation, and it might be that $c'$ assigns a lot of credence to that situation. So it's hard to see how to motivate the move from $c'$ to $c$. However, in a situation in which you have no credence function, and you are unsure whether $c'$ is rational (even though it is) and you're certain that $c$ is rational (and indeed it is), WRD's demand that you should not pick $c'$ seems more reasonable. You occupy no point of view such that $c'$ is less of a depature from that point of view than $c$ is. You know only that $c$ lacks the flaws for sure, whereas $c'$ might have them. Better, then, to go for $c$, is it not? And if it is, this is enough to defeat permissivism.

I think it's not quite that simple. I noted above that Levinstein's deference principle relies on the assumption that all rational credence functions are better than all irrational credence functions. Schultheis' WRD seems to rely on something even stronger, namely, the assumption that all rational credence functions are equally good in all situations. For suppose they are not. You might then be unsure whether $c'$ is rational (though it is) and sure that $c$ is rational (and it is), but nonetheless rationally opt for $c'$ because you know that $c'$ has some good feature that you know $c$ lacks and you're willing to take the risk of having an irrational credence function in order to open the possibility of having that good feature.

Here's an example. You are unsure whether it is rational to assign $0.7$ to $X$ and $0.3$ to $\overline{X}$. It turns out that it is, but you don't know that. On the other hand, you do know that it is rational to assign 0.5 to each proposition. But the first assignment and the second are not equally good in all situations. The second has the same accuracy whether $X$ is true or false; the first, in constrast, is better than the first if $X$ is true and worse than the first if $X$ is false. The second does not open up the possibility of high accuracy that the first does; though, to compensate, it also precludes the possibility of low accuracy, which the first doesn't. Surveying the situation, you think that you will take the risk. You'll adopt the first, even though you aren't sure whether or not it is rational. And you'll do this because you want the possibility of being rational and having that higher accuracy. This seems a rational thing to do. So, it seems to me, WRD is false.

Although I think this objection to WRD works, I think it's helpful to see how it might play out for a particular motivation for permissivism. Here's the motivation: Some credence functions offer the promise of great accuracy -- for instance, assigning 0.9 to $X$ and 0.1 to $\overline{X}$ will be very accurate if $X$ is true. However, those that do so also open the possibility of great inaccuracy -- if $X$ is false, the credence function just considered is very inaccurate. Other credence functions neither offer great accuracy nor risk great inaccuracy. For instance, assigning 0.5 to both $X$ and $\overline{X}$ guarantees the same inaccuracy whether or not $X$ is true. You might say that you are more risk-averse the lower is the maximum possible inaccuracy you are willing to risk. Thus, the options that are rational for you are those undominated options with maximum inaccuracy at most whatever the threshold is that you set. Now, suppose you use the Brier score to measure your inaccuracy -- so that the inaccuracy of the credence function $c(X) = p$ and $c(\overline{X}) = 1-p$ is $2(1-p)^2$ if $X$ is true and $2p^2$ if $X$ is false. And suppose you are willing to tolerate a maximum possible inaccuracy of $0.5$, which also gives you a mininum inaccuracy of $0.5$. In that case, only $c(X) = 0.5 = c(\overline{X})$ will be rational from the point of view of your risk attitudes --- since $2(1-0.5)^2 = 0.5 = 2(0.5^2)$. On the other hand, suppose you are willing to tolerate a maximum inaccuracy of $0.98$, which also gives you a minimum inaccuracy of $0.18$. In that case, any credence function $c$ with $0.3 \leq c(X) \leq 0.7$ and $c(\overline{X}) = 1-c(X)$ is rational from the point of view of your risk attitudes.

Now, suppose that you are in the sort of situation that Schultheis imagines. You are uncertain of the extent of the set $R_E$ of rational responses to your evidence $E$. On the account we're considering, this must be because you are uncertain of your own attitudes to epistemic risk. Let's say that the threshold of maximum inaccuracy that you're willing to tolerate is $0.98$, but you aren't certain of that --- you think it might be anything between $0.72$ and $1.28$. So you're sure that it's rational to assign anything between 0.4 and 0.6 to $X$, but unsure whether it's rational to assign $0.7$ to $X$ --- if your threshold turns out to be less than 0.98, then assigning $0.7$ to $X$ would be irrational, because it risks inaccuracy of $0.98$. In this situation, is it rational to assign $0.7$ to $X$? I think it is. Among the credence functions that you know for sure are rational, the ones that give you the lowest possible inaccuracy are the one that assigns 0.4 to $X$ and the one that assigns 0.6 to $X$. They have maximum inaccuracy of 0.72, and they open up the possibility of an inaccuracy of 0.32, which is lower than the lowest possible inaccuracy opened up by any others that you know to be rational. On the other hand, assigning 0.7 to $X$ opens up the possibility of an inaccuracy of 0.18, which is considerably lower. As a result, it doesn't seem irrational to assign 0.7 to $X$, even though you don't know whether it is rational from the point of view of your attitudes to risk, and you do know that assigning 0.6 is rational.

There is another possible response to Schultheis' challenge for those who like this sort of motivation for permissivism. You might simply say that, if your attitudes to risk are such that you will tolerate a maximum inaccuracy of at most $t$, then regardlesss of whether you know this fact, indeed regardless of your level of uncertainty about it, the rational credence functions are precisely those that have maximum inaccuracy of at most $t$. This sort of approach is familiar from expected utility theory. Suppose I have credences in $X$ and in $\overline{X}$. And suppose I face two options whose utility is determined by whether or not $X$ is true or false. Then, regardless of what I believe about my credences in $X$ and $\overline{X}$, I should choose whichever option maximises expected utility from the point of view of my actual credences. The point is this: if what it is rational for you to believe or to do is determined by some feature of you, whether it's your credences or your attitudes to risk, being uncertain about those features doesn't change what it is rational for you to do. This introduces a certain sort of externalism to our notion of rationality. There are features of ourselves -- our credences or our attitudes to risk -- that determine what it is rational for us to believe or do, which are nonetheless not luminous to us. But I think this is inevitable. Of course, we might might move up a level and create a version of expected utility theory that appeals not to our first-order credences but to our credences concerning those first-order credences -- perhaps you use the higher-order credences to define a higher-order expected value for the first-order expected utilities, and you maximize that. But it simply pushes the problem back a step. For your higher-order credences are no more luminous than your first-order ones. And to stop the regress, you must fix some level at which the credences at that level simply determine the expectation that rationality requires you to maximize, and any uncertainty concerning those does not affect rationality. And the same goes in this case. So, given this particular motivation for permissivism, which appeals to your attitudes to epistemic risk, it seems that there is another reason why WRD is false. If $c$ is in $R_E$, then it is rational for you, regardless of your epistemic attitude to its rationality.

Using a generalized Hurwicz criterion to pick your priors

2021-01-04T11:53:00.001+00:00

Over the summer, I got interested in the problem of the priors again. Which credence functions is it rational to adopt at the beginning of your epistemic life? Which credence functions is it rational to have before you gather any evidence? Which credence functions provide rationally permissible responses to the empty body of evidence? As is my wont, I sought to answer this in the framework of epistemic utility theory. That is, I took the rational credence functions to be those declared rational when the appropriate norm of decision theory is applied to the decision problem in which the available acts are all the possible credence functions, and where the epistemic utility of a credence function is measured by a strictly proper measure. I considered a number of possible decision rules that might govern us in this evidence-free situation: Maximin, the Principle of Indifference, and the Hurwicz criterion. And I concluded in favour of a generalized version of the Hurwicz criterion, which I axiomatised. I also described which credence functions that decision rule would render rational in the case in which there are just three possible worlds between which we divide our credences. In this post, I'd like to generalize the results from that treatment to the case in which there any finite number of possible worlds.

Here's the decision rule (where $a(w_i)$ is the utility of $a$ at world $w_i$).

Generalized Hurwicz Criterion Given an option $a$ and a sequence of weights $0 \leq \lambda_1, \ldots, \lambda_n \leq 1$ with $\sum^n_{i=1} \lambda_i = 1$, which we denote $\Lambda$, define the generalized Hurwicz score of $a$ relative to $\Lambda$ as follows: if $$a(w_{i_1}) \geq a(w_{i_2}) \geq \ldots \geq a(w_{i_n})$$ then $$H^\Lambda(a) := \lambda_1a(w_{i_1}) + \ldots + \lambda_na(w_{i_n})$$That is, $H^\Lambda(a)$ is the weighted average of all the possible utilities that $a$ receives, where $\lambda_1$ weights the highest utility, $\lambda_2$ weights the second highest, and so on.

The Generalized Hurwicz Criterion says that you should order options by their generalized Hurwicz score relative to a sequence $\Lambda$ of weightings of your choice. Thus, given $\Lambda$,$$a \preceq^\Lambda_{ghc} a' \Leftrightarrow H^\Lambda(a) \leq H^\Lambda(a')$$And the corresponding decision rule says that you should pick your Hurwicz weights $\Lambda$ and then, having done that, it is irrational to choose $a$ if there is $a'$ such that $a \prec^\Lambda_{ghc} a'$.

Now, let $\mathfrak{U}$ be an additive strictly proper epistemic utility measure. That is, it is generated by a strictly proper scoring rule. A strictly proper scoring rule is a function $\mathfrak{s} : \{0, 1\} \times [0, 1] \rightarrow [-\infty, 0]$ such that, for any $0 \leq p \leq 1$, $p\mathfrak{s}(1, x) + (1-p)\mathfrak{s}(0, x)$ is maximized, as a function of $x$, uniquely at $x = p$. And an epistemic utility measure is generated by $\mathfrak{s}$ if, for any credence function $C$ and world $w_i$,$$\mathfrak{U}(C, w_i) = \sum^n_{j=1} \mathfrak{s}(w^j_i, c_j)$$where

$c_j = C(w_j)$, and
$w^j_i = 1$ if $j=i$ and $w^j_i = 0$ if $j \neq i$

In what follows, we write the sequence $(c_1, \ldots, c_n)$ to represent the credence function $C$.

Also, given a sequence $(\alpha_1, \ldots, \alpha_k)$ of numbers, let$$\mathrm{Av}((\alpha_1, \ldots, \alpha_k)) := \frac{\alpha_1 + \ldots + \alpha_k}{k}$$That is, $\mathrm{av}(A)$ is the average of the numbers in $A$. And given $1 \leq k \leq n$, let $A|_k = (a_1, \ldots, a_k)$. That is, $A|_k$ is the truncation of the sequence $A$ that omits all terms after $a_k$. Then we say that $A$ does not exceed its average if, for each $1 \leq k \leq n$,$$\mathrm{av}(A) \geq \mathrm{av}(A|_k)$$That is, at no point in the sequence does the average of the numbers up to that point exceed the average of all the numbers in the sequence.

Theorem 1 Suppose $\Lambda = (\lambda_1, \ldots, \lambda_n)$ is a sequence of generalized Hurwicz weights. Then there is a sequence of subsequences $\Lambda_1, \ldots, \Lambda_m$ of $\Lambda$ such that

$\Lambda = \Lambda_1 \frown \ldots \frown \Lambda_m$
$\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av} (\Lambda_m)$
each $\Lambda_i$ does not exceed its average

Then, the credence function$$(\underbrace{\mathrm{av}(\Lambda_1), \ldots, \mathrm{av}(\Lambda_1)}_{\text{length of $\Lambda_1$}}, \underbrace{\mathrm{av}(\Lambda_2), \ldots, \mathrm{av}(\Lambda_2)}_{\text{length of $\Lambda_2$}}, \ldots, \underbrace{\mathrm{av}(\Lambda_m), \ldots, \mathrm{av}(\Lambda_m)}_{\text{length of $\Lambda_m$}})$$maximizes $H^\Lambda(\mathfrak{U}(-))$ among credence functions $C = (c_1, \ldots, c_n)$ for which $c_1 \geq \ldots \geq c_n$.

This is enough to give us all of the credence functions that maximise $H^\Lambda(\mathfrak{U}(-))$: they are the credence function mentioned together with any permutation of it --- that is, any credence function obtained from that one by switching around the credences assigned to the worlds.

Proof of Theorem 1. Suppose $\mathfrak{U}$ is a measure of epistemic value that is generated by the strictly proper scoring rule $\mathfrak{s}$. And suppose that $\Lambda$ is the following sequence of generalized Hurwicz weights $0 \leq \lambda_1, \ldots, \lambda_n \leq 1$ with $\sum^n_{i=1} \lambda_i = 1$.

First, due to a theorem that originates in Savage and is stated and proved fully by Predd, et al., if $C$ is not a probability function---that is, if $c_1 + \ldots + c_n \neq 1$---then there is a probability function $P$ such that $\mathfrak{U}(P, w_i) > \mathfrak{U}(C, w_i)$ for all worlds $w_i$. Thus, since GHC satisfies Strong Dominance, whatever maximizes $H^\Lambda(\mathfrak{U}(-))$ will be a probability function.

Now, since $\mathfrak{U}$ is generated by a strictly proper scoring rule, it is also truth-directed. That is, if $c_i > c_j$, then $\mathfrak{U}(C, w_i) > \mathfrak{U}(C, w_j)$. Thus, if $c_1 \geq c_2 \geq \ldots \geq c_n$, then$$H^\Lambda(\mathfrak{U}(C)) = \lambda_1\mathfrak{U}(C, w_1) + \ldots + \lambda_n\mathfrak{U}(C, w_n)$$This is what we seek to maximize. But notice that this is just the expectation of $\mathfrak{U}(C)$ from the point of view of the probability distribution $\Lambda = (\lambda_1, \ldots, \lambda_n)$.

Now, Savage also showed that, if $\mathfrak{s}$ is strictly proper and continuous, then there is a differentiable and strictly convex function $\varphi$ such that, if $P, Q$ are probabilistic credence functions, then
\begin{eqnarray*}
\mathfrak{D}_\mathfrak{s}(P, Q) & = & \sum^n_{i=1} \varphi(p_i) - \sum^n_{i=1} \varphi(q_i) - \sum^n_{i=1} \varphi'(q_i)(p_i - q_i) \\
& = & \sum^n_{i=1} p_i\mathfrak{U}(P, w_i) - \sum^n_{i=1} p_i\mathfrak{U}(Q, w_i)
\end{eqnarray*}
So $C$ maximizes $H^\Lambda(\mathfrak{U}(-))$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$ iff $C$ minimizes $\mathfrak{D}_\mathfrak{s}(\Lambda, -)$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$. We now use the KKT conditions to calculate which credence functions minimize $\mathfrak{D}_\mathfrak{s}(\Lambda, -)$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$.

Thus, if we write $x_n$ for $1 - x_1 - \ldots - x_{n-1}$, then
\begin{multline*}
f(x_1, \ldots, x_{n-1}) = \mathfrak{D}((\lambda_1, \ldots, \lambda_n), (x_1, \ldots, x_n)) = \\
\sum^n_{i=1} \varphi(\lambda_i) - \sum^n_{i=1} \varphi(x_i) - \sum^n_{i=1} \varphi'(x_i)(\lambda_i - x_i)
\end{multline*}
So
\begin{multline*}
\nabla f = \langle \varphi''(x_1) (x_1 - \lambda_1) - \varphi''(x_n)(x_n - \lambda_n), \\
\varphi''(x_2) (x_2 - \lambda_2) - \varphi''(x_n)(x_n - \lambda_n), \ldots \\
\varphi''(x_{n-1}) (x_{n-1} - \lambda_{n-1}) - \varphi''(x_n)(x_n - \lambda_n) )\rangle
\end{multline*}

Let $$\begin{array}{rcccl}
g_1(x_1, \ldots, x_{n-1}) & = & x_2 - x_1& \leq & 0\\
g_2(x_1, \ldots, x_{n-1}) & = & x_3 - x_2& \leq & 0\\
\vdots & \vdots & \vdots & \vdots & \vdots \\
g_{n-2}(x_1, \ldots, x_{n-1}) & = & x_{n-1} - x_{n-2}& \leq & 0 \\
g_{n-1}(x_1, \ldots, x_{n-1}) & = & 1 - x_1 - \ldots - x_{n-2} - 2x_{n-1} & \leq & 0
\end{array}$$So,
\begin{eqnarray*}
\nabla g_1 & = & \langle -1, 1, 0, \ldots, 0 \rangle \\
\nabla g_2 & = & \langle 0, -1, 1, 0, \ldots, 0 \rangle \\
\vdots & \vdots & \vdots \\
\nabla g_{n-2} & = & \langle 0, \ldots, 0, -1, 1 \rangle \\
\nabla g_{n-1} & = & \langle -1, -1, -1, \ldots, -1, -2 \rangle \\
\end{eqnarray*}
So the KKT theorem says that $x_1, \ldots, x_n$ is a minimizer iff there are $0 \leq \mu_1, \ldots, \mu_{n-1}$ such that$$\nabla f(x_1, \ldots, x_{n-1}) + \sum^{n-1}_{i=1} \mu_i \nabla g_i(x_1, \ldots, x_{n-1}) = 0$$That is, iff there are $0 \leq \mu_1, \ldots, \mu_{n-1}$ such that
\begin{eqnarray*}
\varphi''(x_1) (x_1 - \lambda_1) - \varphi''(x_n)(x_n - \lambda_n) - \mu_1 - \mu_{n-1} & = & 0 \\
\varphi''(x_2) (x_2 - \lambda_2) - \varphi''(x_n)(x_n - \lambda_n) + \mu_1 - \mu_2 - \mu_{n-1} & = & 0 \\
\vdots & \vdots & \vdots \\
\varphi''(x_{n-2}) (x_{n-2} - \lambda_{n-2}) - \varphi''(x_n)(x_n - \lambda_n) + \mu_{n-3} - \mu_{n-2} - \mu_{n-1}& = & 0 \\
\varphi''(x_{n-1}) (x_{n-1} - \lambda_{n-1}) - \varphi''(x_n)(x_n - \lambda_n)+\mu_{n-2} - 2\mu_{n-1} & = & 0
\end{eqnarray*}
By summing these identities, we get:
\begin{eqnarray*}
\mu_{n-1} & = & \frac{1}{n} \sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{n-1}{n} \varphi''(x_n)(x_n - \lambda_n) \\
&= & \frac{1}{n} \sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \varphi''(x_n)(x_n - \lambda_n) \\
& = & \sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{n-1}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)
\end{eqnarray*}
So, for $1 \leq k \leq n-2$,
\begin{eqnarray*}
\mu_k & = & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - k\varphi''(x_n)(x_n - \lambda_n) - \\
&& \hspace{20mm} \frac{k}{n}\sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) + k\frac{n-1}{n} \varphi''(x_n)(x_n - \lambda_n) \\
& = & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) -\frac{k}{n} \varphi''(x_n)(x_n - \lambda_n) \\
&= & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)
\end{eqnarray*}
So, for $1 \leq k \leq n-1$,
$$\mu_k = \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)$$
Now, suppose that there is a sequence of subsequences $\Lambda_1, \ldots, \Lambda_m$ of $\Lambda$ such that

$\Lambda = \Lambda_1 \frown \ldots \frown \Lambda_m$
$\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av}(\Lambda_m)$
each $\Lambda_i$ does not exceed its average.

And let $$P = (\underbrace{\mathrm{av}(\Lambda_1), \ldots, \mathrm{av}(\Lambda_1)}_{\text{length of $\Lambda_1$}}, \underbrace{\mathrm{av}(\Lambda_2), \ldots, \mathrm{av}(\Lambda_2)}_{\text{length of $\Lambda_2$}}, \ldots, \underbrace{\mathrm{av}(\Lambda_m), \ldots, \mathrm{av}(\Lambda_m)}_{\text{length of $\Lambda_m$}})$$Then we write $i \in \Lambda_j$ if $\lambda_i$ is in the subsequence $\Lambda_j$. So, for $i \in \Lambda_j$, $p_i = \mathrm{av}(\Lambda_j)$. Then$$\frac{k}{n}\sum^n_{i=1} \varphi''(p_i)(p_i - \lambda_i) = \frac{k}{n} \sum^m_{j = 1} \sum_{i \in \Lambda_j} \varphi''(\mathrm{av}(\Lambda_j))(\mathrm{av}(\Lambda_j) - \lambda_i) = 0 $$
Now, suppose $k$ is in $\Lambda_j$. Then
\begin{multline*}
\mu_k = \sum^k_{i=1} \varphi''(p_i)(p_i - \lambda_i) = \\
\sum_{i \in \Lambda_1} \varphi''(p_i)(p_i - \lambda_i) + \sum_{i \in \Lambda_2} \varphi''(p_i)(p_i - \lambda_i) + \ldots + \\
\sum_{i \in \Lambda_{j-1}} \varphi''(p_i)(p_i - \lambda_i) + \sum_{i \in \Lambda_j|_k} \varphi''(p_i)(p_i - \lambda_i) = \\
\sum_{i \in \Lambda_j|_k} \varphi''(p_i)(p_i - \lambda_i) = \sum_{i \in \Lambda_j|_k} \varphi''(\mathrm{av}(\Lambda_j)(\mathrm{av}(\Lambda_j) - \lambda_i)
\end{multline*}
So, if $|\Lambda|$ is the length of the sequence $\Lambda$,$$\mu_k \geq 0 \Leftrightarrow |\Lambda_j|_k|\mathrm{av}(\Lambda_j) - \sum_{i \in \Lambda_j|_k} \lambda_i \geq 0 \Leftrightarrow \mathrm{av}(\Lambda_j) \geq \mathrm{av}(\Lambda_j|_k)$$But, by assumption, this is true for all $1 \leq k \leq n-1$. So $P$ minimizes $H^\Lambda(\mathfrak{U}(-))$, as required.

We now show that there is always a series of subsequences that satisfy (1), (2), (3) from above. We proceed by induction.

Base Case $n = 1$. Then it is clearly true with the subsequence $\Lambda_1 = \Lambda$.

Inductive Step Suppose it is true for all sequences $\Lambda = (\lambda_1, \ldots, \lambda_n)$ of length $n$. Now consider a sequence $(\lambda_1, \ldots, \lambda_n, \lambda_{n+1})$. Then, by the inductive hypothesis, there is a sequence of sequences $\Lambda_1, \ldots, \Lambda_m$ such that

$\Lambda \frown (\lambda_{n+1}) = \Lambda_1 \frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})$
$\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av} (\Lambda_m)$
each $\Lambda_i$ does not exceed its average.

Now, first, suppose $\mathrm{av}(\Lambda_m) \geq \lambda_{n+1}$. Then let $\Lambda_{m+1} = (\lambda_{n+1})$ and we're done.

So, second, suppose $\mathrm{av}(\Lambda_m) < \lambda_{n+1}$. Then we find the greatest $k$ such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$Then we let $\Lambda^*_{k+1} = \Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})$. Then we can show that

$(\lambda_1, \ldots, \lambda_n, \lambda_{n+1}) = \Lambda_1 \frown \Lambda_2 \frown \ldots \frown \Lambda_k \frown \Lambda^*_{k+1}$.
Each $\Lambda_1, \ldots, \Lambda_k, \Lambda^*_{k+1}$ does not exceed average.
$\mathrm{av}(\Lambda_1) \geq \mathrm{av}(\Lambda_2) \geq \ldots \geq \mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda^*_{k+1})$.

(1) and (3) are obvious. So we prove (2). In particular, we show that $\Lambda^*_{k+1}$ does not exceed average. We assume that each subsequence $\Lambda_j$ starts with $\Lambda_{i_j+1}$

Suppose $i \in \Lambda_{k+1}$. Then, since $\Lambda_{k+1}$ does not exceed average, $$\mathrm{av}(\Lambda_{k+1}) \geq \mathrm{av}(\Lambda_{k+1}|_i)$$But, since $k$ is the greatest number such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$We know that$$\mathrm{av}(\Lambda_{k+2}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1}|_i)$$
Suppose $i \in \Lambda_{k+2}$. Then, since $\Lambda_{k+2}$ does not exceed average, $$\mathrm{av}(\Lambda_{k+2}) \geq \mathrm{av}(\Lambda_{k+2}|_i)$$But, since $k$ is the greatest number such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$We know that$$\mathrm{av}(\Lambda_{k+3}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+2})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+2}|_i)$$But also, from above,$$ \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1} \frown \Lambda_{k+2}|_i)$$
And so on.

This completes the proof. $\Box$