Friday, 7 August 2020

Accuracy without Additivity

For a PDF of this post, see here.

One of the central arguments in accuracy-first epistemology -- the one that gets the project off the ground, I think -- is the accuracy-dominance argument for Probabilism. This started life in a more pragmatic guise in de Finetti's proof that, if your credences are not probabilistic, there are alternatives that would lose less than yours would if they were penalised using the Brier score, which levies a price of $(1-x)^2$ on every credence $x$ in a truth and $x^2$ on every credence $x$ in a falsehood. This was then adapted to an accuracy-based argument by Roger Rosencrantz, where he interpreted the Brier score as a measure of inaccuracy, not a penalty score. Interpreted thus, de Finetti's result says that any non-probabilistic credences are accuracy-dominated by some probabilistic credences. Jim Joyce then noted that this argument only establishes Probabilism if you have a further argument that inaccuracy should be measured by the Brier score. He thought there was no particular reason to think that's right, so he greatly generalized de Finetti's result to show that, relative to a much wider range of inaccuracy measures, all non-probabilistic credences are accuracy dominated. One problem with this, which Al Hájek pointed out, was that he didn't give a converse argument -- that is, he didn't show that, for each of his inaccuracy measures, each probabilistic credence function is not accuracy dominated. Joel Predd and his Princeton collaborators then addressed this concern and proved a very general result, namely, that for any additive, continuous, and strictly proper inaccuracy measure, any non-probabilistic credences are accuracy-dominated, while no probabilistic credences are.

That brings us to this blogpost. Additivity is a controversial claim. It says that the inaccuracy of a credence function is the (possibly weighted) sum of the inaccuracies of the credences it assigns. So the question arises: can we do without additivity? In this post, I'll give a quick proof of the accuracy-dominance argument that doesn't assume anything about the inaccuracy measures other than that they are continuous and strictly proper. Anyone familiar with the Predd, et al. paper will see that the proof strategy draws very heavily on theirs. But it bypasses out the construction of the Bregman divergence that corresponds to the strictly proper inaccuracy measure. For that, you'll have to wait for Jason Konek's forthcoming work...

Suppose:
  • $\mathcal{F}$ is a set of propositions;
  • $\mathcal{W} = \{w_1, \ldots, w_n\}$ be the set of possible worlds relative to $\mathcal{F}$;
  • $\mathcal{C}$ be the set of credence functions on $\mathcal{F}$;
  • $\mathcal{P}$ be the set of probability functions on $\mathcal{F}$. So, by de Finetti's theorem, $\mathcal{P} = \{v_w : w \in \mathcal{W}\}^+$. If $p$ is in $\mathcal{P}$, we write $p_i$ for $p(w_i)$.
Theorem Suppose $\mathfrak{I}$ is a strictly proper inaccuracy measure on the credence functions in $\mathcal{F}$. Then if $c$ is not in $\mathcal{P}$, there is $c^\star$ in $\mathcal{P}$ such that, for all $w_i$ in $\mathcal{W}$,
$$
\mathfrak{I}(c^\star, w_i) < \mathfrak{I}(c, w_i)
$$

Proof. We begin by defining a divergence $\mathfrak{D} : \mathcal{P} \times \mathcal{C} \rightarrow [0, \infty]$ that takes a probability function $p$ and a credence function $c$ and measures the divergence from the former to the latter:
$$
\mathfrak{D}(p, c) = \sum_i p_i  \mathfrak{I}(c, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)
$$
Three quick points about $\mathfrak{D}$.

(1) $\mathfrak{D}$ is a divergence. Since $\mathfrak{I}$ is strictly proper, $\mathfrak{D}(p, c) \geq 0$ with equality iff $c = p$.

(2) $\mathfrak{D}(v_{w_i}, c) = \mathfrak{I}(c, w_i)$, for all $w_i$ in $\mathcal{W}$.

(3) $\mathfrak{D}$ is strictly convex in its first argument.  Suppose $p$ and $q$ are in $\mathcal{P}$, and suppose $0 < \lambda < 1$. Then let $r = \lambda p + \lambda q$. Then, since $\sum_i p_i\mathfrak{I}(c, w_i)$ is uniquely minimized, as a function of $c$, at $c = p$, and $\sum_i q_i\mathfrak{I}(c, w_i)$ is uniquely minimized, as a function of $c$, at $c = q$, we have$$\begin{eqnarray*}
\sum_i p_i \mathfrak{I}(c, w_i) & < & \sum_i p_i \mathfrak{I}(r, w_i) \\
\sum_i q_i \mathfrak{I}(c, w_i) & < & \sum_i q_i \mathfrak{I}(r, w_i)
\end{eqnarray*}$$Thus

 $\lambda [-\sum_i p_i \mathfrak{I}(p, w_i)] + (1-\lambda) [-\sum_i q_i \mathfrak{I}(q, w_i)] >$

$ \lambda [-\sum_i p_i \mathfrak{I}(r, w_i)] + (1-\lambda) [-\sum_i q_i \mathfrak{I}(r, w_i)] = $

$-\sum_i r_i \mathfrak{I}(r, w_i)$

Now, adding

$\lambda \sum_i p_i \mathfrak{I}(c, w_i) + (1-\lambda)\sum_i q_i\mathfrak{I}(c, w_i) =$

$\sum_i (\lambda p_i + (1-\lambda)q_i) \mathfrak{I}(c, w_i) = \sum_i r_i \mathfrak{I}(c, w_i)$

to both sides gives

$\lambda [\sum_i p_i \mathfrak{I}(c, w_i)-\sum_i p_i \mathfrak{I}(p, w_i)]+ $

$(1-\lambda) [\sum_i q_i\mathfrak{I}(c, w_i)-\sum_i q_i \mathfrak{I}(q, w_i)] > $

 $\sum_i r_i \mathfrak{I}(c, w_i)-\sum_i r_i \mathfrak{I}(r, w_i)$

That is,$$\lambda \mathfrak{D}(p, c) + (1-\lambda) \mathfrak{D}(q, c) > \mathfrak{D}(\lambda p + (1-\lambda)q, c)$$as required.

Now, suppose $c$ is not in $\mathcal{P}$. Then, since $\mathcal{P}$ is a closed convex set, there is a unique $c^\star$ in $\mathcal{P}$ that minimizes $\mathfrak{D}(x, c)$ as a function of $x$. Now, suppose $p$ is in $\mathcal{P}$. We wish to show that$$\mathfrak{D}(p, c) \geq \mathfrak{D}(p, c^\star) + \mathfrak{D}(c^\star, c)$$We can see that this holds iff$$\sum_i (p_i - c^\star_i) (\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i)) \geq 0$$After all,
$$\begin{eqnarray*}
& & \mathfrak{D}(p, c) - \mathfrak{D}(p, c^\star) - \mathfrak{D}(c^\star, c) \\
& = & [\sum_i p_i \mathfrak{I}(c, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)] - \\
&& [\sum_i p_i \mathfrak{I}(c^\star, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)] - \\
&& [\sum_i c^\star_i \mathfrak{I}(c, w_i) - \sum_i c^\star_i \mathfrak{I}(c^\star, w_i)] \\
& = & \sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i))
\end{eqnarray*}$$
Now we prove this inequality. We begin by observing that, since $p$, $c^\star$ are in $\mathcal{P}$, since $\mathcal{P}$ is convex, and since $\mathfrak{D}(x, c)$ is minimized uniquely at $x = c^\star$, if $0 < \varepsilon < 1$, then$$\frac{1}{\varepsilon}[\mathfrak{D}(\varepsilon p + (1-\varepsilon) c^\star, c) - \mathfrak{D}(c^\star, c)] > 0$$Expanding that, we get

$\frac{1}{\varepsilon}[\sum_i (\varepsilon p_i + (1- \varepsilon) c^\star_i)\mathfrak{I}(c, w_i) -$

$\sum_i (\varepsilon p_i + (1-\varepsilon)c^\star_i)\mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i) - $

$\sum_i c^\star_i\mathfrak{I}(c, w_i) + \sum_i c^\star_i \mathfrak{I}(c^\star,  i)] > 0$\medskip

 So

$\frac{1}{\varepsilon}[\sum_i ( c^\star_i + \varepsilon(p_i - c^\star_i))\mathfrak{I}(c, w_i) -$

$\sum_i ( c^\star_i + \varepsilon(p_i-c^\star_i))\mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i) - $

$\sum_i c^\star_i\mathfrak{I}(c, w_i) + \sum_i c^\star_i \mathfrak{I}(c^\star, w_i)] > 0 $\medskip

 So\medskip

$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(\varepsilon p+ (1-\varepsilon) c^\star), w_i) +$

$ \frac{1}{\varepsilon}[\sum_i c^\star_i \mathfrak{I}(c^\star, w_i) - \sum_ic^\star_i \mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i)] > 0$\medskip

Now, since $\mathfrak{I}$ is strictly proper,
$$\frac{1}{\varepsilon}[\sum_i c^\star_i \mathfrak{I}(c^\star, w_i) - \sum_ic^\star_i \mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i)] < 0$$
So, for all $\varepsilon > 0$,$$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(\varepsilon p+ (1-\varepsilon) c^\star, w_i) > 0$$
So, since $\mathfrak{I}$ is continuous$$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i)) \geq 0$$which is what we wanted to show. So, by above,$$\mathfrak{D}(p,c) \geq \mathfrak{D}(p, c^\star) + \mathfrak{D}(c^\star, c) $$In particular, since each $w_i$ is in $\mathcal{P}$,$$\mathfrak{D}(v_{w_i}, c) \geq \mathfrak{D}(v_{w_i}, c^\star) + \mathfrak{D}(c^\star, c)$$But, since $c^\star$ is in $\mathcal{P}$ and $c$ is not, and since $\mathfrak{D}$ is a divergence, $\mathfrak{D}(c^\star, c) > 0$. So$$\mathfrak{I}(c, w_i) = \mathfrak{D}(v_{w_i}, c) > \mathfrak{D}(v_{w_i}, c^\star) = \mathfrak{I}(c^\star, w_i)$$as required. $\Box$




Sunday, 26 July 2020

Decomposing Bregman divergences

For a PDF of this post, see here.

Here are a couple of neat little results about Bregman divergences that I just happened upon. They might help to prove some more decomposition theorems along the lines of this classic result by Morris DeGroot and Stephen Fienberg and, more recently, this paper by my colleagues in the computer science department at Bristol. I should say that a lot is known about Bregman divergences because of their role in information geometry, so these results are almost certainly known already, but I don't know where.

Refresher on Bregman divergences


First up, what's a divergence? It's essentially generalization of the notion of a measure of distance from one point to another. The points live in some closed convex subset $\mathcal{X} \subseteq \mathbf{R}^n$. A divergence is a function $D : \mathcal{X} \times \mathcal{X} \rightarrow [0, \infty]$ such that
  • $D(x, y) \geq 0$, for all $x$, $y$ in $\mathcal{X}$, and
  • $D(x, y) = 0$ iff $x = y$.
Note: We do not assume that a divergence is symmetric. So the distance from $x$ to $y$ need not be the same as the distance from $y$ to $x$. That is, we do not assume $D(x, y) = D(y, x)$ for all $x$, $y$ in $\mathcal{X}$. Indeed, among the family of divergences that we will consider -- the Bregman divergences -- only one is symmetric -- the squared Euclidean distance. And we do not assume the triangle inequality. That is, we don't assume that the divergence from $x$ to $z$ is at most the sum of the divergence from $x$ to $y$ and the divergence from $y$ to $z$. That is, we do not assume $D(x, z) \leq D(x, y) + D(y, z)$. Indeed, the conditions under which $D(x, z) = D(x, y) + D(y, z)$ for a Bregman divergence $D$ will be our concern here.

So, what's a Bregman divergence? $D : \mathcal{X} \times \mathcal{X} \rightarrow [0, \infty]$ is a Bregman divergence if there is a strictly convex function $\Phi : \mathcal{X} \rightarrow \mathbb{R}$ that is differentiable on the interior of $\mathcal{X}$ such that$$D(x, y) = \Phi(x) - \Phi(y) - \nabla \Phi(y) (x-y)$$In other words, to find the divergence from $x$ to $y$, you go to $y$, find the tangent to $\Phi$ at $y$. Then hop over to $x$ and subtract the value at $x$ of the tangent you just drew at $y$ from the value at $x$ of $\Phi$. That is, you subtract $\nabla \Phi(y) (x-y) + \Phi(y)$ from $\Phi(x)$. Because $\Phi$ is convex, it is always curving away from the tangent, and so $\nabla \Phi(y) (x-y) + \Phi(y)$, the value at $x$ of the tangent you drew at $y$, is always less than $\Phi(x)$, the value at $x$ of $\Phi$.

The two most famous Bregman divergences are:
  • Squared Euclidean distance. Let $\Phi(x) = ||x||^2 = \sum_i x_i^2$, in which case$$D(x, y) = ||x-y||^2 = \sum_i (x_i - y_i)^2$$
  • Generalized Kullback-Leibler divergence. Let $\Phi(x) = \sum_i x_i \log x_i$, in which case$$D(x, y) = \sum_i x_i\log\frac{x_i}{y_i} - x_i + y_i$$
Bregman divergences are convex in the first argument. Thus, we can define, for $z$ in $\mathcal{X}$ and for a closed convex subset $C \subseteq \mathcal{X}$, the $D$-projection of $z$ into $C$ is the point $\pi_{z, C}$ in $C$ such that $D(y, z)$ is minimized, as a function of $y$, at $y = \pi_{z, C}$. Now, we have the following theorem about Bregman divergences, due to Imre Csiszár:

Theorem (Generalized Pythagorean Theorem) If $\mathcal{C} \subseteq \mathcal{X}$ is closed and convex, then$$D(x, \pi_{z, C}) + D(\pi_{z, C}, z) \leq D(x, z)$$

Decomposing Bregman divergences


This invites the question: when does equality hold? The following result gives a particular class of cases, and in doing so provides us with a recipe for creating decompositions of Bregman divergences into their component parts. Essentially, it says that the above inequality is an equality if $C$ is a plane in $\mathbb{R}^n$.

Theorem 1  Suppose $r$ is in $\mathbb{R}$ and $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\sum_i \alpha_i = 1$. Then let $C := \{(x_1, \ldots, x_n) : \sum_i \alpha_ix_i = r\}$. Then if $z$ in $\mathcal{X}$ and $x$ is in $C$,$$D_\Phi(x, z) = D_\Phi(x, \pi_{z, C}) + D_\Phi(\pi_{z, C}, z)$$

Proof of Theorem 1.  We begin by showing:

Lemma 1 For any $x$, $y$, $z$ in $\mathcal{X}$,$$D_\Phi(x, z) = D_\Phi(x, y) + D_\Phi(y, z) \Leftrightarrow (\nabla \Phi(y) - \nabla \Phi(z))(x-y) = 0$$

Proof of Lemma 1.  $$D_\Phi(x, z) = D_\Phi(x, y) + D_\Phi(y, z)$$iff

$\Phi(x) - \Phi(z) - \nabla(z)(x-z)$

$= \Phi(x) - \Phi(y) - \nabla(y)(x-y) + \Phi(y) - \Phi(z) - \nabla(z)(y-z)$

iff$$(\nabla \Phi(y) - \nabla \Phi(z))(x-y) = 0$$as required.

Return to Proof of Theorem 1. Now we show that if $x$ is in $C$, then$$(\nabla \Phi(\pi_{z, C}) - \Phi(z))(x-\pi_{z, C}) = 0$$We know that $D(y, z)$ is minimized on $C$, as a function of $y$, at $y = \pi_{z, C}$. Thus, let $y = \pi_{z, C}$. And let $h(x) := \sum_i \alpha_ix^i - r$. Then $\frac{\partial}{\partial x_i} h(x) = \alpha_i$. So, by the KKT conditions, there is $\lambda$ such that,$$\nabla \Phi(y) - \nabla \Phi(z) + (\lambda \alpha_1, \ldots, \lambda \alpha_n) = (0, \ldots, 0)$$Thus,$$\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z) = -\lambda \alpha_i$$for all $i = 1, \ldots, n$.

Thus, finally, 
\begin{eqnarray*}
& &(\nabla \Phi(y) - \nabla \Phi(z))(x-y) \\
& = & \sum_i \left (\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z)\right )(x_i-y_i) \\
& = &  \sum_i (-\lambda \alpha_i) (x_i - y_i) \\
& = & -\lambda \left (\sum_i \alpha_i x_i - \sum_i \alpha_i y_i\right ) \\
& = & -\lambda (r-r) \\
& = & 0
\end{eqnarray*}
as required. $\Box$

Theorem 2  Suppose $1 \leq k \leq n$. Let $C := \{(x_1, \ldots, x_n) : x_1 = x_2 = \ldots = x_k\}$. Then if $z$ in $\mathcal{X}$ and $x$ is in $C$,$$D_\Phi(x, z) = D_\Phi(x, \pi_{z, C}) + D_\Phi(\pi_{z, C}, z)$$

Proof of Theorem 2. We know that $D(y, z)$ is minimized on $C$, as a function of $y$, at $y = \pi_{z, C}$. Thus, let $y = \pi_{z, C}$. And let $h_i(x) := x_{i+1} - x_i$, for $i = 1, \ldots, k-1$. Then$$\frac{\partial}{\partial x_j} h_i(x) = \left \{ \begin{array}{ll} 1 & \mbox{if } i+1 = j \\ -1 & \mbox{if } i = j \\ 0 & \mbox{otherwise}\end{array} \right.$$ So, by the KKT conditions, there are $\lambda_1, \ldots, \lambda_k$ such that,

$\nabla \Phi(y) - \nabla \Phi(z)$

$+ (-\lambda_1, \lambda_1, 0, \ldots, 0) + (0, -\lambda_2, \lambda_2, 0, \ldots, 0) + \ldots$

$+ (0, \ldots, 0, -\lambda_k, \lambda_k, 0, \ldots, 0) = (0, \ldots, 0)$

Thus,$$\begin{eqnarray*}\frac{\partial}{\partial y_1} \Phi(y) - \frac{\partial}{\partial z_1} \Phi(z) & = & - \lambda_1 \\ \frac{\partial}{\partial y_2} \Phi(y) - \frac{\partial}{\partial z_2} \Phi(z) & = & \lambda_1 - \lambda_2 \\ \vdots & \vdots & \vdots \\ \frac{\partial}{\partial y_{k-1}} \Phi(y) - \frac{\partial}{\partial z_{k-1}} \Phi(z) & = & \lambda_{k-2}- \lambda_{k-1} \\ \frac{\partial}{\partial y_k} \Phi(y) - \frac{\partial}{\partial z_k} \Phi(z) & = & \lambda_{k-1} \\ \frac{\partial}{\partial y_{k+1}} \Phi(y) - \frac{\partial}{\partial z_{k+1}} \Phi(z) & = & 0 \\ \vdots & \vdots & \vdots \\ \frac{\partial}{\partial y_n} \Phi(y) - \frac{\partial}{\partial z_n} \Phi(z) & = & 0 \end{eqnarray*}$$

Thus, finally, 
\begin{eqnarray*}
& &(\nabla \Phi(y) - \nabla \Phi(z))(x-y) \\
& = & \sum_i \left (\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z)\right )(x_i-y_i) \\
& = & -\lambda_1(x_1-y_1) + (\lambda_1 - \lambda_2)(x_2-y_2) + \ldots \\
&& + (\lambda_{k-2} - \lambda_{k-1})(x_{k-1}-y_{k-1}) + \lambda_{k-1}(x_k-y_k) \\
&& + 0(x_{k+1} - y_{k+1}) + \ldots + 0 (x_n - y_n) \\
& = & \sum^{k-1}_{i=1} \lambda_i (x_{i+1} - x_i) + \sum^{k-1}_{i=1} \lambda_i (y_i - y_{i+1})\\
& = & 0
\end{eqnarray*}
as required. $\Box$

DeGroot and Fienberg's calibration and refinement decomposition


To obtain these two decomposition results, we needed to assume nothing more than that $D$ is a Bregman divergence. The classic result by DeGroot and Fienberg requires a little more. We can see this by considering a very special case of it. Suppose $(X_1, \ldots, X_n)$ is a sequence of propositions that forms a partition. And suppose $w$ is a possible world. Then we can represent $w$ as the vector $w = (0, \ldots, 0, 1, 0, \ldots, 0)$, which takes value 1 at the proposition that is true in $w$ and 0 everywhere else. Now suppose $c = (c, \ldots, c)$ is an assignment of the same credence to each proposition. Then one very particular case of DeGroot and Fienberg's result says that, if $(0, \ldots, 0, 1, 0, \ldots, 0)$ is the world at which $X_i$ is true, then

$D((0, \ldots, 0, 1, 0, \ldots, 0), (c, \ldots, c))$

$= D((0, \ldots, 0, 1, 0, \ldots, 0), (\frac{1}{n}, \ldots, \frac{1}{n})) + D((\frac{1}{n}, \ldots, \frac{1}{n}), (c, \ldots, c))$

Now, we know from Lemma 1 that this is true iff$$(\nabla \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \nabla \Phi(c, \ldots, c))((0, \ldots, 0, 1, 0, \ldots, 0) - (\frac{1}{n}, \ldots, \frac{1}{n})) = 0$$which is true iff

$\left ( \frac{\partial}{\partial x_i} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_i} \Phi(c, \ldots, c) \right )$

$= \frac{1}{n} \sum^n_{j=1} \left ( \frac{\partial}{\partial x_j} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_j} \Phi(c, \ldots, c) \right )$

and that is true iff

$\frac{\partial}{\partial x_i} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_i} \Phi(c, \ldots, c)$

$= \frac{\partial}{\partial x_j} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_j} \Phi(c, \ldots, c)$

for all $1 \leq i, j, \leq n$, which is true iff, for any $x$, $1 \leq i, j \leq n$,$$\frac{\partial}{\partial x_i} \Phi(x, \ldots, x) = \frac{\partial}{\partial x_j} \Phi(x, \ldots, x)$$Now, this is true if $\Phi(x_1, \ldots, x_n) = \sum^n_{i=1} \varphi(x_i)$ for some $\varphi$. That is, it is true if $D$ is an additive Bregman divergence. But it is also true for certain non-additive Bregman divergences, such as the one generated from the log-sum-exp function:

Definition (log-sum-exp) Suppose $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\sum^n_{i=1} \alpha_i = 1$. Then let $$\Phi^A(x_1, \ldots, x_n) = \log(1 + \alpha_1e^{x_1} + \ldots \alpha_ne^{x_n})$$Then
$$D(x, y) = \log (1 + \sum_i \alpha_ie^{x_i}) - \log(1 + \sum_i \alpha_ie^{y_i}) - \sum_k \frac{\alpha_k(x_k - y_k)e^{y_k}}{1 + \sum_i \alpha_ie^{y_i}}$$

Now$$\frac{\partial}{\partial x_i} \Phi^A(x_1, \ldots, x_n) = \frac{\alpha_i e^{x_i}}{1 + \alpha_1 e^{x_1} + \ldots + \alpha_ne^{x_n}}$$So, if $\alpha_i = \alpha_j$ for all $1 \leq i, j \leq n$, then$$\frac{\partial}{\partial x_i} \Phi^A(x, \ldots, x) = \frac{\alpha e^x}{1 + e^x} = \frac{\partial}{\partial x_j} \Phi^A(x, \ldots, x)$$But if $\alpha_i \neq \alpha_j$ for some $1 \leq i, j \leq n$, then$$\frac{\partial}{\partial x_i} \Phi^A(x, \ldots, x) = \frac{\alpha_ie^x}{1 + e^x} \neq \frac{\alpha_je^x}{1 + e^x} = \frac{\partial}{\partial x_j} \Phi^A(x, \ldots, x)$$

And indeed, the result even fails if we have a semi-additive Bregman divergence. That is, there are different $\phi_1, \ldots, \phi_n$ such that $\Phi(x) = \sum^n_{i=1} \phi_i(x_i)$. For instance, suppose $\phi_1(x) = x^2$ and $\phi_2(x) = x\log x$ and $\Phi(x, y) = \phi_1(x) +  \phi_2(y) = x^2 + y\log y$. Then$$\frac{\partial}{\partial x_1} \Phi(x, x) = 2x \neq 1 + \log x = \frac{\partial}{\partial x_2} \Phi(x, x)$$

Proving the Generalized Pythagorean Theorem


In this section, I really just spell out in more detail the proof that Predd, et al. give of the Generalized Pythagorean Theorem, which is their Proposition 3. But that proof contains some important general facts that might be helpful for people working with Bregman divergences. I collect these together here into one lemma.

Lemma 2  Suppose $D$ is a Bregman divergence generated from $\Phi$. And suppose $x, y, z \in \mathcal{X}$. Then$$\begin{eqnarray*} & & D(x, z) - [D(x, y) + D(y, z)] \\ & = & (\nabla \Phi(y) - \nabla \Phi(z))(x - y) \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(y + \varepsilon (x - y), z) - D(y, z)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z)] \end{eqnarray*}$$

We can then prove the Generalized Pythagorean Theorem easily. After all, if $x$ is in a closed convex set $C$ and $y$ is the point in $C$ that minimizes $D(y, z)$ as a function of $y$. Then, for all $0 \leq \varepsilon \leq 1$, $\varepsilon x + (1-\varepsilon)y$ is in $C$. And since $y$ minimizes,$$D(\varepsilon x + (1-\varepsilon)y, z) \geq D(y, z)$$. So $D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z) \geq 0$. So $$\lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon}D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z) \geq 0$$So, by Lemma 2,$$D(x, z) \geq D(x, y) + D(y, z)$$

Proof of Lemma 2.  $$\begin{eqnarray*} && \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(z) - \nabla \Phi(z)(\varepsilon x + (1-\varepsilon)y - z)) - \\ & & (\Phi(y) - \Phi(z) - \nabla\Phi(z)(y-z))]\\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(y) - \varepsilon\nabla \Phi(z)(x -y)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(y)]  - \nabla \Phi(z)(x -y) \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(y + \varepsilon (x -y)) - \Phi(y)]  - \nabla \Phi(z)(x -y) \\ & = & \nabla \Phi(y)(x -y) - \nabla \Phi(z)(x -y) \\ & = & (\nabla \Phi(y) - \nabla \Phi(z))(x -y)\end{eqnarray*}$$

Thursday, 23 July 2020

Epistemic risk and permissive rationality (part I): an overview

I got interested in epistemic risk again, after a hiatus of four or five years, by thinking about the debate in epistemology between permissivists and impermissivists about epistemic rationality. Roughly speaking, according to the impermissivist, every body of evidence you might obtain mandates a unique rational set of attitudes in response --- this is sometimes called the uniqueness thesis. According to the permissivist, there is evidence you might obtain that doesn't mandate a unique rational set of attitudes in response --- there are, instead, multiple rational responses.

I want to argue for permissivism. And I want to do it by appealing to the sorts of claims about how to set your priors and posteriors that I've been developing over this series of blogposts (here and here). In the first of those blogposts, I argued that we should pick priors using a decision rule called the generalized Hurwicz criterion (GHC). That is, we should see our choice of priors as a decision we must make; and we should make that decision using a particular decision rule -- namely, GHC -- where we take the available acts to be the different possible credence functions and the utility of an act at a world to be a measure of its accuracy at that world.

Now, GHC is, in fact, not a single decision rule, but a family of rules, each specified by some parameters that I call the Hurwicz weights. These encode different attitudes to risk -- they specify the weight you assign to the best-case scenario, the weight you assign to the second-best, and so on down to the weight you assign to the second-worst scenario, and the weight you assign to the worst. And, what's more, many different attitudes to risk are permissible; and therefore many different Hurwicz weights are permissible; and so many versions of GHC are legitimate decision rules to adopt when picking priors. So different permissible attitudes to risk determine different Hurwicz weights; and different Hurwicz weights mandate different rational priors; and different rational priors mandate different rational posteriors given the same evidence. Epistemic rationality, therefore, is permissive. That's the argument in brief.

With this post, I'd like to start a series of posts in which I explore how this view plays out in the permissivism debate. If there are many different rationally permissible responses to the same piece of evidence because there are many different rationally permissible attitudes to risk, then how does that allow us to answer the various objections to permissivism that have been raised.

In this post, I want to do four things: first, run through a taxonomy of varieties of permissivism that slightly expands on one due to Elizabeth Jackson; second, explain my motivation for offering this argument for permissivism; third, discuss an earlier risk-based argument for the position due to Thomas Kelly; finally, situate within Jackson's taxonomy the version of permissivism that follows from my own risk-based approach to setting priors and posteriors.

Varieties of permissivism


Let's start with the taxonomy of permissivisms. I suspect it's not complete; there are likely other dimensions along which permissivists will differ. But it's quite useful for our purposes.

First, there are different versions of permissivism for different sorts of doxastic attitudes we might have in response to evidence. So there are versions for credences, beliefs, imprecise probabilities, comparative confidences, ranking functions, and so on. For instance, on the credal version of permissivism, there is evidence that doesn't determine a unique set of credences that rationality requires us to have in response to that evidence. For many different sorts of doxastic attitude, you can be permissive with respect to one but not the other: permissive about beliefs but not about credences, for instance, or vice versa.

Second, permissivism comes in interpersonal and intrapersonal versions. According to interpersonal permissivism, it is possible for different individuals to have the same evidence, but different attitudes in response, and yet both be rational. According to the intrapersonal version, there is evidence a single individual might have, and different sets of attitudes such that whichever they have, they'll still be rational. Most people who hold to intrapersonal permissivism for a certain sort of doxastic attitude also hold to the interpersonal version, but there are many who think intrapersonal permissivism is mistaken but interpersonal permissivism is correct.

Third, it comes in wide and narrow versions. This is determined by how many different attitudes are permitted in response to a piece of evidence, and how much variation there is between them. On narrow versions, there are not so many different rational responses and they do not vary too widely; on wide versions, there are many and they vary greatly.

Fourth, it comes in common and rare versions. On the first, most evidence is permissive; on the latter, permissive evidence is rare.

I'll end up defending two versions of permissivism: (i) a wide common version of interpersonal permissivism about credences; and (ii) a narrow common version of intrapersonal permissivism about credences.

Why argue for permissivism?


Well, because it's true, mainly. But there's another motivation for adding to the already crowded marketplace of arguments for the position. Many philosophers defend permissivism for negative reasons. They look at two very different sorts of evidence and give reasons to be pessimistic about the prospects of identifying a unique rational credal response to them. They are: very sparse evidence and very complex evidence. In the first, they say, our evidence constrains us too little. There are too many credal states that respect it. If there is a single credal response that rationality mandates to this sparse evidence, there must be some way to whittle down the vast set of states that respect it leave us with only one. For instance, some philosophers claim that, among this vast set of states, we should pick the one that has lowest informational content, since any other will go beyond what is warranted by the evidence. But it has proven extremely difficult to identify that credal state in many cases, such as von Mises' water-wine example, Bertrand's paradox, and van Fraassen's cube factory. Despairing of finding a way to pick a single credal state from this vast range, many philosophers have become permissivist. In the second sort of case, at the other extreme, where our evidence is very complex not very sparse, our evidence points in too many directions at once. In such cases, you might hope to identify a unique way in which to weigh the different sources of evidence and the direction in which they point to give the unique credal state that rationality mandates. And yet again, it has proven difficult to find a principled way of assigning these weights. Despairing, philosophers have become permissivist in these cases too.

I'd like to give a positive motivation for permissivism---one that doesn't motivate it by pointing to the difficulty of establishing its negation. My account will be based within accuracy-first epistemology, and it will depend crucially on the notion of epistemic risk. Rationality permits a variety of attitudes to risk in the practical sphere. Faced with the same risky choice, you might be willing to gamble because you are risk-seeking, and I might be unwilling because I am risk-averse, but we are both rational and neither more rational than the other. On my account, rationality also permits different attitudes to risk in the epistemic sphere. And different attitudes to epistemic risk warrant different credal attitudes in response to a body of evidence. Therefore, permissivism.

Epistemic risk encoded in epistemic utility


It is worth noting that this is not the first time that the notion of epistemic risk has entered the permissivism debate. In an early paper on the topic, Thomas Kelly appeals to William James' distinction between the two goals that we have when we have beliefs---believing truths and avoiding errors. When we have a belief, it gives us a chance of being right, but it also runs the risk of being wrong. In constrast, when we withhold judgment on a proposition, we run no risk of being wrong, but we give ourselves no chance of being right. Kelly then notes that whether you should believe on the basis of some evidence depends on how strongly you want to believe truths and how strongly you don't want to believe falsehoods. Using an epistemic utility framework introduced independently by Kenny Easwaran and Kevin Dorst, we can make this precise. Suppose:
  1. I assign a positive epistemic utility of $R > 0$ to believing a truth or disbelieving a falsehood;
  2. I assign a negative epistemic utility (or positive epistemic disutility) of $-W < 0$ to believing a falsehood or disbelieving a truth; and
  3. I assign a neutral epistemic utility of 0 to withholding judgment.
And suppose $W > R$. And suppose further that there is some way to measure, for each proposition, how likely or probable my evidence makes that proposition---that is, we assume there is a unique evidential probability function of the sort that J. M. Keynes, E. T. Jaynes, and Timothy Williamson envisaged. Then, if $r$ is how likely my evidence makes the proposition $X$, then:
  1. the expected value of believing $X$ is $rR + (1-r)(-W)$,
  2. the expected value of disbelieving $X$ is $r(-W) + (1-r)R$, and
  3. the expected value of withholding judgment is $0$.
A quick calculation shows that believing uniquely maximises expected utility when $r > \frac{W}{R+W}$, disbelieving uniquely maximises when $r < \frac{R}{R+W}$, and withholding uniquely maximises if $\frac{R}{R +W} < r < \frac{W}{R+W}$. What follows is that the more you disvalue being wrong, the stronger the evidence will have to be in order to make it rational to believe. Now, Kelly assumes that various values of $R$ and $W$ are rationally permissible---it is permissible to disvalue believing falsehoods a lot more than you value believing truths, and it is permissible to disvalue that just a little more. And, if that is the case, different individuals might have the same evidence while rationality requires of them different doxastic attitudes---a belief for one of them, who disvalues being wrong only a little more than they value being right, and no belief for the other, where the difference between their disvalue for false belief and value for true belief is much greater. Kelly identifies the values you pick for $R$ and $W$ with your attitudes to epistemic risk. So different doxastic attitudes are permissible in the face of the same evidence because different attitudes to epistemic risk are permissible.

Now, there are a number of things worth noting here before I pass to my own alternative approach to epistemic risk.

First, note that Kelly manages to show that epistemic rationality might be permissive even if there is a unique evidential probability measure. So even those who think you can solve the problem of what probability is demanded by the very sparse evidence and the very complex evidence we described above, still they should countenance a form of epistemic permissivism if they agree that there are different permissible values for $R$ and $W$.

Second, it might seem at first that Kelly's argument gives interpersonal permissivism at most. After all, for fixed $R$ and $W$, and a unique evidential probability $r$ for $X$ given your evidence, it might seem that there is always a single attitude---belief in $X$, disbelief in $X$, or judgment withheld about $X$---that maximises expected epistemic value. But this isn't always true. After all, if $r = \frac{R}{R + W}$, then it turns out that disbelieving and withholding have the same expected epistemic value, and if $r = \frac{W}{R+W}$, then believing and withholding have the same expected epistemic value. And in those cases, it would be rationally permissible for an individual to pick either.

Third, and relatedly, it might seem that Kelly's argument gives only narrow permissivism, since it allows for cases in which believing and withholding are both rational, and it allows for cases in which disbelieving and withholding are both rational, but it doesn't allow for cases in which all three are rational. But that again is a mistake. If you value believing truths exactly as much as you value believing falsehoods, so that $R = W$, and if the objective evidential probability of $X$ given your evidence is $r = \frac{1}{2}$, then believing, disbelieving, and withholding judgment are all permissible. Having said that, there is some reason to say that it is not rationally permissible to set $R = W$. After all, if you do, and if $r = \frac{1}{2}$, then it is permissible to both believe $X$ and believe $\overline{X}$ at the same time, and that seems wrong.

Fourth, and most importantly for my purposes, Kelly's argument works for beliefs, but not for credences. The problem, briefly stated, is this: suppose $r$ is how likely my evidence makes proposition $X$. And suppose $\mathfrak{s}(1, x)$ is the accuracy of credence $x$ in a truth, while $\mathfrak{s}(0, x)$ is the accuracy of credence $x$ in a falsehood. Then the expected accuracy of credence $x$ in $X$ is
\begin{equation}\label{eeu}
r\mathfrak{s}(1, x) + (1-r)\mathfrak{s}(0, x)\tag{*}
\end{equation}
But nearly all advocates of epistemic utility theory for credences agree that rationality requires that $\mathfrak{s}$ is a strictly proper scoring rule. And that means that (*) is maximized, as a function of $x$, at $x = r$. So differences in how you value epistemic utility don't give rise to differences in what credences you should have. Your credences should always match the objective evidential probability of $X$ given your evidence. Epistemic permissivism about credences would therefore be false.

I think Kelly's observation, supplemented with Easwaran's precise formulation of epistemic value, furnishes a strong argument for permissivism about beliefs. But I think we can appeal to epistemic risk to give something more, namely, two versions of permissivism about credences: first, an wide common interpersonal version, and second a narrow common intrapersonal version.

Epistemic risk encoded in decision rules


To take the first step towards these versions of permissivism for credences, let's begin with the observation that there are two ways in which risk enters into the rational evaluation of a set of options. First, risk might be encoded in the utility function, which measures the value of each option at each possible world; or, second, it might be encoded in the choice rule, which takes in various features of the options, including their utilities at different worlds, and spits out the set of options that are rationally permissible.

Before we move to the epistemic case, let's look at how this plays out in the practical case. I am about to flip a fair coin. I make you an offer: pay me £30, and I will pay you £100 if the coin lands heads and nothing if it lands tails. You reject my offer. There are two ways to rationalise your decision. On the first, you choose using expected utility theory, which is a risk-neutral decision rule. However, because the utility you assign to an outcome is a sufficiently concave function of the money you get in that outcome, and your current wealth is sufficiently small, the expected utility of accepting my offer is less than the expected utility of rejecting it. For instance, perhaps your utility for an outcome in which your total wealth is £$n$ is $\log n$. And perhaps your current wealth is £$40$. Then your expected utility for accepting my offer is $\frac{1}{2}\log 110 + \frac{1}{2} \log 10 \approx 3.502$ while your expected utility for rejecting it is $\log 40 \approx 3.689$. So you are rationally required to reject. On this way of understanding your choice, your risk-aversion is encoded in your utility function, while your decision rule is risk-neutral. On the second way of understanding your choice, it is the other way around. Instead of expected utility theory, you choose using a risk-sensitive decision rule, such as Wald's Maximin, the Hurwicz criterion, the generalized Hurwicz criterion, Quiggin's rank-dependent utility theory, or Buchak's risk-weighted expected utility theory. According to Maximin, for instance, you are required to choose an option whose worst-case outcome is best. The worst case if you accept the offer is the one in which the coin lands tails and I pay you back nothing, in which case you end up £$30$ down, whereas the worst case if you refuse my offer is the status quo in which you end up with exactly as much as you had before. So, providing you prefer more money to less, the worst-case outcome of accepting the offer is worse than the worse-case outcome of refusing it, so Maximin will lead you to refuse the offer. And it will lead you to do that even if, for instance, you value money linearly. Thus, there is no need to reflect your attitude to risk in your utility function at all, because it is encoded in your decision rule.

I take the lesson of the Allais paradox to be that there is rational risk-sensitive behaviour that we cannot capture entirely using the first method here. That is, there are rational preferences that we cannot recover within expected utility theory by making the utility function concave in money, or applying some other tweak. We must instead permit risk-sensitive choice rules. Now, there are two sorts of such rules: those that require credences among their inputs and those that don't. In the first camp, perhaps the most sophisticated is Lara Buchak's risk-weighted expected utility theory. In the second, we've already met the most famous example, namely, Maximin, which is maximally risk-averse. But there is also Maximax, which is maximally risk-seeking. And there is the Hurwicz criterion, which strikes a balance between the two. And there's my generalization of the Hurwicz criterion, which I'll abbreviate GHC. As I've discussed over the last few blogposts, I favour the latter in the case of picking priors. (For an alternative approach to epistemic risk, see Boris Babic's recent paper here.)

To see what happens when you use GHC to pick priors, let's give a quick example in a situation in which there are just three possible states of the world to which you assign credences, $w_1$, $w_2$, $w_3$, and we write $(p_1, p_2, p_3)$ for a credence function $p$ that assigns $p_i$ to world $w_i$. Suppose your Hurwicz weights are these: $\alpha_1$ for the best case, $\alpha_2$ for the second-best (and second-worst) case, and $\alpha_3$ for the worst case. And your accuracy measure is $\mathfrak{I}$. Then we're looking for those that minimize your Hurwicz score, which is$$H^A(p) = \alpha_1\mathfrak{I}(p, w_{i_1}) + \alpha_2\mathfrak{I}(p, w_{i_2}) + \alpha_3\mathfrak{I}(p, w_{i_3})$$when$$\mathfrak{I}(p, w_{i_1}) \geq \mathfrak{I}(p, w_{i_2}) \geq\mathfrak{I}(p, w_{i_3})$$Now suppose for our example that $\alpha_1 \geq \alpha_2 \geq \alpha_3$. Then the credence functions that minimize $H^A_{\mathfrak{I}}$ are$$\begin{array}{ccc} (\alpha_1, \alpha_2, \alpha_3) & (\alpha_1, \alpha_3, \alpha_2) & (\alpha_2, \alpha_1, \alpha_3) \\ (\alpha_2, \alpha_3, \alpha_1) & (\alpha_3, \alpha_1, \alpha_2) & (\alpha_3, \alpha_2, \alpha_1) \end{array}$$

With that example in hand, and a little insight into how GHC works when you use it to select priors, let's work through Elizabeth Jackson's taxonomy of permissivism from above.

First, since the attitudes we are considering are credences, it's a credal version of permissivism that follows from this risk-based approach in accuracy-first epistemology.

Second, we obtain both an interpersonal and an intrapersonal permissivism. A particular person will have risk attitudes represented by specific Hurwicz weights. And yet, even once those are fixed, there will usually be a number of different permissible priors. That is, rationally will permit a number of different credal states in the absence of evidence. For instance, if my Hurwicz weights are $\alpha_1 = 0.5$, $\alpha_2 = 0.3$, $\alpha_3 = 0.2$, then rationality allows me to assign 0.5 to world $w_1$, 0.3 to $w_2$ and 0.2 to $w_3$, but it also permits me to assign $0.3$ to $w_1$, $0.2$ to $w_2$, and $0.5$ to $w_3$.

So there is intrapersonal credal permissivism, but it is reasonably narrow---there are only six rationally permissible credence functions for someone with the Hurwicz scores just specified, for instance. On the other hand, the interpersonal permissivism we obtain is very wide. Indeed, it is as wide as range of permissible attitudes to risk. As we noted in a previous post, for any probabilistic credence function over a space of possible worlds, there are Hurwicz weights that will render those credences permissible. So providing those weights are rationally permissible, so are the credences.

Finally, is the permissivism we get from this risk-based approach common or rare? So far, we've just considered it in the case of priors. That is, we've only established permissivism in the case in which you have no evidence. But of course, once it's established there, it's also established for many other bodies of evidence, since we obtain the rational credences given a body of evidence by looking to what we obtain by updating rational priors by conditioning on that evidence. And, providing a body of evidence isn't fully informative, if there are multiple rational priors, they will give rise to multiple rational posteriors when we condition them on that evidence. So the wide interpersonal credal permissivism we obtain is common, and so is the narrow intrapersonal credal permissivism.

Wednesday, 22 July 2020

Taking risks and picking posteriors

For a PDF of this blog, see here.

When are my credences rational? In Bayesian epistemology, there's a standard approach to this question. We begin by asking what credences would be rational were you to have no evidence at all; then we ask what ways of updating your credences are rational when you receive new evidence; and finally we say that your current credences are rational if they are the result of updating rational priors in a rational way on the basis of your current total evidence. This account can be read in one of two ways: on the doxastic reading, you're rational if, in fact, when you had no evidence you had priors that were rational and if, in fact, when you received evidence you updated in a rational way; on the propositional reading, you're rational if there exists some rational prior and there exists some rational way of updating such that applying the updating rule to the prior based on your current evidence issues in your current credences.

In this previous post, I asked how we might use accuracy-first epistemology and decision-making rules for situations of massive uncertainty to identify the rational priors. I suggested that we should turn to the early days of decision theory, when there was still significant interest in how we might make a decision in situations in which it is not possible to assign probabilities to the different possible states of the world. In particular, I noted Hurwicz's generalization of Wald's Maximin rule, which is now called the Hurwicz Criterion, and I offered a further generalization of my own, which I then applied to the problem of picking priors. Here's my generalization:

Generalized Hurwicz Criterion (GHC) Suppose the set of possible states of the world is $W = \{w_1, \ldots, w_n\}$. Pick $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\alpha_1 + \ldots + \alpha_n = 1$, and denote this sequence of weights $A$. Suppose $a$ is an option defined on $W$, and write $a(w_i)$ for the utility of $a$ at world $w_i$. Then if$$a(w_{i_1}) \geq a(w_{i_2}) \geq \ldots \geq a(w_{i_n})$$then let$$H^A(a) = \alpha_1a(w_{i_1}) + \ldots + \alpha_na(w_{i_n})$$Pick an option that maximises $H^A$.

Thus, whereas Wald's Maximin puts all of the weight onto the worst case, and Hurwicz's Criterion distributes all the weight between the best and worst cases, the generalized Hurwicz Criterion allows you to distribute the weight between best, second-best, and so on down to second-worst, and worst. I said that you should pick your priors by applying GHC with a measure of accuracy for credences. Then I described the norms for priors that it imposes.

In this post, I'm interested in the second component of the Bayesian approach I described above, namely, the rational ways of updating. How does rationality demand we update our prior when new evidence arrives? Again, I'll be asking this within the accuracy-first framework.

As in the previous post, we'll consider the simplest possible case. We'll assume there are just three possible states of the world that you entertain and to which you will assign credences. They are $w_1, w_2, w_3$. If $p$ is a credence function, then we'll write $p_i$ for $p(w_i)$, and we'll denote the whole credence function $(p_1, p_2, p_3)$. At the beginning of your epistemic life you have no evidence, and you must pick your prior.  We'll assume that you do as I proposed in the previous post and set your prior using GHC with some weights that you have picked, $\alpha_1$ for the best-case scenario, $\alpha_2$ for the second-best, and $\alpha_3$ for the worst. Later on, let's suppose, you learn evidence $E = \{w_1, w_2\}$. How should you update? As we will see, the problem is that there are many seemingly plausible approaches to this question, most of which disagree and most of which give implausible answers.

A natural first proposal is to use the same decision rule to select our posterior as we used to select our prior. To illustrate, let's suppose that our Hurwicz weight for the best-case scenario is $\alpha_1 = 0.5$, for the second-best $\alpha_2 = 0.3$, and for the worst $\alpha_3 = 0.2$. Applying GHC with these weights and any additive and continuous strictly proper (acsp) accuracy measure gives the following as the permissible priors:$$\begin{array}{ccc}(0.5, 0.3, 0.2) & (0.5, 0.2, 0.3) & (0.3, 0.5, 0.2) \\ (0.3, 0.2, 0.5) & (0.2, 0.5, 0.3) &(0.2, 0.3, 0.5)\end{array}$$
Let's suppose we pick $(0.5, 0.3, 0.2)$. And now suppose we learn $E = \{w_1, w_2\}$. If we simply apply GHC again, we get the same set of credence functions as permissible posteriors. But none of these even respects the evidence we've obtained -- that is, all of them assign positive credence to world $w_3$, which our evidence has ruled out. So that can't be quite right.

Perhaps, then, we should first limit the permissible  posteriors to those that respect our evidence -- by assigning credence $0$ to world $w_3$ -- and then find the credence function that maximizes GHC among them. It turns out that the success of this move depends on the measure of accuracy that you use. Suppose, for instance, you use the Brier score $\mathfrak{B}$, whose accuracy for a credence function $p = (p_1, p_2, p_3)$ at world $w_i$ is $$\mathfrak{B}(p, w_i) = 2p_i - (p_1^2 + p_2^2 + p_3^2)$$That is, you find the credence function of the form $q = (q_1, 1-q_1, 0)$ that minimizes $H^A_\mathfrak{B}$. But it turns out that this is $q = (0.6, 0.4, 0)$, which is not the result of conditioning $(0.5, 0.3, 0.2)$ on $E = \{w_1, w_2\}$ -- that would be $q = (0.625, 0.375, 0)$.

However, as I explained in another previous post, there is a unique additive and continuous strictly proper accuracy measure that will give conditionalization in this way. I called it the enhanced log score $\mathfrak{L}^\star$, and it is also found in Juergen Landes' paper here (Proposition 9.1) and Schervish, Seidenfeld, and Kadane's paper here (Example 6). Its accuracy for a credence function $p = (p_1, p_2, p_3)$ at world $w_i$ is $$\mathfrak{L}^\star(p, w_i) = \log p_i - (p_1 + p_2 + p_3)$$If we apply GHC with that accuracy measure and with the restriction to credence functions that satisfy the evidence, we get $(0.625, 0.375, 0)$ or $(0.375, 0.625, 0)$, as required. So while GHC doesn't mandate conditioning on your evidence, it does at least permit it. However, while this goes smoothly if we pick $(0.5, 0.3, 0.2)$ as our prior, it does not work so well if we pick $(0.2, 0.3, 0.5)$, which, if you recall, is also permitted by the Hurwicz weights we are using. After all, the two permissible posteriors remain the same, but neither is the result of conditioning that prior on $E$. This proposal, then, is a non-starter.

There is, in any case, something strange about the approach just mooted. After all, GHC assigns a weight to the accuracy of a candidate posterior in each of the three worlds, even though in world $w_3$ you wouldn't receive evidence $E$ and would thus not adopt this posterior. Let's suppose that you'd receive evidence $\overline{E} = \{w_3\}$ instead at world $w_3$; and let's suppose you'd adopt the only credence function that respects this evidence, namely, $(0, 0, 1)$. If that's the case, we might try applying GHC not to potential posteriors but to potential rules for picking posteriors. I'll call these posterior rules. In the past, I've called them updating rules, but this is a bit misleading. An updating rule would take as inputs both prior and evidence and give the result of updating the former on the latter. But these rules really just take evidence as an input and say which posterior you'll adopt if you receive it. Thus, for our situation, in which you might learn either $E$ or $\overline{E}$, the posterior rule would have the following form:$$p' = \left \{ \begin{array}{rcl}E & \mapsto & p'_E \\ \overline{E} & \mapsto & p'_{\overline{E}}\end{array}\right.$$for some suitable specification of $p'_E$ and $p'_\overline{E}$. Then the accuracy of a rule $p'$ at a world is just the accuracy of the output of that rule at that world. Thus, in this case:$$\begin{array}{rcl}\mathfrak{I}(p', w_1) & = & \mathfrak{I}(p'_E, w_1) \\ \mathfrak{I}(p', w_2) & = & \mathfrak{I}(p'_E, w_2) \\\mathfrak{I}(p', w_3) & = & \mathfrak{I}(p'_\overline{E}, w_3)\end{array}$$The problem is that this move doesn't help. Part of the reason is that whatever was the best-case scenario for the prior, the best case for the posterior is sure to be world $w_3$, since $p'_\overline{E} = (0, 0, 1)$ is perfectly accurate at that world. Thus, suppose you pick $(0.5, 0.3, 0.2)$ as your prior. It turns out that the rules that minimize $H^A_{\mathfrak{L}^\star}$ will give $p'_E = (0.4, 0.6, 0)$ or $p'_E = (0.6, 0.4, 0)$, whereas conditioning your prior on $E$ gives $p'_E = (0.625, 0.375, 0)$ or $p'_E = (0.375, 0.625, 0)$.

Throughout our discussion so far, we have dismissed various possible approaches because they are not consistent with conditionalization. But why should that be a restriction? Perhaps the approach we are taking will tell us that the Bayesian fixation with conditionalization is misguided. Perhaps. But there are strong arguments for conditionalization within accuracy-first epistemology, so we'd have to see why they go wrong before we start rewriting Bayesian textbooks. I'll consider three such arguments here. The first isn't as strong as it seems; the second isn't obviously available to someone who used GHC to pick priors; the third is promising but it leads us initially down a tempting road into an inhospitable morass.

The first is closely related to a proposal I explored in a previous blogpost. So I'll briefly outline the approach here and refer to the issues raised in that post. The idea is this: Your prior is $(p_1, p_2, p_3)$. You learn $E$. You must now adopt a posterior that respects your new evidence, namely, $(q_1, 1-q_1, 0)$. You should choose the posterior of that form that maximises expected accuracy from the point of view of your prior, that is, you're looking for $(x, 1-x, 0)$ that maximizes$$p_1 \mathfrak{I}((x, 1-x, 0), w_1) + p_2 \mathfrak{I}((x, 1-x, 0), w_2) + p_3 \mathfrak{I}((x, 1-x, 0), w_3)$$This approach is taken in a number of places: at the very least, here, here, and here. Now, it turns out that there is only one additive and continuous strictly proper accuracy measure that is guaranteed always to give conditionalization on this approach. That is, there is only one measure such that, for any prior, the posterior it expects to be best among those that respect the evidence is the one that results from conditioning the prior on the evidence. Indeed, that accuracy measure is one we've already met above, namely, the enhanced log score $\mathfrak{L}^\star$ (see here). However, it turns out that it only works if we assume our credences are defined only over the set of possible states of the world, and not over more coarse-grained propositions (see here). So I think this approach is a non-starter.

More promising at first sight is the argument by Hilary Greaves and David Wallace from 2006. Here, just as we considered earlier, we look not just at the posterior we will adopt having learned $E$, but also the posterior we would adopt were we to learn $\overline{E}$. Thus, if your prior is $(p_1, p_2, p_3)$, then you are looking for $(x, 1-x, 0)$ that maximizes$$p_1 \mathfrak{I}((x, 1-x, 0), w_1) + p_2 \mathfrak{I}((x, 1-x, 0), w_2) + p_3 \mathfrak{I}((0, 0, 1), w_3)$$And it turns out that this will always be$$x = \frac{p_1}{p_1 + p_2}\ \ 1-x = \frac{p_2}{p_1+p_2}$$providing $\mathfrak{I}$ is strictly proper.

Does this help us? Does it show that, if we set our priors using GHC, we should then set our posteriors using conditionalization? One worry might be this: What justifies you in choosing your posteriors using one decision rule -- namely, maximise subjective expected utility -- when you picked your priors using a different one -- namely, GHC? But there seems to be a natural answer. As I emphasised above, GHC is specifically designed for situations in which probabilities, either subjective or objective, are not available. It allows us to make decisions in their absence. But of course when it comes to choosing the posterior, we are no longer in such a situation. At that point, we can simply resort to what became more orthodox decision theory, namely, Savage's subjective expected utility theory.

But there's a problem with this. GHC is not a neutral norm for picking priors. When you pick your Hurwicz weights for the best case, the second-best case, and so on down to the second-worst case and the worst case, you reflect an attitude to risk. Give more weight to the worst cases and you're risk averse, choosing options that make those worst cases better; give more weight to the best cases and you're risk seeking; spread the weights equally across all cases and you are risk neutral. But the problem is that subjective expected utility theory is a risk neutral theory. (One way to see this is to note that it is the special case of Lara Buchak's risk-weighted expected utility theory that results from using the neutral risk function $r(x) = x$.) Thus, for those who have picked their prior using a risk-sensitive instance of GHC when they lacked probabilities, the natural decision rule when they have access to probabilities is not going to be straightforward expected utility theory. It's going to be a risk-sensitive rule that can accommodate subjective probabilities. The natural place to look would be Lara Buchak's theory, for instance. And it's straightforward to show that Greaves and Wallace's result does not hold when you use such a rule. (In forthcoming work, Catrin Campbell-Moore and Bernhard Salow have been working on what does follow and how we might change our accuracy measures to fit with such a theory and what follows from an argument like Greaves and Wallace's when you do that.) In sum, I think arguments for conditionalization based on maximizing expected accuracy won't help us here.

Fortunately, however, there is another argument, and it doesn't run into this problem. As we will see, though, it does face other challenges. In Greaves and Wallace's argument, we took the view from from the prior that we picked using GHC, and we used it to evaluate our way of picking posteriors. In this argument, due to me and Ray Briggs, we take the view from nowhere, and we use it to evaluate the prior and the posterior rule together. Thus, suppose $p$ is your prior and $p'$ is your posterior rule. Then we evaluate them together by taking their joint accuracy to be the sum of their individual accuracies. Thus,$$\mathfrak{I}((p, p'), w) = \mathfrak{I}(p, w) + \mathfrak{I}(p', w)$$Then we have the following fact, where $p'$ is a conditioning rule for $p$ over some partition $\mathcal{E}$ iff, for all $E$ in $\mathcal{E}$, if $p(E)  > 0$, then $p'_E(-) = p(-|E)$:

Theorem Suppose $\mathfrak{I}$ is an additive and continuous strictly proper scoring rule. Then, if $p'$ is not a conditioning rule for $p$ over $\mathcal{E}$, there are $q$ and $q'$ such that$$\mathfrak{I}((p, p'), w) < \mathfrak{I}((q, q'), w)$$for all worlds $w$.

That is, if $p'$ is not a conditioning rule for $p$, then, taken together, they are accuracy-dominated. There is an alternative pair, $q$ and $q'$, that, taken together, are guaranteed to be more accurate than $p$ and $p'$ are, taken together.

Notice that this argument establishes a slightly different norm from the one that the expected accuracy argument secures. The latter is a narrow scope norm: if $p$ is your prior, then your posterior rule should be to condition on $p$ with whatever evidence you learn. The former is a wide scope norm: you should not have prior $p$ and a posterior rule that does not condition on the evidence you learn. This suggests that, if you're sitting at the beginning of your epistemic life and you're picking priors and posterior rules together, as a package, you should pick them so that the posterior rule involves conditioning on the prior with the evidence received. Does it also tell you anything about what to do if you're sitting with your prior already fixed and new evidence comes in? I'm not sure. Here's a reason to think it might. You might think that it's only rational to do at a later time what it was rational to plan to do at an earlier time. If that's right, then we can obtain the narrow scope norm from the wide scope one.

Let's park those questions for the moment. For the approach taken in this argument suggests something else. In the previous post, we asked how to pick your priors, and we hit upon GHC. Now that we have a way of evaluating priors and posterior rules together, perhaps we should just apply GHC to those? Let's see what happens if we do that. As before, assume the best case receives weight $\alpha_1 = 0.5$, the second-best $\alpha_2 = 0.3$, and the third best $\alpha_3 = 0.2$. Then we know that the priors that GHC permits when we consider them on their own without the posteriors plans appended to them are just$$\begin{array}{ccc}(0.5, 0.3, 0.2) & (0.5, 0.2, 0.3) & (0.3, 0.5, 0.2) \\ (0.3, 0.2, 0.5) & (0.2, 0.5, 0.3) &(0.2, 0.3, 0.5)\end{array}$$Now let's consider what happens when we add in the posterior rules for learning $E = \{w_1, w_2\}$ or $\overline{E} = \{w_3\}$. Then it turns out that the minimizers are the priors$$(0.3, 0.2, 0.5)\ \ (0.2, 0.3, 0.5)$$combined with the corresponding conditionalizing posterior rules. Now, since those two priors are among the ones that GHC permits when applied to the priors alone, this might seem consistent with the original approach. The problem is that these priors are specific to the case in which you'll learn either $E$ or $\overline{E}$. If, on the other hand, you'll learn $F = \{w_1\}$ or $\overline{F} = \{w_2, w_3\}$, the permissible priors are$$(0.5, 0.2, 0.3)\ \ (0.5, 0.3, 0.2)$$And, at the beginning of your epistemic life, you don't know which, if either, is correct.

In fact, there's what seems to me a deeper problem. In the previous paragraph we considered a situation in which you might learn either $E$ or $\overline{E}$ or you might learn either $F$ or $\overline{F}$, and you don't know which. But the two options determine different permissible priors. The same thing happens if there are four possible states of the world $\{w_1, w_2, w_3, w_4\}$ and you might learn either $E_1 = \{w_1, w_2\}$ or $E_2 = \{w_3, w_4\}$ or you might learn either $F_1 = \{w_1, w_2\}$ or $F_2 = \{w_3\}$ or $F_3 = \{w_4\}$. Now, suppose you assign the following Hurwicz weights: to the best case, you assign $\alpha_1 = 0.4$, to the second best $\alpha_2 = 0.3$, to the second worst $\alpha_3 = 0.2$ and to the worst $\alpha_1 = 0.1$. Then if you'll learn $E_1 = \{w_1, w_2\}$ or $E_2 = \{w_3, w_4\}$, then the permissible priors are
 $$\begin{array}{cccc}(0.1, 0.4, 0.2, 0.3) & (0.4, 0.1, 0.2, 0.3) & (0.1, 0.4, 0.3, 0.2) & (0.4, 0.1, 0.2, 0.3) \\ (0.2, 0.3, 0.1, 0.4) & (0.2, 0.3, 0.4, 0.1) & (0.3, 0.2, 0.1, 0.4) & (0.2, 0.3, 0.4, 0.1) \end{array}$$But if you'll learn $F_1 = \{w_1, w_2\}$ or $F_2 = \{w_3\}$ or $F_3 = \{w_4\}$, then your permissible priors are
 $$\begin{array}{cccc}(0.1, 0.2, 0.3, 0.4) & (0.1, 0.2, 0.4, 0.3) & (0.2, 0.1, 0.3, 0.4) & (0.2, 0.1, 0.4, 0.3) \end{array}$$That is, there is no overlap between the two. It seems to me that the reason this is such a problem is that it's always been a bit of an oddity that the two accuracy-first arguments for conditionalization seem to depend on this assumption that there is some partition from which your evidence will come. It seems strange that when you learn $E$, in order to determine how to update, you need to know what alternative propositions you might have learned instead. The reason this assumption hasn't proved so problematic so far is that the update rule is in fact not sensitive to the partition. For instance, if I will learn $E_1 = F_1 = \{w_1, w_2\}$, both the Greaves and Wallace argument and the Briggs and Pettigrew argument for conditionalization say that you should update on that in the same way whether or not you might have learned $E_2 = \{w_3, w_4\}$ instead or whether you might have learned $F_2 = \{w_3\}$ or $F_3 = \{w_4\}$ instead. But here the assumption does seem problematic, because the permissible priors are sensitive to what the partition is from which you'll receive your future evidence.

What to conclude from all this? It seems to me that the correct approach is this: choose priors using GHC; choose posterior rules to go with them using the dominance argument that Ray and I gave--that is, update by conditioning.

Thursday, 16 July 2020

Taking risks and picking priors

For a PDF of this post, see here.

In my last couple of posts (here and here), I've been discussing decision rules for situations in which probabilities over the possible states of the world are not available for some reason. Perhaps your evidence is too sparse, and points in no particular direction, or too complex, and points in too many. So subjective probabilities are not available. And perhaps you simply don't know the objective probabilities. In those situations, you can't appeal to standard subjective expected utility theory of the sort described by Ramsey, Savage, Jeffrey, etc., nor to objective expected utility theory of the sort described by von Neumann & Morgenstern. What, then, are you to do? As I've discussed in the previous posts, this was in fact a hot topic in the early days of decision theory. Abraham Wald discussed the Maximin approach, Leonid Hurwicz expanded that to give his Criterion, Franco Modigliani mentioned Maximax approaches, and Leonard Savage discussed Minimax Regret. Perhaps the culmination of this research programme was John Milnor's elegant 'Games against Nature' in 1951 (revised in 1954), where he provided simple axiomatic characterisations of each of these decision rules. In the first post on this, I pointed out a problem with Hurwicz's approach (independently noted in this draft paper by Johan Gustafsson); in the second, I expanded that approach to avoid the problem.

Perhaps unsurprisingly, my interest in these decision rules stems from the possibility of applying them in accuracy-first epistemology. On that approach in formal epistemology, we determine the rationality of credences by considering the extent to which they promote what we take to be the sole epistemic good for credences, namely, accuracy. And we make this precise by applying decision theory to the question of which credence functions are rational. Thus: first, in place of the array of acts or options that are the usual focus of standard applications of decision theory, we substitute the different possible credence functions you might adopt; second, in place of the utility function that measures the value of acts or options at different possible states of the world, we substitute mathematical functions that measure the accuracy of a possible credence function at a possible state of the world. Applying a decision rule then identifes the rationally permissible credence functions. For instance, in the classic paper by Jim Joyce in 1998, he applies a dominance rule to show that only probabilistic credence functions are rationally permissible.

Now, one of the central questions in the epistemology of credences is the problem of the priors. Before I receive any evidence at all, which credences should I have? What should my ur-priors be, if you like? What should a superbaby's credences be, as David Lewis would put it? Now, people have reasonable concerns about the very idea of a superbaby -- this cognitive agent who has no evidence whatsoever but is nonetheless equipped with the conceptual resources to formulate a rich algebra of propositions. Evidence and conceptual resources grow together, after all. However, the problem of the priors arises even when you do have lots of evidence about other topics, but take yourself to have none that bears on a new topic in which you have yet to set your credences. And indeed it arises even if you do have credences that bear on the new topic, but you don't think they are justified, or they're inadmissible for the purpose to which you wish to use the credences -- for instance, if you are a scientist producing the priors for the Bayesian model you will use in your paper. So I think we needn't see the superbaby as a problematic idealization, but as a way of representing a situation in which we in fact quite often find ourselves.

So, what credences should a superbaby choose? The key point is that they have no recourse to any subjective or objective probabilities. So they can't appeal to expected accuracy to set their credences. Thus, we might naturally look to Wald, Hurwicz, Milnor, etc. to see what they might use instead. In this paper from 2016, I explored what Maximin requires, and in this follow-up later that year, I explored what the Hurwicz Criterion mandates. In this post, I'd like to explore what the Generalized Hurwicz Criterion stated in the previous blogpost requires of your credences. Let's remind ourselves of this decision rule.

Generalized Hurwicz Criterion (GHC) Suppose the set of possible states of the world is $W = \{w_1, \ldots, w_n\}$. Pick $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\alpha_1 + \ldots + \alpha_n = 1$, and denote this sequence of weights $A$. If $a$ is an option defined on $W$ and$$a(w_{i_1}) \geq a(w_{i_2}) \geq \ldots \geq a(w_{i_n})$$then let$$H^A(a) = \alpha_1a(w_{i_1}) + \ldots + \alpha_na(w_{i_n})$$Pick an option that maximises $H^A$.

So $\alpha_1$ weights the best-case utility, $\alpha_2$ the second best, and so on down to $\alpha_n$, which weights the worst-case utility. We then sum these weighted utilities to give the generalised Hurwicz score and we choose in order to maximise this.

Now, suppose that our options are credence functions, and the utility of a credence function is given by an accuracy measure $\mathfrak{I}$. In fact, for our purposes here, we'll consider only credence functions defined over three worlds $w_1, w_2, w_3$. Things get complicated pretty fast here, and there will be plenty of interest in this simple case. As usual in accuracy-first epistemology, we'll say that the accuracy of your credence function is determined as follows:

(1) $\mathfrak{s}(1, x)$ is your measure of accuracy for credence $x$ in a truth,

(2) $\mathfrak{s}(0, x)$ is your measure of accuracy for credence $x$ in a falsehood,

(3) $\mathfrak{s}$ is strictly proper, so that, for all $0 \leq p \leq 1$, $p\mathfrak{s}(1, x) + (1-p)\mathfrak{s}(0, x)$ is maximised, as a function of $x$, at $x = p$,

(4) if $c$ is a credence function defined on $w_1, w_2, w_3$, and we write $c_i$ for $c(w_i)$, then

  • $\mathfrak{I}(c, w_1) = \mathfrak{s}(1, c_1) + \mathfrak{s}(0, c_2) + \mathfrak{s}(0, c_3)$
  • $\mathfrak{I}(c, w_2) = \mathfrak{s}(0, c_1) + \mathfrak{s}(1, c_2) + \mathfrak{s}(0, c_3)$
  • $\mathfrak{I}(c, w_3) = \mathfrak{s}(0, c_1) + \mathfrak{s}(0, c_2) + \mathfrak{s}(1, c_3)$
So the accuracy of your credence function at a world is the sum of the accuracy at that world of the credences that it assigns.

Now, given a credence function on $w_1, w_2, w_3$, we represent it by the triple $(c_1, c_2, c_3)$. And let's set our generalized Hurwicz weights to be $\alpha_1, \alpha_2, \alpha_3$. The first thing to note is that, if $(c_1, c_2, c_3)$ minimizes $H^A_\mathfrak{I}$, then so does any permutation of it---that is,$$(c_1, c_2, c_3),\ \ (c_1, c_3, c_2),\ \ (c_2, c_1, c_3),\ \ (c_2, c_3, c_1),\ \ (c_3, c_1, c_2),\ \ (c_3, c_2, c_1)$$all minimize $H^A_\mathfrak{I}$ if any of one of them does. The reason is that the generalized Hurwicz score for the three-world case depends on the best, middle, and worst inaccuracies for a credence function, and those are exactly the same for those six credence functions, even though the best, middle, and worst inaccuracies occur at different worlds for each. This means that, in order to find the minimizers, we only need to seek those for which $c_1 \geq c_2 \geq c_3$, since all others will be permutations of those. Let $\mathfrak{X} = \{(c_1, c_2, c_3)\, |\, c_1 \geq c_2 \geq c_3\}$. Since the accuracy measure $\mathfrak{I}$ is strictly proper and therefore truth-directed, for each $c$ in $\mathfrak{X}$,$$\mathfrak{I}(c, w_1) \geq \mathfrak{I}(c, w_2) \geq \mathfrak{I}(c, w_3)$$And so$$H^A_\mathfrak{I}(c) = \alpha_1 \mathfrak{I}(c, w_1) + \alpha_2\mathfrak{I}(c, w_2) + \alpha_3 \mathfrak{I}(c, w_3)$$
That means that $H^A_\mathfrak{I}(c)$ is the expected inaccuracy of $c$ by the lights of the credence function $(\alpha_1, \alpha_2, \alpha_3)$ generated by the Hurwicz weights. This allows us to calculate each case. As Catrin Campbell-Moore helped me to see, it turns out that the minimizer does not depend on which strictly proper scoring rule you use---each gives the same. In the first column of the table below, I list the different possible orderings of the three Hurwicz weights. In two cases, specifying that order is not sufficient to determine the minimizer. To do that, you also have to know the absolute values of some of the weights. Where necessary, I include those in the second column. In the third column, I specify the member of $\mathfrak{X}$ that minimizes $H^A_\mathfrak{I}$ relative to those weights.$$\begin{array}{c|c|ccc}
\mbox{Ordering of} & \mbox{Further properties} & c_1 & c_2 & c_3\\
\mbox{the weights} & \mbox{of the weights} & && \\
&&&\\
\hline
&&&\\
\alpha_1 \leq \alpha_2 \leq \alpha_3 & - & \frac{1}{3} & \frac{1}{3} & \frac{1}{3}  \\
&&&\\
\alpha_1 \leq \alpha_3 \leq \alpha_2 & \alpha_1 + \alpha_2 \geq \frac{2}{3} &  \frac{\alpha_1 + \alpha_2}{2} & \frac{\alpha_1 + \alpha_2}{2} & \alpha_3  \\
&&&\\
\alpha_1 \leq \alpha_3 \leq \alpha_2 & \alpha_1 + \alpha_2 \leq \frac{2}{3} &  \frac{1}{3} & \frac{1}{3} & \frac{1}{3}  \\
&&&\\
\alpha_2 \leq \alpha_1 \leq \alpha_3 & \alpha_1 \leq \frac{1}{3}& \frac{1}{3} & \frac{1}{3} & \frac{1}{3}  \\
&&&\\
\alpha_2 \leq \alpha_1 \leq \alpha_3 & \alpha_1 > \frac{1}{3} & \alpha_1 & \frac{\alpha_2 + \alpha_3}{2} & \frac{\alpha_2 + \alpha_3}{2} \\
&&&\\
\alpha_2 \leq \alpha_3 \leq \alpha_1 &- & \alpha_1 & \frac{\alpha_2 + \alpha_3}{2} & \frac{\alpha_2 + \alpha_3}{2} \\
&&&\\
\alpha_3 \leq \alpha_1 \leq \alpha_2 &- & \frac{\alpha_1 + \alpha_2}{2} & \frac{\alpha_1 + \alpha_2}{2} & \alpha_3  \\
&&&\\
\alpha_3 \leq \alpha_2 \leq \alpha_1 &- & \alpha_1 & \alpha_2 & \alpha_3 
\end{array}$$
In the following diagram, we plot the different possible Hurwicz weights in a barycentric plot, so that the bottom left corner of the triangle is $(1, 0, 0)$, the bottom-right is $(0, 1, 0)$ and the top is $(0, 0, 1)$. We then divide this into four regions. If your weights $A = (\alpha_1, \alpha_2, \alpha_3)$ lie in a given region, then the triple I've placed in that region gives the credence function that minimizes the Generalized Hurwicz Score $H^A$ for those weights. Note, the bottom left triangle is $\mathfrak{X}$. Essentially, to find which member of $\mathfrak{X}$ a given weighting demands, you plot that weighting in this diagram and then find the closest member of $\mathfrak{X}$. You can use Euclidean closeness for this purpose, since all strictly proper accuracy measures will give that same result.


 To finish, some interesting points about this:

First, for any probabilistic credence function, there are Hurwicz weights for which that credence function is a minimizer. If $c = (c_1, c_2, c_3)$ and $c_{i_1} \geq c_{i_2} \geq c_{i_3}$, then the weights are $c_{i_1}, c_{i_2}, c_{i_3}$.

Second, in my earlier paper, I noted that Maximin demands the uniform distribution. So you might see the uniform distribution as the maximally risk-averse option. And indeed you can read something like that into William James' remark that the person who always suspends judgment has an irrational aversion to being a dupe. But actually it turns out that quite a wide range of attitudes to risk--in this case represented by generalised Hurwicz weights in the upper trapezoid region--demand the uniform distribution.

Third, in my first blogpost about all this, I criticized Hurwicz's original version of his criterion for being incompatible with a natural dominance norm, which says that if one option is always at least as good as another and sometimes better, then it should be strictly preferred. Suppose we apply this strengthening to Weak Dominance in our characterization of the Generalized Hurwicz Criterion in the previous blogpost. Then we obtain a slightly more demanding version of the Generalized Hurwicz Criterion that requires that each of your Hurwicz weightings is strictly positive. And if we do that, then it's no longer true that any probabilistic credence function can be rationalised in this way. Only regular ones can be. So that plausible strengthening of Weak Dominance leads us to an argument for Regularity, the principle that says that you should not assign extremal credences to propositions that are not tautologies or contradictions.

Wednesday, 15 July 2020

A Generalised Hurwicz Criterion

Here's a PDF of this blogpost.

In yesterday's post, I discussed Leonid Hurwicz's Criterion of Realism. This is a decision rule for situations in which your evidence is so sparse and your uncertainty so great that you cannot assign probabilities to the possible states of the world. It requires you to pick a real number $0 \leq \alpha \leq 1$ and then assign to each option or act the weighted average of its worst-case utility and its best-case utility, where $\alpha$ is the weight for the former and $1-\alpha$ is the weight for the latter. That is, given an option defined on a set of possible worlds $W$, let$$H^\alpha(a) := \alpha \max_{w \in W} a(w) + (1-\alpha) \min_{w \in W} a(w)$$The Hurwicz Criterion says: Pick an option that maximises $H^\alpha$.

In that post, I also mentioned Hurwicz's own characterization of a class of preference orderings to which the orderings determined by $H^\alpha$ belong, namely, the class of all preference orderings that are indifferent between two options with the same minimum and maximum utilities. And I pointed out that there is something troubling about it: if you strengthen Hurwicz's Dominance axiom in a natural way, his axioms are no longer consistent. (As I mention in an update at the beginning of that blog, after posting it, Johan Gustafsson sent me a draft paper of his that makes this same point.) To see this, let $W = \{w_1, w_2, w_3\}$, and suppose $a$ and $a'$ are options defined on $W$ with
$$
\begin{array}{r|ccc}
& w_1 & w_2 & w_3\\
\hline
a & m & n & M \\
a' & m & n' & M
\end{array}
$$where $m < n < n' < M$. Then $H^\alpha(a) = H^\alpha(a')$, for any $0 \leq \alpha \leq 1$, but $a'$ weakly dominates $a$---that is, $a(w_i) \leq a'(w_i)$ for all $i = 1, 2, 3$ and $a(w_i) < a'(w_i)$ for some $i = 1, 2, 3$. That is, the Hurwicz Criterion is incompatible with a natural Dominance principle that says that if one option is always at least as good as another and sometimes better than it, it should be strictly preferred. And indeed all of the other rules in the class that Hurwicz characterised suffer from the same problem. Gustafsson uses this to mount an argument against one of Hurwicz's axioms, namely, the one that I called Coarse-Graining Invariance in the previous blogpost.

How to address this? A natural response is to note that the Hurwicz Criterion was too narrow in its view. While it was broader than Wald's Maximin, which was also in the air at the time, and its duel, Maximax, it still only considered best- and worst-case utilities. What of the rest? Of course, Hurwicz thought that his narrow focus was justified by his axioms. But if you are moved by the argument against those axioms that Gustafsson and I favour, that justification evaporates. Thus, I introduce what I'll call the Generalised Hurwicz Criterion as follows, where we are considering only options defined on a finite set of possible states of the world $W = \{w_1, \ldots, w_n\}$:

Generalized Hurwicz Criterion (GHC) Pick $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\alpha_1 + \ldots + \alpha_n = 1$, and denote this sequence of weights $A$. If $a$ is defined on $W$ and$$a(w_{i_1}) \leq a(w_{i_2}) \leq \ldots \leq a(w_{i_n})$$then let$$H^A(a) = \alpha_1a(w_{i_1}) + \ldots + \alpha_na(w_{i_n})$$Pick an option that maximises $H^A$.

Note: it's important to reassure yourself that $H^A$ is well-defined in this statement. Now, the next thing is to offer a characterization of GHC. Recall, Hurwicz himself didn't actually characterise his Criterion, but rather a larger class of which it was a member. However, John Milnor, at that point just shortly out of his undergraduate degree and embarking on his doctoral work in knot theory, did provide a characterization in his paper 'Games against Nature'. We'll build on that result here to characterise GHC. First, two pieces of notation:
  • If $a$, $a'$ are options, then define $a + a'$ to be the option with $(a + a')(w) = a(w) + a'(w)$, for all $w$ in $W$.
  • If $a$ is an option and $m$ is a real number, then define $ma$ to be the option with $ma(w) = m \times a(w)$, for all $w$ in $W$.
(A1) Structure $\preceq$ is reflexive and transitive.

(A2) Weak Dominance
  1. If $a(w) \leq a'(w)$ for all $w$ in $W$, then $a \preceq a'$.
  2. If $a(w) < a'(w)$ for all $w$ in $W$, then $a \prec a'$.
(A3) Permutation Invariance For any set of worlds $W$ and any options $a, a'$ defined on $W$, if $\pi : W \cong W$ is a permutation of the worlds in $W$ and if $a'(w) = a(\pi(w))$ for all $w$ in $W$, then $a \sim a'$.

(A4) Continuity Suppose $a_1, a_2, \ldots$ is a sequence of options that converges on $a$ in the limit. Then, if $a_i \prec a'$ for all $i = 1, 2, \ldots$, then $a \preceq a'$.

(A5) Linearity If $m > 0$ and $a \sim a'$, then $ma \sim ma'$.

(A6) Summation If $a \sim b$ and $a' \sim b'$, then $a + a' \sim b + b'$.

Theorem Suppose (A1-6). Then there is some sequence $A$ of weights $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ such that $a \preceq a'$ iff $H^A(a) \leq H^A(a')$.

Before we continue to the proof, one point to note: as I noted in the previous post and Johan Gustafsson noted in his draft paper, if we strengthen Weak Dominance, Hurwicz's axioms are inconsistent. However, if we strengthen it here, the axioms remain consistent, and they characterise a slightly narrower version of the Generalized Hurwicz Criterion in which all weights must be non-zero.

Proof of Theorem. First, we determine the weights. We do this in $n$ steps. Here, we denote an option by the $n$-tuple of its utility values at the $n$ different worlds. Thus, $a = (a(w_1), \ldots, a(w_n))$.
  • If $\beta_n$ is the supremum of the set$$\{\alpha : (\alpha, \ldots, \alpha) \preceq (0, \ldots, 0, 1)\}$$then let $\alpha_n = \beta_n$;
  • If $\beta_{n-1}$ is the supremum of the set$$\{\alpha : (\alpha, \ldots, \alpha) \preceq (0, \ldots, 0, 1, 1)\}$$then let $\alpha_{n-1} = \beta_{n-1}- \alpha_n$;
  • If $\beta_{n-2}$ is the supremum of the set$$\{\alpha : (\alpha, \ldots, \alpha) \preceq (0, \ldots, 0, 1, 1, 1)\}$$then let $\alpha_{n-2} = \beta_{n-2} - \alpha_n - \alpha_{n-1}$;
  • and so on until...
  • If $\beta_1$ is the supremum of the set$$\{\alpha : (\alpha, \ldots, \alpha) \preceq (1, \ldots, 1, 1)\}$$then let $\alpha_1 = \beta_1 - \alpha_n - \alpha_{n-1} - \ldots - \alpha_2$.
Now, by Weak Dominance, for each $i = 1, \ldots, n$, $0 \leq \beta_i \leq 1$, since $(0, \ldots, 0) \preceq (0, \ldots, 1, \ldots, 1) \preceq (1, \ldots, 1)$. What's more, and again by Weak Dominance, $\beta_i \leq \beta_{i-1}$. Thus, $\alpha_i \geq 0$. What's more, $\beta_1 = 1$, so $\alpha_1 + \ldots + \alpha_n = 1$.

What's more, by Continuity:
\begin{eqnarray*}
(\alpha_n, \ldots, \alpha_n) & \sim & (0, 0, \ldots, 0, 1) \\
(\alpha_n + \alpha_{n-1}, \ldots, \alpha_n + \alpha_{n-1}) & \sim & (0, 0, \ldots, 1, 1) \\
\vdots & \vdots & \vdots \\
(\alpha_n + \ldots + \alpha_2, \ldots, \alpha_n+ \ldots + \alpha_2) & \sim & (0, 1, \ldots, 1, 1) \\
(\alpha_n + \ldots + \alpha_1, \ldots, \alpha_n+ \ldots + \alpha_1) & \sim & (1, 1, \ldots, 1, 1) \\
\end{eqnarray*}

Now, suppose $u_1 \leq \ldots \leq u_n$, and consider the option $(u_1, \ldots, u_n)$. Then, by Linearity:

$(\alpha_n (u_n-u_{n-1}), \ldots, \alpha_n(u_n-u_{n-1})) \sim (0, 0, \ldots, 0, u_n - u_{n-1})$

$((\alpha_n + \alpha_{n-1})(u_{n-1}-u_{n-2}), \ldots, (\alpha_n + \alpha_{n-1})(u_{n-1}-u_{n-2})) \sim$
$(0, 0, \ldots, u_{n-1}-u_{n-2}, u_{n-1}-u_{n-2})$

and so on, until...

$((\alpha_n + \ldots + \alpha_2)(u_2-u_1), \ldots, (\alpha_n+ \ldots + \alpha_2)(u_2-u_1)) \sim$
$(0, u_2-u_1, \ldots, u_2-u_1, u_2-u_1)$

$((\alpha_n + \ldots + \alpha_1)u_1, \ldots, (\alpha_n+ \ldots + \alpha_1)u_1) \sim (u_1, u_1, \ldots, u_1, u_1)$

And so, by Summation:$$(u_1, \ldots, u_n) \sim (\sum_i \alpha_i u_i, \ldots, \sum_i \alpha_i u_i)$$So, if $v_1 \leq \ldots \leq v_n$, and$$\sum_i \alpha_iv_i = \sum_i \alpha_iu_i$$then$$(v_1, \ldots, v_n) \sim (\sum_i \alpha_i v_i, \ldots, \sum_i \alpha_i v_i) \sim (\sum_i \alpha_i u_i, \ldots, \sum_i \alpha_i u_i) \sim (u_1, \ldots, u_n)$$as required. $\Box$

Sunday, 12 July 2020

Hurwicz's Criterion of Realism and decision-making under massive uncertainty

For a PDF version of this post, click here.

[UPDATE: After posting this, Johan Gustafsson got in touch and it seems he and I have happened upon similar points via slightly different routes. His paper is here. He takes his axioms from Binmore's Rational Decisions, who took them from Milnor's 'Games against Nature'. Hurwicz and Arrow also cite Milnor, but Hurwicz's original characterisation appeared before Milnor's paper, and he cites Chernoff's Cowles Commission Discussion Paper: Statistics No. 326A as the source of his axioms.]

In 1951, Leonid Hurwicz, a Polish-American economist who would go on to share the Nobel prize for his work on mechanism design, published a series of short notes as part of the Cowles Commission Discussion Paper series, where he introduced a new decision rule for choice in the face of massive uncertainty. The situations that interested him were those in which your evidence is so sparse that it does not allow you to assign probabilities to the different possible states of the world. These situations, he thought, fall outside the remit of Savage's expected utility theory.

The rule he proposed is called Hurwicz's Criterion of Realism or just the Hurwicz Criterion. He introduced it in the form in which it is usually stated in February 1951 in the Cowles Commission Discussion Paper: Statistics No. 356 -- the title was 'A Class of Criteria for Decision-Making under Ignorance'. The Hurwicz Criterion says that you should choose an option that maximises what I'll call its Hurwicz score, which is a particular weighted average of its best-case utility and its worst-case utility. A little more formally: We follow Hurwicz and let an option be a function $a$ from a set $W$ of possible states of the world to the real numbers $\mathbb{R}$. Now, you begin by setting the weight $0 \leq \alpha \leq 1$ you wish to assign to the best-case utility of an option, and then you assign the remaining weight $1-\alpha$ to its worst-case. Then the Hurwicz score of option $a$ is just $$H^\alpha(a) := \alpha \max_{w \in W} a(w) + (1-\alpha) \min_{w \in W} a(w)$$

However, reading his other notes in the Cowles series that surround this brief three-page note, it's clear that Hurwicz's chief interest was not so much in this particular form of decision rule, but rather with any such rule that determines the optimal choices solely by looking at their best- and worst-case scenarios. The Hurwicz Criterion is one such rule, but there are others. You might, for instance, weight the best- and worst-cases not by fixed constant coefficients, but by coefficients that change with the minimum and maximum values, or change with the difference between them or with their ratio. One of the most interesting contributions of these papers that surround the one in which Hurwicz gives us his Criterion is a characterization of rules that depend only on best- and worst-case utilities. Hurwicz gave rather an inelegent initial version of that characterization in Cowles Commission Discussion Paper: Statistics No. 370, published at the end of 1951 -- the title was 'Optimality Criteria for Decision-Making under Ignorance'. Kenneth Arrow then seems to have helped clean it up, and they published the new version together in the Appendix of their edited volume, in which they contributed most of the chapters, often with co-authors, Studies in Resource Allocation. The version with Arrow is still reasonably involved, but the idea is quite straightforward, and it is remarkable how strong a restriction Hurwicz obtains from seemingly weak and plausible axioms. This really seems to me a case where axioms that seem quite innocuous on their own can combine in interesting ways to make trouble. So I thought it might be interesting to give a simplified version that has all the central ideas.

Here's the framework:

Possibilities and possible worlds. Let $\Omega$ be the set of possibilities. A possible world is a set of possibilities--that is, a subset of $\Omega$. And a set $W$ of possible worlds is a partition of $\Omega$. That is, $W$ presents the possibilities at $\Omega$ at a certain level of grain. So if $\Omega = \{\omega_1, \omega_2, \omega_3\}$, then $\{\{\omega_1\}, \{\omega_2\}, \{\omega_3\}\}$ is the most fine-grained set of possible worlds, but there are coarser-grained sets as well, such as $\{\{\omega_1, \omega_2\}, \{\omega_3\}\}$ or $\{\{\omega_1\}, \{\omega_2,  \omega_3\}\}$. (This is not quite how Hurwicz understands the relationship between different sets of possible states of the world -- he talks of deleting worlds rather than clumping them together, but I think this formalization better captures his idea.)

Options. For any set $W$ of possible worlds, an option defined on $W$ is simply a function from $W$ into the real numbers $\mathbb{R}$. So an option $a : W \rightarrow \mathbb{R}$ takes each world $w$ in $W$ and assigns a utility $a(w)$ to it. (Hurwicz refers to von Neumann and Morgenstern to motivate the assumption that utilities can be measured by real numbers.)

Preferences. For any set $W$ of possible worlds, there is a preference relation $\preceq_W$ over the options defined on $W$. (Hurwicz states his result in terms of optimal choices rather than preferences. But I think it's a bit easier to see what's going on if we state it in terms of preferences. There's then a further question as to which options are optimal given a particular preference ordering, but we needn't address that here.)

Hurwicz's goal was to lay down conditions on these preference relations such that the following would hold:

Hurwicz's Rule Suppose $a$ and $a'$ are options defined on $W$. Then

(H1) If
  • $\min_w a(w) = \min_w a'(w)$
  • $\max_w a(w) = \max_w a'(w)$
then $a \sim_W a'$. That is, you should be indifferent between any two options with the same maximum and minimum.

(H2) If
  • $\min_w a(w) < \min_w a'(w)$
  • $\max_w a(w) < \max_w a'(w)$
then $a \prec_W a'$. That is, you should prefer one option to another if the worst case of the first is better than the worst case of the second and the best case of the first is better than the best case of the second.

Here are the four conditions or axioms:

(A1) Structure $\preceq_W$ is reflexive and transitive.

(A2) Weak Dominance
  1. If $a(w) \leq a'(w)$ for all $w$ in $W$, then $a \preceq_W a'$.
  2. If $a(w) < a'(w)$ for all $w$ in $W$, then $a \prec_W a'$.
This is a reasonably weak version of a standard norm on preferences.

(A3) Permutation Invariance For any set of worlds $W$ and any options $a, a'$ defined on $W$, if $\pi : W \cong W$ is a permutation of the worlds in $W$ and if $a'(w) = a(\pi(w))$ for all $w$ in $W$, then $a \sim_W a'$.

This just says that it doesn't matter to you which worlds receive which utilities -- all that matters are the utilities received.

(A4) Coarse-Graining Invariance Suppose $W = \{\ldots, w_1, w_2, \ldots\}$ is a set of possible worlds and suppose $a, a'$ are options on $W$ with $a(w_1) = a(w_2)$ and $a'(w_1) = a'(w_2)$. Then let $W' = \{\ldots, w_1 \cup w_2, \ldots\}$, so that $W'$ has the same worlds as $W$ except that, instead of $w_1$ and $w_2$, it has their union. And define options $b$ and $b'$ on $W'$ as follows: $b(w_1 \cup w_2) = a(w_1) = a(w_2)$ and $b'(w_1 \cup w_2) = a'(w_1) = a'(w_2)$, and $b(w) = a(w)$ and $b'(w) = a'(w)$ for all other worlds. Then $a \sim_W a'$ iff $b \sim_W b'$.

This says that if two options don't distinguish between two worlds, it shouldn't matter to you whether they are defined on a fine- or coarse-grained space of possible worlds.

Then we have the following theorem:

Theorem (Hurwicz) (A1) + (A2) + (A3) + (A4) $\Rightarrow$ (H1) + (H2).

Here's the proof. Assume (A1) + (A2) + (A3) + (A4). First, we'll show that (H1) follows. We'll sketch the proof only for the case in which $W = \{w_1, w_2, w_3\}$, since that gives all the crucial moves. So denote an act on $W$ by a triple $(a(w_1), a(w_2), a(w_3))$. Now, suppose that $a$ and $a'$ are options defined on $W$ with the same minimum, $m$, and maximum, $M$. Let $n$ be the middle value of $a$ and $n'$ the middle value of $a'$.

Now, first note that
$$(m, m, M) \sim_W (m, M, M)$$ After all, $(m, m, M) \sim_W (M, m, m)$ by Permutation Invariance. And, by Coarse-Graining Invariance, $(m, M, M) \sim_W (M, m, m)$ iff $(m, M) \sim_{W'} (M, m)$, where $W' = \{w_1, w_2 \cup w_3\}$. And, by Permutation Invariance and the reflexivity of $\sim_{W'}$, $(m, M) \sim_{W'} (M, m)$. So $(m, M, M) \sim_W (M, m, m) \sim_W (m, m, M)$, as required. And now we have, by previous results, Permutation Invariance, and Weak Dominance:
$$a \sim_W (m, n, M) \preceq_W (m, M, M) \sim_W (m, m, M) \preceq_W (m, n', M) \sim_W a'$$
and
$$a' \sim_W (m, n', M) \preceq_W (m, M, M) \sim_W (m, m, M) \preceq_W (m, n, M) \sim_W a$$
And so, by transitivity, $a \sim_W a'$. That gives (H1).

For (H2), suppose $a$ has worst case $m$, middle case $n$, and best-case $M$, while $a'$ has worst case $m'$, middle case $n'$, and worst case $M'$. And suppose $m < m'$ and $M < M'$. Then$$a \sim_W (m, n, M) \preceq_W (m, M, M) \sim_W (m, m, M) \prec_W (m', n', M') \sim_W a'$$as required. $\Box$

In a follow-up blog post, I'd like to explore Hurwicz's conditions (A1-4) in more detail. I'm a fan of his approach, not least because I want to use something like his decision rule within the framework of accuracy-first epistemology to understand how we select our first credences -- our ur-priors or superbaby credences (see here). But I now think Hurwicz's focus on only the worst-case and best-case scenarios is too restrictive. So I have to grapple with the theorem I've just presented. That's what I hope to do in the next post. But here's a quick observation. (A1-4), while plausible at first sight, sail very close to inconsistency. For instance, (A1), (A3), and (A4) are inconsistent when combined with a slight strengthening of (A2). Suppose we add the following to (A2) to give (A2$^\star$):

3. If $a(w) \leq a'(w)$ for all $w$ in $W$ and $a(w) < a'(w)$ for some $w$ in $W$, then $a \prec_W a'$.

Then we have know from above that $(m, m, M) \sim_W (m, M, M)$, but (A2$^\star$) entails that $(m, m, M) \prec_W (m, M, M)$, which gives a contradiction.