## Monday, 4 January 2021

### Using a generalized Hurwicz criterion to pick your priors

Over the summer, I got interested in the problem of the priors again. Which credence functions is it rational to adopt at the beginning of your epistemic life? Which credence functions is it rational to have before you gather any evidence? Which credence functions provide rationally permissible responses to the empty body of evidence? As is my wont, I sought to answer this in the framework of epistemic utility theory. That is, I took the rational credence functions to be those declared rational when the appropriate norm of decision theory is applied to the decision problem in which the available acts are all the possible credence functions, and where the epistemic utility of a credence function is measured by a strictly proper measure. I considered a number of possible decision rules that might govern us in this evidence-free situation: Maximin, the Principle of Indifference, and the Hurwicz criterion. And I concluded in favour of a generalized version of the Hurwicz criterion, which I axiomatised. I also described which credence functions that decision rule would render rational in the case in which there are just three possible worlds between which we divide our credences. In this post, I'd like to generalize the results from that treatment to the case in which there any finite number of possible worlds.

Here's the decision rule (where $a(w_i)$ is the utility of $a$ at world $w_i$).

Generalized Hurwicz Criterion  Given an option $a$ and a sequence of weights $0 \leq \lambda_1, \ldots, \lambda_n \leq 1$ with $\sum^n_{i=1} \lambda_i = 1$, which we denote $\Lambda$, define the generalized Hurwicz score of $a$ relative to $\Lambda$ as follows: if $$a(w_{i_1}) \geq a(w_{i_2}) \geq \ldots \geq a(w_{i_n})$$ then $$H^\Lambda(a) := \lambda_1a(w_{i_1}) + \ldots + \lambda_na(w_{i_n})$$That is, $H^\Lambda(a)$ is the weighted average of all the possible utilities that $a$ receives, where $\lambda_1$ weights the highest utility, $\lambda_2$ weights the second highest, and so on.

The Generalized Hurwicz Criterion says that you should order options by their generalized Hurwicz score relative to a sequence $\Lambda$ of weightings of your choice. Thus, given $\Lambda$,$$a \preceq^\Lambda_{ghc} a' \Leftrightarrow H^\Lambda(a) \leq H^\Lambda(a')$$And the corresponding decision rule says that you should pick your Hurwicz weights $\Lambda$ and then, having done that, it is irrational to choose $a$ if there is $a'$ such that $a \prec^\Lambda_{ghc} a'$.

Now, let $\mathfrak{U}$ be an additive strictly proper epistemic utility measure. That is, it is generated by a strictly proper scoring rule. A strictly proper scoring rule is a function $\mathfrak{s} : \{0, 1\} \times [0, 1] \rightarrow [-\infty, 0]$ such that, for any $0 \leq p \leq 1$, $p\mathfrak{s}(1, x) + (1-p)\mathfrak{s}(0, x)$ is maximized, as a function of $x$, uniquely at $x = p$. And an epistemic utility measure is generated by $\mathfrak{s}$ if, for any credence function $C$ and world $w_i$,$$\mathfrak{U}(C, w_i) = \sum^n_{j=1} \mathfrak{s}(w^j_i, c_j)$$where

• $c_j = C(w_j)$, and
• $w^j_i = 1$ if $j=i$ and $w^j_i = 0$ if $j \neq i$

In what follows, we write the sequence $(c_1, \ldots, c_n)$ to represent the credence function $C$.

Also, given a sequence $(\alpha_1, \ldots, \alpha_k)$ of numbers, let$$\mathrm{Av}((\alpha_1, \ldots, \alpha_k)) := \frac{\alpha_1 + \ldots + \alpha_k}{k}$$That is, $\mathrm{av}(A)$ is the average of the numbers in $A$. And given $1 \leq k \leq n$, let $A|_k = (a_1, \ldots, a_k)$. That is, $A|_k$ is the truncation of the sequence $A$ that omits all terms after $a_k$. Then we say that $A$ does not exceed its average if, for each $1 \leq k \leq n$,$$\mathrm{av}(A) \geq \mathrm{av}(A|_k)$$That is, at no point in the sequence does the average of the numbers up to that point exceed the average of all the numbers in the sequence.

Theorem 1 Suppose $\Lambda = (\lambda_1, \ldots, \lambda_n)$ is a sequence of generalized Hurwicz weights. Then there is a sequence of subsequences $\Lambda_1, \ldots, \Lambda_m$ of $\Lambda$ such that

1. $\Lambda = \Lambda_1 \frown \ldots \frown \Lambda_m$
2. $\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av} (\Lambda_m)$
3. each $\Lambda_i$ does not exceed its average

Then, the credence function$$(\underbrace{\mathrm{av}(\Lambda_1), \ldots, \mathrm{av}(\Lambda_1)}_{\text{length of \Lambda_1}}, \underbrace{\mathrm{av}(\Lambda_2), \ldots, \mathrm{av}(\Lambda_2)}_{\text{length of \Lambda_2}}, \ldots, \underbrace{\mathrm{av}(\Lambda_m), \ldots, \mathrm{av}(\Lambda_m)}_{\text{length of \Lambda_m}})$$maximizes $H^\Lambda(\mathfrak{U}(-))$ among credence functions $C = (c_1, \ldots, c_n)$ for which $c_1 \geq \ldots \geq c_n$.

This is enough to give us all of the credence functions that maximise $H^\Lambda(\mathfrak{U}(-))$: they are the credence function mentioned together with any permutation of it --- that is, any credence function obtained from that one by switching around the credences assigned to the worlds.

Proof of Theorem 1. Suppose $\mathfrak{U}$ is a measure of epistemic value that is generated by the strictly proper scoring rule $\mathfrak{s}$. And suppose that $\Lambda$ is the following sequence of generalized Hurwicz weights $0 \leq \lambda_1, \ldots, \lambda_n \leq 1$ with $\sum^n_{i=1} \lambda_i = 1$.

First, due to a theorem that originates in Savage and is stated and proved fully by Predd, et al., if $C$ is not a probability function---that is, if $c_1 + \ldots + c_n \neq 1$---then there is a probability function $P$ such that $\mathfrak{U}(P, w_i) > \mathfrak{U}(C, w_i)$ for all worlds $w_i$. Thus, since GHC satisfies Strong Dominance, whatever maximizes $H^\Lambda(\mathfrak{U}(-))$ will be a probability function.

Now, since $\mathfrak{U}$ is generated by a strictly proper scoring rule, it is also truth-directed. That is, if $c_i > c_j$, then $\mathfrak{U}(C, w_i) > \mathfrak{U}(C, w_j)$. Thus, if $c_1 \geq c_2 \geq \ldots \geq c_n$, then$$H^\Lambda(\mathfrak{U}(C)) = \lambda_1\mathfrak{U}(C, w_1) + \ldots + \lambda_n\mathfrak{U}(C, w_n)$$This is what we seek to maximize. But notice that this is just the expectation of $\mathfrak{U}(C)$ from the point of view of the probability distribution $\Lambda = (\lambda_1, \ldots, \lambda_n)$.

Now, Savage also showed that, if $\mathfrak{s}$ is strictly proper and continuous, then there is a differentiable and strictly convex function $\varphi$ such that, if $P, Q$ are probabilistic credence functions, then
\begin{eqnarray*}
\mathfrak{D}_\mathfrak{s}(P, Q) & = & \sum^n_{i=1} \varphi(p_i) - \sum^n_{i=1} \varphi(q_i) - \sum^n_{i=1} \varphi'(q_i)(p_i - q_i) \\
& = & \sum^n_{i=1} p_i\mathfrak{U}(P, w_i) - \sum^n_{i=1} p_i\mathfrak{U}(Q, w_i)
\end{eqnarray*}
So $C$ maximizes $H^\Lambda(\mathfrak{U}(-))$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$ iff $C$ minimizes $\mathfrak{D}_\mathfrak{s}(\Lambda, -)$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$. We now use the KKT conditions to calculate which credence functions minimize $\mathfrak{D}_\mathfrak{s}(\Lambda, -)$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$.

Thus, if we write $x_n$ for $1 - x_1 - \ldots - x_{n-1}$, then
\begin{multline*}
f(x_1, \ldots, x_{n-1}) = \mathfrak{D}((\lambda_1, \ldots, \lambda_n), (x_1, \ldots, x_n)) = \\
\sum^n_{i=1} \varphi(\lambda_i) - \sum^n_{i=1} \varphi(x_i) - \sum^n_{i=1} \varphi'(x_i)(\lambda_i - x_i)
\end{multline*}
So
\begin{multline*}
\nabla f = \langle \varphi''(x_1) (x_1 - \lambda_1) - \varphi''(x_n)(x_n - \lambda_n), \\
\varphi''(x_2) (x_2 - \lambda_2) - \varphi''(x_n)(x_n - \lambda_n), \ldots \\
\varphi''(x_{n-1}) (x_{n-1} - \lambda_{n-1}) - \varphi''(x_n)(x_n - \lambda_n) )\rangle
\end{multline*}

Let $$\begin{array}{rcccl} g_1(x_1, \ldots, x_{n-1}) & = & x_2 - x_1& \leq & 0\\ g_2(x_1, \ldots, x_{n-1}) & = & x_3 - x_2& \leq & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots \\ g_{n-2}(x_1, \ldots, x_{n-1}) & = & x_{n-1} - x_{n-2}& \leq & 0 \\ g_{n-1}(x_1, \ldots, x_{n-1}) & = & 1 - x_1 - \ldots - x_{n-2} - 2x_{n-1} & \leq & 0 \end{array}$$So,
\begin{eqnarray*}
\nabla g_1 & = & \langle -1, 1, 0, \ldots, 0 \rangle \\
\nabla g_2 & = & \langle 0, -1, 1, 0, \ldots, 0 \rangle \\
\vdots & \vdots & \vdots \\
\nabla g_{n-2} & = & \langle 0, \ldots, 0, -1, 1 \rangle \\
\nabla g_{n-1} & = & \langle -1, -1, -1, \ldots, -1,  -2 \rangle \\
\end{eqnarray*}
So the KKT theorem says that $x_1, \ldots, x_n$ is a minimizer iff there are $0 \leq \mu_1, \ldots, \mu_{n-1}$ such that$$\nabla f(x_1, \ldots, x_{n-1}) + \sum^{n-1}_{i=1} \mu_i \nabla g_i(x_1, \ldots, x_{n-1}) = 0$$That is, iff there are $0 \leq \mu_1, \ldots, \mu_{n-1}$ such that
\begin{eqnarray*}
\varphi''(x_1) (x_1 - \lambda_1) - \varphi''(x_n)(x_n - \lambda_n) - \mu_1 - \mu_{n-1} & = & 0 \\
\varphi''(x_2) (x_2 - \lambda_2) - \varphi''(x_n)(x_n - \lambda_n) + \mu_1 - \mu_2 - \mu_{n-1} & = & 0 \\
\vdots & \vdots & \vdots \\
\varphi''(x_{n-2}) (x_{n-2} - \lambda_{n-2}) - \varphi''(x_n)(x_n - \lambda_n) + \mu_{n-3} - \mu_{n-2} - \mu_{n-1}& = & 0 \\
\varphi''(x_{n-1}) (x_{n-1} - \lambda_{n-1}) - \varphi''(x_n)(x_n - \lambda_n)+\mu_{n-2} - 2\mu_{n-1} & = & 0
\end{eqnarray*}
By summing these identities, we get:
\begin{eqnarray*}
\mu_{n-1} &  = & \frac{1}{n} \sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{n-1}{n} \varphi''(x_n)(x_n - \lambda_n) \\
&= & \frac{1}{n} \sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \varphi''(x_n)(x_n - \lambda_n) \\
& = & \sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{n-1}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)
\end{eqnarray*}
So, for $1 \leq k \leq n-2$,
\begin{eqnarray*}
\mu_k & = & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - k\varphi''(x_n)(x_n - \lambda_n) - \\
&& \hspace{20mm} \frac{k}{n}\sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) + k\frac{n-1}{n} \varphi''(x_n)(x_n - \lambda_n) \\
& = & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) -\frac{k}{n} \varphi''(x_n)(x_n - \lambda_n) \\
&= & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)
\end{eqnarray*}
So, for $1 \leq k \leq n-1$,
$$\mu_k = \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)$$
Now, suppose that there is a sequence of subsequences $\Lambda_1, \ldots, \Lambda_m$ of $\Lambda$ such that

1. $\Lambda = \Lambda_1 \frown \ldots \frown \Lambda_m$
2. $\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av}(\Lambda_m)$
3. each $\Lambda_i$ does not exceed its average.

And let $$P = (\underbrace{\mathrm{av}(\Lambda_1), \ldots, \mathrm{av}(\Lambda_1)}_{\text{length of \Lambda_1}}, \underbrace{\mathrm{av}(\Lambda_2), \ldots, \mathrm{av}(\Lambda_2)}_{\text{length of \Lambda_2}}, \ldots, \underbrace{\mathrm{av}(\Lambda_m), \ldots, \mathrm{av}(\Lambda_m)}_{\text{length of \Lambda_m}})$$Then we write $i \in \Lambda_j$ if $\lambda_i$ is in the subsequence $\Lambda_j$. So, for $i \in \Lambda_j$, $p_i = \mathrm{av}(\Lambda_j)$. Then$$\frac{k}{n}\sum^n_{i=1} \varphi''(p_i)(p_i - \lambda_i) = \frac{k}{n} \sum^m_{j = 1} \sum_{i \in \Lambda_j} \varphi''(\mathrm{av}(\Lambda_j))(\mathrm{av}(\Lambda_j) - \lambda_i) = 0$$
Now, suppose $k$ is in $\Lambda_j$. Then
\begin{multline*}
\mu_k = \sum^k_{i=1} \varphi''(p_i)(p_i - \lambda_i) = \\
\sum_{i \in \Lambda_1} \varphi''(p_i)(p_i - \lambda_i) + \sum_{i \in \Lambda_2} \varphi''(p_i)(p_i - \lambda_i) + \ldots + \\
\sum_{i \in \Lambda_{j-1}} \varphi''(p_i)(p_i - \lambda_i) + \sum_{i \in \Lambda_j|_k} \varphi''(p_i)(p_i - \lambda_i) = \\
\sum_{i \in \Lambda_j|_k} \varphi''(p_i)(p_i - \lambda_i) = \sum_{i \in \Lambda_j|_k} \varphi''(\mathrm{av}(\Lambda_j)(\mathrm{av}(\Lambda_j) - \lambda_i)
\end{multline*}
So, if $|\Lambda|$ is the length of the sequence $\Lambda$,$$\mu_k \geq 0 \Leftrightarrow |\Lambda_j|_k|\mathrm{av}(\Lambda_j) - \sum_{i \in \Lambda_j|_k} \lambda_i \geq 0 \Leftrightarrow \mathrm{av}(\Lambda_j) \geq \mathrm{av}(\Lambda_j|_k)$$But, by assumption, this is true for all $1 \leq k \leq n-1$. So $P$ minimizes $H^\Lambda(\mathfrak{U}(-))$, as required.

We now show that there is always a series of subsequences that satisfy (1), (2), (3) from above.  We proceed by induction.

Base Case  $n = 1$. Then it is clearly true with the subsequence $\Lambda_1 = \Lambda$.

Inductive Step  Suppose it is true for all sequences $\Lambda = (\lambda_1, \ldots, \lambda_n)$ of length $n$. Now consider a sequence $(\lambda_1, \ldots, \lambda_n, \lambda_{n+1})$. Then, by the inductive hypothesis, there is a sequence of sequences $\Lambda_1, \ldots, \Lambda_m$ such that

1. $\Lambda \frown (\lambda_{n+1}) = \Lambda_1 \frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})$
2. $\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av} (\Lambda_m)$
3. each $\Lambda_i$ does not exceed its average.

Now, first, suppose $\mathrm{av}(\Lambda_m) \geq \lambda_{n+1}$. Then let $\Lambda_{m+1} = (\lambda_{n+1})$ and we're done.

So, second, suppose $\mathrm{av}(\Lambda_m) < \lambda_{n+1}$. Then we find the greatest $k$ such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$Then we let $\Lambda^*_{k+1} = \Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})$. Then we can show that

1. $(\lambda_1, \ldots, \lambda_n, \lambda_{n+1}) = \Lambda_1 \frown \Lambda_2 \frown \ldots \frown \Lambda_k \frown \Lambda^*_{k+1}$.
2. Each $\Lambda_1, \ldots, \Lambda_k, \Lambda^*_{k+1}$ does not exceed average.
3. $\mathrm{av}(\Lambda_1) \geq \mathrm{av}(\Lambda_2) \geq \ldots \geq \mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda^*_{k+1})$.

(1) and (3) are obvious. So we prove (2). In particular, we show that $\Lambda^*_{k+1}$ does not exceed average. We assume that each subsequence $\Lambda_j$ starts with $\Lambda_{i_j+1}$

• Suppose $i \in \Lambda_{k+1}$. Then, since $\Lambda_{k+1}$ does not exceed average, $$\mathrm{av}(\Lambda_{k+1}) \geq \mathrm{av}(\Lambda_{k+1}|_i)$$But, since $k$ is the greatest number such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$We know that$$\mathrm{av}(\Lambda_{k+2}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1}|_i)$$
• Suppose $i \in \Lambda_{k+2}$. Then, since $\Lambda_{k+2}$ does not exceed average, $$\mathrm{av}(\Lambda_{k+2}) \geq \mathrm{av}(\Lambda_{k+2}|_i)$$But, since $k$ is the greatest number such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$We know that$$\mathrm{av}(\Lambda_{k+3}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+2})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+2}|_i)$$But also, from above,$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1} \frown \Lambda_{k+2}|_i)$$
• And so on.

This completes the proof. $\Box$

1. 2. 3. 