Joyce's argument for Probabilism: the mathematics
Last time, I posted about Jim Joyce's argument for Probabilism. At its heart lies a mathematical theorem. In this post, I state this mathematical theorem and prove it; then I generalize it and prove the generalization.
Recall from last time:
\[
\langle c(X_1), \ldots, c(X_n) \rangle
\]
This is essentially the representation we used last time when we plotted credence functions as points on the Euclidean plane. To avoid prolixity, I'll use $c$ to refer to the credence function and to the vector that represents it.
Under this representation, we have the following:
Recall:
With that in hand, we can state Theorem 1, the mathematical core of Joyce's argument for Probabilism. (In fact, the version we state here is slightly stronger than Joyce's; clause (2) is stronger.)
Theorem 1
The first step in the proof of Theorem 1 is the following Lemma, which is due to (de Finetti, 1974). It gives an extremely useful characterization of the credence functions that satisfy Probabilism: they are precisely those that are convex combinations of the omniscient credence functions in $\mathcal{V}$.
Definition 1 Let $\mathcal{V}^+$ be the convex hull of $\mathcal{V}$. That is, $\mathcal{V}^+$ is the smallest convex set that contains $\mathcal{V}$.
Lemma 1 $\mathcal{P} = \mathcal{V}^+$.
Proof: Suppose, to begin with, that $\mathcal{F}$ is an algebra.
Given the strict convexity of $F$, it follows that $d_F(y, x) \geq 0$, with equality iff $x = y$.
What follows are some of the crucial theorems concerning Bregman divergences. Each shows that Bregman divergences share many important geometrical features with Squared Euclidean Distance. We will be appealing to these various properties over the coming weeks.
First: Lemmas 2 and 3 holds for any Bregman divergence $d_F$:
Lemma 4 Suppose $\mathcal{X} \subseteq \mathbb{R}^n$ is convex. Then if $x \not \in \mathcal{X}$, there is $x^* \in \mathcal{X}$ such that $d_F(y, x^*) < d_F(y, x)$ for all $y \in \mathcal{X}$.
Proof: This is Proposition 3 in (Predd, et al., 2009).
Lemma 5 Suppose $\mathcal{X} \subseteq \mathbb{R}^n$. Then, if $x, y \in \mathcal{X}^+$ and $x \neq y$, there is $z \in \mathcal{X}$ such that $d_F(z, x) < d_F(z, y)$.
Proof: This is proved as part (a) of Theorem 1 in (Predd, et al., 2009).
Together, these are enough to show that Theorem 1 holds for any Bregman divergence. Thus, if we can establish that distance between credence functions ought to be measured by a Bregman divergence, we can run Joyce's argument.
Second: Suppose the inaccuracy of a credence function $c$ at world $w$ is given by $d_F(v_w, c)$ for some Bregman divergence $d_F$. Then, if credence function $c_1$ is further from $c$ than $c_2$ is, then $c$ will expect $c_1$ to have greater inaccuracy than it will expect $c_2$ to have.
Lemma 6 Recall that, if $v$ is in $\mathcal{V}$, then $A_v$ is the unique atom of $\mathcal{F}$ such that $v(A_v) = 1$. Then, if $c \in \mathcal{P}$ and $d_F(c, c_1) < d_F(c, c_2)$, then
\[
\sum_{v \in \mathcal{V}} c(A_v)d_F(v_w, c_1) < \sum_{v \in \mathcal{V}} c(A_v)d_F(v_w, c_2)
\]
Proof: This is proved on page 32 of (Pettigrew, 2013).
One consequence of this is that each probabilistic credence function expects itself to be most accurate. The following result gives a partial converse to this.
Notice that $B(c, w)$ is obtained by taking the squared difference between each credence $c(X)$ and the corresponding ideal credence $v_w(X)$ and summing the results. In this sort of situation, we say that $B$ is generated by a scoring rule.
Definition 3 A scoring rule is a function $s : \{0, 1\} \times [0, 1] \rightarrow [0, \infty]$.
The idea is that $s(0, x)$ measures the inaccuracy of having credence $x$ in a false proposition, whereas $s(1, x)$ measures the inaccuracy of having credence $x$ in a true proposition.
Definition 4 Given a scoring rule $s$, we say that $I_s$ is the inaccuracy measure generated by $s$, where
\[
I_s(c, w) = \sum_{X \in \mathcal{F}} s(v_w(X), c(X))
\]
We say that a scoring rule is proper if a particular credence expects itself to be more accurate than it expects any other credence to be. More precisely:
Definition 5 We say that a scoring rule $s$ is proper if
\[
ps(1, x) + (1-p)s(0, x)
\]
is minimal at $x = p$.
Then we have the following theorem, which connects the inaccuracy measures generated by proper scoring rules and Bregman divergences.
Theorem 2 Suppose $s$ is a continuous proper scoring rule. Then there is a Bregman divergence $d_F$ such that
\[
I_s(c, w) = d_F(v_w, c) + \sum_{X \in \mathcal{F}} s(v_w(X), v_w(X))
\]
Proof: This is proved as Equation (8) in (Predd, et al.).
The framework
Recall from last time:
- $\mathcal{F}$ is a finite set of propositions. (It is the set of propositions about which our agent has an opinion.)
- A credence function is a function $c : \mathcal{F} \rightarrow [0, 1]$.
\[
\langle c(X_1), \ldots, c(X_n) \rangle
\]
This is essentially the representation we used last time when we plotted credence functions as points on the Euclidean plane. To avoid prolixity, I'll use $c$ to refer to the credence function and to the vector that represents it.
Under this representation, we have the following:
- Let $\mathcal{B}$ be the set of credence functions. That is, $\mathcal{B} := [0, 1]^n$.
- Let $\mathcal{P}$ be set of probability functions. So $\mathcal{P} \subseteq \mathcal{B} = [0, 1]^n$.
- The function\[ Q(c, c') := \sum_{X \in \mathcal{F}} (c(X) - c'(X))^2\] genuinely measures the squared Euclidean distance between the vectors that represent $c$ and $c'$.
Recall:
- $\mathcal{W}$ is the set of classically consistent assignments of truth-values to the propositions in $\mathcal{F}$. (Thus, we can think of $\mathcal{W}$ as the set of possible worlds grained as finely as the propositions in $\mathcal{F}$ will allow.)
- Given $w$ in $\mathcal{W}$, define $v_w : \mathcal{F} \rightarrow [0, 1]$ as follows:\[ v_w(X) = \left \{ \begin{array}{ll} 0 & \mbox{if $X$ is false at $w$} \\ 1 & \mbox{if $X$ is true at $w$} \end{array} \right. \]
- Let $\mathcal{V} := \{v_w : w \mbox{ in } \mathcal{W}\}$. Thus, under the vector representation, $\mathcal{V} \subseteq [0, 1]^n$.
The core theorem
With that in hand, we can state Theorem 1, the mathematical core of Joyce's argument for Probabilism. (In fact, the version we state here is slightly stronger than Joyce's; clause (2) is stronger.)
Theorem 1
- If $c \not \in \mathcal{P}$, then there is $c^* \in \mathcal{P}$ such that $Q(v, c^*) < Q(v, c)$ for all $v \in \mathcal{V}$.
- If $c \in \mathcal{P}$, then there is no $c^* \in \mathcal{B}$ such that $Q(v, c^*) \leq Q(v, c)$ for all $v \in \mathcal{V}$.
The first step in the proof of Theorem 1 is the following Lemma, which is due to (de Finetti, 1974). It gives an extremely useful characterization of the credence functions that satisfy Probabilism: they are precisely those that are convex combinations of the omniscient credence functions in $\mathcal{V}$.
Definition 1 Let $\mathcal{V}^+$ be the convex hull of $\mathcal{V}$. That is, $\mathcal{V}^+$ is the smallest convex set that contains $\mathcal{V}$.
Lemma 1 $\mathcal{P} = \mathcal{V}^+$.
Proof: Suppose, to begin with, that $\mathcal{F}$ is an algebra.
- First, we prove $\mathcal{V}^+ \subseteq \mathcal{P}$. To do this, we note two things: first, each $v \in \mathcal{V}$ is a probability function, so $\mathcal{V} \subseteq \mathcal{P}$; second, $\mathcal{P}$ is convex.
- Second, we prove $\mathcal{P} \subseteq \mathcal{V}^+$. Since $\mathcal{F}$ is a finite algebra, it has atoms. Let $\mathcal{A} \subseteq \mathcal{F}$ be the set of atoms of $\mathcal{F}$. Then $\mathcal{A}$ and $\mathcal{V}$ stand in one-one correspondence: if $v \in \mathcal{V}$, there is exactly one $A_v \in \mathcal{A}$ such that $v(A_v) = 1$ and $v(A_{v'}) = 0$ for $v \neq v'$. Now suppose $c \in \mathcal{P}$. And suppose $X \in \mathcal{F}$. Then $X$ is equivalent to the disjunction of the atoms $A_v$ such that $v(X) = 1$. Thus\[c(X) = c\left (\bigvee_{v : v(X) = 1} A_v \right ) = \sum_{v : v(X) = 1} c(A_v) = \sum_{v} c(A_v)v(X)\] So\[c = \sum_{v \in \mathcal{V}} c(A_v) v\] Thus, $c$ is a convex combination of elements of $\mathcal{V}$. That is, $c \in \mathcal{V}^+$.
$\Box$
This characterization of $\mathcal{P}$, the set of probability functions on $\mathcal{F}$, makes our lives a lot easier, since there is a vast array of results about the behaviour of convex sets. Here are the two that are important for our purposes:
Lemma 2 Suppose $\mathcal{X} \subseteq \mathbb{R}^n$ is convex. Then if $x \not \in \mathcal{X}$, there is $x^* \in \mathcal{X}$ such that $Q(y, x^*) < Q(y, x)$ for all $y \in \mathcal{X}$.
Lemma 3 Suppose $\mathcal{X} \subseteq \mathbb{R}^n$. Then, if $x, y \in \mathcal{X}^+$ and $x \neq y$, there is $z \in \mathcal{X}$ such that $Q(z, x) < Q(z, y)$.
Lemma 3 Suppose $\mathcal{X} \subseteq \mathbb{R}^n$. Then, if $x, y \in \mathcal{X}^+$ and $x \neq y$, there is $z \in \mathcal{X}$ such that $Q(z, x) < Q(z, y)$.
We can now see how Lemmas 1, 2, and 3 combine to give Theorem 1: By Lemma 1, $\mathcal{P} = \mathcal{V}^+$. But, by Lemma 2, if there is a point $c$ outside $\mathcal{V}^+$, there is a point $c^*$ inside $\mathcal{V}^+$ such that $c^*$ is closer to all points in $\mathcal{V}^+$ than $c$ is; thus, in particular, $c^*$ is closer to all $v$ in $\mathcal{V}$ than $c$ is. This gives Theorem 1(1). By Lemma 3, if $c, c'$ are in $\mathcal{V}^+$, and $c \neq c'$, then there is $v$ in $\mathcal{V}$ such that $c$ is closer to $v$ than $c'$ is. Thus, $c$ is not even weakly dominated. This gives Theorem 1(2). And we're done!
So far, we've been assuming that distance between credence functions is measured by Squared Euclidean Distance. Does Theorem 1 depend on that? That is, is there a broader class of alternative measures of the distance between credence functions such that Theorem 1 holds for every distance measure in that class? The first thing to say is that Squared Euclidean Distance is not itself a distance measure, strictly speaking: that is, it isn't a metric. It doesn't satisfy the triangle inequality. It is rather what statisticians call a divergence. In this section, we show that Theorem 1 holds for any Bregman divergence. Squared Euclidean Distance is a Bregman divergence, but so is Kullback-Leibler divergence, the squared Mahalanobis distance, and Itakura-Saito distance. And, as we will see, if an inaccuracy measure is generated by a proper scoring rule there is a Bregman divergence that differs from that inaccuracy measure by a constant.
Definition 2 Suppose $\mathcal{X} \subseteq \mathbb{R}^n$ is convex. Suppose $F : \mathcal{X} \rightarrow \mathbb{R}$ is strictly convex and $\nabla F$ is defined on $\mathrm{int}(\mathcal{X})$ and extends to a bounded, continuous function on $\mathcal{X}$. Then the Bregman divergence generated by $F$ is
\[
d_F(y, x) := F(x) - F(y) - \langle \nabla F(x), (y-x)\rangle
\]
The core theorem generalized
So far, we've been assuming that distance between credence functions is measured by Squared Euclidean Distance. Does Theorem 1 depend on that? That is, is there a broader class of alternative measures of the distance between credence functions such that Theorem 1 holds for every distance measure in that class? The first thing to say is that Squared Euclidean Distance is not itself a distance measure, strictly speaking: that is, it isn't a metric. It doesn't satisfy the triangle inequality. It is rather what statisticians call a divergence. In this section, we show that Theorem 1 holds for any Bregman divergence. Squared Euclidean Distance is a Bregman divergence, but so is Kullback-Leibler divergence, the squared Mahalanobis distance, and Itakura-Saito distance. And, as we will see, if an inaccuracy measure is generated by a proper scoring rule there is a Bregman divergence that differs from that inaccuracy measure by a constant.
Definition 2 Suppose $\mathcal{X} \subseteq \mathbb{R}^n$ is convex. Suppose $F : \mathcal{X} \rightarrow \mathbb{R}$ is strictly convex and $\nabla F$ is defined on $\mathrm{int}(\mathcal{X})$ and extends to a bounded, continuous function on $\mathcal{X}$. Then the Bregman divergence generated by $F$ is
\[
d_F(y, x) := F(x) - F(y) - \langle \nabla F(x), (y-x)\rangle
\]
Essentially, $d_F(y, x)$ is the difference between $F(y)$ and the first-order Taylor expansion around $x$ evaluated at $y$, as the following diagram illustrates:
Given the strict convexity of $F$, it follows that $d_F(y, x) \geq 0$, with equality iff $x = y$.
What follows are some of the crucial theorems concerning Bregman divergences. Each shows that Bregman divergences share many important geometrical features with Squared Euclidean Distance. We will be appealing to these various properties over the coming weeks.
First: Lemmas 2 and 3 holds for any Bregman divergence $d_F$:
Lemma 4 Suppose $\mathcal{X} \subseteq \mathbb{R}^n$ is convex. Then if $x \not \in \mathcal{X}$, there is $x^* \in \mathcal{X}$ such that $d_F(y, x^*) < d_F(y, x)$ for all $y \in \mathcal{X}$.
Proof: This is Proposition 3 in (Predd, et al., 2009).
Lemma 5 Suppose $\mathcal{X} \subseteq \mathbb{R}^n$. Then, if $x, y \in \mathcal{X}^+$ and $x \neq y$, there is $z \in \mathcal{X}$ such that $d_F(z, x) < d_F(z, y)$.
Proof: This is proved as part (a) of Theorem 1 in (Predd, et al., 2009).
Together, these are enough to show that Theorem 1 holds for any Bregman divergence. Thus, if we can establish that distance between credence functions ought to be measured by a Bregman divergence, we can run Joyce's argument.
Second: Suppose the inaccuracy of a credence function $c$ at world $w$ is given by $d_F(v_w, c)$ for some Bregman divergence $d_F$. Then, if credence function $c_1$ is further from $c$ than $c_2$ is, then $c$ will expect $c_1$ to have greater inaccuracy than it will expect $c_2$ to have.
Lemma 6 Recall that, if $v$ is in $\mathcal{V}$, then $A_v$ is the unique atom of $\mathcal{F}$ such that $v(A_v) = 1$. Then, if $c \in \mathcal{P}$ and $d_F(c, c_1) < d_F(c, c_2)$, then
\[
\sum_{v \in \mathcal{V}} c(A_v)d_F(v_w, c_1) < \sum_{v \in \mathcal{V}} c(A_v)d_F(v_w, c_2)
\]
Proof: This is proved on page 32 of (Pettigrew, 2013).
One consequence of this is that each probabilistic credence function expects itself to be most accurate. The following result gives a partial converse to this.
Notice that $B(c, w)$ is obtained by taking the squared difference between each credence $c(X)$ and the corresponding ideal credence $v_w(X)$ and summing the results. In this sort of situation, we say that $B$ is generated by a scoring rule.
Definition 3 A scoring rule is a function $s : \{0, 1\} \times [0, 1] \rightarrow [0, \infty]$.
The idea is that $s(0, x)$ measures the inaccuracy of having credence $x$ in a false proposition, whereas $s(1, x)$ measures the inaccuracy of having credence $x$ in a true proposition.
Definition 4 Given a scoring rule $s$, we say that $I_s$ is the inaccuracy measure generated by $s$, where
\[
I_s(c, w) = \sum_{X \in \mathcal{F}} s(v_w(X), c(X))
\]
We say that a scoring rule is proper if a particular credence expects itself to be more accurate than it expects any other credence to be. More precisely:
Definition 5 We say that a scoring rule $s$ is proper if
\[
ps(1, x) + (1-p)s(0, x)
\]
is minimal at $x = p$.
Then we have the following theorem, which connects the inaccuracy measures generated by proper scoring rules and Bregman divergences.
Theorem 2 Suppose $s$ is a continuous proper scoring rule. Then there is a Bregman divergence $d_F$ such that
\[
I_s(c, w) = d_F(v_w, c) + \sum_{X \in \mathcal{F}} s(v_w(X), v_w(X))
\]
Proof: This is proved as Equation (8) in (Predd, et al.).
References
- de Finetti, B. (1974) A Theory of Probability (vol 1) (New York: Wiley)
- Pettigrew, R. (2013) 'A New Epistemic Utility Argument for the Principal Principle' Episteme 10(1): 19-35
- Predd, et al. (2009) 'Probabilistic Coherence and Proper Scoring Rules' IEEE Transactions on Information Theory 55(10): 4786-4792.
I read your posted description and its tell me how to solve intermediate class math with new formulas thanks for share it proper sentence structure checker .
ReplyDelete