Thursday, 3 April 2014

How should we measure accuracy in epistemology? A new result

In recent formal epistemology, a lot of attention has been paid to a programme that one might call accuracy-first epistemology.  It is based on a particular account of the goodness of doxastic states: on this account, a doxastic state -- be it a full belief, a partial belief, or a comparative probability ordering -- is better the greater its accuracy; Alvin Goldman calls this account veritism.  This informal idea is often then made mathematically precise and the resulting formal account of doxastic goodness is used to draw various epistemological conclusions.

In this post, the doxastic states with which I will be concerned are credences or partial beliefs.  Such a doxastic state is represented by a single credence function $c$, which assigns a real number $0 \leq c(X) \leq 1$ to each proposition $X$ about which the agent has an opinion.  Thus, a measure of accuracy is a function $A$ that takes a credence function $c$ and a possible world $w$ and returns a number $A(c, w)$ that measures the accuracy of $c$ at $w$:  $A(c, w)$ takes values in $[-\infty, 0]$.

Beginning with Joyce 1998, a number of philosophers have given different characterisations of the legitimate measures of accuracy: Leitgeb and Pettigrew 2010; Joyce 2009; and D'Agostino and Sinigaglia 2009.  Leitgeb and Pettigrew give a very narrow characterisation, as do D'Agostino and Sinigaglia:  they agree that the so-called Brier score (or some strictly increasing transformation of it) is the only legitimate measure of accuracy.  Joyce, on the other hand, gives a much broader characterisation.  I find none of these characterisations adequate, though I won't enumerate my concerns here.  Rather, in this post, I'd like to offer a new characterisation.

Characterizing accuracy

My characterisation begins by describing two ways in which we might hope to define accuracy.  On one, the alethic account, the accuracy of a credence function $c$ at a world $w$ is its proximity to what I will call the omniscient credence function at $w$.  On the other, the calibrationist account, the accuracy of $c$ at $w$ is a combination of two factors: the first is the proximity of $c$ to what I will call the perfectly calibrated counterpart of $c$ at $w$; the second is the refinement of $c$ at $w$, which we define to be the proximity of one's perfectly calibrated counterpart to the omniscient credence function at $w$.  Thus, both accounts of accuracy are spelt out in terms of proximity.  So, in order to make each precise, we need a measure of the distance (or divergence) between credence functions.  (Of course, we also need a definition of the omniscient credence function at $w$, and the perfectly calibrated counterpart to $c$ at $w$.)

My characterisation of the legitimate accuracy measure is then given by the following claim:  If we use the same measure of distance (or divergence) to spell out the two different accounts of accuracy just described, they ought to agree; that is, they ought to assign the same accuracy whenever applied to the same credence function and world.  We'll state this more precisely below, once we've stated the two accounts more precisely.

The alethic account of accuracy

On the alethic account, the accuracy of $c$ at $w$ is the proximity of $c$ to the omniscient credence function at $w$.  What is this latter function?  It is the credence function that an omniscient agent would have at $w$.  Thus, it assigns maximal credence (i.e. 1) to all propositions that are true at $w$ and minimal credence (i.e. 0) to all propositions that are false at $w$.  Let $v_w$ be the omniscient credence function at $w$.

Next, we need a measure of proximity between credence functions.  This will be the negative of a measure of distance between credence functions.  It seems natural to think that the distance between two credence functions is determined by summing the distance between the credences they assign to each proposition on which they are defined.  Thus, suppose $d : [0, 1] \times [0, 1] \rightarrow [0, \infty]$ is a one-dimensional divergence:  that is, $d(x, y) \geq 0$ with equality iff $x = y$.  Then let $D : [0, 1]^n \times [0, 1]^n \rightarrow [0, \infty]$ be the $n$-dimensional divergence $D$ generated by $d$ in the natural way:$$D(c, c') = \sum_{X \in F} d(c(X), c'(X))$$ We say that $D$ is additive and generated by $d$.
With this in hand, we define the corresponding measure of accuracy as follows:
$$A_D(c, w) := -D(v_w, c)$$That is, on the alethic account, the accuracy of $c$ at $w$ is the negative of the distance from the omniscient credence function $v_w$ at $w$ to $c$.  Thus, the omniscient credence function at $w$ is maximally accurate at $w$, where it receives an accuracy score of $0$.

For instance, the Brier score is an accuracy measure defined in this way.  Let $d(x, y) = (x-y)^2$.  So, $D(c, c') = \sum_{X \in F} (c(X) - c'(X))^2$.  So$$A_D(c, w) = \sum_{X \in F} (v_w(X) - c(X))^2$$But of course there are many other accuracy measures besides this that are also generated in this way.

This is the account of accuracy that is behind much recent work in what has come to be known as accuracy-based epistemology.

The calibrationist account of accuracy

We now turn to the alternative account of accuracy.  The idea might be put as follows:  This account agrees with the alethic account of accuracy that the best credence function to have at world $w$ is the omniscient credence function at $w$.  But it disagrees with the alethic account on the reason for this.  On the alethic account, $v_w$ is maximally accurate at $w$ because it is maximally close to the omniscient credence function, namely, itself.  According to the calibrationist account, that isn't the source of its accuracy.  On this alternative account, $v_w$ is maximally accurate because it has two accuracy-giving virtues to the maximal degree possible:  it is maximally well calibrated; and it is maximally refined.  It is the extent to which a credence function shares these virtues with $v_w$ that determines it accuracy.  Thus, the accuracy of a credence function is the sum of the extent to which it is calibrated and the extent to which it is refined.

Calibration

Roughly speaking, we want to say that a credence in a proposition is perfectly calibrated when it matches the relative frequency with which propositions of that type are true.  Thus, if a weather forecaster predicts rain with 80% chance then her prediction is perfectly calibrated if it rains on 80% of similar days.  The problem is to say when two propositions are of the same type.  We get round this problem as follows:  two propositions are of the same time for a particular agent if that agent assigns the same credence to each.  Thus, a credence function is perfectly calibrated at a world if, for each credence it assigns to a proposition, that credence is equal to the proportion of true propositions amongst the propositions to which it assigns that credence.  That is,

Definition (Perfectly calibrated) Suppose $c : F \rightarrow [0, 1]$ is a credence function on a set $F$ of propositions.  Then $c$ is perfectly calibrated at $w$ if, for each $x \in ran(c)$, $$x = \frac{|\{X \in F : c(X) = x \mbox{ and x is true at w}\}|}{|\{X \in F : c(X) = x\}|}$$
This tells us when a credence function is perfectly calibrated at a world, but it does not suggest a measure of proximity to calibration that we can use to rank those that fall below perfection.  In order to do that, we define the notion of a perfectly calibrated counterpart (cf. Dawid 1982, DeGroot and Fienberg 1983).

Definition (Perfectly calibrated counterpart) Suppose $c : F \rightarrow [0, 1]$ is a credence function on $F$.  And suppose $w$ is a possible worldThen $c^w$ is the perfectly calibrated counterpart to $c$ at $w$ if $$c^w(Z) = \frac{|\{X \in F : c(X) = c(Y) \mbox{ and X is true at w}\}|}{|\{X \in F : c(X) = c(Z)\}|}$$
Given $x$ in the range of $c$, we let $c^w(x)$ be the proportion of truths amongst the propositions to which $c$ assigns $x$.

Proposition $c^w$ is perfectly calibrated at $w$.

Now, if $D$ is a measure of distance between credence functions, we can define a measure of calibration: $$C(c, w) := -D(c^w, c)$$ Now, some philosophers have claimed that a credence is better the closer it is to perfect calibration.  That is, they agree that $v_w$ is maximally accurate at a world, but only because it is maximally calibrated.  However, as has been shown in a number of different ways, this can't be quite right (Seidenfeld 1985, Joyce 1998).  For instance, to borrow Joyce's example, consider the following set of four propositions: $X = \{X_1, X_2, X_3, X_4\}$.  And consider the following two credence functions:
• $c_1 = (0.5, 0.5, 0.5, 0.5)$
• $c_2 = (1, 1, 0.1, 0)$
And evaluate them at the possible world $w$ in which $X_1$ and $X_2$ are true.  Thus
• $v_w = (1, 1, 0, 0)$.
Then $C(c_1, w) > C(c_2, w)$.  That is, $c_1$ is better calibrated at $w$ than $c_2$.  Indeed, $c_1$ is perfectly calibrated at $w$ since 50% of the propositions to which $c_1$ assigns credence 0.5 are true at $w$.  On the other hand, while 100% of the propositions to which $c_2$ assigns credence 1 are true at $w$, and 0% of the propositions of which it assigns 0 are true at $w$, it is also the case that 0% of the propositions to which it assigns 0.1 are true at $w$.  Because of this latter fact, $c_2$ is not perfectly calibrated at $w$.  However, there is a clear sense in which $c_1$ is more accurate than $c_2$.  As Joyce points out, $c_2$ shares three of its credences with the omniscient credence function, and the other one is very close to that assigned by the omniscient credence function.

Refinement

Does this show that we ought to abandon proximity to calibration in our account of accuracy?  I think not.  There is a clear sense in which calibration captures something important about the virtue of accuracy.  Thus, it seems more sensible to conclude from the example above only that calibration is not the whole story.  What other virtue does $c_1$ have that $c_2$ lacks?  The traditional answer is this: $c_2$ is less refined than $c_1$.

What does this mean?  Intuitively, the idea is this:  Let's begin with $c_1$.  Partition the set $X = \{X_1, X_2, X_3, X_4\}$ according to the credence assigned to each proposition at $c_1$.   Since $c_1$ assigns each proposition the same credence, this gives the trivial partition $X$.  Then the refinement of $c_1$ is a function of the size of each cell of the partition and the homogeneity of the truth values of the propositions in each cell.  Thus, $c_1$ is not very refined, since its single cell is large and the truth values of the propositions in it are not at all homogeneous:  there are exactly as many truths as falsehoods.  Next, consider $c_2$.  Again, partition $X$ according to the credences assigned by $c_2$.  Thus, we obtain the following partition: $\{X_1, X_2\}$, $\{X_3\}$, $\{X_4\}$.  All three cells are perfectly homogeneous in the truth values of the propositions they contain: $X_1$ and $X_2$ are both true; and a singleton cell is always perfectly homogeneous.  Thus, $c_2$ is maximally refined.  Thus, while $c_1$ is better calibrated than $c_2$, it is less refined.  This accounts for our intuitive sense that $c_2$ is the more accurate credence function at $w$.

How are we to make this intuitive idea precise?  The truth value homogeneity of a cell is partly determined by how far the proportion of truths in that cell lies from 1 and how far it lies from 0.  Now, recall that we assumed above that our measure of distance $D$ is additive:  so $D(c, c') = \sum_{X \in F} d(c(X), c'(X))$.  Thus, we have a measure of the distance from 1 or from 0 to the proportion of truths in a cell.  And we know that the proportion of truths in the cell containing the propositions to which $c$ assigns $x$ is $c^w(x)$, as defined above.  Thus, the truth value homogeneity of a cell is partly determined by $d(1, c^w(x))$ and $d(0, c^w(x))$.

How else is it determined?  As we said above, it is also determined by the size of the cell.  Now, suppose we let $\nu_c(x)$ be the number of propositions to which $c$ assigns $x$.  Thus, $\nu_c(x)c^w(x)$ is the number of true propositions to which $c$ assigns $x$; and $\nu_c(w)(1-c^w(x))$ is the number of false propositions to which $c$ assigns $x$.  Then the following definition of the refinement of $c$ is natural:$$R_D(c, w) := \sum_{x \in ran(c)} \nu_c(x)c^w(x)d(1, c^w(x)) + \nu_c(x)(1-c^w(x))d(0, c^w(x))$$It is then straightforward to show that this entails:$$R_D(c, w) = -D(v_w, c^w)$$Thus, the refinement of $c$ at $w$ is the proximity of the omniscient credence function at $w$ to the perfectly calibrated counterpart of $c$ at $w$.

Thus, according to the calibrationist account of accuracy, we measure the accuracy of a credence function $c$ at $w$ by the following function:
$$C_D(c^w, c) + R_D(v_w, c^w)$$That is, it is a combination of the calibration of $c$ at $w$ and the refinement of $c$ at $w$.  It is these two virtues -- maximal calibration and maximal refinement -- that make the omniscient credence function the most accurate.  The accuracy of any other credence function is measured by adding together how much of each of these virtues it has.

The combined account

Thus, we now have two accounts of accuracy:  accuracy is proximity to the omniscient credence function; accuracy is a combination of calibration and refinement, where calibration is proximity to one's perfectly calibrated counterpart and refinement ends up being the proximity of one's perfectly calibrated counterpart to the omniscient credence function.

What happens if we demand that these two accounts should coincide?  That is, what follows from the this claim:

Alethic-calibration agreement Informally:

Alethic accuracy = Calibration + Refinement

Formally: for all sets of propositions $F$, for all credence functions $c : F \rightarrow [0, 1]$, and for all possible worlds $w$,$$A_D(c, w) = C_D(c, w) + R_D(c, w)$$
To state the answer, we need to define the notion of an additive Bregman divergence.  These are divergences introduced by the Russian mathematician, Lev M. Bregman in 1967.  The definition isn't very informative; what is important for the purposes of accuracy-based epistemology are the properties that we detail in the theorems below.  I've omitted the proofs here, since they are long and involved.  But you can find all the results here.  (For more on Bregman divergences, see this earlier post.)

Definition (Additive Bregman divergence)  Suppose $D$ is an additive $n$-dimensional divergence: that is, there is $d : [0, 1] \times [0, 1] \rightarrow [0, \infty]$ such that $d(x, y) \geq 0$ with equality iff $x = y$ and, for all $x, y \in [0, 1]^n$, $D(x, y) = \sum^n_{i=1} d(x_i, y_i)$.  Then we say that $D$ is an additive Bregman divergence if there is $\varphi : [0, 1] \rightarrow R$ such that:
1. $\varphi$ is continuous, bounded, and strictly convex on $[0, 1]$;
2. $\varphi$ is continuously differentiable on $(0, 1)$;
3. For all $x, y \in [0, 1]$,$$d(y, x) = \varphi(y) - \varphi(x) - \varphi'(x)(y-x)$$where we define $\varphi'(i) = \lim_{x \rightarrow i} \varphi'(x)$ for $i = 0, 1$.
Our first result is this: Alethic-calibration agreement entails that distance is measured by an additive Bregman divergence. This is a sort of converse to Theorem 4, DeGroot and Fienberg 1983.

Theorem Suppose that, for all sets of propositions $F$, for all credence functions $c : F \rightarrow [0, 1]$, and for all possible worlds $w$,$$A_D(c, w) = C_D(c, w) + R_D(c, w)$$ Then $D$ is an additive Bregman divergence.

Our next result is that, if $D$ is an additive Bregman divergence, then $A_D$ is a strictly proper scoring rule. (For more on proper scoring rules, see this earlier post.)

Theorem Suppose $D$ is an additive Bregman divergence generated by $d$.  Then define the following scoring rule: $s(1, x) = d(1, x)$ and $s(0, x) = d(0, x)$.  Then $s$ is strictly proper.  (That is, for all $p \in [0, 1]$, $$ps(1, x) + (1-p)s(0, x)$$ is minimized as a function of $x$ uniquely at $x = p$.)

Our final result, due originally to Predd, et al. 2009, shows that any accuracy measure $A_D$ for which $D$ is an additive Bregman divergence can be used in Jim Joyce's nonpragmatic vindication of Probabilism:

Theorem Suppose $D$ is an additive Bregman divergence generated by $d$.  Then
1. If $c$ violates Probabilism, then there is $p$ that satisfies Probabilism such that, for all worlds $w$,$$A_D(c, w) < A_D(p, w)$$
2. If $p$ satisfies Probabilism, then there is no $c \neq p$ such that, for all worlds $w$, $$A_D(p, w) \leq A_D(c, w)$$

Conclusion

In sum:  Veritism is the view that the sole virtue of credences is accuracy.  In order to draw precise conclusions from this claim, we need a precise account of how to measure accuracy.  That's what I've tried to give in this post.  The claim is that there are two natural ways to define the accuracy of a credence at a world:  (i) its proximity to the omniscient credence; (ii) the sum of its calibration and its refinement.  If we assume that these two different accounts of accuracy should always agree, then we can conclude that the distance measure or divergence that underpins our accuracy measure is an additive Bregman divergence.  And once we have that, we can mobilize mathematical results about those divergences in order to ground arguments in formal epistemology such as Joyce's nonpragmatic vindication of Probabilism.

References

• D'Agostino, M. and Sinigaglia, C. (2010) Epistemic Accuracy and Subjective Probability. In: Suárez, M. and Dorato M. and Rédei, M. (eds.) EPSA Epistemology and Methodology of Science. 95-105 (Springer)
• DeGroot, M. H. and Fienberg, S. E. (1983). The Comparison and Evaluation of Forecasters. The Statistician. 32(1/2):12-22.
• Leitgeb, H. and Pettigrew, R. (2010) An Objective Justification of Bayesianism I: Measuring Inaccuracy. Philosophy of Science. 77(2):201-235
• Joyce, J. M. (1998) Nonpragmatic vindication of Probabilism. Philosophy of Science. 65:575-603.
• Joyce, J. M. (2009) Accuracy and Coherence: Prospects for an alethic epistemology of partial beliefs. In: Huber, F. and Schmidt-Petri C. (eds.) Degrees of Belief. (Springer)
• Predd, J., Seiringer, R., Lieb, E. H., Osherson, D., Poor, V., and Kulkarni, S. (2009). Probabilistic Coherence and Proper Scoring Rules. IEEE Transactions of Information Theory. 55(10):4786-4792

1. This is really cool stuff! I wonder, have you investigated what happens when you take accuracy not as Cd(c,w)+Rd(c,w) but X*Cd(c,w)+Y*Rd(c,w) for constants X, Y? It seems to me like in certain situations one might value one of those measures more than the other. Would all of the generalized alethic-calibration agreement give you an additive Bregman divergence and would the generated scoring rule still be strictly proper?

1. Sorry -- a second reply! Thinking about this a bit more, I see what happens. It turns out that there can be no divergence $D$ such that$$A_D(c, w) = XC_D(c, w) + YR_D(c, w)$$unless $X = Y = 1$. The reason is this: Suppose $D(x, y) = \sum_i d(x_i, y_i)$ and $A_D(c, w) = XC_D(c, w) + YR_D(c, w)$. Then it follows that $d$ has a particular form (the details of this are in the PDF I linked to that includes the proof; they're a bit buried, though): if we let $s$ be the scoring rule corresponding to $d$ (that is, $s(i, x) := d(i, x)$), then$$d(x, y) = \frac{1}{X} Exp_s(y | x) - \frac{Y}{X} Exp_s(x | x)$$(where $Exp_s(y | x) xs(1, y) + (1-x) s(0, y)$). But then we have$$d(1, x) = \frac{1}{X}s(1, x) - \frac{Y}{X}s(1, 1) = \frac{1}{X}d(1,x)$$so $X =1$. And we also have$$0 = d(x, x) = Exp_s(x | x) - YExp_s(x | x)$$. So $Y = 1$.

2. Thanks very much for this excellent question! I'm pretty sure that generalised alethic-calibration agreement wouldn't always give an additive Bregman divergence. The reason is the DeGroot and Fienberg 1983 result of which this is a sort of converse. They show that any strictly proper scoring rule can be decomposed into the sort of calibration and refinement measure that I give; and the decomposition involves the weightings (i.e. your X and Y) both being 1. So I have to argue that this weighting of the two is the only reasonable thing to do if I'm going to use my result to argue for strictly proper scoring rules. It's really useful to get that clear. Thanks!

3. It's very interesting, but are there cases in which the methodologies based on these accounts of accuracy would give different results? Could you, for example, outline a hypothetical case in which the same evidence would lead to differing measures of accuracy depending on which account of accuracy is adopted?

1. Very interesting question! Were you thinking that there might be situations in which $A_D$ and $C_D + R_D$ ought to come apart? Or were you thinking that there might be situations in which the underlying divergence $D$ is determined by features of the situation? I was certainly hoping that there wouldn't be any situations of the former sort -- the idea is that these are two different ways to get at the single notion of accuracy that's appropriate in credal epistemology. As I understand it, the latter sort of situation is considered quite a lot in the statistics literature. Different strictly proper scoring rules are suited to different situations. A really good place to start exploring that question is Imre Csiszar's 1991 paper 'Why Least Squares and Maximum Entropy?' in Annals of Statistics. Least Squares and Maximum Entropy are two different statistic inference methods that are based on maximizing or minimizing a Bregman divergence. Csiszar begins with an axiomatization of all Bregman divergences; he then extends the axiomatization in two ways -- the first gives a characterization of the Bregman divergence that gives rise to Least Squares; the second gives a characterization of the Bregman divergence that gives rise to Maximum Entropy.

4. Thanks for a fascinating post! After reading it I started wondering about the 'opposite' direction: for which scoring rules it is guaranteed that Alethic accuracy = Calibration + Refinement? Richard already addressed this in the 2nd comment. Now, alright, the Brier Score can be decomposed into calibration and refinement (this is I guess Murphy (1973)). Richard says "DeGroot and Fienberg (1983) [...] show that [...] strictly proper scoring rule can be decomposed into the sort of calibration and refinement measure that I give; and the decomposition involves the weightings (i.e. your X and Y) both being 1." Of course, I have to read DeGroot and Fienberg's paper carefully to see how the proof goes, but please allow me at this point to express my concerns as to whether this is the whole story. (And please correct me if what I write is completely misguided).

The "Alethic accuracy = Calibration + Refinement" seems to boil down to this: for your chosen distance between n-tuples of real numbers from [0,1], which we label D:

-D(v_w,c)=-D(c^w,c)-D(v_w,c^w)

Unfortunately I don't see yet how this recipe would be applicable for some proper scoring rules. Squared Euclidean distance can be employed for two n-tuples one of each consists entirely of 0s and 1s (is a representation of a possible world): it that case in can serve to form a scoring rule. But it also can serve to form a distance measure between any two arbitrary n-tuples of real numbers from [0,1]. Therefore all 3 expressions in the above inequality make sense if D is the squared Euclidean distance and we're thinking of the Brier Score as our scoring rule.

But there are other proper scoring rules for which I don't see how we could do this. Consider the logarithmic scoring rule; take a proposition A for which your credence is C(A); if in the given world A is true, take ln(C(A)); if it is false, take ln(1-C(A)). If we want to conceive of this as a distance measure between n-tuples of real numbers from [0,1], one of this n-tuples has to consist entirely of 0s and 1s. But the 'middle' expression of the above equality contains D(c^w,c), and it is by no means guaranteed that c^w will turn out to be like that. And so it would seem we cannot answer the question whether the equality holds if our scoring rule is the logarithmic one.

Since this seems to contradict what Richard wrote in the 2nd comment, and in particular the DG&F result, I must be not getting something. But the worry I express above seems to be pretty basic, so I'd hope people more familiar with the topic will point me in the right direction. Thanks in advance, and thanks again for a stimulating post!

1. Good point, Leszek! I should have been clearer in my gloss of DeGroot and Fienberg's theorem above. The point that underlies so much of this area is that, if $s$ is a strictly proper scoring rule, we can define a divergence as follows:$$d_s(x, y) = Exp_s(y | x) - Exp_s(x | x)$$DeG&F use that to define the measure of calibration corresponding to a given scoring rule as well as the measure of refinement. And then they prove that, if $s$ is strictly proper then the total score of a credence function $c$ at a world $w$ is the sum of the calibration of $c$ at $w$ and the refinement of $c$ at $w$.