Updating by minimizing expected inaccuracy -- or: my new favourite scoring rule

For a PDF version of this post, click here.

One of the central questions of Bayesian epistemology concerns how you should update your credences in response to new evidence you obtain. The proposal I want to discuss here belongs to an approach that consists of two steps. First, we specify the constraints that your evidence places on your posterior credences. Second, we specify a means by which to survey the credence functions that satisfy those constraints and pick one to adopt as your posterior.

For instance, in the first step, we might say that when we learn a proposition E, we must become certain of it, and so it imposes the following constraint on our posterior credence function Q: Q(E)=1. Or we might consider the sort of situation Richard Jeffrey discussed, where there is a partition E1,,Em and credences q1,,qm with q1++qm=1 such that your evidence imposes the constraint: Q(Ei)=qi, for i=1,,m. Or the situation van Fraassen discussed, where your evidence constrains your posterior conditional credences, so that there is a credence q and propositions A and B such that your evidence imposes the constraint: Q(A|B)=q.

In the second step of the approach, on the other hand, we might following objective Bayesians like Jon Williamson, Alena Vencovská, and Jeff Paris and say that, from among those credence functions that respect your evidence, you should pick the one that, on a natural measure of informational content, contains minimal information, and which thus goes beyond your evidence as little as possible (Paris & Vencovská 1990, Williamson 2010). Or we might follow what I call the method of minimal mutilation proposed by Persi Diaconis and Sandy Zabell and pick the credence function among those that respect the evidence that is closest to your prior according to some measure of divergence between probability functions (Diaconis & Zabell 1982). Or, you might proceed as Hannes Leitgeb and I suggested and pick the credence function that minimizes expected inaccuracy from the point of view of your prior, while satisfying the constraints the evidence imposes (Leitgeb & Pettigrew 2010). In this post, I'd like to fix a problem with the latter proposal.

We'll focus on the simplest case: you learn E and this requires you to adopt a posterior Q such that Q(E)=1. This is also the case in which the norm governing it is least controversial. The largely undisputed norm in this case says that you should conditionalize your prior on your evidence, so that, if P is your prior and P(E)>0, then your posterior should be Q()=P(|E). That is, providing you assigned a positive credence to E before you learned it, your credence in the proposition X after learning E should be your prior credence in X conditional on E.

In order to make the maths as simple as possible, let's assume you assign credences to a finite set of worlds {w1,,wn}, which forms a partition of logical space. Given a credence function P, we write pi for P(wi), and we'll sometimes represent P by the vector (p1,,pn). Let's suppose further that your measure of the inaccuracy of a credence function is I, which is generated additively from a scoring rule s. That is,
  • s1(x) measures the inaccuracy of credence x in a truth;
  • s0(x) measures the inaccuracy of credence x in a falsehood;
  • I(P,wi)=s0(p1)+s0(pi1)+s1(pi)+s0(pi+1)++s0(pn).
Hannes and I then proposed that, if P is your prior, you should adopt as your posterior the credence function Q such that
  1. Q(E)=1;
  2. for any other credence function Q for which Q(E)=1, the expected inaccuracy of Q by the lights of P is less than the expected inaccuracy of Q by the lights of P.
Throughout, we'll denote the expected inaccuracy of Q by the lights of P when inaccuracy is measured by I as ExpI(Q|P). Thus,
ExpI(Q|P)=i=1npiI(Q,wi)
At this point, however, a problem arises. There are two inaccuracy measures that tend to be used in statistics and accuracy-first epistemology. The first is the Brier inaccuracy measure B, which is generated by the quadratic scoring rule q:
q0(x)=x2   and    q1(x)=(1x)2
So
B(P,wi)=12pi+i=1npi2
The second is the local log inaccuracy measure L, which is generated by what I'll call here the basic log score l:
l0(x)=0    and    l1(x)=logx
So
L(P,wi)=logpi
The problem is that both have undesirable features for this purpose: the Brier inaccuracy measure does not deliver Conditionalization when you take the approach Hannes and I described; the local log inaccuracy measure does give Conditionalization, but while it is strictly proper in a weak sense, the basic log score that generates it is not; and relatedly, but more importantly, the local log inaccuracy measure does not furnish an accuracy dominance argument for Probabilism. Let's work through this in more detail.

According to the standard Bayesian norm of Conditionalization, if P is your prior and P(E)>0, then your posterior after learning at most E should be Q()=P(|E). That is, when I remove all credence from the worlds at which my evidence is false, in order to respect my new evidence, I should redistribute it to the worlds at which my evidence is true in proportion to my prior credence in those worlds.

Now suppose that I update instead by picking the posterior Q for which Q(E)=1 and that minimizes expected inaccuracy as measured by the Brier inaccuracy measure. Then, at least in most cases, when I remove all credence from the worlds at which my evidence is false, in order to respect my new evidence, I redistribute it equally to the worlds at which my evidence is true---not in proportion to my prior credence in those worlds, but equally to each, regardless of my prior attitude.

Here's a quick illustration in the case in which you distribute your credences over three worlds, w1, w2, w3 and the proposition you learn is E={w1,w2}. Then we want to find a posterior Q=(x,1x,0) with minimal expected Brier inaccuracy from the point of view of the prior P=(p1,p2,p3). Then:
ExpB((x,1x,0)|(p1,p2,p3))=p1[(1x)2+(1x)2+02]+p2[x2+x2+02]+p3[x2+(1x)2+1]
Differentiating this with respect to x gives 4p1+4x2p3 which equals 0 iff x=p1+p33 Thus, providing p1+p33,p2+p331, then the posterior that minimizes expected Brier inaccuracy while respecting the evidence is Q=(p1+p33,p2+p33,0) And this is typically not the same as Conditionalization demands.

Now turn to the local log measure, L. Here, things are actually a little complicated by the fact that log0=. After all, ExpL((x,1x,0)|(p1,p2,p3))=p1logxp2log(1x)p3log0 and this is regardless of the value of x. So every value of x minimizes, and indeed maximizes, this expectation. As a result, we have to look at the situation in which the evidence imposes the constraint Q(E)=1ε for ε>0, and ask what happens as we let ε approach 0. Then
ExpL((x,1εx,ε)|(p1,p2,p3))=p1logxp2log(1εx)p3logε
Differentiating this with respect to x gives
p1x+p21εx
which equals 0 iff
x=(1ε)p1p1+p2
And this approaches Conditionalization as ε approaches 0. So, in this sense, as Ben Levinstein pointed out, the local log inaccuracy measure gives Conditionalization, and indeed Jeffrey Conditionalization or Probability Kinematics as well (Levinstein 2012). So far, so good.

However, throughout this post, and in the two derivations above---the first concerning the Brier inaccuracy measure and the second concerning the local log inaccuracy measure---we assumed that all credence functions must be probability functions. That is, we assumed Probabilism, the other central tenet of Bayesianism alongside Conditionalization. Now, if we measure inaccuracy using the Brier measure, we can justify that, for then we have the accuracy dominance argument, which originated mathematically with Bruno de Finetti, and was given its accuracy-theoretic philosophical spin by Jim Joyce (de Finetti 1974, Joyce 1998). That is, if your prior or your posterior isn't a probability function, then there is an alternative that is and that is guaranteed to be more Brier-accurate. However, the local log inaccuracy measure doesn't furnish us with any such argument. One very easy way to see this is to note that the non-probabilistic credence function (1,1,,1) over {w1,,wn} dominates all other credence functions according to the local log measure. After all, L((1,1,,1),wi)=log1=0, for i=1,,n, while L(P,wi)>0 for any P with pi<1 for some i=1,,n.

Another related issue is that the scoring rule l that generates L is not strictly proper. A scoring rule s is said to be strictly proper if every credence expects itself to be the best. That is, for any 0p1, ps1(x)+(1p)s0(x) is minimized, as a function of x, at x=p. But plogx+(1p)0=plogx is always minimized, as a function of x, at x=1, where plogx=0. Similarly, an inaccuracy measure I is strictly proper if, for any probabilistic credence function P, ExpI(Q|P)=i=1npiI(Q,wi) is minimized, as a function of Q at Q=P. Now, in this sense, L is not strictly proper, since ExpL(Q|P)=i=1npiL(Q,wi) is minimized, as function of Q at Q=(1,1,,1), as noted above. Nonetheless, if we restrict our attention to probabilistic Q, ExpL(Q|P)=i=1npiL(c,wi) is minimized at Q=P. In sum: L is only a reasonable inaccuracy measure to use if you already have an independent motivation for Probabilism. But accuracy-first epistemology does not have that luxury. One of central roles of an inaccuracy measure in that framework is to furnish an accuracy dominance argument for Probabilism.

So, we ask: is there a scoring rule s and resulting inaccuracy measure I such that:
  1. s is a strictly proper scoring rule;
  2. I is a strictly proper inaccuracy measure; 
  3. I furnishes an accuracy dominance argument for Probabilism;
  4. If P(E)>0, then ExpI(Q|P) is minimized, as a function of Q among credence functions for which Q(E)=1, at Q()=P(|E).
Straightforwardly, (1) entails (2). And, by a result due to Predd, et al., (1) also entails (3) (Predd 2009). So we seek s with (1) and (4). Theorem 1 below shows that essentially only one such s and I exist and they are what I will call the enhanced log score l and the enhanced log inaccuracy measure L:
l0(x)=x    and    l1(x)=logx+x1

The enhanced log score l. s0 in yellow; s1 in blue.


Before we state and prove the theorem, there are some features of this scoring rule and its resulting inaccuracy measure that are worth noting. Juergen Landes has identified this scoring rule for a different purpose (Proposition 9.1, Landes 2015).


Proposition 1 l is strictly proper.

Proof. Suppose 0p1. Then
ddxpl1(x)+(1p)l0(x)=ddxp[logx+x]+(1p)x=px+1=0 iff p=x.

Proposition 2 If P is non-probabilistic, then P=(p1kpk,,pnkpk) accuracy dominates P=(p1,,pn).

Proof. L(P,wi)=log(pikpk)+1=logpi+logkpk+1 and L(P,wi)=logpi+kpk But logx+1x, for all x>0, with equality iff x=1. So, if P is non-probabilistic, then kpk1 and  L(P,wi)<L(P,wi) for i=1,,n.

Proposition 3 If P is probabilistic, L(P,wi)=1+L(P,wi).

Proof.
L(P,wi)=p1++pi1+(logpi+pi)+pi+1++pn=logpi+1=1+L(P,wi)
 

Corollary 1 If P, Q are probabilistic, then
ExpL(Q|P)=1+ExpL(Q|P)

Proof.  By Proposition 3.

Corollary 2 Suppose E1,,Em is a partition and 0q1,,qm1 with i=1mqi=1. Then, among Q for which Q(Ei)=qi for i=1,,m, ExpL(Q|P) is minimized at the Jeffrey Conditionalization posterior Q()=i=1kqiP(|Ei).

Proof.  This follows from Corollary 1 and Theorem 5.1 from (Diaconis & Zabell 1982).

Having seen l and L in action, let's see that they are unique in having this combination of features.

Theorem 1 Suppose s is a strictly proper scoring rule and I is the inaccuracy measure  it generates. And suppose that, for any {w1,,wn} and any E{w1,,wn}, and any probabilistic credence function P, the probabilistic credence function Q that minimizes the expected inaccuracy of Q with respect to P with the constraint Q(E)=1, and when inaccuracy is measured by I, is Q()=P(|E). Then the scoring rule is
s1(x)=logx+x    and    s0(x)=x or any affine transformation of this.

Proof. First, we appeal to the following lemma (Proposition 2, Predd, et al. 2009):

Lemma 1

(i) Suppose s is a continuous strictly proper scoring rule. Then defineφs(x)=xs1(x)(1x)s0(x)Then φs is differentiable on (0,1) and convex on [0,1] and ExpI(Q|P)ExpI(P|P)=i=1nφs(pi)φs(qi)φs(qi)(piqi) (ii) Suppose φ is differentiable on (0,1) and convex on [0,1]. Then let
  • s1φ(x)=φ(x)φ(x)(1x) 
  • s0φ(x)=φ(x)φ(x)(0x)
Then sφ is a strictly proper scoring rule.

Moreover, sφs=s.

Now, let's focus on {w1,w2,w3,w4} and let E={w1,w2,w3}. Let p1=a, p2=b, p3=c. Then we wish to minimize
ExpI((x,y,1xy,0)|(a,b,c,1abc))
Now, but Lemma 1,
ExpI((x,y,1xy,0)|(a,b,c,1abc))=φ(a)φ(x)φ(x)(ax)+φ(b)φ(y)φ(y)(by)+φ(c)φ(1xy)φ(1xy)(c(1xy))+ExpI((a,b,c,1abc)|(a,b,c,1abc))
Thus:
xExpI((x,y,1xy,0)|(a,b,c,1abc))=φ(x)(xa)((1xy)c)φ(1xy)
and
yExpI((x,y,1xy,0)|(a,b,c,1abc))=φ(y)(yb)((1xy)c)φ(1xy)
which are both 0 iffφ(x)(xa)=φ(y)(yb)=((1xy)c)φ(1xy) Now, suppose this is true for x=aa+b+c and y=ba+b+c. Then, for all 0a,b,c1 with a+b+c1, aφ(aa+b+c)=bφ(ba+b+c)
We now wish to show that φ(x)=kx for all 0x1. If we manage that, then it follows that φ(x)=klogx+m and φ(x)=kxlogx+(mk)x. And we know from Lemma 1:
s0(x)=φ(x)φ(x)(0x)=[kxlogx+(mk)x][klogx+m](0x)=kx
and
s1(x)=φ(x)φ(x)(1x)=[kxlogx+(mk)x][klogx+m](1x)=klogx+kxm
Now, first, let f(x)=φ(1x). Thus, it will suffice to prove that f(x)=x. For then φ(x)=φ(11x)=f(1x)=1x, as required. And to prove f(x)=x, we need only show that f(x) is a constant function. We know that, for all 0a,b,c1 with a+b+c1, we have
af(a+b+ca)=bf(a+b+cb)
Soddxaf(a+b+xa)=ddxbf(a+b+xb)So, for all 0a,b,c1 with a+b+c1
f(a+b+ca)=f(a+b+cb)We now show that, for all x1, f(x)=f(2), which will suffice to show that it is constant. First, we consider 2x. Then let
a=1x     b=12     c=121x
Then
f(x)=f(a+b+ca)=f(a+b+cb)=f(2)
Second, consider 1x2. Then pick 2y such that 1x+1y1. Then let
a=1x     b=1y     c=11x1y
Then
f(x)=f(a+b+ca)=f(a+b+cb)=f(y)=f(2)
as required.

Comments