Thursday, 12 March 2015

A Strange Thing about the Brier Score

This post was co-written by Brian Knab and Miriam Schoenfield.

In the literature on epistemic utility theory, the Brier Score is offered as a paradigmatically reasonable measure of epistemic utility, or epistemic accuracy. We offer a case meant to put pressure on the claim that the Brier score in fact reasonably captures epistemic utility or epistemic accuracy.

1. A Simple Case

Consider two people contemplating the origin of the universe.

The simple deist is confident that a being exists that designed the universe. She is aware that cosmologists have developed non-design theories about the origins of the universe. However, she's confident that the non-design thesis is false.

So, according to the simple deist: deism is true, and the non-design thesis (which we’ll call “adeism”) is false. Deism and adeism form a partition of her possibility space.

The simple adeist, on the other hand, is confident that deism is false. She's confident that the universe came about without any help from a designer at all, and that the non-design thesis is true.

So, according to the simple adeist : deism is false, and the non-design hypothesis is true. Deism, and adeism form a partition of her possibility space.

It turns out: There is a deisgner!  (So deism is true, adeism is false). Who is more accurate?  The deist, obviously.

The Brier Score straightforwardly confirms this -- the simple deist is more accurate, according to the Brier Score, than the simple adeist.

2. A Problem Case

Now, consider again two people contemplating the origin of the universe. Both of them are admittedly somewhat uncertain about the existence of a designer. Both of them are aware of a large number of non-design theories of the origin of the universe.

The sophisticated deist  is more confident in deism than adeism. She has, moreover, also carefully considered all of the available non-design hypotheses, and has concluded that only one of them could possibly be true.  The rest, she thinks, are non-starters.

The sophisticated adeist  is, on the other hand, more confident in adeism than deism. She has also carefully considered all of the available non-design hypotheses, and although she thinks it’s likely that one of them is true, she has no opinions concerning which is the true one.  In her estimation, the non-design hypotheses are all equally likely.

Now, suppose it turns out: Deism is true (and so every non-design hypothesis is false). Who is more accurate?

We think: the sophisticated deist!  After all the sophisticated deist has the following two advantages over the sophisticated adeist: she has a higher credence in the truth than the sophisticated adeist does, and she has less credence invested in falsehoods than the sophisticated adeist does. So what advantage does the sophisticated adeist have over the sophisticated deist?  The only remaining difference between them is the way in which they distribute their confidence among the false hypotheses.  But why should the way in which the adeist distributes her confidence among the various false hypotheses make her more accurate in a world in which deism is true?

From the first personal side of things: if I want to have an accurate attitude about the origin of the universe and my choices are between being a sophisticated deist or a sophisticated adeist, I’d prefer to be the sophisticated deist, in the world in which deism is true.

But, in certain situations, and given enough non-design theories, the Brier score delivers the opposite verdict. For example, let D be the design hypothesis, and suppose there are 58 non-design theories, T1, T2, ... T58. Thus our partition is

By the description of the case, the ideal credences, across this partition, are (1,0,0,0,0...0)

Suppose the sophisticated deist’s credences are

Then her Brier Score is

Suppose the sophisticated adeist’s credences are

Then her Brier Score is

That’s a small victory for the adeist, admittedly, but the point is a structural one.  The adeist  -- in spite of the fact that she is a good deal less confident in the truth and, overall, a good deal more confident in the false -- is more accurate than the deist, according to the Brier. (For a related point -- one which trades on this same structural feature of the Brier Score --  see Knab, “In Defense of Absolute Value.”)

3. Discussion

That, we think, is enough of a puzzle to put some pressure on the Brier understanding of epistemic accuracy. More generally, the Brier Score fails to satisfy what looks like a plausible desideratum:

Falsity Distributions Don’t Matter: For any partition of theories: T1...Tn, a probabilistic agent’s accuracy with respect to this partition at world w should be determined solely by the amount of credence she invests in the true theory at w, and the amount of credence she invests in false theories at w.  The way she distributes her credences amongst the false theories at w shouldn’t affect her accuracy.


  1. One response, which Joyce will make and which has its origins in de Finetti (1974 vol. I, ch 3; see specifically 3.1.2 and his admonishments in 3.7.3), is to remark that the distinction between prevision and prediction is violated in the example. To make a prediction, one tries to venture a guess among the possible outcomes that will occur. A prevision, on the other hand, distributes among the possible outcomes one's own expectations. De Finetti's quadratic scoring rule argument is designed to elicit coherent previsions, not "coherent predictions"; more still, the idea of a "coherent prediction" is viewed by de Finetti, and presumably by Joyce as well, as a category mistake.

    I agree with Jim that EDT, if it is to repurpose de Finetti's machinery, needs to abide by this distinction. So, not only is this response available to EDTers; absent alternative foundations, they are compelled to take it.

    A response then to this EDT response: Why not put pressure on EDT by asking what the purely epistemic corollary of subjective expectation is?

  2. Greg - is your thought that our desideratum (Falsity Distributions Don't Matter) is only motivated as a desideratum for predictions and not as a desideratum for previsions? We intended it to be a desideratum for the accuracy of previsions - but perhaps your thought is that if we impose such a constraint, that shows that in some sense we're only interested in predictions and not in previsions? But question - is a prediction literally just a guess? We definitely think that your accuracy should be determined not just by whether you're more confident in the truth than in the falsehood, but by how much credence you have invested in the truth and how much credence you have invested in falsehood.

  3. I notice that Falsity Distributions Don't Matter applies only among probabilistic distributions. This strikes me as significant for two reasons: (1) It can't be invoked when providing an accuracy-based argument for probabilism. (2) It keeps the principle from applying to examples like the following:

    Consider two distributions over a 3-proposition partition whose first element is true:

    (1, 0, 0) and (1, 2, -2)

    I have the intuition that the first distribution is doing better with respect to accuracy, despite the fact that they differ only in the values they assign to falsehoods. Strictly speaking this isn't a counterexample to FDDM, because the latter distribution isn't probabilistic. Nevertheless, if I have the intuition that the values assigned to falsehoods matter in this case, why should I believe that they don't matter when selecting among probabilistic distributions?

  4. Thanks Mike - On second thought, we don't think we should restrict the principle to probability functions. Our basic thought is this:

    Accuracy should be determined by how much credence is invested in truth and how much credence is invested in falsehood.

    If we stick to probability functions this has the consequence that falsity distributions don't matter, but more generally, our main idea might be right even though falsity distributions do matter.
    For example: if (1,0,0) is true, then (1,1,1) is more inaccurate then (1, .5, .5). In this case falsity distributions do matter, but what explains why the former is more inaccurate is that, in the former case, more credence is invested in falsehood than in the latter case.
    Your example, however, puts pressure on even the modified principle because, I take it, the thought is that in both cases 0 credence is invested in falsehood. That may be a problem, but to be honest, I'm not sure exactly what it means to have a -2 (or 2) credence in something. Is a -2 credence in p being REALLY REALLY not confident in p? That's hard for me to understand since I think of 0 as representing MAXIMAL lack of confidence in p.

  5. Cool post! I wonder how much work the "amount of credence" talk is doing to gloss over a questionable presupposition of FDDM. Talking of "amounts" this way seems to prime a kind of reductive, divide-and-add way of thinking that Brier-fans might find objectionable.

    Picturing these distributions as points in credal space suggests a more holistic assessment. It also suggests a mistake the sophisticated adeist may be avoiding, yet her deistic counterpart makes. The deist plumps hard for a specification of the false in a way the adeist does not. In a sense, the deist is thereby more alienated from the truth: she gives herself over to a falsehood with greater zest than the adeist ever does.

    One might object that the adeist is still more invested in the false overall. But that goes back to begging the reductive/additive question. The adeist is still not as invested in any specific falsehood as her counterpart.

    Should the adeist get credit for this kind of mitigated success? Maybe so. Should she get so much credit as to overtake her counterpart who commits more heavily to the true? Maybe not. But then the argument doesn't turn on FDDM.

    1. Jonathan -- Thanks! I wonder if the "divide and add way of thinking" isn't written into the Brier understanding itself. It looks as though, on the Brier, we're just computing your penalty with respect to each proposition, and then adding all of those penalties up.

      Your suggestion could just amount, then, to the suggestion that we should abandon Brier accuracy, and adopt a more holistic measure. To claim otherwise is to claim that the additive form of the Brier is misleading, which is odd (though, I suppose, defensible).

      I'm also unsure that "picturing these distributions as points in credal space suggests a more holistic assessment." It's true that picturing them in Euclidean space will move us toward a Euclidean-distance-derivative measure like the Brier. But will picturing them in taxi-cab space also move us toward a more holistic assessment?

      If no, then it looks question-begging -- picturing is just presupposing a Euclidean metric, and thus only something like the Brier could count as holistic. If yes, then it looks like picturing will not tell us why we should adopt a holistic measure -- given that they're so easy to come by -- that has these puzzling features that the Brier has.

      Finally, about giving oneself over to falsehood with greater zest. Notice that the deist is, in all but one case, more confident that the false theories are false than the adeist. So not only does she give herself over to a falsehood with greater zest -- she also gives herself over to the truth with greater zest.

      This is just another way to get at the puzzle. Why should the deist's whole lot of small improvements over the adeist be totally outweighed by her one big mistake?

  6. This comment has been removed by the author.

  7. I guess this depends on how you think of credences, and on how you think about the probability axioms' restriction of credences to [0,1]. (Is that restriction stipulative or normative?) If, for instance, you equate an agent's credence in proposition p with her betting price for p, it is certainly possible (thought admittedly really stupid) to be willing to pay someone 2 dollars to accept a betting ticket that pays them an additional 1 dollar if p is true. That would correspond to a credence of -2 in p.

    1. Mike -- thanks for the comments!

      I think we can avoid the worry you're pushing so long as we understand, in FDDM, "the amount of credence she invests in the false" in terms of magnitudes. That is, in the world (1,0,0), the amount of credence that someone who has the credence (1, 2, -2) invests in the false is 4.

      Now, you might object that this commits us to the claim that, in the world (1,0,0), someone who has the credence (1/3, -1/3, -1/3) is no more accurate than someone who has the credence (1/3, 1/3, 1/3), and that looks weird.

      But anyone who adopts the Brier score will be forced to accept this as well. The Brier will evaluate two agents -- one who has the credence (1/3, 1/3, 1/3), and the other who has (1/3, -1/3, -1/3) -- as equally accurate in the world (1, 0, 0).

      I don't think the claim that FDDM should be understood in terms of magnitudes begs any questions, because it's theoretically motivated. What we're worried about, when we're worried about accuracy, is something like distance from the truth, which will always be a positive quantity.

  8. I take it that FDDM is intended to apply only when no false hypothesis is closer to the truth than any other. I believe the logarithmic rule is unique among power and pseudospherical rules in conforming to this principle.