tag:blogger.com,1999:blog-49876091144152055932021-04-14T13:27:58.773+01:00M-PhiA blog dedicated to mathematical philosophy.Jeffrey Ketlandhttp://www.blogger.com/profile/01753975411670884721noreply@blogger.comBlogger584125tag:blogger.com,1999:blog-4987609114415205593.post-42830170852517456922021-04-13T11:14:00.002+01:002021-04-14T07:59:12.127+01:00What we together risk: three vignettes in search of a theory<p>For a PDF version of this post, see <a href="https://drive.google.com/file/d/10fQudSUQKI2Ut59tEJFjg03aLpusi5MY/view?usp=sharing" target="_blank">here</a>.<br /><br />Many years ago, I was climbing Sgùrr na Banachdich with my friend Alex. It's a mountain in the Black Cuillin, a horseshoe of summits that surround Loch Coruisk at the southern end of the Isle of Skye. It's a Munro---that is, it stands over 3,000 feet above sea level---but only just---it measures 3,166 feet. About halfway through our ascent, the mist rolled in and the rain came down heavily, as it often does near these mountains, which attract their own weather system. At that point, my friend and I faced a choice: to continue our attempt on the summit or begin our descent. Should we continue, there were a number of possible outcomes: we might reach the summit wet and cold but not injured, with the mist and rain gone and in their place sun and views across to Bruach na Frìthe and the distinctive teeth-shaped peaks of Sgùrr nan Gillean; or we might reach the summit without injury, but the mist might remain, obscuring any view at all; or we might get injured on the way and either have to descend early under our own steam or call for help getting off the mountain. On the other hand, should we start our descent now, we would of course have no chance of the summit, but we were sure to make it back unharmed, for the path back is good and less affected by rain.<br /><br />Alex and I had climbed together a great deal that summer and the summer before. We had talked at length about what we enjoyed in climbing and what we feared. To the extent that such comparisons make sense and can be known, we both knew that we both gained exactly the same pleasure from reaching a summit, the same additional pleasure if the view was clear; we gained the same displeasure from injury, the same horror at the thought of having to call for assistance getting off a mountain. What's more, we both agreed exactly on how likely each possible outcome was: how likely we were to sustain an injury should we persevere; how likely that the mist would clear in the coming few hours; and so on. Nonetheless, I wished to turn back, while Alex wanted to continue. <br /><br />How could that be? We both agreed how good or bad each of the options was, and both agreed how likely each would be were we to take either of the courses of action available to us. Surely we should therefore have agreed on which course of action would maximise our expected utility, and therefore agreed which would be best to undertake. Yes, we did agree on which course of action would maximise our expected utility. However, no, we did not therefore agree on which was best, for there are theories of rational decision-making that do not demand that you must rank options by their expected utility. These are the risk-sensitive decision theories, and they include John Quiggin's rank-dependent decision theory and Lara Buchak's risk-weighted expected utility theory. According to Quiggin's and Buchak's theories, what you consider best is not determined only by your utilities and your probabilities, but also by your attitudes to risk. The more risk-averse will give greater weight to the worst-case scenarios and less to the best-case ones than expected utility demands; the more risk-inclined will give greater weight to the best outcomes and less to the worst than expected utility does; and the risk-neutral person will give exactly the weights prescribed by expected utility theory. So, perhaps I preferred to begin our descent from Sgùrr na Banachdich while Alex preferred to continue upwards because I was risk-averse and he was risk-neutral or risk-seeking, or I was risk-neutral and he was risk-seeking. In any case, he must have been less risk-averse than I was. <br /><br />Of course, as it turned out, we sat on a mossy rock in the rain and discussed what to do. We decided to turn back. Luckily, as it happened, for a thunderstorm hit the mountains an hour later at just the time we'd have been returning from the summit. But suppose we weren't able to discuss the decision. Suppose we'd roped ourselves together to avoid getting separated in the mist, and he'd taken the lead, forcing him to make the choice on behalf of both of us. In that case, what should he have done?<br /><br />As I will do throughout these reflections, let me simply report by own reaction to the case. I think, in that case, Alex should have chosen to descend (and not only because that was my preference---I'd have thought the same had it been he who wished to descend and me who wanted to continue!). Had he chosen to continue---even if all had turned out well and we'd reached the summit unharmed and looked over the Cuillin ridge in the sun---I would still say that he chose wrongly on our behalf. This suggests the following principle (in joint work, Ittay Nissan Rozen and Jonathan Fiat argue for a version of this principle that applies in situations in which the individuals do not assign the same utilities to the outcomes):</p><p><b>Principle 1</b> Suppose two people assign the same utilities to the possible outcomes, and assign the same probabilities to the outcomes conditional on choosing a particular course of action. And suppose that you are required to choose between those courses of action on their behalf. Then you must choose whatever the more risk-averse of the two would choose.<br /><br />However, I think the principle is mistaken. A few years after our unsuccessful attempt on Sgùrr na Banachdich, I was living in Bristol and trying to decide whether to take up a postdoctoral fellowship there or a different one based in Paris (a situation that seems an unimaginable luxury and privilege when I look at today's academic job market). Staying in Bristol was the safe bet; moving to Paris was a gamble. I already knew what it would be like to live in Bristol and what the department was like. I knew I'd enjoy it a great deal. I'd visited Paris, but I didn't know what it would be like to live there, and I knew the philosophical scene even less. I knew I'd enjoy living there, but I didn't know how much. I figured I might enjoy it a great deal more than Bristol, but also I might enjoy it somewhat less. The choice was complicated because my partner at the time would move too, if that's what we decided to do. Fortunately, just as Alex and I agreed on how much we valued the different outcomes that faced us on the mountain, so my partner and I agreed on how much we'd value staying in Bristol, how much we'd value living in Paris under the first, optimistic scenario, and how much we'd value living there under the second, more pessimistic scenario. We also agreed how likely the two Parisian scenarios were---we'd heard the same friends describing their experiences of living there, and we'd drawn the same conclusions about how likely we were to value the experience ourselves to different extents. Nonetheless, just as Alex and I had disagreed on whether or not to start our descent despite our shared utilities and probabilities, so my partner and I disagreed on whether or not to move to Paris. Again the more risk-averse of the two, I wanted to stay in Bristol, while he wanted to move to Paris. Again, of course, we sat down to discuss this. But suppose that hadn't been possible. Perhaps my partner had to make the decision for both of us at short notice and I was not available to consult. How should he have chosen?<br /><br />In this case, I think either choice would have been permissible. My partner might have chosen Paris or he might have chosen Bristol and either of these would have been allowed. But of course this runs contrary to Principle 1. <br /><br />So what is the crucial difference between the decision on Sgùrr na Banachdich and the decision whether to move cities? In each case, there is an option---beginning our descent or staying in Bristol---that is certain to have a particular level of value; and there is an alternative option---continuing to climb or moving to Paris---that might give less value than the sure thing, but might give more. And, in each case, the more risk-averse person prefers the sure thing to the gamble, while the more risk-inclined prefers the gamble. So why must someone choosing for me and Alex in the first case choose to descend, while someone choosing for me and my partner in the second case choose either Bristol or Paris? <br /><br />Here's my attempt at a diagnosis: in the choice of cities, there is no risk of harm, while in the decision on the mountain, there is. In the first case, the gamble opens up a possible outcome in which we're harmed---we are injured, perhaps quite badly. In the second case, the gamble doesn't do that---we countenance the possibiilty that moving to Paris might not be as enjoyable as remaining in Bristol, but we are certain it won't harm us! This suggests the following principle:</p><p><b>Principle 2</b> Suppose two people assign the same utilities to the possible outcomes, and assign the same probabilities to the outcomes conditional on choosing a particular course of action. And suppose that you are required to choose between those courses of action on their behalf. Then there are two cases: if one of the available options opens the possibility of a harm, then you must choose whatever the more risk-averse of the two would choose; if neither of the available options opens the possibility of a harm, then you may choose an option if at least one of the two would choose it. </p><p>So risk-averse preferences do not always take precedence, but they do when harms are involved. Why might that be? <br /><br />A natural answer: to expose someone to the risk of a harm requires their consent. That is, when there is an alternative option that opens no possibility of harm, you are only allowed to choose an option that opens up the possibility of a harm if everyone affected would consent to being subject to that risk. So Alex should only choose to continue our ascent and expose us to the risk of injury if I would consent to that, and of course I wouldn't, since I'd prefer to descend. But my partner is free to choose the move to Paris even though I wouldn't choose that, because it exposes us to no risk of harm. <br /><br />A couple of things to note: First, in our explanation, reference to risk-aversion, risk-neutrality, and risk-inclination have dropped out. What is important is not who is more averse to risk, but who consents to what. Second, our account will only work if we employ an absolute notion of harm. That is, I must say that there is some threshold and an option harms you if it causes your utility to fall below that threshold. We cannot use a relative notion of harm on which an option harms you if it merely causes your utility to fall. After all, using a relative notion of harm, the move to Paris will harm you should it turn out to be worse than staying in Bristol. <br /><br />The problem with Principle 2 and the explanation we have just given is that it does not generalise to cases in which more than two people are involved. That is, the following principle seems false:</p><p><b>Principle 3</b> Suppose each member of a group of people assign the same utilities to the possible outcomes, and assign the same probabilities to the outcomes conditional on choosing a particular course of action. And suppose that you are required to choose between those courses of action on their behalf. Then there are two cases: if one of the available options opens the possibility of a harm, then you must choose whatever the most risk-averse of them would choose; if neither of the available options opens the possibility of a harm, then you may choose an option if at least one member of the group would choose it.</p><p>A third vignette might help to illustrate this.<br /><br />I grew up between two power stations. My high school stood in the shadow of the coal-fired plant at Cockenzie, while the school where my mother taught stood in the lee of the nuclear plant at Torness Point. And I was born two years after the Three Mile Island accident and the Chernobyl tragedy happened as I started school. So the risks of nuclear power were somewhat prominent growing up. Now, let's imagine a community of five million people who currently generate their energy from coal-fired plants---a community like Scotland in 1964, just before its first nuclear plant was constructed. This community is deciding whether to build nuclear plants to replace its coal-fired ones. All agree that having a nuclear plant that suffered no accidents would be vastly preferable to having coal plants, and all agree that a nuclear plant that suffered an accident would be vastly worse than the coal plants. And we might imagine that they also all assign the same probability to the prospective nuclear plants suffering an accident---perhaps they all defer to a recent report from the country's atomic energy authority. But, while they agree on the utilities and the probabilities, they have don't all have the same attitudes to risk. In the end, 4.5million people prefer to build the nuclear facilities, while half a million, who are more risk-averse, prefer to retain the coal-fired alternatives. Principle 3 says that, for someone choosing on behalf of this population, the only option they can choose is to retain the coal-fired plants. After all, a nuclear accident is clearly a harm, and there are individuals who would suffer that harm who would not consent to being exposed to the risk. But surely that's wrong. Surely, despite such opposition, it would be acceptable to build the nuclear plant.<br /><br />So, while Principle 2 might yet be true, Principle 3 is wrong. And I think my attempt to explain the basis of Principle 2 must be wrong as well, for if it were right, it would also support Principle 3. After all, in no other case I can think of in which a lack of consent is sufficient to block an action does that block disappear if there are sufficiently many people in favour of the action. <br /><br />So what general principles underpin our reactions to these three vignettes? Why do the preferences of the more risk-averse individuals carry more weight when one of the outcomes involves a harm than when they don't, but not enough weight to overrule a significantly greater number of more risk-inclined individuals? That's the theory I'm in search of here.<br /></p>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com0tag:blogger.com,1999:blog-4987609114415205593.post-37717036127288502112021-04-06T08:04:00.002+01:002021-04-06T17:54:22.915+01:00Believing is said of groups in many ways<p style="text-align: left;">For a PDF version of this post, see <a href="https://drive.google.com/file/d/1x1LgnTfosyWO6BUpPHU1W9xCJ5Hl0Lqe/view?usp=sharing" target="_blank">here</a>. <br /></p><h2 style="text-align: left;">In defence of pluralism <br /></h2><p>Recently, after a couple of hours discussing a problem in the philosophy of mathematics, a colleague mentioned that he wanted to propose a sort of pluralism as a solution. We were debating the foundations of mathematics, and he wanted to consider the claim that there might be no single unique foundation, but rather many different foundations, no one of them better than the others. Before he did so, though, he wanted to preface his suggestion with an apology. Pluralism, he admitted, is unpopular wherever it is proposed as a solution to a longstanding philosophical problem. </p><p>I agree with his sociological observation. Philosophers tend to react badly to pluralist solutions. But why? And is the reaction reasonable? This is pure speculative generalisation based on my limited experience, but I've found that the most common source of resistance is a conviction that there is a particular special role that the concept in question must play; and moreover, in that role, whether or not something falls under the concept determines some important issue concerning it. So, in the philosophy of mathematics, you might think that a proof of a mathematical proposition is legitimate just in case it can be carried out in the system that provides the foundation for mathematics. And, if you allow a plurality of foundations of differing logical strength, the legitimacy of certain proof becomes indeterminate---relative to some foundations, they're legit; relative to others, they aren't. Similarly, you might think that a person who accidentally poisons another person is innocent of murder if, and only if, they were justified in their belief that the liquid they administered was not poisonous. And, if you allow a plurality of concepts of justification, then whether or not the person is innocent might become indeterminate. </p><p>I tend to respond to such concerns in two ways. First, I note that, while the special role that my interlocutor picks out for the concept we're discussing is certainly among the roles that this concept needs to play, it isn't the only one; and it is usually not clear why we should take it to be the most important one. One role for a foundation of mathematics is to test the legitimacy of proofs; but another is to provide a universal language that mathematicians might use, and that might help them discover new mathematical truths (see <a href="https://philpapers.org/rec/MARCTA" target="_blank">this paper</a> by Jean-Pierre Marquis for a pluralist approach that takes both of these roles seriously).</p><p>Second, I note that we usually determine the important issues in question independently of the concept and then use our determinations to test an account of the concept, not the other way around. So, for instance, we usually begin by determining whether we think a particular proof is legitimate---perhaps by asking what it assumes and whether we have good reason for believing that those assumptions are true---and then see whether a particular foundation measures up by asking whether the proof can be carried out within it. We don't proceed the other way around. And we usually determine whether or not a person is innocent independently of our concept of justification---perhaps just by looking at the evidence they had and their account of the reasoning they undertook---and then see whether a particular account of justification measures up by asking whether the person is innocent according to it. Again, we don't proceed the other way around.</p><p>For these two reasons, I tend not to be very moved by arguments against pluralism. Moreover, while it's true that pluralism is often greeted with a roll of the eyes, there are a number of cases in which it has gained wide acceptance. We no longer talk of the <i>probability</i> of an event but distinguish between its <i>chance</i> of occurring, a particular individual's <i>credence</i> in it occurring, and perhaps even it's <i>evidential probability</i> relative to a body of evidence. That is, we are pluralists about probability. Similarly, we no longer talk of a particular belief being <i>justified simpliciter</i>, but distinguish between <i>propositional</i>, <i>doxastic</i>, and <i>personal justification</i>. We are, along some dimensions at least, pluralists about justification. We no longer talk of a person having a <i>reason</i> to choose one thing rather than another, but distinguish between their <i>internal</i> and <i>external reasons</i>. </p><p>I want to argue that we should extend pluralism to so-called <i>group beliefs</i> or<i> collective beliefs</i>. Britain believes lockdowns are necessary to slow the virus. Scotland believes it would fare well economically as an independent country. The University believes the pension fund has been undervalued and requires no further increase in contributions in the near future to meet its obligations in the further future. In 1916, Russia believed Rasputin was dishonest. In each of these sentences, we seem to ascribe a belief to a group or collective entity. When is it correct to do this? I want to argue that there is no single answer. Rather, as Aristotle said of being, believing is said of groups in many ways---that is, a pluralist account is appropriate.</p><p>I've been thinking about this recently because I've been reading Jennifer Lackey's fascinating new book, <i><a href="https://global.oup.com/academic/product/the-epistemology-of-groups-9780199656608" target="_blank">The Epistemology of Groups</a></i> (all page numbers in what follows refer to that). In it, Lackey offers an account of group belief, justified group belief, group knowledge, and group assertion. I'll focus here only on the first.</p><h2 style="text-align: left;">Lackey's treatment of group belief</h2><h3 style="text-align: left;">Three accounts of group belief <br /></h3><p style="text-align: left;">Lackey considers two existing accounts of group belief as well as her own proposal. </p><p style="text-align: left;">The first, due to <a href="https://philpapers.org/rec/GILOSF-2" target="_blank">Margaret Gilbert</a> and with amendments by <a href="https://philpapers.org/rec/TUOGB" target="_blank">Raimo Tuomela</a>, is a non-summative account that treats groups as having 'a mind of their own'. Lackey calls it the <i>Joint Acceptance Account</i> (<i>JAA</i>). I'll stick with the simpler Gilbert version, since the points I'll make don't rely on Tuomela's more involved amendment (24):</p><p style="text-align: left;"><b>JAA</b> A group $G$ believes that $p$ iff it is common knowledge in $G$ that the members of $G$ individually have intentionally and openly expressed their willingness jointly to accept that $p$ with the other members of $G$.</p><p style="text-align: left;">The second, due to Philip Pettit, is a summative account that treats group belief as strongly linked to individual belief. Lackey calls it the <i>Premise-Based Aggregation Account</i> (PBAA) (29). Here's a rough paraphrase:</p><p style="text-align: left;"><b>PBAA</b> A group $G$ believes that $p$ iff there is some collection of propositions $q_1, \ldots, q_n$ such that (i) it is common knowledge among the operative members of $G$ that $p$ is true iff each $q_i$ is true, (ii) for each operative member of $G$, they believe $p$ iff they believe each $q_i$, and (iii) for each $q_i$, the majority of operative members of $G$ believe $q_i$.</p><p style="text-align: left;">Lackey's own proposal is the <i>Group Agent Account</i> (GAA) (48-9):</p><p style="text-align: left;"><b>GAA</b> A group $G$ believes that $p$ iff (i) there is a significant percentage of $G$'s operative members who believe that $p$, and (ii) are such that adding together the bases of their beliefs that $p$ yields a belief set that is not substantively incoherent.</p><h3 style="text-align: left;">Group lies (and bullshit) and judgment fragility: two desiderata for accounts of group belief<br /></h3><p style="text-align: left;">To distinguish between these three accounts, Lackey enumerates four desiderata for accounts of group belief that she takes to tell against JAA and PBAA and in favour of GAA. The first three are related to an objection to Gilbert's account of group belief that was developed by <a href="https://philpapers.org/rec/WRACBA-2" target="_blank">K. Brad Wray</a>, <a href="https://philpapers.org/rec/MEICAA-2" target="_blank">A. W. M. Meijers</a>, and <a href="https://philpapers.org/rec/HAKOTP" target="_blank">Raul Hakli</a> in the 2000s. According to this, JAA makes it too easy for groups to actively, consciously, and intentionally choose what they believe: all they need to do is intentionally and openly express their willingness jointly to accept the proposition in question. Lackey notes two consequences of this: (a) on such an account, it is difficult to give a satisfactory account of group lies (or group bullshit, though I'll focus on group lies); (b) on such an account, whether or not a group believes something at a particular time is sensitive to the group's situation at that time in a way that beliefs should not be sensitive.</p><p style="text-align: left;">So Lackey's first desideratum for an account of group belief is that it must be able to accommodate a plausible account of group lies (and the second that it accommodate group bullshit, but as I said I'll leave that for now). Suppose each member of a group strongly believes $p$ on the basis of excellent evidence that they all share, but they also know that the institution will be culpable of a serious crime if it is taken to believe $p$. Then they might jointly agree to accept $\neg p$. And, if they do, Gilbert must say that they do believe $\neg p$. But were they to assert $\neg p$, we would take the group to have lied, which would require that it believes $p$. The point is that, if a group's belief is so thorougly within its voluntary control, it can manipulate it whenever it likes in order to avoid ever lying in situations in which dishonesty would be subject to censure. </p><p style="text-align: left;">Lackey's third desideratum for an account of group belief is that such belief should not be rendered sensitive in certain ways to the situation in which the group formed it. Suppose that, on the basis of the same shared evidence, a substantial majority of members of a group judge the horse Cisco most likely to win the race, the horse Jasper next most likely, and the horse Whiskey very unlikely to win. But, again on the basis of this same shared body of evidence, the remaining minority of members judge Whiskey most likely to win, Jasper next most likely, and Cisco very unlikely to win. The group would like a consensus before it reports its opinion, but time is short---the race is about to begin, say, and the group has been asked for its opinion before the starting gates open. So, in order to achieve something close to a consensus, it unanimously agrees to accept that Jasper will win, even though he is everyone's second favourite. Yet we might also assume that, had time not been short, the majority would have been able to persuade the minority of Cisco's virtues; and, in that case, they'd unanimously agree to accept that Cisco will win. So, according to Gilbert's account, under time pressure, the group believes Jasper will win, while with world enough and time, they would have believed that Cisco will win. Lackey holds that no account of group belief should make it sensitive to the situation in which it is formed in this way, and thus rejects JAA.</p><p style="text-align: left;">Lackey argues that any account of group belief must satisfy the two desiderata we've just considered. I agree that we need at least one account of group belief that satisfies the first desideratum, but I'm not convinced that all need do this---but I'll leave that for later, when I try to motivate pluralism. For now, I'd like to explain why I'm not convinced that any account needs to satisfy the second desideratum. After all, we know from various empirical studies in social psychology, as well as our experience as thinkers and reasoners and believers, that our ordinary beliefs as individuals are sensitive to the situation in which they're formed in just the sort of way that Lackey wishes to rule out for the beliefs of groups. One of the central theses of Amos Tversky and Daniel Kahneman's work is that we use a different reasoning system when we are forced to make a judgment under time pressure from the one we use when more time is available. So, when my implicit biases are mobilised under time pressure, I might come to believe that a particular job candidate is incompetent, while I might judge them to be competent were I to have more time to assess their track record and override my irrational hasty judgment. And, whenever we are faced with a complex body of evidence that, on the face of it, seems to point in one direction, but which, under closer scrutiny, points in the opposite direction, we will form a different belief if we must do so under time pressure than if we have greater leisure to unpick and balance the different components of the evidence. If individual beliefs can be sensitive to the situation in which they're formed in this way, I see no reason why group beliefs might not also be sensitive in this way.</p><p style="text-align: left;">Before moving on, I'd like to consider whether the PBAA---Pettit's premise-based aggregation account---satisfies Lackey's first desideratum. If it doesn't, it can't be for the same reason that Gilbert's JAA doesn't. After all, according to the PBAA, the group's belief is no more under its voluntary control than the beliefs of its individual members. If, for each $q_i$, a majority believes $q_i$, then the group believes $p$. The only way a group could manipulate its belief is by manipulating the beliefs of its members. But if that sort of manipulation rules out a group belief, Lackey's account is just as vulnerable.</p><p style="text-align: left;">So why does Lackey think that PBAA cannot adequately account for group lies. She considers a case in which the three board members of a tobacco company know that smoking is safe to health iff it doesn't cause lung cancer and it doesn't cause emphysema and it doesn't cause heart disease. The first member believes it doesn't cause lung cancer or heart disease, but believes it does cause emphysema, and so believes it is not safe to health; the second believes it doesn't cause emphysema or heart disease, but it does cause lung cancer, and so believes it is not safe to health; and the third believes it doesn't cause lung cancer or emphysema, but it does cause heart disease, and so believes it is not safe to health. The case is illustrated in Table 1. </p><p style="text-align: left;"></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-YHIjnYssXPM/YGs0NjHQ1NI/AAAAAAAAFIQ/KuXsSeEASn4o8QovUJIdQqZMsbfAfNr1QCLcBGAsYHQ/s2048/IMG_1244.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1389" data-original-width="2048" height="271" src="https://1.bp.blogspot.com/-YHIjnYssXPM/YGs0NjHQ1NI/AAAAAAAAFIQ/KuXsSeEASn4o8QovUJIdQqZMsbfAfNr1QCLcBGAsYHQ/w400-h271/IMG_1244.jpg" width="400" /></a></div><br />Then each board member believes it is not safe to health, but PBAA says that it is, because a majority (first and third) believe it doesn't cause lung cancer, a majority (second and third) believe it doesn't cause emphysema, and a majority (first and second) believe it doesn't cause heart disease. If the company then asserts that it is safe to health, then Lackey claims that it lies, while PBAA says that it believes the proposition it asserts and so does not lie.<p></p><p>I think this case is a bit tricky. I suspect our reaction to it is influenced by our knowledge of how the real-world version played out and the devastating effect it has had. So let us imagine that this group of three is not the board of a tobacco company, but the scientific committee of a public health organisation. The structure of the case will be exactly the same, and the nature of the organisation should not affect whether or not belief is present. Now suppose that, since the stakes are so high, each member would only come to believe of a specific putative risk that it is not present if their credence that it is not present is above 95%. That is, there is some pragmatic encroachment here to the extent that the threshold for belief is determined in part by the stakes involved. And suppose further that the first member of the scientific committee has credence 99% that smoking doesn't cause lung cancer, 99% that it doesn't cause heart disease, and 93% that it doesn't cause emphysema. And let's suppose that, by a tragic bout of bad luck that has bestowed on them very misleading evidence, the evidence available to them supports these credences. Then their credence that smoking is safe to health must be at most 93%---since the probability of a conjunction must be at most the probability of any of the conjuncts---and thus below 95%. So the first member doesn't believe it is safe to health. And suppose the same for the other two members of the committee, but for the other combinations of risks. So the second is 99% sure it doesn't cause emphysema and 99% sure it doesn't cause heart disease, but only 93% sure it doesn't cause lung cancer. And the third is 99% sure it doesn't cause lung cancer and 99% sure it doesn't cause emphysema, but only 93% sure it doesn't cause heart disease. So none of the three believe that smoking is safe to health. The case is illustrated in Table 2. </p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-FZhr93B6w1s/YGs0jcs0hII/AAAAAAAAFIY/FRWSozNrqfUiYhEQBqPOuTuBIVBON82rgCLcBGAsYHQ/s2048/IMG_1243.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1384" data-original-width="2048" height="270" src="https://1.bp.blogspot.com/-FZhr93B6w1s/YGs0jcs0hII/AAAAAAAAFIY/FRWSozNrqfUiYhEQBqPOuTuBIVBON82rgCLcBGAsYHQ/w400-h270/IMG_1243.jpg" width="400" /></a></div><br /> However, just averaging the group's credences in each of the three specific risks, we might say that it is 97% sure that smoking doesn't cause lung cancer, 97% sure it doesn't cause emphysema, and 97% sure it doesn't cause heart disease ($\frac{0.99 + 0.99 + 0.93}{3} = 0.97$). And it is then possible that the group assigns a higher than 95% credence to the conjunction of these three. And, if it does, it seems to me, the PBAA may well get things right, and the group does not lie if it says that smoking carries no health risks. <p></p><p style="text-align: left;">Nonetheless, I think the PBAA cannot be right. In the example I just described, I noted that, just taking a straight average gives, for each specific risk, a credence of 97% that it doesn't exist. And I noted that it's then possible that the group credence that smoking is safe to health is above 95%. But of course, it's also possible that it's below 95%. This would happen, for instance, if the group were to take the three risks to be independent. Then the group credence that smoking is safe to health would be a little over 91%---too low for the group to believe it given the stakes. But PBAA would still say that the group believes that smoking is safe to health. The point is that PBAA is not sufficiently sensitive to the more fine-grained attitudes to the propositions that lie behind the beliefs in those propositions. Simply knowing what each member believes about the three putative risks is not sufficient to determine what the group thinks about them. You also need to look to their credences.<br /></p><p style="text-align: left;">Of course, there are lots of reasons to dislike straight averaging as a means for pooling credences---it can't preserve judgments of independence, for instance---and lots of reasons to dislike the naive application of a threshold or Lockean view of belief that is in the background here---it gives rise to the lottery paradox. But it seems that, for any reasonable method of probablistic aggregation and any reasonable account of the relationship between belief and credence, there will be cases like this in which the PBAA says the group believes a proposition when it shouldn't. So I agree with Lackey that the PBAA sometimes gets things wrong, but I disagree about exactly when.<br /></p><h3 style="text-align: left;">Base fragility: a further desideratum</h3><p style="text-align: left;">Consider an area of science in which two theories vie for precedence, $T_1$ and $T_2$. Half of the scientists working in this area believe the following:</p><ul style="text-align: left;"><li>($A_1$) $T_1$ is simpler than $T_2$,</li><li>($B_1$) $T_2$ is more explanatory than $T_1$,</li><li>($C_1$) simplicity always trumps explanatory power in theory choice.</li></ul><p style="text-align: left;">These scientists consequently believe $T_1$. The other half of the scientists believe the following: </p><ul style="text-align: left;"><li>($A_2$) $T_2$ is simpler than $T_1$,</li><li>($B_2$) $T_1$ is more explanatory than $T_2$,</li><li>($C_2$) explanatory power always trumps simplicity in theory choice.</li></ul><p style="text-align: left;">These scientists consequently believe $T_1$. So all scientists believe $T_1$. But they do so for diametrically opposed reasons. Indeed, all of their beliefs about the comparisons between $T_1$ and $T_2$ are in conflict, but because their views about theory choice are also in conflict, they end up believing the same theory. Does the scientific community believe $T_1$? Lackey says no. In order for a group to believe a proposition, the bases of the members' beliefs must not be substantively incoherent. In our example, for half of the members, the basis of their belief in $T_1$ is $A_1\ \&\ B_1\ \&\ C_1$, while for the other half, it's $A_2\ \&\ B_2\ \&\ C_2$. And $A_1$ contradicts $A_2$, $B_1$ contradicts $B_2$, and $C_1$ contradicts $C_2$. The bases are about as incoherent as can be. </p><p style="text-align: left;">Is Lackey correct to say that the scientific community does not believe in this case? I'm not so sure. For one thing, attributing belief in $T_1$ would help to explain a lot of the group's behaviour. Why does the scientific community fund and pursue research projects that are of interest only if $T_1$ is true? Why does the scientific community endorse and teach from textbooks that give much greater space to expounding and explaining $T_1$? Why do departments in this area hire those with the mathematical expertise required to understand $T_1$ when that expertise is useless for understanding $T_2$? In each case, we might say: because the community believes $T_1$.</p><p style="text-align: left;">Lackey raises two worries about group beliefs based in incoherent bases: (i) they cannot be subject to rational evaluation; (ii) they cannot coherently figure in accounts of collective deliberation. On (ii), it seems to me that the group belief could figure in deliberation. Suppose the community is deliberating about whether to invite a $T_1$-theorist or a $T_2$-theorist to give the keynote address at the major conference in the area. It seems that the group's belief in the superiority of $T_1$ could play a role in the discussions: 'Yes, we want the speaker who will pose the greatest challenge intellectually, but we don't want to hear a string of falsehoods, so let's go with the $T_1$-theorist,' they might reason.</p><p style="text-align: left;">On (i): Lackey asks what we would say if the group were to receive new evidence that $T_1$ has greater simplicity and less explanatory power than we initially thought. For the first half of the group, this would make their belief in $T_1$ more justified; for the second half, it would make their belief less justified. What would it do to the group's belief? Without an account of justification for group belief, it's hard to say. But I don't think the incoherent bases rule out an answer. For instance, we might be reliabilists about group justification. And if we are, then we look at all the times that the members of the group have made judgments about simplicity and explanatory power that have the same pattern as they have time---that is, half one way, half the other---and we look at the proportion of those times that the group belief---formed by whatever aggregation method we favour---has been true. If it's high, then the belief is justified; if it's not, it's not. And we can do that for the group before and after this new evidence comes in. And by doing that, we can compare the level of justification for the group belief.</p><p style="text-align: left;">Of course, this is not to say that reliabilism is the correct account of justification for group beliefs. But it does suggest that incoherent bases don't create a barrier to such accounts.<br /></p><h2 style="text-align: left;">Varieties of group belief</h2><p style="text-align: left;">One thing that is striking when we consider different proposed accounts of group belief is how large the supervenience base might be; that is, how many different features of a group $G$ might partially determine whether or not it believes a proposition $p$. Here's a list, though I don't pretend that it's exhaustive:</p><p style="text-align: left;">(1) <i>The beliefs of individual members of the group</i></p><p style="text-align: left;">(1a) Some accounts are concerned only with individual members' beliefs in $p$; others are interested in members' beliefs beyond that. For instance, a simple majoritarian account is interested only in members' beliefs in $p$. But Pettit's PBAA is interested instead in members' beliefs in each proposition from a set $q_1, \ldots, q_n$ whose conjunction is equivalent to $p$. And Lackey's GAA is interested in the members' beliefs in $p$ as well as the members' beliefs that form the bases for their belief in $p$ when they do believe $p$.</p><p style="text-align: left;">(1b) Some accounts are concerned with the individual beliefs of <i>all</i> members of the group, some only with so-called <i>operative members</i>. For instance, some will say that what determines whether a company believes $p$ is only whether or not members of their board believe $p$, while others will say that all employees of the company count.</p><p style="text-align: left;">(2) <i>The credences of individual members of the group</i></p><p style="text-align: left;">There are distinctions corresponding to (1a) and (1b) here as well.<br /></p><p style="text-align: left;">(3) <i>The outcomes of discussions between the members of the group</i></p><p style="text-align: left;">(3a) Some will say that only discussions that actually take place make a difference---you might say that, before a discussion takes place, the members of the group each believe $p$, but after they discuss it and retain those beliefs, you can say that the group believes $p$; others will say that hypothetical discussions can also make a difference---if individual members would dramatically change their beliefs were they to discuss the matter, that might mean the group does not believe, even if all members do.</p><p style="text-align: left;">(3b) Some will say that it is not the individual members' beliefs after discussion that is important, but their joint decision to accept $p$ as the group's belief. (Margaret Gilbert's JAA is such an account.)<br /></p><p style="text-align: left;">(4) <i>Belief-forming structures within the group</i></p><p style="text-align: left;">(4a) Some groups are extremely highly structured, and some of these structures relate to group belief formation. Some accounts of group belief acknowledge this by talking of 'operative members' of groups, and taking their attitudes to have greater weight in determining the group's attitude. For instance, it is common to say that the operative members of a company are its board members; the operative members of a British university might be its senior management team; the operative members of a trade union might be its executive committee. But of course many groups have much more complex structures than these. For instance, many large organisations are concerned with complex problems that break down into smaller problems, each of which requires a different sort of expertise to understand. The World Health Organization (WHO) might be such an example, or the Intergovernmental Panel on Climate Change (IPCC), or Médecins san Frontières (MSF). In each case, there might be a rigid reporting structure whereby subcommittees report their findings to the main committee, but each subcommittee might form its own subcommittees that report to them; and there might be strict rules about how the findings of a subcommittee must be taken into account by the committee to which it reports before that committee itself reports upwards. In such a structure, the notion of operative members and their beliefs is too crude to capture what's necessary. <br /></p><p style="text-align: left;">(5) <i>The actions of the group </i></p><p style="text-align: left;">(5a) Some might say that a group has a belief just in case it acts in a way that is best explained by positing a group belief. Why does the scientific community persist in appointing only $T_1$-theorists and no $T_2$-theorists? Answer: It believes $T_1$. (I think Kenny Easwaran and Reuben Stern take this view in their recent joint work.)</p><p style="text-align: left;">So, in the case of group beliefs, the disagreement between different accounts does not concern only the conditions on an agree supervenience base; it also concerns the extent of the supervenience base itself. Now, this might soften us up for pluralism, but it is hardly an argument. To give an argument, I'd like to consider a range of possible accounts and, for each, describe a role that group beliefs are typically taken to play and for which this account is best suited.</p><h3 style="text-align: left;">Group beliefs as summaries</h3><p style="text-align: left;">One thing we do when we ascribe beliefs to groups is simply to summarise the views of the group. If I say that, in 1916, Russia believed that Rasputin was dishonest, I simply give a summary of the views of people who belong to the group to which 'Russia' refers in this sentence, namely, Russians alive in 1916. And I say roughly that a substantial majority believed that he was dishonest. </p><p style="text-align: left;">For this role, a simple majoritarian account (SMA) seems best:</p><p style="text-align: left;"><b>SMA</b> A group $G$ believes $p$ iff a substantial majority of members of $G$ believes $p$.</p><p style="text-align: left;">There is an interesting semantic point in the background here. Consider the sentence: 'At the beginning of negotiations at Brest-Litovsk in 1917-8, Russia believed Germany's demands would be less harsh than they turned out to be.' We might suppose that, in fact, this belief was not widespread in Russia, but it was almost universal among the Bolshevik government. Then we might nonetheless say that the sentence is true. At first sight, it doesn't seem that SMA can account for this. But it might do if 'Russia' refers to different groups in the two different sentences: to the whole population in 1916 in the first sentence; to the members of the Bolshevik government in the second. </p><p style="text-align: left;">I'm tempted to think that this happens a lot when we discuss group beliefs. Groups are complex entities, and the name of a group might be used in one sentence to pick out some subset of its structure---just its members, for instance---and in another sentence some other subset of its structure---its members as well as its operative group, for instance---and in another sentence yet some further subset of its structure---its members, its operative group, and the rules by which the operative group abide when they are debating an issue.</p><p style="text-align: left;">Of course, this might look like straightforward synecdoche, but I'm inclined to think it's not, because it isn't clear that there is one default referent of the term 'Russia' such that all other terms are parasitic on that. Rather, there are just many many different group structures that might be picked out by the term, and we have to hope that context determines this with sufficient precision to evaluate the sentence.<br /></p><h3 style="text-align: left;">Group beliefs as attitudes that play a functional role</h3><p style="text-align: left;">An important recent development in our understanding of injustice and oppression has been the recognition of structural forms of racism, sexism, ableism, homophobia, transphobia, and so on. The notion is contested and there are many competing definitions, but to illustrate the point, let me quote from <a href="https://www.nejm.org/doi/full/10.1056/NEJMms2025396" target="_blank">a recent article</a> in the New England Journal of Medicine that considers structural racism in the US healthcare system:</p><blockquote><p style="text-align: left;">All definitions [of structural racism] make clear that racism is not simply the result of private prejudices held by individuals, but is also produced and reproduced by laws, rules, and practices, sanctioned and even implemented by various levels of government, and embedded in the economic system as well as in cultural and societal norms (Bailey, et al. 2021).</p></blockquote><p>The point is that a group---a university, perhaps, or an entire healthcare system, or a corporation---might act as if it holds racist or sexist beliefs, even though no majority of its members holds those beliefs. A university might pay academics who are women less, promote them less frequently, and so on, even while few individuals within the organisation, and certainly not a majority, believe that women's labour is worth less, and that women are less worthy of promotion. In such a case, we might wish to ascribe those beliefs to the institution as a whole. After all, on certain functionalist accounts of belief, to have a belief simply is to be in a state that has certain casual relationships with other states, including actions. And the state of a group is determined not only by the state of the individuals within it but also by the other structural features of the group, such as its laws, rules and practices. And if the states of the individuals within the group, combined with these laws, rules and practices give rise to the sort of behaviour that we would explain in a individual by positing a belief, it seems reasonable to do so in the group case as well. What's more, doing so helps to explain group behaviour in just the same way that ascribing beliefs to individuals helps to explain their behaviour. (As mentioned above, I take it that Kenny Easwaran and Reuben Stern take something like this view of group belief.)<br /></p><h3 style="text-align: left;">Group beliefs as ascriptions that have legal standing</h3><p style="text-align: left;">In her book, Lackey pays particular attention to cases of group belief that are relevant to corporate culpability and liability. In the 1970s, did the tobacco company Philip Morris believe that their product is hazardous to health, even while they repeatedly denied it? Between 1998 and 2014, did Volkswagen believe that their diesel emissions reports were accurate? In 2003, did the British government believe that Iraq could deploy biological weapons within forty-five minutes of an order to do so? Playing this role well is an important job for an account of group belief. It can have very significant real world consequences: Do those who trusted the assertions of tobacco companies and became ill as a result receive compensation? Do governments have a case against car manufacturers? Should a government stand down?</p><p style="text-align: left;">In fact, I think the consequences are often so large and, perhaps more importantly, so varied that the decision whether or not to put them in train should not depend on the applicability of a single concept with a single precise definition. Consider cases of corporate culpability. There are many ways in which this might be punished. We might fine the company. We might demand that it change certain internal policies or rules. We might demand that it change its corporate structure. We might do many things. Some will be appropriate and effective if the company believes a crucial proposition in one sense; some appropriate if it believes that proposition in some other sense. For instance, a fine does many things, but among them is this: it affects the wealth of the company's shareholders, who will react by putting pressure on the company's board. Thus, it might be appropriate to impose a fine if we think that the company believed the proposition that it denied in its public assertions in the sense that a substantial majority of its board believed it. On the other hand, demanding that the company change certain internal policies or rules would be appropriate if the company believes the proposition that it publicly denied in the sense that it is the outcome of applying its belief-forming rules and policies (such as, for instance, the nested set of subcommittees that I imagined for the WHO or the IPPC or MSF above).</p><p style="text-align: left;">The point is that our purpose in ascribing culpability and liability to a group is essentially pragmatic. We do it in order to determine what sort of punishment we might mete out. This is perhaps in contrast to cases of individual culpability and liability, where we are interested also in the moral status of the individual's action independent of how we respond to it. But, in many cases, such as when a corporation has lied, which punishment is appropriate depends on which of the many ways in which a group can believe the company believed the negation of the proposition it asserted in its lie.</p><p style="text-align: left;">So it seems to me that, even if this role were the only role that our concept of group belief had to play, pluralism would be appropriate. Groups are complex entities and there are consequently many ways in which we can seek to change them in order to avoid the sorts of harms that arise when they behave badly. We need different concepts of group belief in order to identify which is appropriate in a given case.</p><p style="text-align: left;">It's perhaps worth noting that, while Lackey's opens her book with cases of corporate culpability, and this is a central motivation for her emphasis on group lying, it isn't clear to me that her group agent account (GAA) can accommodate all cases of corporate lies. Consider the following situation. The board of a tobacco company is composed of eleven people. Each of them believes that tobacco is hazardous to health. However, some believe it for very different reasons from the others. They have all read the same scientific literature on the topic, but six of them remember it correctly and the other five remember it incorrectly. The six who remember it correctly remember that tobacco contains chemical A and remember that when chemical A comes into contact with tissue X in the human body, it causes cancer in that tissue; and they also remember that tobacco does not contain chemical B and they remember that, when chemical B comes into contact with tissue Y in the human body, it does not cause cancer in that tissue. The five who remember the scientific literature incorrectly believe that tobacco contains chemical B and believe that when chemical B comes into contact with tissue Y in the human body, it causes cancer in that tissue; and they also believe that tobacco does not contain chemical A and they believe that, when chemical A comes into contact with tissue X in the human body, it does not cause cancer in that tissue. So, all board members believe that smoking causes cancer. However, the bases of their beliefs forms an incoherent set. The two propositions on which the six base their belief directly contradict the two propositions on which the five base theirs. The board then issues a statement saying that tobacco does not cause cancer. The board is surely lying, but according to GAA, they are not because the bases of their beliefs conflict and so they do not believe that tobacco does cause cancer.<br /></p>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com3tag:blogger.com,1999:blog-4987609114415205593.post-67712133467929694302021-03-14T12:31:00.000+00:002021-03-14T12:31:33.036+00:00Permissivism and social choice: a response to Blessenohl<p>In <a href="https://www.journals.uchicago.edu/doi/abs/10.1086/708011" target="_blank">a recent paper</a> discussing Lara Buchak's risk-weighted expected utility theory, Simon Blessenohl notes that the objection he raises there to Buchak's theory might also tell against permissivism about rational credence. I offer a response to the objection here.</p><p>In his objection, Blessenohl suggests that credal permissivism gives rise to an unacceptable tension between the individual preferences of agents and the collective preferences of the groups to which those agents belong. He argues that, whatever brand of permissivism about credences you tolerate, there will be a pair of agents and a pair of options between which they must choose such that both agents will prefer the first to the second, but collectively they will prefer the second to the first. He argues that this consequence tells against permissivism. I respond that this objection relies on an equivocation between two different understandings of collective preferences: on the first, they are an attempt to summarise the collective view of the group; on the second, they are the preferences of a third-party social chooser tasked with making decisions on behalf of the group. I claim that, on the first understanding, Blessenohl's conclusion does not follow; and, on the second, it follows but is not problematic.<br /></p><p>It is well known that, if two people have difference credences in a given proposition, there is a sense in which the pair of them, taken together, is vulnerable to a sure loss set of bets.* That is, there is a bet that the first will accept and a bet that the second will accept such that, however the world turns out, they'll end up collectively losing money. Suppose, for instance, that Harb is 90% confident that Ladybug will win the horse race that is about to begin, while Jay is only 60% confident. Then Harb's credences should lead him to buy a bet for £80 that will pay out £100 if Ladybug wins and nothing if she loses, while Jay's credences should lead him to sell that same bet for £70 (assuming, as we will throughout, that the utility of £$n$ is $n$). If Ladybug wins, Harb ends up £20 up and Jay ends up £30 down, so they end up £10 down collectively. And if Ladybug loses, Harb ends up £80 down while Jay ends up £70 up, so they end up £10 down as a pair.<br /><br />So, for individuals with different credences in a proposition, there seems to be a tension between how they would choose as individuals and how they would choose as a group. Suppose they are presented with a choice between two options: on the first, $A$, both of them enter into the bets just described; on the second, $B$, neither of them do. We might represent these two options as follows, where we assume that Harb's utility for receiving £$n$ is $n$, and the same for Jay:$$A = \begin{pmatrix}<br />20 & -80 \\<br />-30 & 70<br />\end{pmatrix}\ \ \ <br />B = \begin{pmatrix}<br />0 & 0 \\<br />0 & 0<br />\end{pmatrix}$$The top left entry is Harb's winnings if Ladybug wins, the top right is Harb's winnings if she loses; the bottom left is Jay's winnings if she wins, and the bottom left is Jay's winnings if she loses. So, given a matrix $\begin{pmatrix} a & b \\ c & d \end{pmatrix}$, each row represents a <i>gamble</i>---that is, an assignment of utilities to each state of the world---and each column represents a <i>utility distribution</i>---that is, an assignment of utilities to each individual. So $\begin{pmatrix} a & b \end{pmatrix}$ represents the gamble that the option bequeaths to Harb---$a$ if Ladybug wins, $b$ if she loses---while $\begin{pmatrix} c & d \end{pmatrix}$ represents the gamble bequeathed to Jay---$c$ if she wins, $d$ if she loses. And $\begin{pmatrix} a \\ c \end{pmatrix}$ represents the utility distribution if Ladybug wins---$a$ to Harb, $c$ to Jay---while $\begin{pmatrix} b \\ d \end{pmatrix}$ represents the utility distribution if she loses---$b$ to Harb, $d$ to Jay. Summing the entries in the first column gives the group's collective utility if Ladybug wins, and summing the entries in the second column gives their collective utility if she loses.<br /><br />Now, suppose that Harb cares only for the utility that he will gain, and Jay cares only his own utility; neither cares at all about the other's welfare. Then each prefers $A$ to $B$. Yet, considered collectively, $B$ results in greater total utility for sure: for each column, the sum of the entries in that column in $B$ (that is, $0$) exceeds the sum in that column in $A$ (that is, $-10$). So there is a tension between what the members of the group unanimously prefer and what the group prefers.<br /><br />Now, to create this tension, I assumed that the group prefers one option to another if the total utility of the first is sure to exceed the total utility of the second. But this is quite a strong claim. And, as Blessenohl notes, we can create a similar tension by assuming something much weaker.<br /><br />Suppose again that Harb is 90% confident that Ladybug will win while Jay is only 60% confident that she will. Now consider the following two options:$$A' = \begin{pmatrix}<br />20 & -80 \\<br />0 & 0<br />\end{pmatrix}\ \ \ <br />B' = \begin{pmatrix}<br />5 & 5 \\<br />25 & -75<br />\end{pmatrix}$$In $A'$, Harb pays £$80$ for a £$100$ bet on Ladybug, while in $B'$ he receives £$5$ for sure. Given his credences, he should prefer $A'$ to $B'$, since the expected utility of $A'$ is $10$, while for $B'$ it is $5$. And in $A'$, Jay receives £0 for sure, while in $B'$ he pays £$75$ for a £$100$ bet on Ladybug. Given his credences, he should prefer $A'$ to $B'$, since the expected utility of $A'$ is $0$, while for $B'$ it is $-15$. But again we see that $B'$ will nonetheless end up producing greater total utility for the pair---$30$ vs $20$ if Ladybug wins, and $-70$ vs $-80$ if Ladybug loses. But we can argue in a different way that the group should prefer $B'$ to $A'$. This different way of arguing for this conclusion is the heart of Blessenohl's result.<br /><br />In what follows, we write $\preceq_H$ for Harb's preference ordering, $\preceq_J$ for Jay's, and $\preceq$ for the group's. First, we assume that, when one option gives a particular utility $a$ to Harb for sure and a particular utility $c$ to Jay for sure, then the group should be indifferent between that and the option that gives $c$ to Harb for sure and $a$ to Jay for sure. That is, the group should be indifferent between an option that gives the utility distribution $\begin{pmatrix} a \\ c\end{pmatrix}$ for sure and an option that gives $\begin{pmatrix} c \\ a\end{pmatrix}$ for sure. Blessenohl calls this <i>Constant Anonymity</i>:</p><p><b>Constant Anonymity</b> For any $a, c$,$$\begin{pmatrix}<br />a & a \\<br />c & c<br />\end{pmatrix} \sim<br />\begin{pmatrix}<br />c & c \\<br />a & a <br />\end{pmatrix}$$This allows us to derive the following:$$\begin{pmatrix}<br />20 & 20 \\<br />0 & 0<br />\end{pmatrix} \sim<br />\begin{pmatrix}<br />0 & 0 \\<br />20 & 20 <br />\end{pmatrix}\ \ \ \text{and}\ \ \ <br />\begin{pmatrix}<br />-80 & -80 \\<br />0 & 0<br />\end{pmatrix} \sim<br />\begin{pmatrix}<br />0 & 0 \\<br />-80 & -80 <br />\end{pmatrix}$$And now we can introduce our second principle:<br /><br /><b>Preference Dominance</b> For any $a, b, c, d, a', b', c', d'$, if$$\begin{pmatrix}<br />a & a \\<br />c & c<br />\end{pmatrix} \preceq<br />\begin{pmatrix}<br />a' & a' \\<br />c' & c'<br />\end{pmatrix}\ \ \ \text{and}\ \ \ <br />\begin{pmatrix}<br />b & b \\<br />d & d<br />\end{pmatrix} \preceq<br />\begin{pmatrix}<br />b' & b' \\<br />d' & d'<br />\end{pmatrix}$$then$$\begin{pmatrix}<br />a & b \\<br />c & d<br />\end{pmatrix} \preceq<br />\begin{pmatrix}<br />a' & b' \\<br />c' & d' <br />\end{pmatrix}$$Preference Dominance says that, if the group prefers obtaining the utility distribution $\begin{pmatrix} a \\ c\end{pmatrix}$ for sure to obtaining the utility distribution $\begin{pmatrix} a' \\ c'\end{pmatrix}$ for sure, and prefers obtaining the utility distribution $\begin{pmatrix} b \\ d\end{pmatrix}$ for sure to obtaining the utility distribution $\begin{pmatrix} b' \\ d'\end{pmatrix}$ for sure, then they prefer obtaining $\begin{pmatrix} a \\ c\end{pmatrix}$ if Ladybug wins and $\begin{pmatrix} b \\ d\end{pmatrix}$ if she loses to obtaining $\begin{pmatrix} a' \\ c'\end{pmatrix}$ if Ladybug wins and $\begin{pmatrix} b' \\ d'\end{pmatrix}$ if she loses.<br /><br />Preference Dominance, combined with the indifferences that we derived from Constant Anonymity, gives$$\begin{pmatrix}<br />20 & -80 \\<br />0 & 0<br />\end{pmatrix} \sim<br />\begin{pmatrix}<br />0 & 0 \\<br />20 & -80 <br />\end{pmatrix}$$And then finally we introduce a closely related principle: </p><p><b>Utility Dominance</b> For any $a, b, c, d, a', b', c', d'$, if $a < a'$, $b < b'$, $c < c'$, and $d < d'$, then$$\begin{pmatrix}<br />a & b \\<br />c & d<br />\end{pmatrix} \prec<br />\begin{pmatrix}<br />a' & b' \\<br />c' & d' <br />\end{pmatrix}$$</p><p>This simply says that if one option gives more utility than another to each individual at each world, then the group should prefer the first to the second. So$$\begin{pmatrix}<br />0 & 0 \\<br />20 & -80 <br />\end{pmatrix} \prec<br />\begin{pmatrix}<br />5 & 5 \\<br />25 & -75<br />\end{pmatrix}$$Stringing these together, we have$$A' = \begin{pmatrix}<br />20 & -80 \\<br />0 & 0<br />\end{pmatrix} \sim<br />\begin{pmatrix}<br />0 & 0 \\<br />20 & -80 <br />\end{pmatrix} \prec<br />\begin{pmatrix}<br />5 & 5 \\<br />25 & -75<br />\end{pmatrix} = B'$$And thus, assuming that $\preceq$ is transitive, while Harb and Jay both prefer $A'$ to $B'$, the group prefers $B'$ to $A'$.<br /><br />More generally, Blessenohl proves an impossibility result. Add to the principles we have already stated the following:</p><p><b>Ex Ante Pareto</b> If $A \preceq_H B$ and $A \preceq_J B$, then $A \preceq B$.</p><p>And also:</p><p><b>Egoism</b> For any $a, b, c, d, a', b', c', d'$,$$\begin{pmatrix}<br />a & b \\<br />c & d<br />\end{pmatrix} \sim_H \begin{pmatrix}<br />a & b \\<br />c' & d'<br />\end{pmatrix}\ \ \ \text{and}\ \ \ <br />\begin{pmatrix}<br />a & b \\<br />c & d<br />\end{pmatrix} \sim_J \begin{pmatrix}<br />a' & b' \\<br />c & d<br />\end{pmatrix}$$That is, Harb cares only about the utilities he obtains from an option, and Jay cares only about the utilities that he obtains. And finally:</p><p><b>Individual Preference Divergence</b> There are $a, b, c, d$ such that$$\begin{pmatrix}<br />a & b \\<br />a & b<br />\end{pmatrix} \prec_H \begin{pmatrix}<br />c & d \\<br />c & d<br />\end{pmatrix}\ \ \ \text{and}\ \ \ <br />\begin{pmatrix}<br />a & b \\<br />a & b<br />\end{pmatrix} \succ_J \begin{pmatrix}<br />c & d \\<br />c & d<br />\end{pmatrix}$$Then Blessenohl shows that there are no preferences $\preceq_H$, $\preceq_J$, and $\preceq$ that satisfy Individual Preference Divergence, Egoism, Ex Ante Pareto, Constant Anonymity, Preference Dominance, and Utility Dominance.** And yet, he claims, each of these is plausible. He suggests that we should give up Individual Preference Divergence, and with it permissivism and risk-weighted expected utility theory.<br /><br />Now, the problem that Blessenohl identifies arises because Harb and Jay have different credences in the same proposition. But of course impermissivists agree that two rational individuals can have different credences in the same proposition. So why is this a problem specifically for permissivism? The reason is that, for the impermissivist, if two rational individuals have different credences in the same proposition, they must have different evidence. And for individuals with different evidence, we wouldn't necessarily want the group preference to preserve unanimous agreement between the individuals. Instead, we'd want the group to choose using whichever credences are rational in the light of the joint evidence obtained by pooling the evidence held by each individual in the group. And those might render one option preferable to the other even though each of the individuals, with their less well informed credences, prefer the second option to the first. So Ex Ante Pareto is not plausible when the individuals have different evidence, so impermissivism is safe.<br /><br />To see this, consider the following example: There are two medical conditions, $X$ and $Y$, that affect racehorses. If they have $X$, they're 90% likely to win the race; if they have $Y$, they're 60% likely; if they have both, they're 10% likely to win. Suppose Harb knows that Ladybug has $X$, but has no information about whether she has $Y$; and suppose Jay knows Ladybug has $Y$ and no information about $X$. Then both are rational. And both prefer $A$ to $B$ from above. But we wouldn't expect the group to prefer $A$ to $B$, since the group should choose using the credence it's rational to have if you know both that Ladybug has $X$ and that she has $Y$; that is, the group should choose by pooling the individual's evidence to give the group evidence, and then choose using the probabilities relative to that. And, relative to that evidence, $B$ is preferable to $A$.<br /><br />The permissivist, in contrast, cannot make this move. After all, for them it is possible for two rational individuals to disagree even though they have exactly the same evidence, and therefore the same pooled evidence. Blessenohl considers various ways the permissivist or the risk-weighted expected utility theorist might answer his objection, either by denying Ex Ante Pareto or Preference or Utility Dominance. He considers each response unsuccessful, and I tend to agree with his assessments. However, oddly, he explicitly chooses not to consider the suggestion that we might drop Constant Anonymity. I'd like to suggest that we should consider doing exactly that.<br /><br />I think Blessenohl's objection relies on an ambiguity in what the group preference ordering $\preceq$ represents. On one understanding, it is no more than an attempt to summarise the collective view of the group; on another, it represents the preferences of a third party brought in to make decisions on behalf of the group---the social chooser, if you will. I will argue that Ex Ante Pareto is plausible on the first understanding, but Constant Anonymity isn't; and Constant Anonymity is plausible on the second understanding, but Ex Ante Pareto isn't. <br /><br />Let's treat the first understanding of $\preceq$. On this, $\preceq$ represents the group's collective opinions about the options on offer. So just as we might try to summarise the scientific community's view on the future trajectory of Earth's average surface temperate or the mechanisms of transmission for SARS-CoV-2 by looking at the views of individual scientists, so might we try to summarise Harb and Jay's collective view of various options by looking at their individual views. Understood in this way, Constant Anonymity does not look plausible. Its motivation is, of course, straightforward. If $a < b$ and$$\begin{pmatrix}<br />a & a \\<br />b & b<br />\end{pmatrix} \prec <br />\begin{pmatrix}<br />b & b \\<br />a & a <br />\end{pmatrix}$$then the group's collective view unfairly and without justification favours Harb over Jay. And if$$\begin{pmatrix}<br />a & a \\<br />b & b<br />\end{pmatrix} \succ <br />\begin{pmatrix}<br />b & b \\<br />a & a <br />\end{pmatrix}$$then it unfairly and without justification favours Jay over Harb. So we should rule out both of these. But this doesn't entail that the group preference should be indifferent between these two options. That is, it doesn't entail that we should have$$\begin{pmatrix}<br />a & a \\<br />b & b<br />\end{pmatrix} \sim <br />\begin{pmatrix}<br />b & b \\<br />a & a <br />\end{pmatrix}$$After all, when you compare two options $A$ and $B$, there are four possibilities:</p><ol style="text-align: left;"><li>$A \preceq B$ and $B \preceq A$---that is, $A \sim B$;</li><li>$A \preceq B$ and $B \not \preceq A$---that is, $A \prec B$;</li><li>$A \not \preceq B$ and $B \preceq A$---that is, $A \succ B$;</li><li>$A \not \preceq B$ and $B \not \preceq A$---that is, $A$ and $B$ and not compatible.</li></ol><p>The argument for Constant Anonymity rules out (2) and (3), but it does not rule out (4). What's more, it's easy to see that, if we weaken Constant Anonymity so that it requires (1) or (4) rather than requiring (1), then we see that all of the principles are consistent with it. So introduce <i>Weak Constant Anonymity</i>:</p><p><b>Weak Constant Anonymity</b> For any $a, c$, then either$$\begin{pmatrix}<br />a & a \\<br />c & c<br />\end{pmatrix} \sim<br />\begin{pmatrix}<br />c & c \\<br />a & a <br />\end{pmatrix}$$or$$\begin{pmatrix}<br />a & a \\<br />c & c<br />\end{pmatrix}\ \ \text{and}\ \ <br />\begin{pmatrix}<br />c & c \\<br />a & a <br />\end{pmatrix}\ \ \text{are incomparable}$$<br /><br />Then define the preference ordering $\preceq^*$ as follows:$$A \preceq^* B \Leftrightarrow \left ( A \preceq_H B\ \&\ A \preceq_J B \right )$$Then $\preceq^*$ satisfies Ex Ante Pareto, Weak Constant Anonymity, Preference Dominance, and Utility Dominance. And indeed $\preceq^*$ seems a very plausible candidate for the group preference ordering understood in this first way: where Harb and Jay disagree, it simply has no opinion on the matter; it has opinions only where Harb and Jay agree, and then it shares their shared opinion. <br /><br />On the understanding of $\preceq$ as summarising the group's collective view, if $\begin{pmatrix}<br />a & a \\<br />c & c<br />\end{pmatrix} \sim<br />\begin{pmatrix}<br />c & c \\<br />a & a <br />\end{pmatrix}$ then the group collectively thinks that this option $\begin{pmatrix}<br />a & a \\<br />c & c<br />\end{pmatrix}$ is exactly as good as this option $\begin{pmatrix}<br />c & c \\<br />a & a <br />\end{pmatrix}$. But the group absolutely does not think that. Indeed, Harb and Jay both explicitly deny it, though for opposing reasons. So Constant Anonymity is false.<br /><br />Let's turn next to the second understanding. On this, $\preceq$ is the preference ordering of the social chooser. Here, the original, stronger version of Constant Anonymity seems more plausible. After all, unlike the group itself, the social chooser should have the sort of positive commitment to equality and fairness that the group definitively does not have. As we noted above, Harb and Jay unanimously reject the egalitarian assessment represented by $\begin{pmatrix}<br />a & a \\<br />c & c<br />\end{pmatrix} \sim<br />\begin{pmatrix}<br />c & c \\<br />a & a <br />\end{pmatrix}$. They explicitly both think that these two options are not equally good---if $a < c$, then Harb thinks the second is strictly better, while Jay thinks the first is strictly better. So, as we argued above, we take the group view to be that they are incomparable. But the social chooser should not remain so agnostic. She should overrule the unanimous rejection of the indifference relation between them and accept it. But, having thus overruled one unanimous view and taken a different one, it is little surprise that she will reject other unanimous views, such as Harb and Jay's unanimous view that $A'$ is better than $B'$ above. That is, it is little surprise that she should violate Ex Ante Pareto. After all, her preferences are not only informed by a value that Harb and Jay do not endorse; they are informed by a value that Harb and Jay explicitly reject, given our assumption of Egoism. This is the value of fairness, which is embodied in the social chooser's preferences in Constant Anonymity and rejected in Harb's and Jay's preferences by Egoism. If we require of our social chooser that they adhere to this value, we should not expect Ex Ante Pareto to hold.</p><p>* See Philippe Mongin's 1995 paper <a href="https://www.sciencedirect.com/science/article/abs/pii/S0022053185710447" target="_blank">'Consistent Bayesian Aggregation'</a> for wide-ranging results in this area.</p><p>** Here's the trick: if$$\begin{pmatrix}<br />a & b \\<br />a & b<br />\end{pmatrix} \prec_H \begin{pmatrix}<br />c & d \\<br />c & d<br />\end{pmatrix}\ \ \ \text{and}\ \ \ <br />\begin{pmatrix}<br />a & b \\<br />a & b<br />\end{pmatrix} \succ_J \begin{pmatrix}<br />c & d \\<br />c & d<br />\end{pmatrix}$$<br />Then let$$A' = \begin{pmatrix}<br />c & d \\<br />a & b<br />\end{pmatrix}\ \ \ \text{and}\ \ \ <br />B' = \begin{pmatrix}<br />a & b \\<br />c & d<br />\end{pmatrix}$$Then $A' \succ_H B'$ and $A' \succ_J B'$, but $A' \sim B'$. </p>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com0tag:blogger.com,1999:blog-4987609114415205593.post-77478817057816758322021-01-06T14:24:00.001+00:002021-01-06T14:25:21.026+00:00Life on the edge: a response to Schultheis' challenge to epistemic permissivism about credences<p>In their 2018 paper, <a href="https://philpapers.org/rec/SCHLOT-25" target="_blank">'Living on the Edge'</a>, Ginger Schultheis issues a powerful challenge to epistemic permissivism about credences, the view that there are bodies of evidence in response to which there are a number of different credence functions it would be rational to adopt. The heart of the argument is the claim that a certain sort of situation is impossible. Schultheis thinks that all motivations for permissivism must render situations of this sort possible. Therefore, permissivism must be false, or at least these motivations for it must be wrong.</p><p>Here's the situation, where we write $R_E$ for the set of credence functions that it is rational to have when your total evidence is $E$. </p><ul style="text-align: left;"><li>Our agent's total evidence is $E$.</li><li>There is $c$ in $R_E$ that our agent knows is a rational response to $E$.</li><li>There is $c'$ in $R_E$ that our agent does not know is a rational response to $E$.</li></ul><p>Schultheis claims that the permissivist must take this to be possible, whereas in fact it is impossible. Here are a couple of specific examples that the permissivist will typically take to be possible.</p><p>Example 1: we might have a situation in which the credences it is rational to assign to a proposition $X$ in response to evidence $E$ form the interval $[0.4, 0.7]$. But we might not be sure of quite the extent of the interval. For all we know, it might be $[0.41, 0.7]$ or $[0.39, 0.71]$. Or it might be $[0.4, 0.7]$. So we are sure that $0.5$ is a rational credence in $X$, but we're not sure whether $0.4$ is a rational credence in $X$. In this case, $c(X) = 0.5$ and $c'(X) = 0.4$.</p><p>Example 2: you know that Probablism is a rational requirement on credence functions, and you know that satisfying the Principle of Indifference is rationally permitted, but you don't know whether or not it is also rationally required. In this case, $c$ is the uniform distribution required by the Principle of Indifference, but $c'$ is any other probability function.<br /></p><p>Schultheis then appeals to a principle called <i>Weak Rationality Dominance</i>. We say that one credence function $c$ <i>rationally dominates</i> another $c'$ if $c$ is rational in all worlds in which $c'$ is rational, and also rational in some worlds in which $c'$ is not rational. Weak Rationality Dominance says that it is irrational to adopt a rationally dominated credence function. The important consequence of this for Schultheis' argument is that, if you know that $c$ is rational, but you don't know whether $c'$ is, then $c'$ is irrational. As a result, in our example above, $c'$ is not rational, contrary to what the permissivist claims, because it is rationally dominated by $c$. So permissivism must be false.<br /></p><p>If Weak Rationality Dominance is correct, then, it follows that the permissivist must say that, for any body of evidence $E$ and set $R_E$ of rational responses, the agent with evidence $E$ either <i>must know of each</i> credence function in $R_E$ that it is in $R_E$, or they <i>must not know of any</i> credence function in $R_E$ that it is in $R_E$. If they <i>know of some</i> credence functions in $R_E$ that they are in $R_E$ and <i>not know of others</i> in $R_E$ that they are in $R_E$, then they clash with Weak Rationality Dominance. But, whatever your reason for being a permissivist, it seems very likely that it will entail situations in which there are some credence functions that are rational responses to your evidence and that you know are such responses, while you are unsure about other credence functions that are, in fact, rational responses whether or not they are, in fact, rational responses. This is Schultheis' challenge.</p><p>I'd like to explore a response to Schultheis' argument that takes issue with Weak Rationality Dominance (WRD). I'll spell out the objection in general to begin with, and then see how it plays out for a specific motivation for permissivism, namely, the Jamesian motivation I sketched in <a href="https://philpapers.org/rec/SCHLOT-25" target="_blank">this previous blogpost</a>. </p><p>One worry about WRD is that it seems to entail a deference principle of exactly the sort that I objected to in <a href="https://m-phi.blogspot.com/2020/12/deferring-to-rationality-does-it.html" target="_blank">this blogpost</a>. According to such deference principles, for certain agents in certain situations, if they learn of a credence function that it is rational, they should adopt it. For instance, Ben Levinstein claims that, if you are certain that you are irrational, and you learn that $c$ is rational, then you should adopt $c$ -- or at least you should have the conditional credences that would lead you to do this if you were to apply conditionalization. We might slightly strengthen Levinstein's version of the deference principle as follows: if you are unsure whether you are rational or not, and you learn that $c$ is rational, then you should adopt $c$. WRD entails this deference principle. After all, suppose you have credence function $c'$, and you are unsure whether or not it is rational. And suppose you learn that $c$ is rational (and don't thereby learn that $c'$ is as well). Then, according to Schultheis' principle, you are irrational if you stick with $c'$.</p><p>In the previous blogpost, I objected to Levinstein's deference principle, and others like it, because it relies on the assumption that all rational credence functions are better than all irrational credence functions. I think that's false. I think there are certain sorts of flaw that render you irrational, and lacking those flaws renders you rational. But lacking those flaws doesn't ensure that you're going to be better than someone who has those flaws. Consider, for instance, the extreme subjective Bayesian who justifies their position using an accuracy dominance argument of the sort pioneered by Jim Joyce. That is, they say that accuracy is the sole epistemic good for credence functions. And they say that non-probabilistic credence functions are irrational because, for any such credence function, there are probabilistic ones that accuracy dominate them; and all probabilistic credence functions are rational because, for any such credence function, there is no probabilistic one that accuracy dominates it. Now, suppose I have credence $0.91$ in $X$ and $0.1$ in $\overline{X}$. And suppose I am either sure that this is irrational, or I'm uncertain it is. I then learn that assigning credence $0.1$ to $X$ and $0.9$ to $\overline{X}$ is rational. What should I do? It isn't at all obvious to me that I should move from my credence function to the one I've learned is rational. After all, even from my slightly incoherent standpoint, it's possible to see that the rational one is going to be a lot less accurate than mine if $X$ is true, and I'm very confident that it is. </p><p>So I think that the rational deference principle is wrong, and therefore any version of WRD that entails it is also wrong. But perhaps there is a more restricted version of WRD that is right. And one that is nonetheless capable of sinking permissivism. Consider, for instance, a restricted version of WRD that applies only to agents who have no credence function --- that is, it applies to your initial choice of a credence function; it does not apply when you have a credence function and you are deciding whether to adopt a new one. This makes a difference. The problem with a version that applies when you already have a credence function $c'$ is that, even if it is irrational, it might nonetheless be better than the rational credence function $c$ in some situation, and it might be that $c'$ assigns a lot of credence to that situation. So it's hard to see how to motivate the move from $c'$ to $c$. However, in a situation in which you have no credence function, and you are unsure whether $c'$ is rational (even though it is) and you're certain that $c$ is rational (and indeed it is), WRD's demand that you should not pick $c'$ seems more reasonable. You occupy no point of view such that $c'$ is less of a depature from that point of view than $c$ is. You know only that $c$ lacks the flaws for sure, whereas $c'$ might have them. Better, then, to go for $c$, is it not? And if it is, this is enough to defeat permissivism.</p><p>I think it's not quite that simple. I noted above that Levinstein's deference principle relies on the assumption that all rational credence functions are better than all irrational credence functions. Schultheis' WRD seems to rely on something even stronger, namely, the assumption that all rational credence functions are equally good in all situations. For suppose they are not. You might then be unsure whether $c'$ is rational (though it is) and sure that $c$ is rational (and it is), but nonetheless rationally opt for $c'$ because you know that $c'$ has some good feature that you know $c$ lacks and you're willing to take the risk of having an irrational credence function in order to open the possibility of having that good feature.</p><p>Here's an example. You are unsure whether it is rational to assign $0.7$ to $X$ and $0.3$ to $\overline{X}$. It turns out that it is, but you don't know that. On the other hand, you do know that it is rational to assign 0.5 to each proposition. But the first assignment and the second are not equally good in all situations. The second has the same accuracy whether $X$ is true or false; the first, in constrast, is better than the first if $X$ is true and worse than the first if $X$ is false. The second does not open up the possibility of high accuracy that the first does; though, to compensate, it also precludes the possibility of low accuracy, which the first doesn't. Surveying the situation, you think that you will take the risk. You'll adopt the first, even though you aren't sure whether or not it is rational. And you'll do this because you want the possibility of being rational and having that higher accuracy. This seems a rational thing to do. So, it seems to me, WRD is false.<br /></p><p>Although I think this objection to WRD works, I think it's helpful to see how it might play out for a particular motivation for permissivism. Here's the motivation: Some credence functions offer the promise of great accuracy -- for instance, assigning 0.9 to $X$ and 0.1 to $\overline{X}$ will be very accurate if $X$ is true. However, those that do so also open the possibility of great inaccuracy -- if $X$ is false, the credence function just considered is very inaccurate. Other credence functions neither offer great accuracy nor risk great inaccuracy. For instance, assigning 0.5 to both $X$ and $\overline{X}$ guarantees the same inaccuracy whether or not $X$ is true. You might say that you are more risk-averse the lower is the maximum possible inaccuracy you are willing to risk. Thus, the options that are rational for you are those undominated options with maximum inaccuracy at most whatever the threshold is that you set. Now, suppose you use the Brier score to measure your inaccuracy -- so that the inaccuracy of the credence function $c(X) = p$ and $c(\overline{X}) = 1-p$ is $2(1-p)^2$ if $X$ is true and $2p^2$ if $X$ is false. And suppose you are willing to tolerate a maximum possible inaccuracy of $0.5$, which also gives you a mininum inaccuracy of $0.5$. In that case, only $c(X) = 0.5 = c(\overline{X})$ will be rational from the point of view of your risk attitudes --- since $2(1-0.5)^2 = 0.5 = 2(0.5^2)$. On the other hand, suppose you are willing to tolerate a maximum inaccuracy of $0.98$, which also gives you a minimum inaccuracy of $0.18$. In that case, any credence function $c$ with $0.3 \leq c(X) \leq 0.7$ and $c(\overline{X}) = 1-c(X)$ is rational from the point of view of your risk attitudes.</p><p>Now, suppose that you are in the sort of situation that Schultheis imagines. You are uncertain of the extent of the set $R_E$ of rational responses to your evidence $E$. On the account we're considering, this must be because you are uncertain of your own attitudes to epistemic risk. Let's say that the threshold of maximum inaccuracy that you're willing to tolerate is $0.98$, but you aren't certain of that --- you think it might be anything between $0.72$ and $1.28$. So you're sure that it's rational to assign anything between 0.4 and 0.6 to $X$, but unsure whether it's rational to assign $0.7$ to $X$ --- if your threshold turns out to be less than 0.98, then assigning $0.7$ to $X$ would be irrational, because it risks inaccuracy of $0.98$. In this situation, is it rational to assign $0.7$ to $X$? I think it is. Among the credence functions that you know for sure are rational, the ones that give you the lowest possible inaccuracy are the one that assigns 0.4 to $X$ and the one that assigns 0.6 to $X$. They have maximum inaccuracy of 0.72, and they open up the possibility of an inaccuracy of 0.32, which is lower than the lowest possible inaccuracy opened up by any others that you know to be rational. On the other hand, assigning 0.7 to $X$ opens up the possibility of an inaccuracy of 0.18, which is considerably lower. As a result, it doesn't seem irrational to assign 0.7 to $X$, even though you don't know whether it is rational from the point of view of your attitudes to risk, and you do know that assigning 0.6 is rational. </p><p>There is another possible response to Schultheis' challenge for those who like this sort of motivation for permissivism. You might simply say that, if your attitudes to risk are such that you will tolerate a maximum inaccuracy of at most $t$, then regardlesss of whether you know this fact, indeed regardless of your level of uncertainty about it, the rational credence functions are precisely those that have maximum inaccuracy of at most $t$. This sort of approach is familiar from expected utility theory. Suppose I have credences in $X$ and in $\overline{X}$. And suppose I face two options whose utility is determined by whether or not $X$ is true or false. Then, regardless of what I believe about my credences in $X$ and $\overline{X}$, I should choose whichever option maximises expected utility from the point of view of my actual credences. The point is this: if what it is rational for you to believe or to do is determined by some feature of you, whether it's your credences or your attitudes to risk, being uncertain about those features doesn't change what it is rational for you to do. This introduces a certain sort of externalism to our notion of rationality. There are features of ourselves -- our credences or our attitudes to risk -- that determine what it is rational for us to believe or do, which are nonetheless not luminous to us. But I think this is inevitable. Of course, we might might move up a level and create a version of expected utility theory that appeals not to our first-order credences but to our credences concerning those first-order credences -- perhaps you use the higher-order credences to define a higher-order expected value for the first-order expected utilities, and you maximize that. But it simply pushes the problem back a step. For your higher-order credences are no more luminous than your first-order ones. And to stop the regress, you must fix some level at which the credences at that level simply determine the expectation that rationality requires you to maximize, and any uncertainty concerning those does not affect rationality. And the same goes in this case. So, given this particular motivation for permissivism, which appeals to your attitudes to epistemic risk, it seems that there is another reason why WRD is false. If $c$ is in $R_E$, then it is rational for you, regardless of your epistemic attitude to its rationality.<br /></p>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com3tag:blogger.com,1999:blog-4987609114415205593.post-85533560114109447092021-01-04T11:53:00.000+00:002021-01-04T11:53:40.436+00:00Using a generalized Hurwicz criterion to pick your priors<p>Over the summer, I got interested in the problem of the priors again. Which credence functions is it rational to adopt at the beginning of your epistemic life? Which credence functions is it rational to have before you gather any evidence? Which credence functions provide rationally permissible responses to the empty body of evidence? As is my wont, I sought to answer this in the framework of epistemic utility theory. That is, I took the rational credence functions to be those declared rational when the appropriate norm of decision theory is applied to the decision problem in which the available acts are all the possible credence functions, and where the epistemic utility of a credence function is measured by a strictly proper measure. I considered a number of possible decision rules that might govern us in this evidence-free situation: <a href="https://m-phi.blogspot.com/2020/07/hurwiczs-criterion-of-realism-and.html" target="_blank">Maximin, the Principle of Indifference, and the Hurwicz criterion</a>. And I concluded in favour of <a href="https://m-phi.blogspot.com/2020/07/a-generalised-hurwicz-criterion.html" target="_blank">a generalized version of the Hurwicz criterion</a>, which I axiomatised. I also <a href="https://m-phi.blogspot.com/2020/07/taking-risks-and-picking-priors.html" target="_blank">described</a> which credence functions that decision rule would render rational in the case in which there are just three possible worlds between which we divide our credences. In this post, I'd like to generalize the results from that treatment to the case in which there any finite number of possible worlds.</p><p>Here's the decision rule (where $a(w_i)$ is the utility of $a$ at world $w_i$).<br /></p><p><b>Generalized Hurwicz Criterion </b>Given an option $a$ and a sequence of weights $0 \leq \lambda_1, \ldots, \lambda_n \leq 1$ with $\sum^n_{i=1} \lambda_i = 1$, which we denote $\Lambda$, define the generalized Hurwicz score of $a$ relative to $\Lambda$ as follows: if $$a(w_{i_1}) \geq a(w_{i_2}) \geq \ldots \geq a(w_{i_n})$$ then $$H^\Lambda(a) := \lambda_1a(w_{i_1}) + \ldots + \lambda_na(w_{i_n})$$That is, $H^\Lambda(a)$ is the weighted average of all the possible utilities that $a$ receives, where $\lambda_1$ weights the highest utility, $\lambda_2$ weights the second highest, and so on.</p><p>The Generalized Hurwicz Criterion says that you should order options by their generalized Hurwicz score relative to a sequence $\Lambda$ of weightings of your choice. Thus, given $\Lambda$,$$a \preceq^\Lambda_{ghc} a' \Leftrightarrow H^\Lambda(a) \leq H^\Lambda(a')$$And the corresponding decision rule says that you should pick your Hurwicz weights $\Lambda$ and then, having done that, it is irrational to choose $a$ if there is $a'$ such that $a \prec^\Lambda_{ghc} a'$.</p><p>Now, let $\mathfrak{U}$ be an additive strictly proper epistemic utility measure. That is, it is generated by a strictly proper scoring rule. A <i>strictly proper scoring rule</i> is a function $\mathfrak{s} : \{0, 1\} \times [0, 1] \rightarrow [-\infty, 0]$ such that, for any $0 \leq p \leq 1$, $p\mathfrak{s}(1, x) + (1-p)\mathfrak{s}(0, x)$ is maximized, as a function of $x$, uniquely at $x = p$. And an epistemic utility measure is generated by $\mathfrak{s}$ if, for any credence function $C$ and world $w_i$,$$\mathfrak{U}(C, w_i) = \sum^n_{j=1} \mathfrak{s}(w^j_i, c_j)$$where</p><ul style="text-align: left;"><li>$c_j = C(w_j)$, and <br /></li><li>$w^j_i = 1$ if $j=i$ and $w^j_i = 0$ if $j \neq i$ </li></ul><p>In what follows, we write the sequence $(c_1, \ldots, c_n)$ to represent the credence function $C$.</p><p>Also, given a sequence $(\alpha_1, \ldots, \alpha_k)$ of numbers, let$$\mathrm{Av}((\alpha_1, \ldots, \alpha_k)) := \frac{\alpha_1 + \ldots + \alpha_k}{k}$$That is, $\mathrm{av}(A)$ is the average of the numbers in $A$. And given $1 \leq k \leq n$, let $A|_k = (a_1, \ldots, a_k)$. That is, $A|_k$ is the truncation of the sequence $A$ that omits all terms after $a_k$. Then we say that $A$ does not exceed its average if, for each $1 \leq k \leq n$,$$\mathrm{av}(A) \geq \mathrm{av}(A|_k)$$That is, at no point in the sequence does the average of the numbers up to that point exceed the average of all the numbers in the sequence. <br /></p><p><b>Theorem 1</b> Suppose $\Lambda = (\lambda_1, \ldots, \lambda_n)$ is a sequence of generalized Hurwicz weights. Then there is a sequence of subsequences $\Lambda_1, \ldots, \Lambda_m$ of $\Lambda$ such that</p><ol style="text-align: left;"><li>$\Lambda = \Lambda_1 \frown \ldots \frown \Lambda_m$</li><li>$\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av} (\Lambda_m)$</li><li>each $\Lambda_i$ does not exceed its average</li></ol><p>Then, the credence function$$(\underbrace{\mathrm{av}(\Lambda_1), \ldots, \mathrm{av}(\Lambda_1)}_{\text{length of $\Lambda_1$}}, \underbrace{\mathrm{av}(\Lambda_2), \ldots, \mathrm{av}(\Lambda_2)}_{\text{length of $\Lambda_2$}}, \ldots, \underbrace{\mathrm{av}(\Lambda_m), \ldots, \mathrm{av}(\Lambda_m)}_{\text{length of $\Lambda_m$}})$$maximizes $H^\Lambda(\mathfrak{U}(-))$ among credence functions $C = (c_1, \ldots, c_n)$ for which $c_1 \geq \ldots \geq c_n$.</p><p>This is enough to give us all of the credence functions that maximise $H^\Lambda(\mathfrak{U}(-))$: they are the credence function mentioned together with any permutation of it --- that is, any credence function obtained from that one by switching around the credences assigned to the worlds.<br /></p><p><i>Proof of Theorem 1.</i> Suppose $\mathfrak{U}$ is a measure of epistemic value that is generated by the strictly proper scoring rule $\mathfrak{s}$. And suppose that $\Lambda$ is the following sequence of generalized Hurwicz weights $0 \leq \lambda_1, \ldots, \lambda_n \leq 1$ with $\sum^n_{i=1} \lambda_i = 1$. <br /><br />First, due to a theorem that originates in Savage and is stated and proved fully by Predd, et al., if $C$ is not a probability function---that is, if $c_1 + \ldots + c_n \neq 1$---then there is a probability function $P$ such that $\mathfrak{U}(P, w_i) > \mathfrak{U}(C, w_i)$ for all worlds $w_i$. Thus, since GHC satisfies Strong Dominance, whatever maximizes $H^\Lambda(\mathfrak{U}(-))$ will be a probability function. <br /><br />Now, since $\mathfrak{U}$ is generated by a strictly proper scoring rule, it is also truth-directed. That is, if $c_i > c_j$, then $\mathfrak{U}(C, w_i) > \mathfrak{U}(C, w_j)$. Thus, if $c_1 \geq c_2 \geq \ldots \geq c_n$, then$$H^\Lambda(\mathfrak{U}(C)) = \lambda_1\mathfrak{U}(C, w_1) + \ldots + \lambda_n\mathfrak{U}(C, w_n)$$This is what we seek to maximize. But notice that this is just the expectation of $\mathfrak{U}(C)$ from the point of view of the probability distribution $\Lambda = (\lambda_1, \ldots, \lambda_n)$.<br /><br />Now, Savage also showed that, if $\mathfrak{s}$ is strictly proper and continuous, then there is a differentiable and strictly convex function $\varphi$ such that, if $P, Q$ are probabilistic credence functions, then<br />\begin{eqnarray*}<br />\mathfrak{D}_\mathfrak{s}(P, Q) & = & \sum^n_{i=1} \varphi(p_i) - \sum^n_{i=1} \varphi(q_i) - \sum^n_{i=1} \varphi'(q_i)(p_i - q_i) \\<br />& = & \sum^n_{i=1} p_i\mathfrak{U}(P, w_i) - \sum^n_{i=1} p_i\mathfrak{U}(Q, w_i)<br />\end{eqnarray*}<br />So $C$ maximizes $H^\Lambda(\mathfrak{U}(-))$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$ iff $C$ minimizes $\mathfrak{D}_\mathfrak{s}(\Lambda, -)$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$. We now use the KKT conditions to calculate which credence functions minimize $\mathfrak{D}_\mathfrak{s}(\Lambda, -)$ among credence functions $C$ with $c_1 \geq \ldots \geq c_n$. <br /><br />Thus, if we write $x_n$ for $1 - x_1 - \ldots - x_{n-1}$, then <br />\begin{multline*}<br />f(x_1, \ldots, x_{n-1}) = \mathfrak{D}((\lambda_1, \ldots, \lambda_n), (x_1, \ldots, x_n)) = \\<br />\sum^n_{i=1} \varphi(\lambda_i) - \sum^n_{i=1} \varphi(x_i) - \sum^n_{i=1} \varphi'(x_i)(\lambda_i - x_i) <br />\end{multline*}<br />So<br />\begin{multline*}<br />\nabla f = \langle \varphi''(x_1) (x_1 - \lambda_1) - \varphi''(x_n)(x_n - \lambda_n), \\<br />\varphi''(x_2) (x_2 - \lambda_2) - \varphi''(x_n)(x_n - \lambda_n), \ldots \\<br />\varphi''(x_{n-1}) (x_{n-1} - \lambda_{n-1}) - \varphi''(x_n)(x_n - \lambda_n) )\rangle<br />\end{multline*}<br /><br />Let $$\begin{array}{rcccl}<br />g_1(x_1, \ldots, x_{n-1}) & = & x_2 - x_1& \leq & 0\\<br />g_2(x_1, \ldots, x_{n-1}) & = & x_3 - x_2& \leq & 0\\<br />\vdots & \vdots & \vdots & \vdots & \vdots \\<br />g_{n-2}(x_1, \ldots, x_{n-1}) & = & x_{n-1} - x_{n-2}& \leq & 0 \\<br />g_{n-1}(x_1, \ldots, x_{n-1}) & = & 1 - x_1 - \ldots - x_{n-2} - 2x_{n-1} & \leq & 0<br />\end{array}$$So,<br />\begin{eqnarray*}<br />\nabla g_1 & = & \langle -1, 1, 0, \ldots, 0 \rangle \\<br />\nabla g_2 & = & \langle 0, -1, 1, 0, \ldots, 0 \rangle \\<br />\vdots & \vdots & \vdots \\<br />\nabla g_{n-2} & = & \langle 0, \ldots, 0, -1, 1 \rangle \\<br />\nabla g_{n-1} & = & \langle -1, -1, -1, \ldots, -1, -2 \rangle \\ <br />\end{eqnarray*}<br />So the KKT theorem says that $x_1, \ldots, x_n$ is a minimizer iff there are $0 \leq \mu_1, \ldots, \mu_{n-1}$ such that$$\nabla f(x_1, \ldots, x_{n-1}) + \sum^{n-1}_{i=1} \mu_i \nabla g_i(x_1, \ldots, x_{n-1}) = 0$$That is, iff there are $0 \leq \mu_1, \ldots, \mu_{n-1}$ such that<br />\begin{eqnarray*}<br />\varphi''(x_1) (x_1 - \lambda_1) - \varphi''(x_n)(x_n - \lambda_n) - \mu_1 - \mu_{n-1} & = & 0 \\<br />\varphi''(x_2) (x_2 - \lambda_2) - \varphi''(x_n)(x_n - \lambda_n) + \mu_1 - \mu_2 - \mu_{n-1} & = & 0 \\<br />\vdots & \vdots & \vdots \\<br />\varphi''(x_{n-2}) (x_{n-2} - \lambda_{n-2}) - \varphi''(x_n)(x_n - \lambda_n) + \mu_{n-3} - \mu_{n-2} - \mu_{n-1}& = & 0 \\<br />\varphi''(x_{n-1}) (x_{n-1} - \lambda_{n-1}) - \varphi''(x_n)(x_n - \lambda_n)+\mu_{n-2} - 2\mu_{n-1} & = & 0<br />\end{eqnarray*}<br />By summing these identities, we get:<br />\begin{eqnarray*}<br />\mu_{n-1} & = & \frac{1}{n} \sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{n-1}{n} \varphi''(x_n)(x_n - \lambda_n) \\<br />&= & \frac{1}{n} \sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \varphi''(x_n)(x_n - \lambda_n) \\<br />& = & \sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{n-1}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)<br />\end{eqnarray*}<br />So, for $1 \leq k \leq n-2$,<br />\begin{eqnarray*}<br />\mu_k & = & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - k\varphi''(x_n)(x_n - \lambda_n) - \\<br />&& \hspace{20mm} \frac{k}{n}\sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) + k\frac{n-1}{n} \varphi''(x_n)(x_n - \lambda_n) \\<br />& = & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^{n-1}_{i=1} \varphi''(x_i)(x_i - \lambda_i) -\frac{k}{n} \varphi''(x_n)(x_n - \lambda_n) \\<br />&= & \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)<br />\end{eqnarray*}<br />So, for $1 \leq k \leq n-1$,<br />$$\mu_k = \sum^k_{i=1} \varphi''(x_i)(x_i - \lambda_i) - \frac{k}{n}\sum^n_{i=1} \varphi''(x_i)(x_i - \lambda_i)$$<br />Now, suppose that there is a sequence of subsequences $\Lambda_1, \ldots, \Lambda_m$ of $\Lambda$ such that<br /></p><ol style="text-align: left;"><li>$\Lambda = \Lambda_1 \frown \ldots \frown \Lambda_m$</li><li>$\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av}(\Lambda_m)$</li><li>each $\Lambda_i$ does not exceed its average.<br /></li></ol><p>And let $$P = (\underbrace{\mathrm{av}(\Lambda_1), \ldots, \mathrm{av}(\Lambda_1)}_{\text{length of $\Lambda_1$}}, \underbrace{\mathrm{av}(\Lambda_2), \ldots, \mathrm{av}(\Lambda_2)}_{\text{length of $\Lambda_2$}}, \ldots, \underbrace{\mathrm{av}(\Lambda_m), \ldots, \mathrm{av}(\Lambda_m)}_{\text{length of $\Lambda_m$}})$$Then we write $i \in \Lambda_j$ if $\lambda_i$ is in the subsequence $\Lambda_j$. So, for $i \in \Lambda_j$, $p_i = \mathrm{av}(\Lambda_j)$. Then$$\frac{k}{n}\sum^n_{i=1} \varphi''(p_i)(p_i - \lambda_i) = \frac{k}{n} \sum^m_{j = 1} \sum_{i \in \Lambda_j} \varphi''(\mathrm{av}(\Lambda_j))(\mathrm{av}(\Lambda_j) - \lambda_i) = 0 $$<br />Now, suppose $k$ is in $\Lambda_j$. Then<br />\begin{multline*}<br />\mu_k = \sum^k_{i=1} \varphi''(p_i)(p_i - \lambda_i) = \\<br />\sum_{i \in \Lambda_1} \varphi''(p_i)(p_i - \lambda_i) + \sum_{i \in \Lambda_2} \varphi''(p_i)(p_i - \lambda_i) + \ldots + \\<br />\sum_{i \in \Lambda_{j-1}} \varphi''(p_i)(p_i - \lambda_i) + \sum_{i \in \Lambda_j|_k} \varphi''(p_i)(p_i - \lambda_i) = \\<br />\sum_{i \in \Lambda_j|_k} \varphi''(p_i)(p_i - \lambda_i) = \sum_{i \in \Lambda_j|_k} \varphi''(\mathrm{av}(\Lambda_j)(\mathrm{av}(\Lambda_j) - \lambda_i) <br />\end{multline*}<br />So, if $|\Lambda|$ is the length of the sequence $\Lambda$,$$\mu_k \geq 0 \Leftrightarrow |\Lambda_j|_k|\mathrm{av}(\Lambda_j) - \sum_{i \in \Lambda_j|_k} \lambda_i \geq 0 \Leftrightarrow \mathrm{av}(\Lambda_j) \geq \mathrm{av}(\Lambda_j|_k)$$But, by assumption, this is true for all $1 \leq k \leq n-1$. So $P$ minimizes $H^\Lambda(\mathfrak{U}(-))$, as required.<br /><br />We now show that there is always a series of subsequences that satisfy (1), (2), (3) from above. We proceed by induction. </p><p><i>Base Case </i> $n = 1$. Then it is clearly true with the subsequence $\Lambda_1 = \Lambda$.</p><p><i>Inductive Step</i> Suppose it is true for all sequences $\Lambda = (\lambda_1, \ldots, \lambda_n)$ of length $n$. Now consider a sequence $(\lambda_1, \ldots, \lambda_n, \lambda_{n+1})$. Then, by the inductive hypothesis, there is a sequence of sequences $\Lambda_1, \ldots, \Lambda_m$ such that <br /></p><ol style="text-align: left;"><li>$\Lambda \frown (\lambda_{n+1}) = \Lambda_1 \frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})$</li><li>$\mathrm{av}(\Lambda_1) \geq \ldots \geq \mathrm{av} (\Lambda_m)$</li><li>each $\Lambda_i$ does not exceed its average.<br /></li></ol><p>Now, first, suppose $\mathrm{av}(\Lambda_m) \geq \lambda_{n+1}$. Then let $\Lambda_{m+1} = (\lambda_{n+1})$ and we're done.<br /><br />So, second, suppose $\mathrm{av}(\Lambda_m) < \lambda_{n+1}$. Then we find the greatest $k$ such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$Then we let $\Lambda^*_{k+1} = \Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})$. Then we can show that</p><ol style="text-align: left;"><li>$(\lambda_1, \ldots, \lambda_n, \lambda_{n+1}) = \Lambda_1 \frown \Lambda_2 \frown \ldots \frown \Lambda_k \frown \Lambda^*_{k+1}$.</li><li>Each $\Lambda_1, \ldots, \Lambda_k, \Lambda^*_{k+1}$ does not exceed average.</li><li>$\mathrm{av}(\Lambda_1) \geq \mathrm{av}(\Lambda_2) \geq \ldots \geq \mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda^*_{k+1})$.<br /></li></ol><p>(1) and (3) are obvious. So we prove (2). In particular, we show that $\Lambda^*_{k+1}$ does not exceed average. We assume that each subsequence $\Lambda_j$ starts with $\Lambda_{i_j+1}$</p><ul style="text-align: left;"><li>Suppose $i \in \Lambda_{k+1}$. Then, since $\Lambda_{k+1}$ does not exceed average, $$\mathrm{av}(\Lambda_{k+1}) \geq \mathrm{av}(\Lambda_{k+1}|_i)$$But, since $k$ is the greatest number such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$We know that$$\mathrm{av}(\Lambda_{k+2}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1}|_i)$$</li><li>Suppose $i \in \Lambda_{k+2}$. Then, since $\Lambda_{k+2}$ does not exceed average, $$\mathrm{av}(\Lambda_{k+2}) \geq \mathrm{av}(\Lambda_{k+2}|_i)$$But, since $k$ is the greatest number such that$$\mathrm{av}(\Lambda_k) \geq \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1}))$$We know that$$\mathrm{av}(\Lambda_{k+3}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+2})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+2}|_i)$$But also, from above,$$ \mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1})$$So$$\mathrm{av}(\Lambda_{k+1}\frown \ldots \frown \Lambda_m \frown (\lambda_{n+1})) > \mathrm{av}(\Lambda_{k+1} \frown \Lambda_{k+2}|_i)$$</li><li>And so on.</li></ul><p>This completes the proof. $\Box$ <br /></p><p><br /></p><p><br /></p>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com1tag:blogger.com,1999:blog-4987609114415205593.post-530923806820880722021-01-01T16:12:00.002+00:002021-01-04T07:33:35.794+00:00How permissive is rationality? Horowitz's value question for moderate permissivism<p>Rationality is good; irrationality is bad. Most epistemologists would agree with this rather unnuanced take, regardless of their view of what exactly constitutes rationality and its complement. Granted this, a good test of a thesis in epistemology is whether it can explain why these two claims are true. Can it answer <i>the value question</i>: Why is rationality valuable and irrationality not? And indeed <a href="https://philpapers.org/rec/HORIR" target="_blank">Sophie Horowitz gives an extremely illuminating appraisal</a> of different degrees of epistemic permissivism and impermissivism by asking of each what answer it might give. Her conclusion is that the extreme permissivist -- played in her paper by the extreme subjective Bayesian, who thinks that satisfying Probabilism and being certain of your evidence is necessary and sufficient for rationality -- can give a satisfying answer to this question, or, at least, an answer that is satisfying from their own point of view. And the extreme impermissivist -- played here by the objective Bayesian, who thinks that rationality requires something like the maximum entropy distribution relative to your evidence -- can do so too. But, Horowitz argues, the moderate permissivist -- played by the moderate Bayesian, who thinks rationality imposes requirements more stringent than merely Probabilism, but who does not think they're stringent enough to pick out a unique credence function -- cannot. In this post, I'd like to raise some problems for Horowitz's assessment, and try to offer my own answer to the value question on behalf of the moderate Bayesian. (Full disclosure: If I'm honest, I think I lean towards extreme permissivism, but I'd like to show that moderate permissivism can defend itself against Horowitz's objection.)</p><p>Let's begin with the accounts that Horowitz gives on behalf of the extreme permissivist and the impermissivist.</p><p>The extreme permissivist -- the extreme subjective Bayesian, recall -- can say that only by being rational can you have a credence function that is <i>immodest</i> -- where a credence function is immodest if it uniquely maximizes expected epistemic utility from its own point of view. This is because Horowitz, like others in the epistemic utility theory literature, assume that epistemic utility is measured by <i>strictly proper measures</i>, so that, every probabilistic credence function expects itself to be better than any alternative credence function. From this, we can conclude that, on the extreme permissivist view, rationality is sufficient for immodesty. It's trickier to show that it is also necessary, since it isn't clear what we mean by the expected epistemic utility of a credence function from the point of view of a non-probabilistic credence function -- the usual definitions of expectation make sense only for probabilistic credence functions. Fortunately, however, we don't have to clarify this much. We need only say that, at the very least, if one credence function is epistemically better than another at all possible worlds -- that is, in decision theory parlance, the first dominates the second -- then any credence function, probabilistic or not, will expect the first to be better than the second. We then combine this with the result that, if epistemic utility is measured by a stricty proper measure, then, for each non-probabilistic credence function, there is a probabilistic credence function that dominates it, while for each probabilistic credence function, there is no such dominator (this result traces back to <a href="https://philpapers.org/rec/SAVEOP" target="_blank">Savage's 1971 paper</a>; <a href="https://philpapers.org/rec/PREPCA" target="_blank">Predd, et al.</a> give the proof in detail when the measure is additive; <a href="https://philpapers.org/rec/PETAEW" target="_blank">I then generalised it</a> to remove the additivity assumption). This then shows that being rational is necessary for being immodest. So, according to Horowitz's answer on behalf of the extreme permissivist, being rational is good and being irrational is bad because being rational is necessary and sufficient for being immodest; and it's good to be immodest and bad to be modest. </p><p>On the other hand, the impermissivist can say that, by being rational, you are maximizing expected accuracy from the point of view of the one true rational credence function. That's their answer to the value question, according to Horowitz.<br /></p><p>We'll return to the question of whether these answers are satisfying below. But first I want to turn to Horowitz's claim that the moderate Bayesian cannot give a satisfactory answer. I'll argue that, <i>if</i> the two answers just given on behalf of the extreme permissivist and extreme impermissivist are satisfactory, <i>then</i> there is a satisfactory answer that the moderate permissivist can give. Then I'll argue that, in fact, these answers aren't very satisfying. And I'll finish by sketching my preferred answer on behalf of the moderate permissivist. This is inspired by William James' account of epistemic risks in <i>The Will to Believe</i>, which leads me to discuss <a href="https://philpapers.org/rec/HOREVA" target="_blank">another Horowitz paper</a>. <br /></p><p>Horowitz's strategy is to show that the moderate permissivist cannot find a good epistemic feature of credence functions that belongs to all that they count as rational, but does not belong to any they count as irrational. The extreme permissivist can point to immodesty; the extreme impermissivist can point to maximising expected epistemic utility from the point of view of the sole rational credence function. But, for the moderate, there's nothing. Or so Horowitz argues.<br /></p><p>For instance, Horowitz initially considers the suggestion that rational credence functions guarantee you a minimum amount of epistemic utility. As she notes, the problem with this is that either it leads to impermissivism, or it fails to include all and only the credence functions the moderate considers rational. Let's focus on the case in which we have opinions about a proposition and its negation -- the point generalizes to richer sets of propositions. We'll represent the credence functions as pairs $(c(X), c(\overline{X}))$. And let's measure epistemic utility using the Brier score. So, when $X$ is true, the epistemic utility of $(x, y)$ is $-(1-x)^2 - y^2$, and when $X$ is false, it is $-x^2 - (1-y)^2$. Then, for $r > -0.25$, there is no credence function that guarantees you at least epistemic value $-0.25$ -- if you have at least that epistemic value at one world, you have less than that epistemic value at a different world. For $r = 0.25$, there is exactly one credence function that guarantees you at least epistemic value $-0.25$ -- it is the uniform credence function $(0.5, 0.5)$. And for $r < -0.25$, there are both probabilistic and non-probabilistic credence functions that guarantee you at least epistemic utility $r$. So, Horowitz concludes, a certain level of guaranteed epistemic utility can't be what separates the rational from the irrational for the moderate permissivist, since for any level, either no credence function guarantees it, exactly one does, or there are both credence functions the moderate considers rational and credence functions they consider irrational that guarantee it.<br /></p><p>She identifies a similar problem if we think not about guaranteed accuracy but about expected accuracy. Suppose, as the moderate permissivist urges, that some but not all probability functions are rationally permissible. Then for many rational credence functions, there will be irrational ones that they expect to be better than they expect some rational credence functions to be. Horowitz gives the example of a case in which the rational credence in $X$ is between 0.6 and 0.8 inclusive. Then someone with credence 0.8 will expect the irrational credence 0.81 to be better than it expects the rational credence 0.7 to be -- at least according to many many strictly proper measures of epistemic utility. So, Horowitz concludes, whatever separates the rational from the irrational, it cannot be considerations of expected epistemic utility.</p><p>I'd like to argue that, in fact, Horowitz should be happy with appeals to guaranteed or expected epistemic utility. Let's take guaranteed utility first. All that the moderate permissivist needs to say to answer the value question is that there are two valuable things that you obtain by being rational: immodesty <i>and</i> a guaranteed level of epistemic utility. Immodesty rules out all non-probabilistic credence functions, while the guaranteed level of epistemic utility narrows further -- how narrow depends on how much epistemic utility you wish to guarantee. So, for instance, suppose we say that the rational credence functions are exactly those $(x, 1-x)$ with $0.4 \leq x \leq 0.6$. Then each is immodest. And each has a guaranteed epistemic utility of at least $-(1-0.4)^2 - 0.6^2 = -0.72$. If Horowitz is satisfied with the immodesty answer to the value question when the extreme permissivist gives it, I think she should also be satisfied with it when the moderate permissivist combines it with a requirement not to risk certain low epistemic utilities (in this case, utilities below $-0.72$). And this combination of principles rules in all of the credence functions that the moderate counts as rational and rules out all they count as irrational.<br /></p><p>Next, let's think about expected epistemic utility. Suppose that the set of credence functions that the moderate permissivist counts as rational is a closed convex set. For instance, perhaps the set of rational credence function is $$R = \{c : \{X, \overline{X}\} \rightarrow [0, 1] : 0.6 \leq c(X) \leq 0.8\ \&\ c(\overline{X}) = 1- c(X)\}$$ Then we can prove the following: if a credence function is not in $R$, then there is $c^*$ in $R$ such that each $p$ in $R$ expects $c^*$ to be better than it expects $c$ to be (for the proof strategy, see Section 3.2 <a href="https://philpapers.org/rec/PETAEW" target="_blank">here</a>, but replace the possible chance functions with the rational credence functions). Thus, just as the extreme impermissivist answers the value question by saying that, if you're irrational, there's a credence function <i>the unique</i> <i>rational credence function</i> prefers to yours, while if you're rational, there isn't, the moderate permissivist can say that, if you're irrational, there is a credence function that <i>all the rational credence functions </i>prefer to yours, while if you're rational, there isn't. </p><p>Of course, you might think that it is still a problem for moderate permissivists that there are rational credence functions that expect some irrational credence functions to be better than some alternative rational ones. But I don't think Horowitz will have this worry. After all, the same problem affects extreme permissivism, and she doesn't take issue with this -- at least, not in the paper we're considering. For any two probabilistic credence functions $p_1$ and $p_2$, there will be some non-probabilistic credence function $p'_1$ that $p_1$ will expect to be better than it expects $p_2$ to be -- $p'_1$ is just a very slight perturbation of $p_1$ that makes it incoherent; a perturbation small enough to ensure it lies closer to $p_1$ than $p_2$ does.<br /></p><p>A different worry about the account of the value of rationality that I have just offered on behalf of the moderate permissivist is that it seems to do no more than push the problem back a step. It says that all irrational credence functions have a flaw that all rational credence functions lack. The flaw is this: there is an alternative preferred by all rational credence functions. But to assume that this is indeed a flaw seems to presuppose that we should care how rational credence functions evaluate themselves and other credence functions. But isn't the reason for caring what they say exactly what we have been asking for? Isn't the person who posed the value question in the first place simply going to respond: OK, but what's so great about all the rational credence functions expecting something else to be better, when the question on the table is exactly why rational credence functions are so good?</p><p>This is a powerful objection, but note that it applies equally well to Horowitz's response to the value question on behalf of the impermissivist. There, she claims that what is good about being rational is that you thereby maximise expected accuracy from the point of view of the unique rational credence function. But without an account of what's so good about being rational, I think we equally lack an account of what's so good about maximizing expected accuracy from the point of view of the rational credence functions.</p><p>So, in the end, I think Horowitz's answer to the value question on behalf of the impermissivist and my proposed expected epistemic utility answer on behalf of the moderate permissivist are ultimately unsatisfying.</p><p>What's more, Horowitz's answer on behalf of the extreme permissivist is also a little unsatisfying. The answer turns on the claim that immodesty is a virtue, together with the fact that precisely those credence functions identified as rational by subjective Bayesianism have that virtue. But is it a virtue? Just as arrogance in a person might seem excusable if they genuinely are very competent, but not if they are incompetent, so immodesty in a credence function only seems virtuous if the credence function itself is good. If the credence function is bad, then evaluating itself as uniquely the best seems just another vice to add to its collection. </p><p>So I think Horowitz's answer to the value question on behalf of the extreme permissivist is a little unsatisfactory. But it lies very close to an answer I find compelling. That answer appeals not to immodesty, but to non-dominance. Having a credence function that is dominated is bad. It leaves free epistemic utility on the table in just the same way that a dominated action in practical decision theory leaves free pragmatic utility on the table. For the extreme permissivist, what is valuable about rationality is that it ensures that you don't suffer from this flaw. </p><p>One noteworthy feature of this answer is the conception of rationality to which it appeals. On this conception, the value of rationality does not derive fundamentally from the possession of a positive feature, but from the lack of a negative feature. Ultimately, the primary notion here is irrationality. A credence function is irrational if it exhibits certain flaws, which are spelled out in terms of its success in the pursuit of epistemic utility. You are rational if you are free of these flaws. Thus, for the extreme permissivist, there is just one such flaw -- being dominated. So the rational credences are simply those that lack that flaw -- and the maths tells us that those are precisely the probabilistic credence functions.</p><p>We can retain this conception of rationality, motivate moderate permissivism, and answer the value question for it. In fact, there are at least two ways to do this. We have met something very close to one of these ways when we tried to rehabilitate the moderate permissivist's appeal to guaranteed epistemic utility above. There, we said that what makes rationality good is that it ensures that you are immodest and also ensures a certain guaranteed level of accuracy. But, a few paragraphs back, we argued that immodesty is no virtue. So that answer can't be quite right. But we can replace the appeal to immodesty with an appeal to non-dominance, and then the answer will be more satisfying. Thus, the moderate permissivist who says that the rational credence functions are exactly those $(x, 1-x)$ with $0.4 \leq x \leq 0.6$ can say that being rational is valuable for two reasons: (i) if you're rational, you aren't dominated; (ii) if you're rational you are guaranteed to have epistemic utility at least $-0.72$; (iii) only if you are rational will (i) and (ii) both hold. This answers the value question by appealing to how well credence functions promote epistemic utility, and it separates out the rational from the irrational precisely. <br /></p><p>To explain the second way we might do this, we invoke William James. Famously, in <i>The Will to Believe</i>, James said that we have two goals when we believe: to believe truth, and to avoid error. But these pull in different directions. If we pursue the first by believing something, we open ourselves up to the possibility of error. If we pursue the second by suspending judgment on something, we foreclose the possibility of believing the truth about it. Thus, to govern our epistemic life, we must balance these two goals. James held that how we do this is a subjective matter of personal judgment, and a number of different ways of weighing them are permissible. <a href="https://philpapers.org/rec/KELECB" target="_blank">Thomas Kelly has argued</a> that this can motivate permissivism in the case of full beliefs. Suppose the epistemic utility you assign to getting things right -- that is, believing truths and disbelieving falsehoods -- is $R > 0$. And suppose you assign epistemic utility $-W < 0$ to getting things wrong -- that is, disbelieving truths and believing falsehoods. And suppose you assign $0$ to suspending judgment. And suppose $W > R$. Then, as <a href="https://philpapers.org/rec/EASDTO" target="_blank">Kenny Easwaran</a> and <a href="https://philpapers.org/rec/DORLME-2" target="_blank">Kevin Dorst</a> have independently pointed out, if $r$ is the evidential probability of $X$, believing $X$ maximises expected epistemic utility from its point of view iff $\frac{W}{R + W} \leq r$, while suspending on $X$ maximises expected epistemic utility iff $\frac{R}{W+R} \leq r \leq \frac{W}{R+W}$. If William James is right, different values for $R$ and $W$ are permissible. The more you value believing truths, the greater will be $R$. The more you value avoiding falsehoods, the greater will be $W$ (and the lower will be $-W$). Thus, there will be a possible evidential probability $r$ for $X$, as well as permissible values $R$, $R'$ for getting things right and permissible values $W$, $W'$ for getting things wrong such that $$\frac{W}{R+W} < r < \frac{W'}{R'+W'}$$So, for someone with epistemic utilities characterised by $R$, $W$, it is rational to suspend judgment on $X$, while for someone with $W'$, $R'$, it is rational to believe $X$. Hence, permissivism about full beliefs.</p><p>As <a href="https://philpapers.org/rec/HOREVA" target="_blank">Horowitz points out</a>, however, the same trick won't work for credences. After all, as we've seen, all legitimate measures of epistemic utility for credences are strictly proper measures. And thus, if $r$ is the evidential probability of $X$, then credence $r$ in $X$ uniquely maximises expected epistemic utility relative to any one of those measures. So, a Jamesian permissivism about measures of epistemic value gives permissivism about doxastic states in the case of full belief, but not in the case of credence.</p><p>Nonetheless, I think we can derive permissivism about credences from James' insight. The key is to encode our attitudes towards James' two great goals for belief not in our epistemic utilities but in the rule we adopt when we use those epistemic utilities to pick our credences. Here's one suggestion, which I pursued at greater length in <a href="https://philpapers.org/rec/PETJEF-2" target="_blank">this paper</a> a few years ago, and that I generalised in <a href="https://m-phi.blogspot.com/2020/07/taking-risks-and-picking-priors.html" target="_blank">some blog posts</a> over the summer -- I won't actually present the generalization here, since it's not required to make the basic point. James recognised that, by giving yourself the opportunity to be right about something, you thereby run the risk of being wrong. In the credal case, by giving yourself the opportunity to be very accurate about something, you thereby run the risk of being very inaccurate. In the full belief case, to avoid that risk completely, you must never commit on anything. It was precisely this terror of being wrong that he lamented in Clifford. By ensuring he could never be wrong, there were true beliefs to which Clifford closed himself off. James believed that the extent to which you are prepared to take these epistemic risks is a passional matter -- that is, a matter of subjective preference. We might formalize it using a decision rule called <i>the Hurwicz criterion</i>. This rule was developed by Leonid Hurwicz for situations in which no probabilities are not available to guide our decisions, so it is ideally suited for the situation in which we must pick our prior credences. </p><p>Maximin is the rule that says you should pay attention only to the worst-case scenario and choose a credence function that does best there -- you should maximise your minimum possible utility. Maximax is the rule that says you should pay attention only to the best-case scenario and choose a credence function that does best there -- you should maximise your maximum possible utility. The former is maximally risk averse, the latter maximally risk seeking. As I showed <a href="https://philpapers.org/rec/PETARA-4" target="_blank">here</a>, if you measure epistemic utility in a standard way, maximin demands that you adopt the uniform credence function -- its worst case is best. And almost however you measure epistemic utility, maximax demands that you pick a possible world and assign maximal credence to all propositions that are true there and minimal credence to all propositions that are false there -- its best case, which obviously occurs at the world you picked, is best, because it is perfect there. </p><p>The Hurwicz criterion is a continuum of decision rules with maximin at one end and maximax at the other. You pick a weighting $0 \leq \lambda \leq 1$ that measures how risk-seeking you are and you define the <i>Hurwicz score</i> of an option $a$, with utility $a(w)$ at world $w$, to be$$H^\lambda(a) = \lambda \max \{a(w) : w \in W\} + (1-\lambda) \min \{a(w) : w \in W\}$$And you pick an option with the highest Hurwicz score.</p><p>Let's see how this works out in the simplest case, namely, that in which you have credences only in $X$ and $\overline{X}$. As before, we write credence functions defined on these two propositions as $(c(X), c(\overline{X})$. Then, if $\lambda \leq \frac{1}{2}$ --- that is, if you give at least as much weight to the worst case as to the best case --- then the uniform distribution $(\frac{1}{2}, \frac{1}{2})$ maximises the Hurwicz score relative to any strictly proper measure. And if $\lambda > \frac{1}{2}$ --- that is, if you are risk seeking and give more weight to the best case than the worst --- then $(\lambda, 1 - \lambda)$ and $(1-\lambda, \lambda)$ both maximise the Hurwicz score.</p><p>Now, if any $0 \leq \lambda \leq 1$ is permissible, then so is any credence function $(x, 1-x)$, and we get extreme permissivism. But I think we're inclined to say that there are extreme attitudes to risk that are not rationally permissible, just as there are preferences relating the scratching of one's finger and the destruction of the world that are not rationally permissible. I think we're inclined to think there is some range from $a$ to $b$ with $0 \leq a < b \leq 1$ such that the only rational attitudes to risk are precisely those encoded by the Hurwicz weights that lie between $a$ and $b$. If that's the case, we obtain moderate permissivism.</p><p>To be a bit more precise, this gives us both moderate interpersonal and intrapersonal permissivism. It gives us moderate interpersonal permissivism if $\frac{1}{2} < b < 1$ -- that is, if we are permitted to give more than half our weight to the best case epistemic utility. For then, since $a < b$, there is $b'$ such that $\frac{1}{2} < b' < b$, and then both $(b, 1-b)$ and $(b', 1-b')$ are both rationally permissible. But there is also $b < b'' < 1$, and for any such $b''$, $(b'', 1-b'')$ is not rationally permissible. It also gives us moderate intrapersonal permissivism under the same condition. For if $\frac{1}{2} < b$ and $b$ is your Hurwicz weight, then for you, both $(b, 1-b)$ and $(1-b, b)$ are different, but both are rationally permissible.<br /></p><p>How does this motivation for moderate permissivism fare with respect to the value question? I think it fares as well as the non-dominance-based answer I sketched above for the extreme permissivist. There, I appealed to a single flaw that a credence function might have: it might be dominated by another. Here, I introduced another flaw. It might be rationalised only by Jamesian attitudes to epistemic risk that are too extreme or otherwise beyond the pale. Like being dominated, this is a flaw that relates to the pursuit of epistemic utility. If you exhibit it, you are irrational. And to be rational is to be free of such flaws. The moderate permissivist can thereby answer the value question that Horowitz poses.<br /></p>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com1tag:blogger.com,1999:blog-4987609114415205593.post-87600337975819463592020-12-15T16:22:00.066+00:002020-12-27T08:27:30.647+00:00Deferring to rationality -- does it preclude permissivism?<p>Permissivism about epistemic rationality is the view that there are bodies of evidence in response to which rationality permits a number of different doxastic attitudes. I'll be thinking here about the case of credences. Credal permissivism says: there are bodies of evidence in response to which rationality permits a number of different credence functions.</p><p>Over the past year, I've watched friends on social media adopt remarkably different credence functions based on the same information about aspects of the COVID-19 pandemic, the outcome of the US election, and the withdrawal of the UK from European Union. And while I watch them scream at each other, cajole each other, and sometimes simply ignore each other, I can't shake the feeling that they are all taking rational stances. While they disagree dramatically, and while some will end up closer to the truth than others when it is finally revealed, it seems to me that all are responding rationally to their shared evidence, their opponents' protestations to the contrary. So permissivism is a very timely epistemic puzzle for 2020. What's more, <a href="https://cambridgereview.cargo.site/Dr-Rachel-Fraser" target="_blank">this wonderful piece</a> by Rachel Fraser made me see how my own William James-inspired approach to epistemology connects with a central motivation for believing in conspiracy theories, another major theme of this unloveable year. <br /></p><p>One type of argument against credal permissivism turns on the claim that rationality is worthy of deference. The argument begins with a precise version of this claim, stated as a norm that governs credences. It proceeds by showing that, if epistemic rationality is permissive, then it is sometimes impossible to meet the demands of this norm. Taking this to be a reductio, the argument concludes that rationality cannot be permissive. I know of two versions of the argument, one due to <a href="https://www.pdcnet.org/jphil/content/jphil_2016_0113_0008_0365_0395" target="_blank">Daniel Greco and Brian Hedden</a>, and one due to <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/phpr.12225" target="_blank">Ben Levinstein</a>. I'll mainly consider Levinstein's, since it fixes some problems with Greco and Hedden's. I'll consider <a href="https://academic.oup.com/mind/article-abstract/128/511/907/5133308" target="_blank">David Thorstad's response</a> to Greco and Hedden's argument, which would also work against Levinstein's argument were it to work at all. But I'll conclude that, while it provides a crucial insight, it doesn't quite work, and I'll offer my own alternative response.<br /></p><p>Roughly speaking, you defer to someone on an issue if, upon learning their attitude to that issue, you adopt it as your own. So, for instance, if you ask me what I'd like to eat for dinner tonight, and I say that I defer to you on that issue, I'm saying that I will want to eat whatever I learn you would like to eat. That's a case of deferring to someone else's preferences---it's a case where we defer conatively to them. Here, we are interested in cases in which we defer to someone else's beliefs---that is, where we defer doxastically to them. Thus, I defer doxastically to my radiographer on the issue of whether I've got a broken finger if I commit to adopting whatever credence they announce in that diagnosis. By analogy, we sometimes say that we defer doxastically to a feature of the world if we commit to setting our credence in some way that is determined by that feature of the world. Thus, I might defer doxastically to a particular computer simulation model of sea level change on the issue of sea level rise by 2030 if I commit to setting my credence in a rise of 10cm to whatever probability that model reports when I run it repeatedly while perturbing its parameters and initial conditions slightly around my best estimate of their true values.<br /><br />In philosophy, there are a handful of well-known theses that turn on the claim that we are required to defer doxastically to this individual or that feature of the world---and we're required to do it on all matters. For instance, van Fraassen's Reflection Principle says that you should defer doxastically to your future self on all matters. That is, for any proposition $X$, conditional on your future self having credence $r$ in $X$, you should have credence $r$ in $X$. In symbols:$$c(X\, |\, \text{my credence in $X$ at future time $t$ is $r$}) = r$$And the Principal Principle says that you should defer to the objective chances on all doxastic matters by setting your credences to match the probabilities that they report. That is, for any proposition $X$, conditional on the objective chance of $X$ being $r$, you should have credence $r$ in $X$. In symbols:$$c(X\, |\, \text{the objective chance of $X$ now is $r$}) = r$$Notice that, in both cases, there is a single expert value to which you defer on the matter in question. At time $t$, you have exactly one credence in $X$, and the Reflection Principle says that, upon learning that single value, you should set your credence in $X$ to it. And there is exactly one objective chance of $X$ now, and the Principal Principle says that, upon learning it, you should set your credence in $X$ equal to it. You might be uncertain about what that single value is, but it is fixed and unique. So this account of deference does not cover cases in which there is more than one expert. For instance, it doesn't obviously apply if I defer not to a specific climate model, but to a group of them. In those cases, there is usually no fixed, unique value that is the credence they all assign to a proposition. So principles of the same form as the Reflection or Principal Principle do not say what to do if you learn one of those values, or some of them, or all of them. This problem lies at the heart of the deference argument against permissivism. Those who make the argument think that deference to groups should work in one way; those who defend permissivism against it think it should work in some different way. <br /><br />As I mentioned above, the deference argument begins with a specific, precise norm that is said to govern the deference we should show to rationality. The argument continues by claiming that, if rationality is permissive, then it is not possible to satisfy this norm. Here is the norm as Levinstein states it, where $c \in R_E$ means that $c$ is in the set $R_E$ of rational responses to evidence $E$: </p><p><b>Deference to Rationality</b> Suppose:</p><ol style="text-align: left;"><li>$c$ is your credence function;</li><li>$E$ is your total evidence;</li><li>$c(c \in R_E) = 0$;</li><li>$c'$ is a probabilistic credence function;</li><li>$c(c' \in R_E) > 0$;</li></ol><p>then rationality requires$$c(-|c' \in R_E) = c'(-|c' \in R_E)$$That is, if you are certain that your credence function is not a rational response to your total evidence, then, conditional on some alternative probabilistic credence function being a rational response to that evidence, you should set your credences in line with that alternative once you've brought it up to speed with your new evidence that it is a rational response to your original total evidence.<br /><br />Notice, first, that Levinstein's principle is quite weak. It does not say of just anyone that they should defer to rationality. It says only that, if you are in the dire situation of being certain that you are yourself irrational, then you should defer to rationality. If you are sure you're irrational, then your conditional credences should be such that, were you to learn of a credence function that it's a rational response to your evidence, you should fall in line with the credences that it assigns conditional on that same assumption that it is rational. Restricting its scope in this way makes it more palatable to permissivists who will typically not think that someone who is already pretty sure that they are rational must switch credences when they learn that there are alternative rational responses out there. <br /><br />Notice also that you need only show such deference to rational credence functions that satisfy the probability axioms. This restriction is essential, for otherwise (DtR) will force you to violate the probability axioms yourself. After all, if $c(-)$ is probabilistic, then so is $c(-|X)$ for any $X$ with $c(X) > 0$. Thus, if $c'(-|c' \in R_E)$ is not probabilistic, and $c$ defers to $c'$ in the way Levinstein describes, then $c(-|c' \in R_E)$ is not probabilistic, and thus neither is $c$.<br /><br />Now, suppose:</p><ul style="text-align: left;"><li>$c$ is your credence function;</li><li>$E$ is your total evidence;</li><li>$c'$ and $c''$ are probabilistic credence functions with$$c'(-|c' \in R_E\ \&\ c'' \in R_E) \neq c''(-|c' \in R_E\ \&\ c'' \in R_E)$$That is, $c'$ and $c''$ are distinct and remain distinct even once they become aware that both are rational responses to $E$;</li><li>$c(c' \in R_E\ \&\ c'' \in R_E) > 0$. That is, you give some credence to both of them being rational responses to $E$;</li><li>$c(c \in R_E) = 0$. That is, you are certain that your own credence function is not a rational response to $E$.</li></ul><p>Then, by (DtR),</p><ul style="text-align: left;"><li>$c(-|c' \in R_E) = c'(-|c' \in R_E)$</li><li>$c(-|c'' \in R_E) = c''(-|c'' \in R_E)$ </li></ul><p>Thus, conditioning both sides of the first identity on $c'' \in R_E$ and both sides of the second identity on $c' \in R_E$, we obtain</p><ul style="text-align: left;"><li>$c(-|c' \in R_E\ \&\ c'' \in R_E) = c'(-|c' \in R_E\ \&\ c'' \in R_E)$ </li><li>$c(-|c'' \in R_E\ \&\ c' \in R_E) = c''(-|c' \in R_E\ \&\ c'' \in R_E)$</li></ul><p>But, by assumption, $c'(-| c' \in R_E\ \&\ c'' \in R_E) \neq c''(-|c' \in R_E\ \&\ c'' \in R_E)$. So (DtR) cannot be satisfied.<br /><br />One thing to note about this argument: if it works, it establishes not only that there can be no two different rational responses to the same evidence, but that it is irrational to be anything less than certain of this. After all, what is required to derive the contradiction from DtR is not that there are two probabilistic credence functions $c'$ and $c''$ such that $c'(-|c' \in R_E\ \&\ c'' \in R_E) \neq c''(-|c' \in R_E\ \&\ c'' \in R_E)$<i> that are both rational responses to $E$</i>. Rather, what is required is only that there are two probabilistic credence functions $c'$ and $c''$ with $c'(-|c' \in R_E\ \&\ c'' \in R_E) \neq c''(-|c' \in R_E\ \&\ c'' \in R_E)$ <i>that you think might both be rational responses to $E$</i>---that is, $c(c' \in R_E\ \&\ c'' \in R_E) > 0$. The conclusion that it is irrational to even entertain permissivism strikes me as too strong, but perhaps those who reject permissivism will be happy to accept it.<br /><br />Let's turn, then, to a more substantial worry, given compelling voice by David Thorstad: (DtR) is too strong because the deontic modality that features in it is too strong. As I hinted above, the point is that the form of the deference principles that Greco & Hedden and Levinstein use is borrowed from cases---such as the Reflection Principle and the Principal Principle---in which there is just one expert value, though it might be unknown to you. In those cases, it is appropriate to say that, upon learning the single value and nothing more, you are <i>required</i> to set your credence in line with it. But, unless we simply beg the question against permissivism and assume there is a single rational response to every body of evidence, this isn't our situation. Rather, it's more like the case where you defer to a group of experts, such as a group of climate models. And in this case, Thorstad says, it is inappropriate to <i>demand</i> that you set your credence in line with an expert's credence when you learn what it is. Rather, it is at most appropriate to <i>permit</i> you to do that. That is, Levinstein's principle should not say that rationality <i>requires</i> your credence function to assign the conditional credences stated in its consequent; it should say instead that rationality <i>allows</i> it. <br /><br />Thorstad motivates his claim by drawing an analogy with a moral case that he describes. Suppose you see two people drowning. They're called John and James, and you know that you will be able to save at most one. So the actions available to you are: save John, save James, save neither. And the moral actions are: save John, save James. But now consider a deference principle governing this situation that is analogous to (DtR): it demands that, upon learning that it is moral to save James, you must do that; and upon learning that it is moral to save John, you must do that. From this, we can derive a contradiction in a manner somewhat analogous to that in which we derived the contradiction from (DtR) above: if you learn both that it is moral to save John and moral to save James, you should do both; but that isn't an available action; so moral permissivism must be false. But I take it no moral theory will tolerate that in this case. So, Thorstad argues, there must be something wrong with the moral deference principle; and, by analogy, there must be something wrong with the analogous doxastic principle (DtR).<br /><br />Thorstad's diagnosis is this: the correct deference principle in the moral case should say: upon learning that it is moral to save James, you may do that; upon learning that it is moral to save John, you may do that. You thereby avoid the contradiction, and moral permissivism is safe. Similarly, the correct doxastic deference principle is this: upon learning that a credence function is rational, it is permissible to defer to it. In Levinstein's framework, the following is rationally permissible, not rationally mandated:$$c(-|c' \in R_E) = c'(-|c' \in R_E)$$</p><p>I think Thorstad's example is extremely illuminating, but for reasons rather different from his. Recall that a crucial feature of Levinstein's version of the deference argument against permissivism is that it applies only to people who are certain that their current credences are irrational. If we add the analogous assumption to Thorstad's case, his verdict is less compelling. Suppose, for instance, you are currently committed to saving neither John nor James from drowning; that's what you plan to do; it's the action you have formed an intention to perform. What's more, you're certain that this action is not moral. But you're uncertain whether either of the other two available actions are moral. And let's add a further twist to drive home the point. Suppose, furthermore, that you are certain that you are just about learn, of exactly one of them, that it is permissible. And add to that the fact that, immediately after you learn, of exactly one of them, that it is moral, you must act---failing to do so will leave both John and James to drown. In this case, I think, it's quite reasonable to say that, upon learning that saving James is permissible, you are not only morally permitted to drop your intention to save neither and replace it with the intention to save James, but you are also morally required to do so; and the same should you learn that it is permissible to save John. It would, I think, be impermissible to save neither, since you're certain that's immoral and you know of an alternative that is moral; and it would be impermissible to save John, since you are still uncertain about the moral status of that action, while you are certain that saving James is moral; and it would be morally required to save James, since you are certain of that action alone that it is moral. Now, Levinstein's principle might seem to holds for individuals in an analogous situation. Suppose you're certain that your current credences are irrational. And suppose you will learn of only one credence function that it is rationally permissible. At least in this situation, it might seem that it is rationally required that you adopt the credence function you learn is rationally permissible, just as you are morally required to perform the single act you learn is moral. So, is Levinstein's argument rehabilitated?<br /><br />I think not. Thorstad's example is useful, but not because the case of rationality and morality are analogous; rather, precisely because it draws attention to the fact that they are disanalogous. After all, all moral actions are better than all immoral ones. So, if you are committed to an action you know is immoral, and you learn of another that it is moral, and you know you'll learn nothing more about morality, you must commit to perform the action you've learned is moral. Doing so is the only way you know how to improve the action you'll perform for sure. But this is not the case for rational attitudes. It is not the case that all rational attitudes are better than all irrational attitudes. Let's see a few examples.<br /><br />Suppose my preferences over a set of acts $a_1, \ldots, a_N$ are as follows, where $N$ is some very large number:$$a_1 \prec a_2 \prec a_3 \prec \ldots \prec a_{N-3} \prec a_{N-2} \prec a_{N-1} \prec a_N \prec a_{N-2}$$This is irrational, because, if the ordering is irreflexive, then it is not transitive: $a_{N-2} \prec a_{N-1} \prec a_N \prec a_{N-2}$, but $a_{N-2} \not \prec a_{N-2}$. And suppose I learn that the following preferences are rational:$$a_1 \succ a_2 \succ a_3 \succ \ldots \succ a_{N-3} \succ a_{N-2} \succ a_{N-1} \succ a_N$$Then surely it is not rationally required of me to adopt these alternative preferences. (Indeed, it seems to me that rationality might even prohibit me from transitioning from the first irrational set to the second rational set, but I don't need that stronger claim.) In the end, my original preferences are irrational because of a small, localised flaw. But they nonetheless express coherent opinions about a lot of comparisons. And, concerning all of those comparisons, the alternative preferences take exactly the opposite view. Moving to the latter in order to avoid having preferences that are flawed in the way that the original set are flawed does not seem rationally required, and indeed might seem irrational. <br /><br />Something similar happens in the credal case, at least according to the accuracy-first epistemologist. Suppose I have credence $0.1$ in $X$ and $1$ in $\overline{X}$. And suppose the single legitimate measure of inaccuracy is the Brier score. I don't know this, but I do know a few things: first, I know that accuracy is the only fundamental epistemic value, and I know that a credence function's accuracy scores at different possible worlds determine its rationality at this world; furthermore, I know that my credences are accuracy dominated and therefore irrational, but I don't know what dominates them. Now suppose I learn that the following credences are rational: $0.95$ in $X$ and $0.05$ in $\overline{X}$. It seems that I am not required to adopt these credences (and, again, it seems that I am not even rationally permitted to do so, though again this latter claim is stronger than I need). While my old credences are irrational, they do nonetheless encode something like a point of view. And, from that point of view, the alternative credences look much much worse than staying put. While I know that mine are irrational and accuracy dominated, though I don't know what by, I also know that, from my current, slightly incoherent point of view, the rational ones look a lot less accurate than mine. And indeed they will be much less accurate than mine if $X$ turns out to be false.<br /><br />So, even in the situation in which Levinstein's principle is most compelling, namely, when you are certain you're irrational and you will learn of only one credence function that it is rational, still it doesn't hold. It is possible to be sure that your credence function is an irrational response to your evidence, sure that an alternative is a rational response, and yet not be required to adopt the alternative because learning that the alternative is rational does not teach you that it's better than your current irrational credence function for sure---it might be much worse. This is different from the moral case. So, as stated, Levinstein's principle is false.<br /><br />However, to make the deference argument work, Levinstein's principle need only hold in a single case. Levinstein describes a family of cases---those in which you're certain you're irrational---and claims that it holds in all of those. Thorstad's objection shows that it doesn't. Responding on Levinstein's behalf, I narrowed the family of cases to avoid Thorstad's objection---perhaps Levinstein's principle holds when you're certain you're irrational<i> and know you'll only learn of one credence function that it's rational</i>. After all, the analogous moral principle holds in those cases. But we've just seen that the doxastic version doesn't always hold there, because learning that an alternative credence function is rational does not teach you that it is better than your irrational credence function in the way that learning an act is moral teaches you that it's better than the immoral act you intend to perform. But perhaps we can narrow the range of cases yet further to find one in which the principle does hold.<br /><br />Suppose, for instance, you are certain you're irrational, you know you'll learn of just one credence function that it's rational, and moreover you know you'll learn that it is better than yours. Thus, in the accuracy-first framework, suppose you'll learn that it accuracy dominates you. Then surely Levinstein's principle holds here? And this would be sufficient for Levinstein's argument, since each non-probabilistic credence function is accuracy dominated by many different probabilistic credence functions; so we could find the distinct $c'$ and $c''$ we need for the reductio.<br /><br />Not so fast, I think. How you should respond when you learn that $c'$ is rational depends on what else you think about what determines the rationality of a credence function. Suppose, for instance, you think that a credence function is rational just in case it is not accuracy dominated, but you don't know which are the legitimate measures of accuracy. Perhaps you think there is only one legitimate measure of accuracy, and you know it's either the Brier score---$\mathfrak{B}(c, i) = \sum_{X \in \mathcal{F}} |w_i(X) - c(X)|^2$---or the absolute value score---$\mathfrak{A}(c, i) = \sum_{X \in \mathcal{F}} |w_i(X) - c(X)|^2$---but you don't know which. And suppose your credence function is $c(X) = 0.1$ and $c(\overline{X}) = 1$, as above. Now you learn that $c'(X) = 0.05$ and $c'(\overline{X}) = 0.95$ is rational and an accuracy dominator. So you learn that $c'$ is more accurate than $c$ at all worlds, and, since $c'$ is rational, there is nothing that is more accurate than $c'$ at all worlds. Then you thereby learn that the Brier score is the only legitimate measure of accuracy. After all, according to the absolute value score, $c'$ does not accuracy dominate $c$; in fact, $c$ and $c'$ have exactly the same absolute value score at both worlds. You thereby learn that the credence functions that accuracy dominate you without themselves being accuracy dominated are those for which $c(X)$ lies strictly between the solution of $(1-x)^2 + (1-x)^2 = (1-0.05)^2 + (0-1)^2$ that lies in $[0, 1]$ and the solution of $(0-x)^2 + (1-(1-x))^2 = (0-0.05)^2 + (1-1)^2$ that lies in $[0, 1]$, and $c(\overline{X}) = 1 - c(X)$. You are then permitted to pick any one of them---they are all guaranteed to be better than yours. You are not obliged to pick $c'$ itself. <br /><br />The crucial point is this: learning that $c'$ is rational teaches you something about the features of a credence function that determine whether it is rational---it teaches you that they render $c'$ rational! And that teaches you a bit about the set of rational credence functions---you learn it contains $c'$, of course, but you also learn other normative facts, such as the correct measure of inaccuracy, perhaps, or the correct decision principle to apply with the correct measure of inaccuracy to identify the rational credence functions. And learning those things may well shift your current credences, but you are not compelled to adopt $c'$.<br /><br />Indeed, you might be compelled to adopt something other than $c'$. An example: suppose that, instead of learning that $c'$ is rational and accuracy dominates $c$, you learn that $c''$ is rational and accuracy dominates $c$, where $c''$ is a probability function that Brier dominates $c$, and $c'' \neq c'$. Then, as before, you learn that the Brier score and not the absolute value score is the correct measure of inaccuracy, and thereby learn the set of credence functions that accuracy dominates yours. Perhaps rationality then requires you to fix up your credence function so that it is rational, but in a way that minimizes the amount by which you change your current credences. How to measure this? Well, perhaps you're required to pick an undominated dominator $c^*$ such that the expected inaccuracy of $c$ from the point of view of $c^*$ is minimal. That is, you pick the credence function that dominates you and isn't itself dominated <i>and which thinks most highly of your original credence function</i>. Measuring accuracy using the Brier score, this turns out to be the credence function $c'$ described above. Thus, given this reasonable account of how to respond when you learn what the rational credence functions are, upon learning that $c''$ is rational, rationality then requires you to adopt $c'$. <br /><br />In sum: For someone certain their credence function $c$ is irrational, learning only that $c'$ is rational is not enough to compel them to move to $c'$, nor indeed to change their credences at all, since they've no guarantee that doing so will improve their situation. To compel them to change their credences, you must teach them how to improve their epistemic situation. But when you teach them that doing a particular thing will improve their epistemic situation, that usually teaches them normative facts of which they were uncertain before---how to measure epistemic value, or the principles for choosing credences once you've fixed how to measure epistemic value---and doing that will typically teach them other ways to improve their epistemic situation besides the one you've explicitly taught them. Sometimes there will be nothing to tell between all the ways they've learned to improve their epistemic situation, and so all will be permissible, as Thorstad imagines; and sometimes there will be reason to pick just one of those ways, and so that will be mandated, even if epistemic rationality is permissive. In either case, Levinstein's argument does not go through. The deference principle on which it is based is not true.<br /></p>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com2tag:blogger.com,1999:blog-4987609114415205593.post-51839273472087298062020-09-03T12:19:00.000+01:002020-09-03T12:19:18.042+01:00Accuracy and Explanation in a Social Setting: thoughts on Douven and Wenmackers<p>For a PDF version of this post, see <a href="https://drive.google.com/file/d/1LDmS7qRgkkkTOSL-rOY4LEANa6GGTE2Y/view?usp=sharing" target="_blank">here</a>. <br /></p><p>In this post, I want to <a href="https://m-phi.blogspot.com/2020/08/accuracy-and-explanation-thoughts-on.html" target="_blank">continue my discussion</a> of the part of van Fraassen's argument against inference to the best explanation (IBE) that turns on its alleged clash with Bayesian Conditionalization (BC). In the previous post, I looked at <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9213.12032" target="_blank">Igor Douven's argument</a> that there are at least some ways of valuing accuracy on which updating by IBE comes out better than BC. I concluded that Douven's arguments don't save IBE; BC is still the only rational way to update. <br /><br />The setting for Douven's arguments was individualist epistemology. That is, he considered only the single agent collecting evidence directly from the world and updating in the light of it. But of course we often receive evidence not directly from the world, but indirectly through the opinions of others. I learn how many positive SARS-CoV-2 tests there have been in my area in the past week not my inspecting the test results myself but by listening to the local health authority. In their 2017 paper, <a href="https://doi.org/10.1093/bjps/axv025" target="_blank">'Inference to the Best Explanation versus Bayes’s Rule in a Social Setting'</a>, Douven joined with Sylvia Wenmackers to ask how IBE and BC fare in a context in which some of my evidence comes from the world and some from learning the opinions of others, where those others are also receiving some of their evidence from the world and some from others, and where one of those others from whom they're learning might be me. Like Douven's study of IBE vs BC in the individual setting, Douven and Wenmackers conclude in favour of IBE. Indeed, their conclusion in this case is considerably stronger than in the individual case:</p><p></p><blockquote>The upshot will be that if agents not only update their degrees of belief on the basis of evidence, but also take into account the degrees of belief of their epistemic neighbours, then the noted advantage of Bayesian updating [from Douven's earlier paper] evaporates and IBE does better than Bayes’s rule on every reasonable understanding of inaccuracy minimization. (536-7)</blockquote><p></p><p>As in the previous post, I want to stick up for BC. As in the individualist setting, I think this is the update rule we should use in the social setting.<br /><br />Following van Fraassen's original discussion and the strategy pursued in Douven's solo piece, Douven and Wenmackers take the general and ill-specified question whether IBE is better than BC and make it precise by asking it in a very specific case. We imagine a group of individuals. Each has a coin. All coins have the same bias. No individual knows what this shared bias is, but they do know that it is the same bias for each coin, and they know that the options are given by the following bias hypotheses:</p><p>$B_0$: coin has 0% chance of landing heads</p><p>$B_1$: coin has 10% chance of landing heads</p><p>$\ldots$ <br /></p><p>$B_9$: coin has 90% chance of landing heads</p><p>$B_{10}$: coin has 100% chance of landing heads</p><p>Though they don't say so, I think Douven and Wenmackers assume that all individuals have the same prior over $B_0, \ldots, B_{10}$, namely, the uniform prior; and each satisfies the Principal Principle, and so their credences in everything else follows from their credences in $B_0, \ldots, B_{10}$. As we'll see, we needn't assume that they all have the uniform prior over the bias hypotheses. In any case, they assume that things proceed as follows:<br /></p><p><i>Step (i)</i> Each member tosses their coin some fixed number of times. This produces their worldly evidence for this round.</p><p><i>Step (ii)</i> Each then updates their credence function on this worldly evidence they've obtained. To do this, each member uses the same updating rule, either BC or a version of IBE. We'll specify these in more detail below. </p><p><i>Step (iii) </i>Each then learns the updated credence functions of the others in the group. This produces their social evidence for this round.<br /><br /><i>Step (iv) </i>They then update their own credence function by taking the average of their credence function and the other credence functions in the group that lie within a certain distance of theirs. The set of credence functions that lie within a certain distance of one's own, Douven and Wenmackers call one's bounded confidence interval.</p><p>They then repeat this cycle a number of times, each time an individual begins with the credence function they reached at the end of the previous cycle.<br /><br />Douven and Wenmackers use simulation techniques to see how this group of individuals perform for different updating rules used in step (ii) and different specifications of how close a credence function must lie to yours in order to be included in the average in step (iv). Here's the class of updating rules that they consider: if $P$ is your prior and $E$ is your evidence then your updated credence function should be$$P^c_E(B_i) = \frac{P(B_i)P(E|B_i) + f_c(B_i, E)}{\sum^{10}_{k=0} \left (P(B_k)P(E|B_k) + f_c(B_k, E) \right )}$$where$$f_c(B_i, E) = \left \{ \begin{array}{ll} c & \mbox{if } P(E | B_i) > P(E | B_j) \mbox{ for all } j \neq i \\ \frac{1}{2}c & \mbox{if } P(E | B_i) = P(E|B_j) > P(E | B_k) \mbox{ for all } k \neq j, i \\ 0 & \mbox{otherwise} \end{array} \right. $$That is, for $c = 0$, this update rule is just BC, while for $c > 0$, it gives a little boost to whichever hypothesis best explains the evidence $E$, where providing the best explanation for a series of coin tosses amounts to making it most likely, and if two bias hypotheses make the evidence most likely, they split the boost between them. Douven and Wenmackers consider $c = 0, 0.1, \ldots, 0.9, 1$. For each rule, specified by $c$, they also consider different sizes of bounded confidence intervals. These are specified by the parameter $\varepsilon$. Your bounded confidence interval for $\varepsilon$ includes each credence function for which the average difference between the credences it assigns and the credences you assign is at most $\varepsilon$. Thus, $\varepsilon = 0$ is the most exclusive, and includes only your own credence function, while $\varepsilon = 1$ is the most inclusive, and includes all credence functions in the group. Again, Douven and Wenmackers consider $\varepsilon = 0, 0.1, \ldots, 0.9, 1$. Here are two of their main results:<br /></p><ol style="text-align: left;"><li>For each bias other than $p = 0.1$ or $0.9$, there is an explanationist rule (i.e. $c > 0$ and some specific $\varepsilon$) that gives rise to a lower average inaccuracy at the end of the process than all BC rules (i.e. $c = 0$ and any $\varepsilon$).</li><li>There is an averaging explanationist rule (i.e. $c > 0$ and $\varepsilon > 0$) such that, for each bias other than $p = 0, 0.1, 0.9, 1$, it gives rise to lower average inaccuracy than all BC rules (i.e. $c = 0$ and any $\varepsilon$).</li></ol><p>Inaccuracy is measured by the Brier score throughout. <br /><br />Now, you can ask whether these results are enough to tell so strongly in favour of IBE. But that isn't my concern here. Rather, I want to focus on a more fundamental problem: Douven and Wenmackers' argument doesn't really compare BC with IBE. They're comparing BC-for-worldly-data-plus-Averaging-for-social-data with IBE-for-worldly-data-plus-Averaging-for-social-data. So their simulation results don't really impugn BC, because the average inaccuracies that they attribute to BC don't really arise from it. They arise from using BC in step (ii), but something quite different in step (iv). Douven and Wenmackers ask the Bayesian to respond to the social evidence they receive using a non-Bayesian rule, namely, Averaging. And we can see just how far Averaging lies from BC by considering the following version of the example we have been using throughout.<br /><br />Consider the biased coin case, and suppose there are just three members of the group. And suppose they all start with the uniform prior over the bias hypotheses. At step (i), they each toss their coin twice. The first individual's coin lands $HT$, the second's $HH$, and the third's $TH$. So, at step (ii), if they all use BC (i.e. $c = 0$), they update on this worldly evidence as follows, where $P$ is the shared prior:<br />$$\begin{array}{r|ccccccccccc}<br />& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\<br />\hline<br />&&&&&&&&&& \\<br />P & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} & \frac{1}{11} \\<br />&&&&&&&&&& \\<br />P(-|HT) & 0 & \frac{9}{165} & \frac{16}{165}& \frac{21}{165}& \frac{24}{165} & \frac{25}{165}& \frac{24}{165}& \frac{21}{165}& \frac{16}{165}& \frac{9}{165}& 0\\<br />&&&&&&&&&& \\<br />P(-|HH) & 0 & \frac{1}{385} & \frac{4}{385}& \frac{9}{385}& \frac{16}{385}& \frac{25}{385}& \frac{36}{385}& \frac{49}{385}& \frac{64}{385}& \frac{81}{385}& \frac{100}{385}\\<br />&&&&&&&&&& \\<br />P(-|TH) & 0 & \frac{9}{165} & \frac{16}{165}& \frac{21}{165}& \frac{24}{165} & \frac{25}{165}& \frac{24}{165}& \frac{21}{165}& \frac{16}{165}& \frac{9}{165}& 0\\<br />\end{array}$$<br />Now, at step (iii), they each learn the other's distribution. And they average on that. Let's suppose I'm the first individual. Then I have two choices for my BCI. It either includes my own credence function $P(-|HT)$ and the third individual's $P(-|TH)$, which are identical, or it includes all three, $P(-|HT), P(-|HH), P(-|TH)$. Let's suppose it includes all three. Here is the outcome of averaging:$$\begin{array}{r|ccccccccccc}<br />& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\<br />\hline<br />&&&&&&&&&& \\<br />\mbox{Av} & 0 & \frac{129}{3465} & \frac{236}{3465}& \frac{321}{3465}& \frac{384}{3465}& \frac{425}{3465}& \frac{444}{3465}& \frac{441}{3465}& \frac{416}{3465}& \frac{369}{3465}& \frac{243}{3465}<br />\end{array}$$<br />And now compare that with what they would do if they updated at step (iv) using BC rather than Averaging. I learn the distributions of the second and third individuals. Now, since I know how many times they tossed their coin, and I know that they updated by BC at step (ii), I thereby learn something about how their coin landed. I know that it landed in such a way that would lead them to update to $P(-|HH)$ and $P(-|TH)$, respectively. Now what exactly does this tell me? In the case of the second individual, it tells me that their coin landed $HH$, since that's the only evidence that would lead them to update to $P(-|HH)$. In the case of the third individual, my evidence is not quite so specific. I learn that their coin either landed $HT$ or $TH$, since either of those, and only one of those, would lead them to update to $P(-|TH)$. In general, learning an individual's posteriors when you know their prior and the number of times they've tossed the coin will teach you how many heads they saw and how many tails, though it won't tell you the order in which they saw them. But that's fine. We can still update on that information using BC, and indeed BC will tell us to adopt the same credence as we would if we were to learn the more specific evidence of the order in which the coin tosses landed. If we do so in this case, we get:<br />$$\begin{array}{r|ccccccccccc}<br />& B_0 & B_1& B_2& B_3& B_4& B_5& B_6& B_7& B_8& B_9& B_{10} \\<br />\hline&&&&&&&&&& \\<br />\mbox{Bayes} & 0 & \frac{81}{95205} & \frac{1024}{95205} & \frac{3969}{95205} & \frac{9216}{95205} & \frac{15625}{95205} & \frac{20736}{95205} & \frac{21609}{95205} & \frac{16384}{95205} & \frac{6561}{95205} &0 \\<br />\end{array}<br />$$And this is pretty far from what I got by Averaging at step (iv).<br /><br />So updating using BC is very different from averaging. Why, then, do Douven and Wenmackers use Averaging rather than BC for step (iv)? Here is their motivation:</p><p></p><blockquote>[T]aking a convex combination of the probability functions of the individual agents in a group is the best studied method of forming social probability functions. Authors concerned with social probability functions have mostly considered assigning different weights to the probability functions of the various agents, typically in order to reflect agents’ opinions about other agents’ expertise or past performance. The averaging part of our update rule is in some regards simpler and in others less simple than those procedures. It is simpler in that we form probability functions from individual probability functions by taking only straight averages of individual probability functions, and it is less simple in that we do not take a straight average of the probability functions of all given agents, but only of those whose probability function is close enough to that of the agent whose probability is being updated. (552)</blockquote><p></p><p>In some sense, they're right. Averaging or linear pooling or taking a convex combination of individual credence functions is indeed the best studied method of forming social credence functions. And there are good justifications for it: <a href="https://www.math.utk.edu/~wagner/papers/arithmetic.pdf" target="_blank">János Aczél and Carl Wagner</a> and, independently, <a href="https://www.jstor.org/stable/2287843" target="_blank">Kevin J. McConway</a>, give a neat axiomatic characterization; and <a href="https://philpapers.org/rec/PETOTA-3" target="_blank">I've argued</a> that there are accuracy-based reasons to use it in particular cases. The problem is that our situation in step (iv) is not the sort of situation in which you should use Averaging. Arguments for Averaging concern those situations in which you have a group of individuals, possibly experts, and each has a credence function over the same set of propositions, and you want to produce a single credence function that could be called the group's collective credence function. Thus, for instance, if I wish to give the SAGE group's collective credence that there will be a safe and effective SARS-CoV-2 vaccine by March 2021, I might take the average of their individual credences. But this is quite a different task from the one that faces me as the first individual when I reach step (iv) of Douven and Wenmackers' process. There, I already have credences in the propositions in question. What's more, I know how the other individuals update and the sort of evidence they will have received, even if I don't know which particular evidence of that sort they have. And that allows me to infer from their credences after the update at step (ii) a lot about the evidence they receive. And I have opinions about the propositions in question conditional on the different evidence my fellow group members received. And so, in this situation, I'm not trying to summarise our individual opinions as a single opinion. Rather, I'm trying to use their opinions as evidence to inform my own. And, in that case, BC is better than Averaging. So, in order to show that IBE is superior to BC in some respect, it doesn't help to compare BC at step (ii) + Averaging at step (iv) with IBE at (ii) + Averaging at (iv). It would be better to compare BC at (ii) and (iv) with IBE at (ii) and (iv). <br /><br />So how do things look if we do that? Well, it turns out that we don't need simulations to answer the question. We can simply appeal to the mathematical results we mentioned in the previous post: first, <a href="https://philpapers.org/rec/GREJCC" target="_blank">Hilary Greaves and David Wallace's expected accuracy argument</a>; and second, the accuracy dominance argument that <a href="https://onlinelibrary.wiley.com/doi/full/10.1111/nous.12258" target="_blank">Ray Briggs and I</a> gave. Or, more precisely, we use the slight extensions of those results to multiple learning experiences that I sketched in the previous post. For both of those results, the background framework is the same. We begin with a prior, which we hold at $t_0$, before we begin gathering evidence. And we then look forward to a series of times $t_1, \ldots, t_n$ at each of which we will learn some evidence. And, for each time, we know the possible pieces of evidence we might receive, and we plan, for each time, which credence function we would adopt in response to each of the pieces of evidence we might learn at that time. Thus, formally, for each $t_i$ there is a partition from which our evidence at $t_i$ will come. For each $t_{i+1}$, the partition is a fine-graining of the partition at $t_i$. That is, our evidence gets more specific as we proceed. In the case we've been considering, at $t_1$, we'll learn the outcome of our own coin tosses; at $t_2$, we'll add to that our fellow group members' credence functions at $t_1$, from which we can derive a lot about the outcome of their first run of coin tosses; at $t_3$, we'll add to that the outcome of our next run of our own coin tosses; at $t_4$, we'll add our outcomes of the other group members' coin tosses by learning their credences at $t_3$; and so on. The results are then as follows: </p><p><b>Theorem (Extended Greaves and Wallace)</b> <i>For any strictly proper inaccuracy measure, the updating rule that minimizes expected inaccuracy from the point of view of the prior is BC</i>.</p><p><b>Theorem (Extended Briggs and Pettigrew)</b> <i>For any continuous and strictly proper inaccuracy measure, if your updating rule is not BC, then there is an alternative prior and alternative updating rule that accuracy dominates your prior and your updating rule</i>.</p><p>Now, these results immediately settle one question: if you are an individual in the group, and you know which update rules the others have chosen to use, then you should certainly choose BC for yourself. After all, if you have picked your prior, then it expects picking BC to minimize your inaccuracy, and thus expects picking BC to minimize the total inaccuracy of the group that includes you; and if you have not picked your prior, then if you consider a prior together with something other than BC as your updating rule, there's some other combination you could chose instead that is guaranteed to do better, and thus some other combination you could choose that is guaranteed to improve the total accuracy of the group. But Douven and Wenmackers don't set up the problem like this. Rather, they assume that all members of the group use the same updating rule. So the question is whether everyone picking BC is better than everyone picking something else. Fortunately, at least in the case of the coin tosses, this does follow. As we'll see, things could get more complicated with other sorts of evidence.<br /><br />If you know the updating rules that others will use, then you pick your updating rule simply on the basis of its ability to get you the best accuracy possible; the others have made their choices and you can't affect that. But if you are picking an updating rule for everyone to use, you must consider not only its properties as an updating rule for the individual, but also its properties as a means of signalling to the other members what evidence you have. Thus, prior to considering the details of this, you might think that there could be an updating rule that is very good at producing accurate responses to evidence, but poor at producing a signal to others of the evidence you've received---there might be a wide range of different pieces of evidence you could receive that would lead you to update to the same posterior using this rule, and in that case, learning your posterior would give little information about your evidence. If that were so, we might prefer an updating rule that does not produce such accurate updates, but does signal very clearly what evidence is received. For, in that situation, each individual would produce a less accurate update at step (ii), but would then receive a lot more evidence at step (iv), because the update at step (ii) would signal the evidence that the other members of the group received much more clearly. However, in the coin toss set up that Douven and Wenmackers consider, this isn't an issue. In the coin toss case, learning someone's posterior when you know their prior and how many coin tosses they have observed allows you to learn exactly how many heads and how many tails they observed. It doesn't tell you the order in which you learned them, but knowing that further information wouldn't affect how you would update anyway, either on the BC rule or on the IBE rule---learning $HT \vee TH$ leads to the same update as learning $HT$ for both Bayesian and IBEist. So when we are comparing them, we can consider the information learned at step (ii) and step (iv) both to be worldly information. Both give us information about the tosses of the coin that our peers witnessed. So when we are comparing them, we needn't take into account how good they are at signalling the evidence you have. They are both equally good and both very good. So comparing them when choosing a single rule that each member of the group must use, we need only compare the accuracy of using them as update rules. And the theorems above indicate that BC wins out on that measure.</p>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com27tag:blogger.com,1999:blog-4987609114415205593.post-9501253408257158292020-08-24T07:55:00.002+01:002020-08-24T07:56:12.521+01:00Accuracy and explanation: thoughts on Douven<p> For a PDF of this post, see <a href="https://drive.google.com/file/d/1DUmhOSySmEIn2TCKiaUtUcs2D2zDlVzi/view?usp=sharing" target="_blank">here</a>. <br /></p><p>Igor has eleven coins in his pocket. The first has 0% chance of landing heads, the second 10% chance, the third 20%, and so on up to the tenth, which has 90% chance, and the eleventh, which has 100% chance. He picks one out without letting me know which, and he starts to toss it. After the first 10 tosses, it has landed tails 5 times. How confident should I be that the coin is fair? That is, how confident should I be that it is the sixth coin from Igor's pocket; the one with 50% chance of landing heads? According to the Bayesian, the answer is calculated as follows:$$P_E(H_5) = P(H_5 | E) = \frac{P(H_5)P(E | H_5)}{\sum^{10}_{i=0} P(H_i) P(E|H_i)}$$where</p><ul style="text-align: left;"><li>$E$ is my evidence, which says that 5 out of 10 of the tosses landed heads,</li><li>$P_E$ is my new posterior updating credence upon learning the evidence $E$,</li><li>$P$ is my prior,</li><li>$H_i$ is the hypothesis that the coin has $\frac{i}{10}$ chance of landing heads,</li><li>$P(H_0) = \ldots = P(H_{10}) = \frac{1}{11}$, since I know nothing about which coin Igor pulled from his pocket, and</li><li>$P(E | H_i) = \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5$, by the Principal Principle, and since each coin toss is independent of each other one.</li></ul><p>So, upon learning that the coin landed heads five times out of ten, my posterior should be:$$P_E(H_5) = P(H_5 | E) = \frac{P(H_5)P(E | H_5)}{\sum^{10}_{i=0} P(H_i) P(E|H_i)} = \frac{\frac{1}{11} \left ( \frac{5}{10} \right )^5\left ( \frac{5}{10} \right )^5}{\sum^{10}_{i=1}\frac{1}{11} \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5 } \approx 0.2707$$But some philosophers have suggested that this is too low. The Bayesian calculation takes into account how likely the hypothesis in question makes the evidence, as well as how likely I thought the hypothesis in the first place, but it doesn't take into account that the hypothesis explains the evidence. We'll call these philosophers explanationists. Upon learning that the coin landed heads five times out of ten, the explanationist says, we should be most confident in $H_5$, the hypothesis that the coin is fair, and the Bayesian calculation does indeed give this. But we should be most confident in part because $H_5$ best explains the evidence, and the Bayesian calculation takes no account of this.<br /><br />To accommodate the explanationist's demand, <a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9213.12032" target="_blank">Igor Douven</a> proposes the following alternative updating rule:$$P_E(H_k) = P(H_k | E) = \frac{P(H_k)P(E | H_k) + f(H_k, E)}{\sum^{10}_{i=0} (P(H_i) P(E|H_i) + f(H_i, E))}$$where $f$ gives a little boost to $H_k$ if it is the best explanation of $E$ and not if it isn't. Perhaps, for instance,</p><ul style="text-align: left;"><li>$f(H_k, E) = 0.1$, if the frequency of heads among the coin tosses that $E$ reports is uniquely closest to the chance of heads according to $H_k$, namely, $\frac{k}{10}$,</li><li>$f(H_k, E) = 0.05$, if the frequency of heads among the coin tosses that $E$ reports is equally closest to the chance of heads according to $H_k$ and another hypothesis,</li><li>$f(H_k, E) = 0$, otherwise.</li></ul><p>Thus, according to this:$$P_E(H_5) = \frac{P(H_5)P(E | H_5) + 0.1}{\left (\sum^{10}_{i=0} P(H_i) P(E|H_i) \right ) + 0.1} = \frac{\frac{1}{11} \left ( \frac{5}{10} \right )^5\left ( \frac{5}{10} \right )^5 + 0.1}{\sum^{10}_{i=1}\frac{1}{11} \left ( \frac{i}{10} \right )^5 \left (\frac{10-i}{10} \right )^5 + 0.1 } \approx 0.9746$$So, as required, $H_5$ certainly gets a boost in posterior probability because it best explains the run of heads and tails we observe.<br /><br />Before we move on, it's worth noting a distinctive feature of this case. In many cases where we wish to apply something like abduction or inference to the best explanation, we might think that we can record our enthusiasm for good explanations in the priors. For instance, suppose I have two scientific theories, $T_1$ and $T_2$, both of which predict the evidence I've collected. So, they both make the evidence equally likely. But I want to assign higher probability to $T_1$ upon receipt of that evidence because it provides a better explanation for the evidence. Then I should simply encode this in my prior. That is, I should assign $P(T_1) > P(T_2)$. But that sort of move isn't open to us in Douven's example. The reason is that none of the chance hypotheses are better explanations in themselves: none is simpler or more general or what have you. But rather, for each, there is evidence we might obtain such that it is a better explanation of that evidence. But before we obtain the evidence, we don't know which will prove the better explanation of it, and so can't accommodate our explanationist instincts by giving that hypothesis a boost in our prior.<br /><br />Now let's return to the example. There are well known objections to updating in the explanationist way Douven suggests. Most famously, van Fraassen pointed out that we have good reasons to comply with the Bayesian method of updating, and the explanationist method deviates quite dramatically from that (<i>Laws and Symmetry</i>, chapter 6) . When he was writing, the most compelling argument was <a href="https://philpapers.org/rec/LEWWC" target="_blank">David Lewis' diachronic Dutch Book argument</a>. If you plan to update as Douven suggests, by giving an extra-Bayesian boost to the hypothesis that best explains the evidence, then there is a series of bets you'll accept before you receive the evidence and another set you'll accept afterwards that, taken together, will lose you money for sure. Douven is unfazed. He first suggests that vulnerability to a Dutch Book does not impugn your epistemic rationality, but only your practical rationality. He notes <a href="http://fitelson.org/probability/skyrms_coherence.pdf" target="_blank">Skyrms's claim</a> that, in the case of synchronic Dutch Books, such vulnerability reveals an inconsistency in your assessment of the same bet presented in different ways, and therefore perhaps some epistemic failure, but notes that this cannot be extended to the diachronic case. In any case, he says, avoiding the machinations of malevolent bookies is only one practical concern that we have, and, let's be honest, not a very pressing one. What's more, he points out that, while updating in the Bayesian fashion serves one practical end, namely, making us immune to these sorts of diachronic sure losses, there are other practical ends it might not serve as well. For instance, he uses computer simulations to show that, if we update in his explanationist way, we'll tend to assign credence greater than 0.99 in the true hypothesis much more quickly than if we update in the Bayesian way. He admits that we'll also tend to assign credence greater than 0.99 in a false hypothesis much more quickly than if we use Bayesian updating. But he responds, again with the results of a computer simulation result: suppose we keep tossing the coin until one of the rules assigns more than 0.99 to a hypothesis; then award points to that rule if the hypothesis it becomes very confident in is true, and deduct them if it is false; then the explanationist updating rule will perform better on average than the Bayesian rule. So, if there is some practical decision that you will make only when your credence in a hypothesis exceeds 0.99 -- perhaps the choice is to administer a particular medical treatment, and you need to be very certain in your diagnosis before doing so -- then you will be better off on average updating as Douven suggests, rather than as the Bayesian requires.<br /><br />So much for the practical implications of updating in one way or another. I am more interested in the epistemic implications, and so is Douven. He notes that, since van Fraassen gave his argument, there is a new way of justifying the Bayesian demand to update by conditioning on your evidence. These are the accuracy arguments. While Douven largely works with the argument for conditioning that <a href="https://philpapers.org/rec/LEIAOJ" target="_blank">Hannes Leitgeb and I</a> gave, I think the better version of that argument is due to <a href="https://philpapers.org/rec/GREJCC" target="_blank">Hilary Greaves and David Wallace</a>. The idea is that, as usual, we measure the inaccuracy of a credence function using a strictly proper inaccuracy measure $\mathfrak{I}$. That is, if $P$ is a probabilistic credence function and $w$ is a possible world, then $\mathfrak{I}(P, w)$ gives the inaccuracy of $P$ at $w$. And, if $P$ is a probabilistic credence function, $P$ expects itself to be least inaccurate. That is, $\sum_w P(w) \mathfrak{I}(P, w) < \sum_w P(w) \mathfrak{I}(Q, w)$, for any credence function $Q \neq P$. Then Greaves and Wallace ask us to consider how you might plan to update your credence function in response to different pieces of evidence you might receive. Thus, suppose you know that the evidence you'll receive will be one of the following propositions, $E_1, \ldots, E_m$, which form a partition. This is the situation you're in if you know that you're about to witness 10 tosses of a coin, for instance, as in Douven's example: $E_1$ might be $HHHHHHHHHH$, $E_2$ might be $HHHHHHHHHT$, and so on. Then suppose you plan how you'll respond to each. If you learn $E_i$, you'll adopt $P_i$. Then we'll call this updating plan $\mathcal{R}$ and write it $(P_1, \ldots, P_m)$. Then we can calculate the expected inaccuracy of a given updating plan. Its inaccuracy at a world is the inaccuracy of the credence function it recommends in response to learning the element of the partition that is true at that world. That is, for world $w$ at which $E_i$ is true,$$\mathfrak{I}(\mathcal{R}, w) = \mathfrak{I}(P_i, w)$$And Greaves and Wallace show that the updating rule your prior expects to be best is the Bayesian one. That is, if there is $E_i$ and $P(E_i) > 0$ and $P_i(-) \neq P(X|E_i)$, then there is an alternative updating rule $\mathcal{R}^\star = (P^\star_1, \ldots, P^\star_m)$ such that$$\sum_w P(w) \mathfrak{I}(\mathcal{R}^\star, w) < \sum_w P(w) \mathfrak{I}(\mathcal{R}, w)$$So, in particular, your prior expects the Bayesian rule to be more accurate than Douven's rule. <br /><br />In response to this, Douven points out that there are many ways in which we might value the accuracy of our updating plans. For instance, the Greaves and Wallace argument considers only your accuracy at a single later point in time, after you've received a single piece of evidence and updated only on it. But, Douven argues, we might be interested not in the one-off inaccuracy of a single application of an updating rule, but rather in its inaccuracy in the long run. And we might be interested in different features of the long-run total inaccuracy of using that rule: we might be interested in just adding up all of the inaccuracies of the various credence functions you obtain from multiple applications of the rule; or we might be less interested in the inaccuracies of the interim credence functions and more interested in the inaccuracy of the final credence function you obtain after multiple updates. And, Douven claims, the accuracy arguments do not tell us anything about which performs better out of the Bayesian and explanationist approaches when viewed in these different ways.<br /><br />However, that's not quite right. It turns out that we can, in fact, adapt the Greaves and Wallace argument to cover these cases. To see how, it's probably best to illustrate it with the simplest possible case, but it should be obvious how to scale up the idea. So suppose: </p><ul style="text-align: left;"><li>my credences are defined over four worlds, $XY$, $X\overline{Y}$, $\overline{X}Y$, and $\overline{X}\overline{Y}$;</li><li>my prior at $t_0$ is $P$;</li><li>at $t_1$, I'll learn either $X$ or its negation $\overline{X}$, and I'll respond with $P_X$ or $P_{\overline{X}}$, respectively;</li><li>at $t_2$, I'll learn $XY$, $X\overline{Y}$, $\overline{X}Y$, or $\overline{X} \overline{Y}$, and I'll respond with $P_{XY}$, $P_{X\overline{Y}}$, $P_{\overline{X}Y}$, or $P_{\overline{X}\overline{Y}}$, respectively.</li></ul><p>For instance, I might know that a coin is going to be tossed twice, once just before $t_1$ and once just before $t_2$. So $X$ is the proposition that it lands heads on the first toss, i.e., $X = \{HH, HT\}$, while $\overline{X}$ is the proposition it lands tails on the first toss $\overline{X} = \{TH, TT\}$. And then $Y$ is the proposition it lands heads on the second toss. So $XY = \{HH\}$, $X\overline{Y} = \{HT\}$, and so on. <br /><br />Now, taken together, $P_X$, $P_{\overline{X}}$, $P_{XY}$, $P_{X\overline{Y}}$, $P_{\overline{X}Y}$, and $P_{\overline{X}\overline{Y}}$ constitute my updating plan---let's denote that $\mathcal{R}$. Now, how might be measure the inaccuracy of this plan $\mathcal{R}$? Well, we want to assign a weight to the inaccuracy of the credence function it demands after the first update -- let's call that $\alpha_1$; and we want a weight for the result of the second update -- let's call that $\alpha_2$. So, for instance, if I'm interested in the total inaccuracy obtained by following this rule, and each time is just as important as each other time, I just set $\alpha_1 = \alpha_2$; but if I care much more about my final inaccuracy, then I let $\alpha_1 \ll \alpha_2$. Then the inaccuracy of my updating rule is$$\begin{eqnarray*}<br />\mathfrak{I}(\mathcal{R}, XY) & = & \alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY) \\<br />\mathfrak{I}(\mathcal{R}, X\overline{Y}) & = & \alpha_1 \mathfrak{I}(P_X, X\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, X\overline{Y}) \\<br />\mathfrak{I}(\mathcal{R}, \overline{X}Y) & = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y) \\<br />\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) & = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})<br />\end{eqnarray*}$$Thus, the expected inaccuracy of $\mathcal{R}$ from the point of view of my prior $P$ is:<br /><br />$P(XY)\mathfrak{I}(\mathcal{R}, XY) + P(X\overline{Y})\mathfrak{I}(\mathcal{R}, X\overline{Y}) + P(\overline{X}Y)\mathfrak{I}(\mathcal{R}, \overline{X}Y) + P(\overline{X} \overline{Y})\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) = $<br /><br />$P(XY)[\alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY)] + $<br /><br />$P(X\overline{Y})[\alpha_1 \mathfrak{I}(P_X, X\overline{Y}) + \alpha_2\mathfrak{I}(P_{X\overline{Y}}, X\overline{Y})] + $<br /><br />$P(\overline{X}Y)[\alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y)] + $<br /><br />$P(\overline{X}\overline{Y})[\alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})]$<br /><br />But it's easy to see that this is equal to:<br /><br />$\alpha_1[P(XY)\mathfrak{I}(P_X, XY) + P(X\overline{Y})\mathfrak{I}(P_X, X\overline{Y}) + $<br /><br />$P(\overline{X}Y)\mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + P(\overline{X}\overline{Y})\mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y})] + $<br /><br />$\alpha_2[\mathfrak{I}(P_{XY}, XY) + P(X\overline{Y})\mathfrak{I}(P_{X\overline{Y}}, X\overline{Y}) + $<br /><br />$P(\overline{X}Y)\mathfrak{I}(P_{\overline{X}Y}, \overline{X}Y) + P(\overline{X}\overline{Y})\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})]$<br /><br />Now, this is the weighted sum of the expected inaccuracies of the two parts of my updating plan taken separately; the part that kicks in at $t_1$, and the part that kicks in at $t_2$. And, thanks to Greaves and Wallace's result, we know that each of those expected inaccuracies is minimized by the rule that demands you condition on your evidence. Now, we also know that conditioning $P$ on $XY$ is the same as conditioning $P(-|X)$ on $XY$, and so on. So a rule that tells you, at $t_2$, to update your $t_0$ credence function on your total evidence at $t_2$ is also one that tells you, at $t_2$, to update your $t_1$ credence function on your total evidence at $t_2$. So, of the updating rules that cover the two times $t_1$ and $t_2$, the one that minimizes expected inaccuracy is the one that results from conditioning at each time. That is, if the part of $\mathcal{R}$ that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of $\mathcal{R}$ that kicks in at $t_2$ doesn't demand I condition my credence function at $t_1$ on my evidence at $t_2$, then there is an alternative rule $\mathcal{R}^\star$, that $P$ expects to be more accurate: that is,$$\sum_w P(w)\mathfrak{I}(\mathcal{R}^\star, w) < \sum_w P(w)\mathfrak{I}(\mathcal{R}, w)$$And, as I mentioned above, it's clear how to generalize this to cover not just updating plans that cover two different times at which you receive evidence, but any finite number.<br /><br />However, I think Douven would not be entirely moved by this. After all, while he is certainly interested in the long-run effects on inaccuracy of using one updating rule or another, he thinks that looking only to expected inaccuracy is a mistake. He thinks that we care about other features of updating rules. Indeed, he provides us with one, and uses computer simulations to show that, in the toy coin tossing case that we've been using, the explanationist account has that desirable feature to a greater degree than the Bayesian account.</p><blockquote>For each possible bias value, we ran 1000 simulations of a sequence of 1000 tosses. As previously, the explanationist and the Bayesian updated their degrees of belief after each toss. We registered in how many of those 1000 simulations the explanationist incurred a lower penalty than the Bayesian at various reference points [100 tosses, 250, 500, 750, 1000], at which we calculated both Brier penalties and log score penalties. The outcomes [...] show that, on either measure of inaccuracy, IBE is most often the winner—it incurs the lowest penalty -- at each reference point. Hence, at least in the present kind of context, IBE seems a better choice than Bayes' rule. (page 439)</blockquote>How can we square this with the Greaves and Wallace result? Well, as Douven goes on to explain: "[the explanationist rule] in general achieves greater accuracy than [the Bayesian], even if typically not much greater accuracy"; but "[the Bayesian rule] is less likely than [explanationist rule] to ever make one vastly inaccurate, even though the former typically makes one somewhat more inaccurate than the latter." So the explanationist is most often more accurate, but when it is more accurate, it's only a little more, while when it is less accurate, it's a lot less. So, in expectation, the Bayesian rule wins. Douven then argues that you might be more interested in being more likely to be more accurate, rather than being expectedly more accurate. <br /><br />Perhaps. But in any case there's another accuracy argument for the Bayesian way of updating that doesn't assume that expected inaccuracy is the thing you want to minimize. This is an argument that <a href="https://philpapers.org/rec/BRIAAA-11" target="_blank">Ray Briggs and I</a> gave a couple of years ago. I'll illustrate it in the same setting we used above, where we have prior $P$, at $t_1$ we'll learn $X$ or $\overline{X}$, and at $t_2$ we'll learn $XY$, $X\overline{Y}$, $\overline{X}Y$, or $\overline{X} \overline{Y}$. And we measure the inaccuracy of an updating rule $\mathcal{R} = (P_X, P_{\overline{X}}, P_{XY}, P_{X\overline{Y}}, P_{\overline{X}Y}, P_{\overline{X}\overline{Y}})$ for this as follows: <br />$$\begin{eqnarray*}<br />\mathfrak{I}(\mathcal{R}, XY) & = & \alpha_1 \mathfrak{I}(P_X, XY) + \alpha_2\mathfrak{I}(P_{XY}, XY) \\<br />\mathfrak{I}(\mathcal{R}, X\overline{Y}) & = & \alpha_1 \mathfrak{I}(P_X, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{\overline{X}Y}, X\overline{Y}) \\<br />\mathfrak{I}(\mathcal{R}, \overline{X}Y) & = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}Y) + \alpha_2\mathfrak{I}(P_{X\overline{Y}}, \overline{X}Y) \\<br />\mathfrak{I}(\mathcal{R}, \overline{X}\overline{Y}) & = & \alpha_1 \mathfrak{I}(P_{\overline{X}}, \overline{X}\overline{Y}) + \alpha_2\mathfrak{I}(P_{\overline{X}\overline{Y}}, \overline{X}\overline{Y})<br />\end{eqnarray*}$$Then the following is true: if the part of my plan that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of my plan that kicks in at $t_2$ doesn't demand I condition my $t_1$ credence function on my evidence at $t_2$, then, for any $0 < \beta < 1$, there is an alternative prior $P^\star$ and its associated Bayesian updating rule $\mathcal{R}^\star$, such that, for all worlds $w$,$$\beta\mathfrak{I}(P^\star, w) + (1-\beta)\mathfrak{I}(\mathcal{R}^\star, w) < \beta \mathfrak{I}(P, w) + (1-\beta)\mathfrak{I}(\mathcal{R}, w)$$And, again, this result generalizes to cases that include any number of times at which we receive new evidence, and in which, at each time, the set of propositions we might receive as evidence forms a partition. So it certainly covers the case of the coin of unknown bias that we've been using throughout. So, if you plan to update in some way other than by Bayesian conditionalization starting with your prior, there is an alternative prior and plan that, taken together, is guaranteed to have greater accuracy than yours; that is, they will have greater total accuracy than yours however the world turns out.<br /><br />How do we square this with Douven's simulation results? The key is that this dominance result includes the prior in it. It does not say that, if $\mathcal{R}$ requires you not to condition $P$ on your evidence at any point, then a rule that does require that is guaranteed to be better. It says that if $\mathcal{R}$ requires you not to condition $P$ on your evidence at any point, then there is an alternative prior $P^\star$ such that it, together with a rule that requires you to condition it on your evidence, are better than $P$ and $\mathcal{R}$ for sure. Douven's results compare the performance of conditioning on $P$ and performing the explanationist update on it. This shows that while conditioning might not always give a better result than the explanationist, there is an alternative prior such that conditioning on it is guaranteed to be better than retaining the original prior and performing the explanationist rule. And that, I think, is the reason we should prefer conditioning on our evidence to giving the little explanationist boosts that Douven suggests. If we update by conditioning, our prior and update rule, taken together, are never accuracy dominated; it we update using Douven's explanationist rule, our prior and update rule, taken together, are accuracy dominated.<br /><br />Before wrapping up, it's worth mentioning that there's a little wrinkle to iron out. It might be that, while the original prior and the posteriors it generates at the various times all satisfy the Principal Principle, the dominating prior and updating rule don't. While being dominated is clearly bad, you might think that being dominated by something that is itself irrational -- because it violates the Principal Principle, or for other reasons -- isn't so bad. But in fact we can tweak things to avoid this situation. The following is true: if the part of my plan that kicks in at $t_1$ doesn't demand I condition my prior on my evidence at $t_1$, or if the part of my plan that kicks in at $t_2$ doesn't demand I condition my $t_1$ credence function on my evidence at $t_2$, then, for any $0 < \beta < 1$, there is an alternative prior $P^\star$ and its associated Bayesian updating rule $\mathcal{R}^\star$, such that, $P^\star$ obeys the Principal Principle and, for all possible objective chance functions $ch$,<br /><br />$\beta\sum_{w} ch(w) \mathfrak{I}(P^\star, w) + (1-\beta)\sum_{w} ch(w) \mathfrak{I}(\mathcal{R}^\star, w) < $<br /><br />$\beta \sum_{w} ch(w) \mathfrak{I}(P, w) + (1-\beta)\sum_{w} ch(w) \mathfrak{I}(\mathcal{R}, w)$ <br /><br />So I'm inclined to think that Douven's critique of the Dutch Book argument against the explanationist updating rule hits the mark; and I can see why he thinks the expected accuracy argument against it is also less than watertight; but I think the accuracy dominance argument against it is stronger. We shouldn't use that updating rule, with its extra boost for explanatory hypotheses, because if we do so, there will be an alternative prior such that applying the Bayesian updating rule to that prior is guaranteed to be more accurate than applying the explanationist rule to our actual prior.<br /><p></p>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com12tag:blogger.com,1999:blog-4987609114415205593.post-55270550802447985852020-08-11T08:42:00.023+01:002020-08-12T20:44:08.208+01:00The only symmetric inaccuracy measure is the Brier score<p>If you'd like a PDF of this post, see here. </p><p>[UPDATE 1: I should have made this clear in the original post. The Normality condition makes the proof go through more easily, but it isn't really necessary. Suppose we simply assume instead that $$\mathfrak{I}(w^i, j)= \left \{ \begin{array}{ll} b & \mbox{if } i \neq j \\ a & \mbox{if } i = j \end{array} \right.$$Then we can show that, if $\mathfrak{I}$ is symmetric then, for any probabilistic credence function $p$ and any world $w_i$,$$\mathfrak{I}(p, i) = (b-a)\frac{1}{2} \left (1 - 2p_i + \sum_j p^2_j \right ) + a$$End Update 1.]</p><p>[UPDATE 2: There's something puzzling about the result below. Suppose $\mathcal{W} = \{w_1, \ldots, w_n\}$ is the set of possible worlds. And suppose $\mathcal{F}$ is the full algebra of propositions built out of those worlds. That is, $\mathcal{F}$ is the set of subsets of $\mathcal{W}$. Then there are two versions of the Brier score over a probabilistic credence function $p$ defined on $\mathcal{F}$. The first considers only the credences that $p$ assigns to the possible worlds. Thus,$$\mathfrak{B}(p, i) = \sum^n_{j=1} (w^i_j - p_j)^2 = 1 - 2p_i + \sum_j p^2_j$$But there is another that considers also the credences that $p$ assigns to the other propositions in $\mathcal{F}$. Thus,$$\mathfrak{B}^\star(p, i) = \sum_{X \in \mathcal{F}} (w_i(X) - p(X))^2$$Now, at first sight, these look related, but not very closely. However, notice that both are symmetric. Thus, by the extension of Selten's theorem below (plus update 1 above), if $\mathfrak{I}(w^i, j) = b$ for $i \neq j$ and 0 for $i = j$, then $\mathfrak{I}(p, i) = \frac{1}{2}b\mathfrak{B}(p, i)$. Now, $\mathfrak{B}(w^i, j) = 2$ for $i \neq j$, and $\mathfrak{B}(w^i, j) = 0$ for $i = j$, and so this checks out. But what about $\mathfrak{B}^\star$? Well, according to our extension of Selten's theorem, since $\mathfrak{B}^\star$ is symmetric, we can see that it is just a multiple of $\mathfrak{B}$, the factor determined by $\mathfrak{B}^\star(w^i, j)$. So what is this number? Well, it turns out that, if $i \neq j$, then$$\mathfrak{B}^\star(w^i, j) = 2\sum^{n-2}_{k=0} {n-2 \choose k}$$Thus, it follows that$$\mathfrak{B}^\star(p, i) = \sum^{n-2}_{k=0} {n-2 \choose k}\mathfrak{B}(p, i)$$And you can verify this by other means as well. This is quite a nice result independently of all this stuff about symmetry. After all, there doesn't seem any particular reason to favour $\mathfrak{B}$ over $\mathfrak{B}^\star$ or vice versa. This result shows that using one for the sorts of purposes we have in accuracy-first epistemology won't give different results from using the other. End update 2.]</p><p>So, as is probably obvious, I've been trying recently to find out what things look like in accuracy-first epistemology if you drop the assumption that the inaccuracy of a whole credal state is the sum of the inaccuracies of the individual credences that it comprises --- this assumption is sometimes called Additivity or Separability. In this post, I want to think about a result concerning additive inaccuracy measures that intrigued me in the past and on the basis of which I tried to mount an argument in favour of the Brier score. The result dates back to <a href="https://link.springer.com/article/10.1023/A:1009957816843" target="_blank">Reinhard Selten</a>, the German economist who shared the 1994 Nobel prize with John Harsanyi and John Nash for his contributions to game theory. In this post, I'll show that the result goes through even if we don't assume additivity.<br /></p><p>Suppose $\mathfrak{I}$ is an inaccuracy measure. Thus, if $c$ is a credence function defined on the full algebra built over the possible worlds $w_1, \ldots, w_n$, then $\mathfrak{I}(c, i)$ measures the inaccuracy of $c$ at world $w_i$. Then define the following function on pairs of probabilistic credence functions:$$\mathfrak{D}_\mathfrak{I}(p, q) = \sum_i p_i \mathfrak{I}(q, i) - \sum_i p_i\mathfrak{I}(p, i)$$$\mathfrak{D}_\mathfrak{I}$ measures how much more inaccurate $p$ expects $q$ to be than it expects itself to be; equivalently, how much more accurate $p$ expects itself to be than it expects $q$ to be. Now, if $\mathfrak{I}$ is strictly proper, $\mathfrak{D}_\mathfrak{I}$ is positive whenever $p$ and $q$ are different, and zero when they are the same, so in that case $\mathfrak{D}_\mathfrak{I}$ is a divergence. But we won't be assuming that here -- rather remarkably, we don't need to.</p><p>Now, it's not hard to see that $\mathfrak{D}_\mathfrak{I}$ is not necessarily symmetric. For instance, consider the log score$$\mathfrak{L}(p, i) = -\log p_i$$Then$$\mathfrak{D}_\mathfrak{L}(p, q) = p_i \log \frac{p_i}{q_i}$$This is the so-called Kullback-Leibler divergence and it is not symmetric. Nonetheless, it's equally easy to see that it is at least possible for $\mathfrak{D}_\mathfrak{I}$ to be symmetric. For instance, consider the Brier score$$\mathfrak{B}(p, i) = 1-2p_i + \sum_j p^2_j$$Then$$\mathfrak{D}_\mathfrak{B}(p, q) = \sum_i (p_i - q_i)^2$$So the natural question arises: how many inaccuracy measures are symmetric in this way? That is, how many generate symmetric divergences in the way that the Brier score does? It turns out: none, except the Brier score.</p><p>First, a quick bit of notation: Given a possible world $w_i$, we write $w^i$ for the probabilistic credence function that assigns credence 1 to world $w_i$ and 0 to any world $w_j$ with $j \neq i$. </p><p>And two definitions:</p><p><b>Definition </b><b>(Normal inaccuracy measure) </b><i>An inaccuracy measure $\mathfrak{I}$ is </i>normal<i> if $$\mathfrak{I}(w^i, j) = \left \{ \begin{array}{ll} 1 & \mbox{if } i \neq j \\ 0 & \mbox{if } i = j \end{array} \right.$$</i></p><p><b>Definition (Symmetric inaccuracy measure)</b><i> An inaccuracy measure is </i>symmetric<i> if </i><i>$$\mathfrak{D}_\mathfrak{I}(p, q) = \mathfrak{D}_\mathfrak{I}(q, p)$$for all probabilistic credence functions $p$ and $q$.</i></p><p>Thus, $\mathfrak{I}$ is symmetric if, for any probability functions $p$ and $q$, the loss of accuracy that $p$ expects to suffer by moving to $q$ is the same as the loss of accuracy that $q$ expects to suffer by moving to $p$. </p><p><b>Theorem</b> <i>The only normal and symmetric inaccuracy measure agrees with the Brier score for probabilistic credence functions.<br /></i></p><p><i>Proof</i>. (This just adapts Selten's proof in exactly the way you'd expect.) Suppose $\mathfrak{D}_\mathfrak{I}(p, q) = \mathfrak{D}_\mathfrak{I}(q, p)$ for all probabilistic $p$, $q$. Then, in particular, for any world $w_i$ and any probabilistic $p$,$$\sum_j w^i_j \mathfrak{I}(p, j) - \sum_j w^i_j \mathfrak{I}(w^i, j) = \sum_j p_j \mathfrak{I}(w^i, j) -\sum_j p_j \mathfrak{I}(p, j)$$So,$$\mathfrak{I}(p, i) = (1-p_i) - \sum_j p_j \mathfrak{I}(p, j)$$So,$$\sum_j p_j \mathfrak{I}(p, j) = 1 - \sum_j p^2_j- \sum_j p_j \mathfrak{I}(p, j)$$So,$$\sum_j p_j \mathfrak{I}(p, j) = \frac{1}{2}[1 - \sum_j p^2_j]$$So,$$\mathfrak{I}(p, i) = 1-p_i -\frac{1}{2}[1 - \sum_j p^2_j] = \frac{1}{2} \left (1 - 2p_i + \sum_j p^2_j \right )$$as required. $\Box$</p><p>There are a number of notable features of this result:</p><p>First, the theorem does not assume that the inaccuracy measure is strictly proper, but since the Brier score is strictly proper, it follows that symmetry entails strict propriety.</p><p>Second, the theorem does not assume additivity, but since the Brier score is additive, it follows that symmetry entails additivity. </p><p><br /></p><p><br /></p>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com5tag:blogger.com,1999:blog-4987609114415205593.post-17636646622945989472020-08-10T18:44:00.001+01:002020-08-11T08:53:03.503+01:00The Accuracy Dominance Argument for Conditionalization without the Additivity assumption<p> For a PDF of this post, see <a href="https://drive.google.com/file/d/1WpVipvzgrX6zaS1qeEbHxTWP03bZma9z/view?usp=sharing" target="_blank">here</a>. <br /></p><p><a href="https://m-phi.blogspot.com/2020/08/accuracy-without-additivity.html" target="_blank">Last week</a>, I explained how you can give an accuracy dominance argument for Probabilism without assuming that your inaccuracy measures are additive -- that is, without assuming that the inaccuracy of a whole credence function is obtained by adding up the inaccuracy of all the individual credences that it assigns. The mathematical result behind that also allows us to give <a href="https://drive.google.com/file/d/1z7Z4rE0BlEtCeD_YPktCEaj1Vm24RnSG/view?usp=sharing" target="_blank">my chance dominance argument for the Principal Principle</a> without assuming additivity, and ditto for <a href="https://drive.google.com/file/d/1_NeXKXQ84xL2plPDWz3ioA12Q3IDcIYf/view?usp=sharing" target="_blank">my accuracy-based argument for linear pooling</a>. In this post, I turn to another Bayesian norm, namely, Conditionalization. The first accuracy argument for this was given by <a href="https://philpapers.org/rec/GREJCC" target="_blank">Hilary Greaves and David Wallace</a>, building on ideas developed by <a href="https://philpapers.org/rec/ODDCCA" target="_blank">Graham Oddie</a>. It was an expected accuracy argument, and it didn't assume additivity. More recently, <a href="https://drive.google.com/open?id=1kjY_wQ0nlXIGfnla_MhWF1uQmcJtcTUB" target="_blank">Ray Briggs and I</a> offered an accuracy dominance argument for the norm, and we did assume additivity. It's this latter argument I'd like to consider here. I'd like to show how it goes through even without assuming additivity. And indeed I'd like to generalise it at the same time. The generalisation is inspired by a recent paper by <a href="https://philpapers.org/rec/RESAID" target="_blank">Michael Rescorla</a>. In it, Rescorla notes that all the existing arguments for Conditionalization assume that, when your evidence comes in the form of a proposition learned with certainty, that proposition must be true. He then offers a Dutch Book argument for Conditionalization that doesn't make this assumption, and he issues a challenge for other sorts of arguments to do the same. Here, I take up that challenge. To do so, I will offer an argument for what I call the Weak Reflection Principle.</p><p><b>Weak Reflection Principle (WRP)</b> <i>Your current credence function should be a linear combination of the possible future credence functions that you endorse.</i></p><p>A lot might happen between now and tomorrow. I might see new sights, think new thoughts; I might forget things I know today, take mind-altering drugs that enhance or impair my thinking; and so on. So perhaps there is a set of credence functions I think I might have tomorrow. Some of those I'll endorse -- perhaps those that I'd get if I saw certain new things, or enhanced my cognition in various ways. And some of them I'll disavow -- perhaps those that I'd get if I forget certain things, or impaired my cognition. WRP asks you to separate out the wheat from the chaff, and once you've identified the ones you endorse, it tells you that your current credence function should lie within the span of those future ones; it should be in their convex hull; it should be a weighted sum or convex combination of them.</p><p>One nice thing about WRP is that it gives back Conditionalization in certain cases. Suppose $c^0$ is my current credence function. Suppose I know that between now and tomorrow I'll learn exactly one member of the partition $E_1, \ldots, E_m$ with certainty --- this is the situation that Greaves and Wallace envisage. And suppose I endorse credence function $c^1$ as a response to learning $E_1$, $c^2$ as a response to learning $E_2$, and so on. Then, if I satisfy WRP, and if $c^k(E_k) = 1$, since I did after all learn it with certainty, then it follows that, whenever $c^0(E_k) > 0$, $c^k(X) = c^0(X | E_k)$, which is exactly what Conditionalization asks of you. And notice that, at no point did we assume that if I learn $E_k$, then $E_k$ is true. So we've answered Rescorla's challenge if we can establish WRP.</p><p>To do that, we need Theorem 1 below. And to get there, we need to go via Lemmas 1 and 2. Just to remind ourselves of the framework:</p><ul style="text-align: left;"><li>$w_1, \ldots, w_n$ are the possible worlds;</li><li>credence functions are defined on the full algebra built on top of these possible worlds;</li><li>given a credence function $c$, we write $c_i$ for the credence that $c$ assigns to $w_i$. <b> <br /></b></li></ul><p><b>Lemma 1</b> If $c^0$ is not in the convex combination of $c^1, \ldots, c^m$, then $(c^0, c^1, \ldots, c^m)$ is not in the convex hull of $\mathcal{X}$, where$$\mathcal{X} := \{(w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m) : 1 \leq i \leq n\ \&\ 1 \leq k \leq m\}$$</p><p><b>Definition 1</b> Suppose $\mathfrak{I}$ is a continuous strictly proper inaccuracy measure. Then let$$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m)) = \sum^m_{k=0} \left ( \sum^n_{i=1} p^k_i \mathfrak{I}(c^k, i) - \sum^n_{i=1} p^k_i \mathfrak{I}(p^k, i) \right )$$<br /></p><p><b>Lemma 2</b> Suppose $\mathfrak{I}$ is a continuous strictly proper inaccuracy measure. Suppose $\mathcal{X}$ is a closed convex set of $(n+1)$-tuples of probabilistic credence functions. And suppose $(c^0, c^1, \ldots, c^n)$ is not in $\mathcal{X}$. Then there is $(q^0, q^1, \ldots, q^m)$ in $\mathcal{X}$ such that </p><p>(i) for all $(p^0, p^1, \ldots, p^m) \neq (q^0, q^1, \ldots, q^m)$ in $\mathcal{X}$,</p><p>$\mathfrak{D}_\mathfrak{I}((q^0, q^1, \ldots, q^m), (c^0, c^1, \ldots, c^m)) <$</p><p>$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m))$;</p><p>(ii) for all $(p^0, p^1, \ldots, p^n)$ in $\mathcal{X}$,</p><p>$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m)) \geq$</p><p>$\mathfrak{D}_\mathfrak{I}((p^0, p^1, \ldots, p^m), (q^0, q^1, \ldots, q^m)) +$</p><p>$\mathfrak{D}_\mathfrak{I}((q^0, q^1, \ldots, q^m), (c^0, c^1, \ldots, c^m))$<b>.<br /></b></p><p><b>Theorem 1</b> Suppose each $c^0, c^1, \ldots, c^n$ is a probabilistic credence function. If $c^0$ is not in the convex hull of $c^1, \ldots, c^m$, then there are probabilistic credence functions $q^0, q^1, \ldots, q^m$ such that for all worlds $w_i$ and $1 \leq k \leq m$,$$\mathfrak{I}(q^0, i) + \mathfrak{I}(q^k, i) < \mathfrak{I}(c^0, i) + \mathfrak{I}(c^k, i)$$ </p><p>Let's keep the proofs on ice for a moment. What does this show exactly? It says that, if you don't do as WRP demands, there is some alternative current credence function and, for each of the possible future credence functions in the set you endorse, there is an alternative such that having your current credence function now and then one of your endorsed future credence functions later is guaranteed to make you less accurate overall than having the alternative to your current credence function now and the alternative to that endorsed future credence function later. This, I claim, establishes WRP.</p><p>Now for the proofs.<br /></p><p><i>Proof of Lemma 1</i>. We prove the contrapositive. Suppose $(c^0, c^1, \ldots, c^m)$ is in $\mathcal{X}$. Then there are $0 \leq \lambda_{i, k} \leq 1$ such that $\sum^n_{i=1}\sum^m_{k=1} \lambda_{i, k} = 1$ and$$(c^0, c^1, \ldots, c^m) = \sum^n_{i=1} \sum^m_{k=1} \lambda_{i, k} (w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m)$$Thus,$$c^0 = \sum^n_{i=1}\sum^m_{k=1} \lambda_{i,k} w^i$$<br />and$$c^k = \sum^n_{i=1} \lambda_{i, k} w^i + \sum^n_{i=1} \sum_{l \neq k} \lambda_{i, l} c^k$$So$$(\sum^n_{i=1} \lambda_{i, k}) c^k = \sum^n_{i=1} \lambda_{i, k} w^i$$So let $\lambda_k = \sum^n_{i=1} \lambda_{i, k}$. Then, for $1 \leq k \leq m$,$$\lambda_k c_k = \sum^n_{i=1} \lambda_{i, k} w^i$$And thus$$\sum^m_{k=1} \lambda^k c^k = \sum^m_{k=1} \sum^n_{i=1} \lambda_{i, k} w^i = c^0$$as required. $\Box$ <br /></p><p><i>Proof of Lemma 2</i>. This proceeds exactly like the corresponding theorem from the previous blogpost. $\Box$</p><p><i>Proof of Theorem 1</i>. So, if $c^0$ is not in the convex hull of $c^1, \ldots, c^m$, there is $(q^0, q^1, \ldots, q^m)$ such that, for all $(p^0, p^1, \ldots, p^m)$ in $\mathcal{X}$,$$\mathfrak{D}((p^0, p^1, \ldots, p^m), (q^0, q^1, \ldots, q^m)) < \mathfrak{D}((p^0, p^1, \ldots, p^m), (c^0, c^1, \ldots, c^m))$$In particular, for any world $w_i$ and $1 \leq k \leq m$,</p><p>$\mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (q^0, q^1, \ldots, q^m)) <$</p><p>$\mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (c^0, c^1, \ldots, c^m))$</p><p>But$$\begin{eqnarray*}<br />& & \mathfrak{I}(q^0, i) + \mathfrak{I}(q^k, i) \\<br />& = & \mathfrak{D}(w^i, q^0) + \mathfrak{D}(w^i, q^k) \\<br />& \leq & \mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (q^0, q^1, \ldots, q^m)) \\<br />& < & \mathfrak{D}((w^i, c^1, \ldots, c^{k-1}, w^i, c^{k+1}, \ldots, c^m), (c^0, c^1, \ldots, c^m)) \\<br />& = & \mathfrak{D}(w^i, c^0) + \mathfrak{D}(w^i, c^k) \\<br />& = & \mathfrak{I}(c^0, i) + \mathfrak{I}(c^k, i) <br />\end{eqnarray*}$$as required.<br /><br /> <br /></p>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com4tag:blogger.com,1999:blog-4987609114415205593.post-54675969383751584272020-08-07T09:27:00.002+01:002020-08-10T09:59:53.974+01:00The Accuracy Dominance Argument for Probabilism without the Additivity assumption<div>For a PDF of this post, see <a href="https://drive.google.com/file/d/1EidttBl-pYE8rjqbE3iL3tsXBvxCF639/view?usp=sharing" target="_blank">here</a>.<br /></div><div><br /></div><div>One of the central arguments in accuracy-first epistemology -- the one that gets the project off the ground, I think -- is the accuracy-dominance argument for Probabilism. This started life in a more pragmatic guise in <a href="https://onlinelibrary.wiley.com/doi/book/10.1002/9781119286387" target="_blank">de Finetti's proof</a> that, if your credences are not probabilistic, there are alternatives that would lose less than yours would if they were penalised using the Brier score, which levies a price of $(1-x)^2$ on every credence $x$ in a truth and $x^2$ on every credence $x$ in a falsehood. This was then adapted to an accuracy-based argument by <a href="https://philpapers.org/rec/ROSFAA-3">Roger Rosencrantz</a>, where he interpreted the Brier score as a measure of inaccuracy, not a penalty score. Interpreted thus, de Finetti's result says that any non-probabilistic credences are accuracy-dominated by some probabilistic credences. <a href="https://philpapers.org/rec/JOYANV">Jim Joyce</a> then noted that this argument only establishes Probabilism if you have a further argument that inaccuracy should be measured by the Brier score. He thought there was no particular reason to think that's right, so he greatly generalized de Finetti's result to show that, relative to a much wider range of inaccuracy measures, all non-probabilistic credences are accuracy dominated. One problem with this, which <a href="https://philpapers.org/rec/HJEAFA-2" target="_blank">Al Hájek</a> pointed out, was that he didn't give a converse argument -- that is, he didn't show that, for each of his inaccuracy measures, each probabilistic credence function is not accuracy dominated. <a href="https://philpapers.org/rec/PREPCA" target="_blank">Joel Predd and his Princeton collaborators</a> then addressed this concern and proved a very general result, namely, that for any additive, continuous, and strictly proper inaccuracy measure, any non-probabilistic credences are accuracy-dominated, while no probabilistic credences are.</div><div><br /></div><div>That brings us to this blogpost. Additivity is a controversial claim. It says that the inaccuracy of a credence function is the (possibly weighted) sum of the inaccuracies of the credences it assigns. So the question arises: can we do without additivity? In this post, I'll give a quick proof of the accuracy-dominance argument that doesn't assume anything about the inaccuracy measures other than that they are continuous and strictly proper. Anyone familiar with the Predd, et al. paper will see that the proof strategy draws very heavily on theirs. But it bypasses out the construction of the Bregman divergence that corresponds to the strictly proper inaccuracy measure. For that, you'll have to wait for Jason Konek's forthcoming work...</div><div><br />Suppose:<br /><ul style="text-align: left;"><li>$\mathcal{F}$ is a set of propositions;</li><li>$\mathcal{W} = \{w_1, \ldots, w_n\}$ be the set of possible worlds relative to $\mathcal{F}$;</li><li>$\mathcal{C}$ be the set of credence functions on $\mathcal{F}$;</li><li>$\mathcal{P}$ be the set of probability functions on $\mathcal{F}$. So, by de Finetti's theorem, $\mathcal{P} = \{v_w : w \in \mathcal{W}\}^+$. If $p$ is in $\mathcal{P}$, we write $p_i$ for $p(w_i)$.</li></ul><b>Theorem</b> Suppose $\mathfrak{I}$ is a strictly proper inaccuracy measure on the credence functions in $\mathcal{F}$. Then if $c$ is not in $\mathcal{P}$, there is $c^\star$ in $\mathcal{P}$ such that, for all $w_i$ in $\mathcal{W}$, <br />$$<br />\mathfrak{I}(c^\star, w_i) < \mathfrak{I}(c, w_i)<br />$$<br /><br /><i>Proof</i>. We begin by defining a divergence $\mathfrak{D} : \mathcal{P} \times \mathcal{C} \rightarrow [0, \infty]$ that takes a probability function $p$ and a credence function $c$ and measures the divergence from the former to the latter:<br />$$<br />\mathfrak{D}(p, c) = \sum_i p_i \mathfrak{I}(c, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)<br />$$<br />Three quick points about $\mathfrak{D}$.<br /><br />(1) $\mathfrak{D}$ is a divergence. Since $\mathfrak{I}$ is strictly proper, $\mathfrak{D}(p, c) \geq 0$ with equality iff $c = p$.<br /><br />(2) $\mathfrak{D}(v_{w_i}, c) = \mathfrak{I}(c, w_i)$, for all $w_i$ in $\mathcal{W}$.<br /><br />(3) $\mathfrak{D}$ is strictly convex in its first argument. Suppose $p$ and $q$ are in $\mathcal{P}$, and suppose $0 < \lambda < 1$. Then let $r = \lambda p + \lambda q$. Then, since $\sum_i p_i\mathfrak{I}(c, w_i)$ is uniquely minimized, as a function of $c$, at $c = p$, and $\sum_i q_i\mathfrak{I}(c, w_i)$ is uniquely minimized, as a function of $c$, at $c = q$, we have$$\begin{eqnarray*}<br />\sum_i p_i \mathfrak{I}(c, w_i) & < & \sum_i p_i \mathfrak{I}(r, w_i) \\<br />\sum_i q_i \mathfrak{I}(c, w_i) & < & \sum_i q_i \mathfrak{I}(r, w_i)<br />\end{eqnarray*}$$Thus<br /><br /> $\lambda [-\sum_i p_i \mathfrak{I}(p, w_i)] + (1-\lambda) [-\sum_i q_i \mathfrak{I}(q, w_i)] >$<br /><br />$ \lambda [-\sum_i p_i \mathfrak{I}(r, w_i)] + (1-\lambda) [-\sum_i q_i \mathfrak{I}(r, w_i)] = $<br /><br />$-\sum_i r_i \mathfrak{I}(r, w_i)$<br /><br />Now, adding</div><div><br /></div><div>$\lambda \sum_i p_i \mathfrak{I}(c, w_i) + (1-\lambda)\sum_i q_i\mathfrak{I}(c, w_i) =$</div><div><br /></div><div>$\sum_i (\lambda p_i + (1-\lambda)q_i) \mathfrak{I}(c, w_i) = \sum_i r_i \mathfrak{I}(c, w_i)$<br /></div><div><br /></div><div>to both sides gives<br /><br />$\lambda [\sum_i p_i \mathfrak{I}(c, w_i)-\sum_i p_i \mathfrak{I}(p, w_i)]+ $<br /><br />$(1-\lambda) [\sum_i q_i\mathfrak{I}(c, w_i)-\sum_i q_i \mathfrak{I}(q, w_i)] > $<br /><br /> $\sum_i r_i \mathfrak{I}(c, w_i)-\sum_i r_i \mathfrak{I}(r, w_i)$<br /><br />That is,$$\lambda \mathfrak{D}(p, c) + (1-\lambda) \mathfrak{D}(q, c) > \mathfrak{D}(\lambda p + (1-\lambda)q, c)$$as required.<br /><br />Now, suppose $c$ is not in $\mathcal{P}$. Then, since $\mathcal{P}$ is a closed convex set, there is a unique $c^\star$ in $\mathcal{P}$ that minimizes $\mathfrak{D}(x, c)$ as a function of $x$. Now, suppose $p$ is in $\mathcal{P}$. We wish to show that$$\mathfrak{D}(p, c) \geq \mathfrak{D}(p, c^\star) + \mathfrak{D}(c^\star, c)$$We can see that this holds iff$$\sum_i (p_i - c^\star_i) (\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i)) \geq 0$$After all,<br />$$\begin{eqnarray*}<br />& & \mathfrak{D}(p, c) - \mathfrak{D}(p, c^\star) - \mathfrak{D}(c^\star, c) \\<br />& = & [\sum_i p_i \mathfrak{I}(c, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)] - \\<br />&& [\sum_i p_i \mathfrak{I}(c^\star, w_i) - \sum_i p_i \mathfrak{I}(p, w_i)] - \\<br />&& [\sum_i c^\star_i \mathfrak{I}(c, w_i) - \sum_i c^\star_i \mathfrak{I}(c^\star, w_i)] \\<br />& = & \sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i))<br />\end{eqnarray*}$$<br />Now we prove this inequality. We begin by observing that, since $p$, $c^\star$ are in $\mathcal{P}$, since $\mathcal{P}$ is convex, and since $\mathfrak{D}(x, c)$ is minimized uniquely at $x = c^\star$, if $0 < \varepsilon < 1$, then$$\frac{1}{\varepsilon}[\mathfrak{D}(\varepsilon p + (1-\varepsilon) c^\star, c) - \mathfrak{D}(c^\star, c)] > 0$$Expanding that, we get<br /><br />$\frac{1}{\varepsilon}[\sum_i (\varepsilon p_i + (1- \varepsilon) c^\star_i)\mathfrak{I}(c, w_i) -$<br /><br />$\sum_i (\varepsilon p_i + (1-\varepsilon)c^\star_i)\mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i) - $<br /><br />$\sum_i c^\star_i\mathfrak{I}(c, w_i) + \sum_i c^\star_i \mathfrak{I}(c^\star, i)] > 0$\medskip<br /><br /> So<br /><br />$\frac{1}{\varepsilon}[\sum_i ( c^\star_i + \varepsilon(p_i - c^\star_i))\mathfrak{I}(c, w_i) -$<br /><br />$\sum_i ( c^\star_i + \varepsilon(p_i-c^\star_i))\mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i) - $<br /><br />$\sum_i c^\star_i\mathfrak{I}(c, w_i) + \sum_i c^\star_i \mathfrak{I}(c^\star, w_i)] > 0 $\medskip<br /><br /> So\medskip<br /><br />$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(\varepsilon p+ (1-\varepsilon) c^\star), w_i) +$<br /><br />$ \frac{1}{\varepsilon}[\sum_i c^\star_i \mathfrak{I}(c^\star, w_i) - \sum_ic^\star_i \mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i)] > 0$\medskip<br /><br />Now, since $\mathfrak{I}$ is strictly proper,<br />$$\frac{1}{\varepsilon}[\sum_i c^\star_i \mathfrak{I}(c^\star, w_i) - \sum_ic^\star_i \mathfrak{I}(\varepsilon p + (1-\varepsilon) c^\star, w_i)] < 0$$<br />So, for all $\varepsilon > 0$,$$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(\varepsilon p+ (1-\varepsilon) c^\star, w_i) > 0$$<br />So, since $\mathfrak{I}$ is continuous$$\sum_i (p_i - c^\star_i)(\mathfrak{I}(c, w_i) - \mathfrak{I}(c^\star, w_i)) \geq 0$$which is what we wanted to show. So, by above,$$\mathfrak{D}(p,c) \geq \mathfrak{D}(p, c^\star) + \mathfrak{D}(c^\star, c) $$In particular, since each $w_i$ is in $\mathcal{P}$,$$\mathfrak{D}(v_{w_i}, c) \geq \mathfrak{D}(v_{w_i}, c^\star) + \mathfrak{D}(c^\star, c)$$But, since $c^\star$ is in $\mathcal{P}$ and $c$ is not, and since $\mathfrak{D}$ is a divergence, $\mathfrak{D}(c^\star, c) > 0$. So$$\mathfrak{I}(c, w_i) = \mathfrak{D}(v_{w_i}, c) > \mathfrak{D}(v_{w_i}, c^\star) = \mathfrak{I}(c^\star, w_i)$$as required. $\Box$<br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com9tag:blogger.com,1999:blog-4987609114415205593.post-25380882643116155392020-07-26T16:16:00.005+01:002020-07-30T09:24:13.662+01:00Decomposing Bregman divergencesFor a PDF of this post, see <a href="https://drive.google.com/file/d/1Ek85k8ri-B0EB7_g-CDcC25QgaPO00Sx/view?usp=sharing" target="_blank">here</a>. <br /><br /><div>Here are a couple of neat little results about Bregman divergences that I just happened upon. They might help to prove some more decomposition theorems along the lines of <a href="https://www.jstor.org/stable/2987588" target="_blank">this classic result</a> by Morris DeGroot and Stephen Fienberg and, more recently, <a href="https://link.springer.com/chapter/10.1007/978-3-319-23528-8_5" target="_blank">this paper</a> by my colleagues in the computer science department at Bristol. I should say that a lot is known about Bregman divergences because of their role in information geometry, so these results are almost certainly known already, but I don't know where.</div><div><br /></div><div><h2 style="text-align: left;">Refresher on Bregman divergences</h2></div><br />First up, what's a divergence? It's essentially generalization of the notion of a measure of distance from one point to another. The points live in some closed convex subset $\mathcal{X} \subseteq \mathbf{R}^n$. A divergence is a function $D : \mathcal{X} \times \mathcal{X} \rightarrow [0, \infty]$ such that<br /><ul><li>$D(x, y) \geq 0$, for all $x$, $y$ in $\mathcal{X}$, and</li><li>$D(x, y) = 0$ iff $x = y$.</li></ul>Note: We do not assume that a divergence is <i>symmetric</i>. So the distance from $x$ to $y$ need not be the same as the distance from $y$ to $x$. That is, we do not assume $D(x, y) = D(y, x)$ for all $x$, $y$ in $\mathcal{X}$. Indeed, among the family of divergences that we will consider -- the Bregman divergences -- only one is symmetric -- the squared Euclidean distance. And we do not assume <i>the triangle inequality</i>. That is, we don't assume that the divergence from $x$ to $z$ is at most the sum of the divergence from $x$ to $y$ and the divergence from $y$ to $z$. That is, we do not assume $D(x, z) \leq D(x, y) + D(y, z)$. Indeed, the conditions under which $D(x, z) = D(x, y) + D(y, z)$ for a Bregman divergence $D$ will be our concern here. <br /><br />So, what's a Bregman divergence? $D : \mathcal{X} \times \mathcal{X} \rightarrow [0, \infty]$ is a Bregman divergence if there is a strictly convex function $\Phi : \mathcal{X} \rightarrow \mathbb{R}$ that is differentiable on the interior of $\mathcal{X}$ such that$$D(x, y) = \Phi(x) - \Phi(y) - \nabla \Phi(y) (x-y)$$In other words, to find the divergence from $x$ to $y$, you go to $y$, find the tangent to $\Phi$ at $y$. Then hop over to $x$ and subtract the value at $x$ of the tangent you just drew at $y$ from the value at $x$ of $\Phi$. That is, you subtract $\nabla \Phi(y) (x-y) + \Phi(y)$ from $\Phi(x)$. Because $\Phi$ is convex, it is always curving away from the tangent, and so $\nabla \Phi(y) (x-y) + \Phi(y)$, the value at $x$ of the tangent you drew at $y$, is always less than $\Phi(x)$, the value at $x$ of $\Phi$.<br /><br />The two most famous Bregman divergences are:<br /><ul><li><i>Squared Euclidean distance</i>. Let $\Phi(x) = ||x||^2 = \sum_i x_i^2$, in which case$$D(x, y) = ||x-y||^2 = \sum_i (x_i - y_i)^2$$</li><li><i>Generalized Kullback-Leibler divergence</i>. Let $\Phi(x) = \sum_i x_i \log x_i$, in which case$$D(x, y) = \sum_i x_i\log\frac{x_i}{y_i} - x_i + y_i$$</li></ul>Bregman divergences are convex in the first argument. Thus, we can define, for $z$ in $\mathcal{X}$ and for a closed convex subset $C \subseteq \mathcal{X}$, the $D$-projection of $z$ into $C$ is the point $\pi_{z, C}$ in $C$ such that $D(y, z)$ is minimized, as a function of $y$, at $y = \pi_{z, C}$. Now, we have the following theorem about Bregman divergences, due to Imre Csiszár:<br /><br /><div><b>Theorem (Generalized Pythagorean Theorem)</b> If $\mathcal{C} \subseteq \mathcal{X}$ is closed and convex, then$$D(x, \pi_{z, C}) + D(\pi_{z, C}, z) \leq D(x, z)$$</div><div><br /></div><div><h2 style="text-align: left;">Decomposing Bregman divergences</h2></div><br />This invites the question: when does equality hold? The following result gives a particular class of cases, and in doing so provides us with a recipe for creating decompositions of Bregman divergences into their component parts. Essentially, it says that the above inequality is an equality if $C$ is a plane in $\mathbb{R}^n$.<br /><br /><b>Theorem 1 </b> Suppose $r$ is in $\mathbb{R}$ and $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\sum_i \alpha_i = 1$. Then let $C := \{(x_1, \ldots, x_n) : \sum_i \alpha_ix_i = r\}$. Then if $z$ in $\mathcal{X}$ and $x$ is in $C$,$$D_\Phi(x, z) = D_\Phi(x, \pi_{z, C}) + D_\Phi(\pi_{z, C}, z)$$<br /><br /><i>Proof of Theorem 1. </i> We begin by showing:<br /><br /><b>Lemma 1</b> For any $x$, $y$, $z$ in $\mathcal{X}$,$$D_\Phi(x, z) = D_\Phi(x, y) + D_\Phi(y, z) \Leftrightarrow (\nabla \Phi(y) - \nabla \Phi(z))(x-y) = 0$$<br /><br /><i>Proof of Lemma 1</i>. $$D_\Phi(x, z) = D_\Phi(x, y) + D_\Phi(y, z)$$iff<br /><br />$\Phi(x) - \Phi(z) - \nabla(z)(x-z)$<br /><br />$= \Phi(x) - \Phi(y) - \nabla(y)(x-y) + \Phi(y) - \Phi(z) - \nabla(z)(y-z)$<br /><br />iff$$(\nabla \Phi(y) - \nabla \Phi(z))(x-y) = 0$$as required.<br /><br /><i>Return to Proof of Theorem 1</i>. Now we show that if $x$ is in $C$, then$$(\nabla \Phi(\pi_{z, C}) - \Phi(z))(x-\pi_{z, C}) = 0$$We know that $D(y, z)$ is minimized on $C$, as a function of $y$, at $y = \pi_{z, C}$. Thus, let $y = \pi_{z, C}$. And let $h(x) := \sum_i \alpha_ix^i - r$. Then $\frac{\partial}{\partial x_i} h(x) = \alpha_i$. So, by the KKT conditions, there is $\lambda$ such that,$$\nabla \Phi(y) - \nabla \Phi(z) + (\lambda \alpha_1, \ldots, \lambda \alpha_n) = (0, \ldots, 0)$$Thus,$$\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z) = -\lambda \alpha_i$$for all $i = 1, \ldots, n$. <br /><br />Thus, finally, <br />\begin{eqnarray*}<br />& &(\nabla \Phi(y) - \nabla \Phi(z))(x-y) \\<br />& = & \sum_i \left (\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z)\right )(x_i-y_i) \\<br />& = & \sum_i (-\lambda \alpha_i) (x_i - y_i) \\<br />& = & -\lambda \left (\sum_i \alpha_i x_i - \sum_i \alpha_i y_i\right ) \\<br />& = & -\lambda (r-r) \\<br />& = & 0<br />\end{eqnarray*}<br />as required. $\Box$<br /><br /><b>Theorem 2 </b> Suppose $1 \leq k \leq n$. Let $C := \{(x_1, \ldots, x_n) : x_1 = x_2 = \ldots = x_k\}$. Then if $z$ in $\mathcal{X}$ and $x$ is in $C$,$$D_\Phi(x, z) = D_\Phi(x, \pi_{z, C}) + D_\Phi(\pi_{z, C}, z)$$<br /><br /><i>Proof of Theorem 2</i>. We know that $D(y, z)$ is minimized on $C$, as a function of $y$, at $y = \pi_{z, C}$. Thus, let $y = \pi_{z, C}$. And let $h_i(x) := x_{i+1} - x_i$, for $i = 1, \ldots, k-1$. Then$$\frac{\partial}{\partial x_j} h_i(x) = \left \{ \begin{array}{ll} 1 & \mbox{if } i+1 = j \\ -1 & \mbox{if } i = j \\ 0 & \mbox{otherwise}\end{array} \right.$$ So, by the KKT conditions, there are $\lambda_1, \ldots, \lambda_k$ such that,<br /><br />$\nabla \Phi(y) - \nabla \Phi(z)$<br /><br />$+ (-\lambda_1, \lambda_1, 0, \ldots, 0) + (0, -\lambda_2, \lambda_2, 0, \ldots, 0) + \ldots$<br /><br />$+ (0, \ldots, 0, -\lambda_k, \lambda_k, 0, \ldots, 0) = (0, \ldots, 0)$<br /><br />Thus,$$\begin{eqnarray*}\frac{\partial}{\partial y_1} \Phi(y) - \frac{\partial}{\partial z_1} \Phi(z) & = & - \lambda_1 \\ \frac{\partial}{\partial y_2} \Phi(y) - \frac{\partial}{\partial z_2} \Phi(z) & = & \lambda_1 - \lambda_2 \\ \vdots & \vdots & \vdots \\ \frac{\partial}{\partial y_{k-1}} \Phi(y) - \frac{\partial}{\partial z_{k-1}} \Phi(z) & = & \lambda_{k-2}- \lambda_{k-1} \\ \frac{\partial}{\partial y_k} \Phi(y) - \frac{\partial}{\partial z_k} \Phi(z) & = & \lambda_{k-1} \\ \frac{\partial}{\partial y_{k+1}} \Phi(y) - \frac{\partial}{\partial z_{k+1}} \Phi(z) & = & 0 \\ \vdots & \vdots & \vdots \\ \frac{\partial}{\partial y_n} \Phi(y) - \frac{\partial}{\partial z_n} \Phi(z) & = & 0 \end{eqnarray*}$$<br /><br />Thus, finally, <br />\begin{eqnarray*}<br />& &(\nabla \Phi(y) - \nabla \Phi(z))(x-y) \\<br />& = & \sum_i \left (\frac{\partial}{\partial y_i} \Phi(y) - \frac{\partial}{\partial z_i} \Phi(z)\right )(x_i-y_i) \\<br />& = & -\lambda_1(x_1-y_1) + (\lambda_1 - \lambda_2)(x_2-y_2) + \ldots \\<br />&& + (\lambda_{k-2} - \lambda_{k-1})(x_{k-1}-y_{k-1}) + \lambda_{k-1}(x_k-y_k) \\<br />&& + 0(x_{k+1} - y_{k+1}) + \ldots + 0 (x_n - y_n) \\<br />& = & \sum^{k-1}_{i=1} \lambda_i (x_{i+1} - x_i) + \sum^{k-1}_{i=1} \lambda_i (y_i - y_{i+1})\\<br />& = & 0<br />\end{eqnarray*}<br /><div>as required. $\Box$</div><div><br /></div><div><h2 style="text-align: left;">DeGroot and Fienberg's calibration and refinement decomposition</h2></div><br />To obtain these two decomposition results, we needed to assume nothing more than that $D$ is a Bregman divergence. The classic result by DeGroot and Fienberg requires a little more. We can see this by considering a very special case of it. Suppose $(X_1, \ldots, X_n)$ is a sequence of propositions that forms a partition. And suppose $w$ is a possible world. Then we can represent $w$ as the vector $w = (0, \ldots, 0, 1, 0, \ldots, 0)$, which takes value 1 at the proposition that is true in $w$ and 0 everywhere else. Now suppose $c = (c, \ldots, c)$ is an assignment of the same credence to each proposition. Then one very particular case of DeGroot and Fienberg's result says that, if $(0, \ldots, 0, 1, 0, \ldots, 0)$ is the world at which $X_i$ is true, then<br /><br />$D((0, \ldots, 0, 1, 0, \ldots, 0), (c, \ldots, c))$<br /><br />$= D((0, \ldots, 0, 1, 0, \ldots, 0), (\frac{1}{n}, \ldots, \frac{1}{n})) + D((\frac{1}{n}, \ldots, \frac{1}{n}), (c, \ldots, c))$<br /><br />Now, we know from Lemma 1 that this is true iff$$(\nabla \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \nabla \Phi(c, \ldots, c))((0, \ldots, 0, 1, 0, \ldots, 0) - (\frac{1}{n}, \ldots, \frac{1}{n})) = 0$$which is true iff<br /><br />$\left ( \frac{\partial}{\partial x_i} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_i} \Phi(c, \ldots, c) \right )$<br /><br />$= \frac{1}{n} \sum^n_{j=1} \left ( \frac{\partial}{\partial x_j} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_j} \Phi(c, \ldots, c) \right )$<br /><br />and that is true iff<br /><br />$\frac{\partial}{\partial x_i} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_i} \Phi(c, \ldots, c)$<br /><br />$= \frac{\partial}{\partial x_j} \Phi(\frac{1}{n}, \ldots, \frac{1}{n}) - \frac{\partial}{\partial x_j} \Phi(c, \ldots, c)$<br /><br />for all $1 \leq i, j, \leq n$, which is true iff, for any $x$, $1 \leq i, j \leq n$,$$\frac{\partial}{\partial x_i} \Phi(x, \ldots, x) = \frac{\partial}{\partial x_j} \Phi(x, \ldots, x)$$Now, this is true if $\Phi(x_1, \ldots, x_n) = \sum^n_{i=1} \varphi(x_i)$ for some $\varphi$. That is, it is true if $D$ is an additive Bregman divergence. But it is also true for certain non-additive Bregman divergences, such as the one generated from the log-sum-exp function:<br /><br /><b>Definition (log-sum-exp)</b> Suppose $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\sum^n_{i=1} \alpha_i = 1$. Then let $$\Phi^A(x_1, \ldots, x_n) = \log(1 + \alpha_1e^{x_1} + \ldots \alpha_ne^{x_n})$$Then <br />$$D(x, y) = \log (1 + \sum_i \alpha_ie^{x_i}) - \log(1 + \sum_i \alpha_ie^{y_i}) - \sum_k \frac{\alpha_k(x_k - y_k)e^{y_k}}{1 + \sum_i \alpha_ie^{y_i}}$$<br /><br />Now$$\frac{\partial}{\partial x_i} \Phi^A(x_1, \ldots, x_n) = \frac{\alpha_i e^{x_i}}{1 + \alpha_1 e^{x_1} + \ldots + \alpha_ne^{x_n}}$$So, if $\alpha_i = \alpha_j$ for all $1 \leq i, j \leq n$, then$$\frac{\partial}{\partial x_i} \Phi^A(x, \ldots, x) = \frac{\alpha e^x}{1 + e^x} = \frac{\partial}{\partial x_j} \Phi^A(x, \ldots, x)$$But if $\alpha_i \neq \alpha_j$ for some $1 \leq i, j \leq n$, then$$\frac{\partial}{\partial x_i} \Phi^A(x, \ldots, x) = \frac{\alpha_ie^x}{1 + e^x} \neq \frac{\alpha_je^x}{1 + e^x} = \frac{\partial}{\partial x_j} \Phi^A(x, \ldots, x)$$<br /><br /><div>And indeed, the result even fails if we have a semi-additive Bregman divergence. That is, there are different $\phi_1, \ldots, \phi_n$ such that $\Phi(x) = \sum^n_{i=1} \phi_i(x_i)$. For instance, suppose $\phi_1(x) = x^2$ and $\phi_2(x) = x\log x$ and $\Phi(x, y) = \phi_1(x) + \phi_2(y) = x^2 + y\log y$. Then$$\frac{\partial}{\partial x_1} \Phi(x, x) = 2x \neq 1 + \log x = \frac{\partial}{\partial x_2} \Phi(x, x)$$</div><div><br /></div><div><h2 style="text-align: left;">Proving the Generalized Pythagorean Theorem</h2></div><div><br /></div><div>In this section, I really just spell out in more detail the proof that <a href="https://ieeexplore.ieee.org/document/5238758" target="_blank">Predd, et al.</a> give of the Generalized Pythagorean Theorem, which is their Proposition 3. But that proof contains some important general facts that might be helpful for people working with Bregman divergences. I collect these together here into one lemma.</div><div><br /></div><div><b>Lemma 2</b> Suppose $D$ is a Bregman divergence generated from $\Phi$. And suppose $x, y, z \in \mathcal{X}$. Then$$\begin{eqnarray*} & & D(x, z) - [D(x, y) + D(y, z)] \\ & = & (\nabla \Phi(y) - \nabla \Phi(z))(x - y) \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(y + \varepsilon (x - y), z) - D(y, z)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z)] \end{eqnarray*}$$</div><div><br /></div><div>We can then prove the Generalized Pythagorean Theorem easily. After all, if $x$ is in a closed convex set $C$ and $y$ is the point in $C$ that minimizes $D(y, z)$ as a function of $y$. Then, for all $0 \leq \varepsilon \leq 1$, $\varepsilon x + (1-\varepsilon)y$ is in $C$. And since $y$ minimizes,$$D(\varepsilon x + (1-\varepsilon)y, z) \geq D(y, z)$$. So $D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z) \geq 0$. So $$\lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon}D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z) \geq 0$$So, by Lemma 2,$$D(x, z) \geq D(x, y) + D(y, z)$$</div><div><br /></div><div><i>Proof of Lemma 2. </i>$$\begin{eqnarray*} && \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [D(\varepsilon x + (1-\varepsilon)y, z) - D(y, z)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(z) - \nabla \Phi(z)(\varepsilon x + (1-\varepsilon)y - z)) - \\ & & (\Phi(y) - \Phi(z) - \nabla\Phi(z)(y-z))]\\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(y) - \varepsilon\nabla \Phi(z)(x -y)] \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(\varepsilon x + (1-\varepsilon)y) - \Phi(y)] - \nabla \Phi(z)(x -y) \\ & = & \lim_{\varepsilon \rightarrow 0} \frac{1}{\varepsilon} [(\Phi(y + \varepsilon (x -y)) - \Phi(y)] - \nabla \Phi(z)(x -y) \\ & = & \nabla \Phi(y)(x -y) - \nabla \Phi(z)(x -y) \\ & = & (\nabla \Phi(y) - \nabla \Phi(z))(x -y)\end{eqnarray*}$$<br /></div>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com5tag:blogger.com,1999:blog-4987609114415205593.post-46220199378213929972020-07-23T10:28:00.003+01:002020-07-23T10:29:28.555+01:00Epistemic risk and permissive rationality (part I): an overviewI got interested in epistemic risk again, after a hiatus of four or five years, by thinking about the debate in epistemology between permissivists and impermissivists about epistemic rationality. Roughly speaking, according to the impermissivist, every body of evidence you might obtain mandates a unique rational set of attitudes in response --- this is sometimes called the uniqueness thesis. According to the permissivist, there is evidence you might obtain that doesn't mandate a unique rational set of attitudes in response --- there are, instead, multiple rational responses.<br /><br />I want to argue for permissivism. And I want to do it by appealing to the sorts of claims about how to set your priors and posteriors that I've been developing over this series of blogposts (<a href="https://m-phi.blogspot.com/2020/07/taking-risks-and-picking-priors.html" target="_blank">here</a> and <a href="https://m-phi.blogspot.com/2020/07/taking-risks-and-picking-posteriors.html" target="_blank">here</a>). In the first of those blogposts, I argued that we should pick priors using a decision rule called the generalized Hurwicz criterion (GHC). That is, we should see our choice of priors as a decision we must make; and we should make that decision using a particular decision rule -- namely, GHC -- where we take the available acts to be the different possible credence functions and the utility of an act at a world to be a measure of its accuracy at that world.<br /><br />Now, GHC is, in fact, not a single decision rule, but a family of rules, each specified by some parameters that I call the Hurwicz weights. These encode different attitudes to risk -- they specify the weight you assign to the best-case scenario, the weight you assign to the second-best, and so on down to the weight you assign to the second-worst scenario, and the weight you assign to the worst. And, what's more, many different attitudes to risk are permissible; and therefore many different Hurwicz weights are permissible; and so many versions of GHC are legitimate decision rules to adopt when picking priors. So different permissible attitudes to risk determine different Hurwicz weights; and different Hurwicz weights mandate different rational priors; and different rational priors mandate different rational posteriors given the same evidence. Epistemic rationality, therefore, is permissive. That's the argument in brief.<br /><br />With this post, I'd like to start a series of posts in which I explore how this view plays out in the permissivism debate. If there are many different rationally permissible responses to the same piece of evidence because there are many different rationally permissible attitudes to risk, then how does that allow us to answer the various objections to permissivism that have been raised.<br /><br />In this post, I want to do four things: first, run through a taxonomy of varieties of permissivism that slightly expands on one due to <a href="https://doi.org/10.1017/epi.2019.19" target="_blank">Elizabeth Jackson</a>; second, explain my motivation for offering this argument for permissivism; third, discuss an earlier risk-based argument for the position due to <a href="https://philpapers.org/rec/KELECB" target="_blank">Thomas Kelly</a>; finally, situate within Jackson's taxonomy the version of permissivism that follows from my own risk-based approach to setting priors and posteriors.<br /><br /><h2>Varieties of permissivism</h2><br />Let's start with the taxonomy of permissivisms. I suspect it's not complete; there are likely other dimensions along which permissivists will differ. But it's quite useful for our purposes.<br /><br />First, there are different versions of permissivism for different sorts of doxastic attitudes we might have in response to evidence. So there are versions for credences, beliefs, imprecise probabilities, comparative confidences, ranking functions, and so on. For instance, on the credal version of permissivism, there is evidence that doesn't determine a unique set of credences that rationality requires us to have in response to that evidence. For many different sorts of doxastic attitude, you can be permissive with respect to one but not the other: permissive about beliefs but not about credences, for instance, or vice versa. <br /><br />Second, permissivism comes in interpersonal and intrapersonal versions. According to interpersonal permissivism, it is possible for different individuals to have the same evidence, but different attitudes in response, and yet both be rational. According to the intrapersonal version, there is evidence a single individual might have, and different sets of attitudes such that whichever they have, they'll still be rational. Most people who hold to intrapersonal permissivism for a certain sort of doxastic attitude also hold to the interpersonal version, but there are many who think intrapersonal permissivism is mistaken but interpersonal permissivism is correct.<br /><br />Third, it comes in wide and narrow versions. This is determined by how many different attitudes are permitted in response to a piece of evidence, and how much variation there is between them. On narrow versions, there are not so many different rational responses and they do not vary too widely; on wide versions, there are many and they vary greatly.<br /><br />Fourth, it comes in common and rare versions. On the first, most evidence is permissive; on the latter, permissive evidence is rare. <br /><br />I'll end up defending two versions of permissivism: (i) a wide common version of interpersonal permissivism about credences; and (ii) a narrow common version of intrapersonal permissivism about credences.<br /><br /><h2>Why argue for permissivism?</h2><br />Well, because it's true, mainly. But there's another motivation for adding to the already crowded marketplace of arguments for the position. Many philosophers defend permissivism for negative reasons. They look at two very different sorts of evidence and give reasons to be pessimistic about the prospects of identifying a unique rational credal response to them. They are: very sparse evidence and very complex evidence. In the first, they say, our evidence constrains us too little. There are too many credal states that respect it. If there is a single credal response that rationality mandates to this sparse evidence, there must be some way to whittle down the vast set of states that respect it leave us with only one. For instance, some philosophers claim that, among this vast set of states, we should pick the one that has lowest informational content, since any other will go beyond what is warranted by the evidence. But it has proven extremely difficult to identify that credal state in many cases, such as von Mises' water-wine example, Bertrand's paradox, and van Fraassen's cube factory. Despairing of finding a way to pick a single credal state from this vast range, many philosophers have become permissivist. In the second sort of case, at the other extreme, where our evidence is very complex not very sparse, our evidence points in too many directions at once. In such cases, you might hope to identify a unique way in which to weigh the different sources of evidence and the direction in which they point to give the unique credal state that rationality mandates. And yet again, it has proven difficult to find a principled way of assigning these weights. Despairing, philosophers have become permissivist in these cases too.<br /><br />I'd like to give a positive motivation for permissivism---one that doesn't motivate it by pointing to the difficulty of establishing its negation. My account will be based within accuracy-first epistemology, and it will depend crucially on the notion of epistemic risk. Rationality permits a variety of attitudes to risk in the practical sphere. Faced with the same risky choice, you might be willing to gamble because you are risk-seeking, and I might be unwilling because I am risk-averse, but we are both rational and neither more rational than the other. On my account, rationality also permits different attitudes to risk in the epistemic sphere. And different attitudes to epistemic risk warrant different credal attitudes in response to a body of evidence. Therefore, permissivism.<br /><br /><h2>Epistemic risk encoded in epistemic utility</h2><br />It is worth noting that this is not the first time that the notion of epistemic risk has entered the permissivism debate. In an early paper on the topic, <a href="https://philpapers.org/rec/KELECB" target="_blank">Thomas Kelly</a> appeals to <a href="https://www.gutenberg.org/files/26659/26659-h/26659-h.htm" target="_blank">William James'</a> distinction between the two goals that we have when we have beliefs---believing truths and avoiding errors. When we have a belief, it gives us a chance of being right, but it also runs the risk of being wrong. In constrast, when we withhold judgment on a proposition, we run no risk of being wrong, but we give ourselves no chance of being right. Kelly then notes that whether you should believe on the basis of some evidence depends on how strongly you want to believe truths and how strongly you don't want to believe falsehoods. Using an epistemic utility framework introduced independently by <a href="https://doi.org/10.1111/nous.12099" target="_blank">Kenny Easwaran</a> and <a href="https://doi.org/10.1093/mind/fzx028" target="_blank">Kevin Dorst</a>, we can make this precise. Suppose:<br /><ol><li>I assign a positive epistemic utility of $R > 0$ to believing a truth or disbelieving a falsehood;</li><li>I assign a negative epistemic utility (or positive epistemic disutility) of $-W < 0$ to believing a falsehood or disbelieving a truth; and</li><li>I assign a neutral epistemic utility of 0 to withholding judgment.</li></ol>And suppose $W > R$. And suppose further that there is some way to measure, for each proposition, how likely or probable my evidence makes that proposition---that is, we assume there is a unique evidential probability function of the sort that J. M. Keynes, E. T. Jaynes, and Timothy Williamson envisaged. Then, if $r$ is how likely my evidence makes the proposition $X$, then:<br /><ol><li>the expected value of believing $X$ is $rR + (1-r)(-W)$,</li><li>the expected value of disbelieving $X$ is $r(-W) + (1-r)R$, and</li><li>the expected value of withholding judgment is $0$.</li></ol>A quick calculation shows that believing uniquely maximises expected utility when $r > \frac{W}{R+W}$, disbelieving uniquely maximises when $r < \frac{R}{R+W}$, and withholding uniquely maximises if $\frac{R}{R +W} < r < \frac{W}{R+W}$. What follows is that the more you disvalue being wrong, the stronger the evidence will have to be in order to make it rational to believe. Now, Kelly assumes that various values of $R$ and $W$ are rationally permissible---it is permissible to disvalue believing falsehoods a lot more than you value believing truths, and it is permissible to disvalue that just a little more. And, if that is the case, different individuals might have the same evidence while rationality requires of them different doxastic attitudes---a belief for one of them, who disvalues being wrong only a little more than they value being right, and no belief for the other, where the difference between their disvalue for false belief and value for true belief is much greater. Kelly identifies the values you pick for $R$ and $W$ with your attitudes to epistemic risk. So different doxastic attitudes are permissible in the face of the same evidence because different attitudes to epistemic risk are permissible.<br /><br />Now, there are a number of things worth noting here before I pass to my own alternative approach to epistemic risk.<br /><br />First, note that Kelly manages to show that epistemic rationality might be permissive even if there is a unique evidential probability measure. So even those who think you can solve the problem of what probability is demanded by the very sparse evidence and the very complex evidence we described above, still they should countenance a form of epistemic permissivism if they agree that there are different permissible values for $R$ and $W$.<br /><br />Second, it might seem at first that Kelly's argument gives interpersonal permissivism at most. After all, for fixed $R$ and $W$, and a unique evidential probability $r$ for $X$ given your evidence, it might seem that there is always a single attitude---belief in $X$, disbelief in $X$, or judgment withheld about $X$---that maximises expected epistemic value. But this isn't always true. After all, if $r = \frac{R}{R + W}$, then it turns out that disbelieving and withholding have the same expected epistemic value, and if $r = \frac{W}{R+W}$, then believing and withholding have the same expected epistemic value. And in those cases, it would be rationally permissible for an individual to pick either.<br /><br />Third, and relatedly, it might seem that Kelly's argument gives only narrow permissivism, since it allows for cases in which believing and withholding are both rational, and it allows for cases in which disbelieving and withholding are both rational, but it doesn't allow for cases in which all three are rational. But that again is a mistake. If you value believing truths exactly as much as you value believing falsehoods, so that $R = W$, and if the objective evidential probability of $X$ given your evidence is $r = \frac{1}{2}$, then believing, disbelieving, and withholding judgment are all permissible. Having said that, there is some reason to say that it is not rationally permissible to set $R = W$. After all, if you do, and if $r = \frac{1}{2}$, then it is permissible to both believe $X$ and believe $\overline{X}$ at the same time, and that seems wrong.<br /><br />Fourth, and most importantly for my purposes, Kelly's argument works for beliefs, but not for credences. The problem, briefly stated, is this: suppose $r$ is how likely my evidence makes proposition $X$. And suppose $\mathfrak{s}(1, x)$ is the accuracy of credence $x$ in a truth, while $\mathfrak{s}(0, x)$ is the accuracy of credence $x$ in a falsehood. Then the expected accuracy of credence $x$ in $X$ is<br />\begin{equation}\label{eeu}<br />r\mathfrak{s}(1, x) + (1-r)\mathfrak{s}(0, x)\tag{*}<br />\end{equation}<br />But nearly all advocates of epistemic utility theory for credences agree that rationality requires that $\mathfrak{s}$ is a strictly proper scoring rule. And that means that (*) is maximized, as a function of $x$, at $x = r$. So differences in how you value epistemic utility don't give rise to differences in what credences you should have. Your credences should always match the objective evidential probability of $X$ given your evidence. Epistemic permissivism about credences would therefore be false.<br /><br />I think Kelly's observation, supplemented with Easwaran's precise formulation of epistemic value, furnishes a strong argument for permissivism about beliefs. But I think we can appeal to epistemic risk to give something more, namely, two versions of permissivism about credences: first, an wide common interpersonal version, and second a narrow common intrapersonal version.<br /><br /><h2>Epistemic risk encoded in decision rules</h2><br />To take the first step towards these versions of permissivism for credences, let's begin with the observation that there are two ways in which risk enters into the rational evaluation of a set of options. First, risk might be encoded in the utility function, which measures the value of each option at each possible world; or, second, it might be encoded in the choice rule, which takes in various features of the options, including their utilities at different worlds, and spits out the set of options that are rationally permissible.<br /><br />Before we move to the epistemic case, let's look at how this plays out in the practical case. I am about to flip a fair coin. I make you an offer: pay me £30, and I will pay you £100 if the coin lands heads and nothing if it lands tails. You reject my offer. There are two ways to rationalise your decision. On the first, you choose using expected utility theory, which is a risk-neutral decision rule. However, because the utility you assign to an outcome is a sufficiently concave function of the money you get in that outcome, and your current wealth is sufficiently small, the expected utility of accepting my offer is less than the expected utility of rejecting it. For instance, perhaps your utility for an outcome in which your total wealth is £$n$ is $\log n$. And perhaps your current wealth is £$40$. Then your expected utility for accepting my offer is $\frac{1}{2}\log 110 + \frac{1}{2} \log 10 \approx 3.502$ while your expected utility for rejecting it is $\log 40 \approx 3.689$. So you are rationally required to reject. On this way of understanding your choice, your risk-aversion is encoded in your utility function, while your decision rule is risk-neutral. On the second way of understanding your choice, it is the other way around. Instead of expected utility theory, you choose using a risk-sensitive decision rule, such as Wald's <a href="https://en.wikipedia.org/wiki/Wald's_maximin_model" target="_blank">Maximin</a>, the <a href="https://m-phi.blogspot.com/2020/07/hurwiczs-criterion-of-realism-and.html" target="_blank">Hurwicz criterion</a>, the <a href="https://m-phi.blogspot.com/2020/07/a-generalised-hurwicz-criterion.html" target="_blank">generalized Hurwicz criterion</a>, Quiggin's <a href="https://en.wikipedia.org/wiki/Rank-dependent_expected_utility" target="_blank">rank-dependent utility theory</a>, or Buchak's <a href="https://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780199672165.001.0001/acprof-9780199672165" target="_blank">risk-weighted expected utility theory</a>. According to Maximin, for instance, you are required to choose an option whose worst-case outcome is best. The worst case if you accept the offer is the one in which the coin lands tails and I pay you back nothing, in which case you end up £$30$ down, whereas the worst case if you refuse my offer is the status quo in which you end up with exactly as much as you had before. So, providing you prefer more money to less, the worst-case outcome of accepting the offer is worse than the worse-case outcome of refusing it, so Maximin will lead you to refuse the offer. And it will lead you to do that even if, for instance, you value money linearly. Thus, there is no need to reflect your attitude to risk in your utility function at all, because it is encoded in your decision rule.<br /><br />I take the lesson of the <a href="https://www.oxfordreference.com/view/10.1093/oi/authority.20110803095403122" target="_blank">Allais paradox</a> to be that there is rational risk-sensitive behaviour that we cannot capture entirely using the first method here. That is, there are rational preferences that we cannot recover within expected utility theory by making the utility function concave in money, or applying some other tweak. We must instead permit risk-sensitive choice rules. Now, there are two sorts of such rules: those that require credences among their inputs and those that don't. In the first camp, perhaps the most sophisticated is Lara Buchak's <a href="https://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780199672165.001.0001/acprof-9780199672165" target="_blank">risk-weighted expected utility theory</a>. In the second, we've already met the most famous example, namely, Maximin, which is maximally risk-averse. But there is also Maximax, which is maximally risk-seeking. And there is the Hurwicz criterion, which strikes a balance between the two. And there's my generalization of the Hurwicz criterion, which I'll abbreviate GHC. As I've discussed over the last few blogposts, I favour the latter in the case of picking priors. (For an alternative approach to epistemic risk, see Boris Babic's recent paper <a href="https://www.journals.uchicago.edu/doi/pdfplus/10.1086/703552" target="_blank">here</a>.)<br /><br />To see what happens when you use GHC to pick priors, let's give a quick example in a situation in which there are just three possible states of the world to which you assign credences, $w_1$, $w_2$, $w_3$, and we write $(p_1, p_2, p_3)$ for a credence function $p$ that assigns $p_i$ to world $w_i$. Suppose your Hurwicz weights are these: $\alpha_1$ for the best case, $\alpha_2$ for the second-best (and second-worst) case, and $\alpha_3$ for the worst case. And your accuracy measure is $\mathfrak{I}$. Then we're looking for those that minimize your Hurwicz score, which is$$H^A(p) = \alpha_1\mathfrak{I}(p, w_{i_1}) + \alpha_2\mathfrak{I}(p, w_{i_2}) + \alpha_3\mathfrak{I}(p, w_{i_3})$$when$$\mathfrak{I}(p, w_{i_1}) \geq \mathfrak{I}(p, w_{i_2}) \geq\mathfrak{I}(p, w_{i_3})$$Now suppose for our example that $\alpha_1 \geq \alpha_2 \geq \alpha_3$. Then the credence functions that minimize $H^A_{\mathfrak{I}}$ are$$\begin{array}{ccc} (\alpha_1, \alpha_2, \alpha_3) & (\alpha_1, \alpha_3, \alpha_2) & (\alpha_2, \alpha_1, \alpha_3) \\ (\alpha_2, \alpha_3, \alpha_1) & (\alpha_3, \alpha_1, \alpha_2) & (\alpha_3, \alpha_2, \alpha_1) \end{array}$$<br /><br />With that example in hand, and a little insight into how GHC works when you use it to select priors, let's work through Elizabeth Jackson's taxonomy of permissivism from above.<br /><br />First, since the attitudes we are considering are credences, it's a credal version of permissivism that follows from this risk-based approach in accuracy-first epistemology.<br /><br />Second, we obtain both an interpersonal and an intrapersonal permissivism. A particular person will have risk attitudes represented by specific Hurwicz weights. And yet, even once those are fixed, there will usually be a number of different permissible priors. That is, rationally will permit a number of different credal states in the absence of evidence. For instance, if my Hurwicz weights are $\alpha_1 = 0.5$, $\alpha_2 = 0.3$, $\alpha_3 = 0.2$, then rationality allows me to assign 0.5 to world $w_1$, 0.3 to $w_2$ and 0.2 to $w_3$, but it also permits me to assign $0.3$ to $w_1$, $0.2$ to $w_2$, and $0.5$ to $w_3$.<br /><br />So there is intrapersonal credal permissivism, but it is reasonably narrow---there are only six rationally permissible credence functions for someone with the Hurwicz scores just specified, for instance. On the other hand, the interpersonal permissivism we obtain is very wide. Indeed, it is as wide as range of permissible attitudes to risk. As we noted in a previous post, for any probabilistic credence function over a space of possible worlds, there are Hurwicz weights that will render those credences permissible. So providing those weights are rationally permissible, so are the credences. <br /><br />Finally, is the permissivism we get from this risk-based approach common or rare? So far, we've just considered it in the case of priors. That is, we've only established permissivism in the case in which you have no evidence. But of course, once it's established there, it's also established for many other bodies of evidence, since we obtain the rational credences given a body of evidence by looking to what we obtain by updating rational priors by conditioning on that evidence. And, providing a body of evidence isn't fully informative, if there are multiple rational priors, they will give rise to multiple rational posteriors when we condition them on that evidence. So the wide interpersonal credal permissivism we obtain is common, and so is the narrow intrapersonal credal permissivism.Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com10tag:blogger.com,1999:blog-4987609114415205593.post-33893996719156809942020-07-22T07:23:00.001+01:002020-07-22T14:12:06.290+01:00Taking risks and picking posteriorsFor a PDF of this blog, see <a href="https://drive.google.com/file/d/17FgjpCNg-qwj9zPK1D3PoPG2sh4ecGj2/view?usp=sharing" target="_blank">here</a>.<br /><br />When are my credences rational? In Bayesian epistemology, there's a standard approach to this question. We begin by asking what credences would be rational were you to have no evidence at all; then we ask what ways of updating your credences are rational when you receive new evidence; and finally we say that your current credences are rational if they are the result of updating rational priors in a rational way on the basis of your current total evidence. This account can be read in one of two ways: on the doxastic reading, you're rational if, <i>in fact</i>, when you had no evidence you had priors that were rational and if, <i>in fact</i>, when you received evidence you updated in a rational way; on the propositional reading, you're rational if <i>there exists</i> some rational prior and <i>there exists</i> some rational way of updating such that applying the updating rule to the prior based on your current evidence issues in your current credences.<br /><br />In <a href="https://m-phi.blogspot.com/2020/07/taking-risks-and-picking-priors.html" target="_blank">this previous post</a>, I asked how we might use accuracy-first epistemology and decision-making rules for situations of massive uncertainty to identify the rational priors. I suggested that we should turn to the early days of decision theory, when there was still significant interest in how we might make a decision in situations in which it is not possible to assign probabilities to the different possible states of the world. In particular, I noted Hurwicz's generalization of Wald's Maximin rule, which is now called the Hurwicz Criterion, and I offered a further generalization of my own, which I then applied to the problem of picking priors. Here's my generalization:<br /><br /><b>Generalized Hurwicz Criterion (GHC)</b> Suppose the set of possible states of the world is $W = \{w_1, \ldots, w_n\}$. Pick $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\alpha_1 + \ldots + \alpha_n = 1$, and denote this sequence of weights $A$. Suppose $a$ is an option defined on $W$, and write $a(w_i)$ for the utility of $a$ at world $w_i$. Then if$$a(w_{i_1}) \geq a(w_{i_2}) \geq \ldots \geq a(w_{i_n})$$then let$$H^A(a) = \alpha_1a(w_{i_1}) + \ldots + \alpha_na(w_{i_n})$$Pick an option that maximises $H^A$.<br /><br />Thus, whereas Wald's Maximin puts all of the weight onto the worst case, and Hurwicz's Criterion distributes all the weight between the best and worst cases, the generalized Hurwicz Criterion allows you to distribute the weight between best, second-best, and so on down to second-worst, and worst. I said that you should pick your priors by applying GHC with a measure of accuracy for credences. Then I described the norms for priors that it imposes.<br /><br />In this post, I'm interested in the second component of the Bayesian approach I described above, namely, the rational ways of updating. How does rationality demand we update our prior when new evidence arrives? Again, I'll be asking this within the accuracy-first framework.<br /><br />As in the previous post, we'll consider the simplest possible case. We'll assume there are just three possible states of the world that you entertain and to which you will assign credences. They are $w_1, w_2, w_3$. If $p$ is a credence function, then we'll write $p_i$ for $p(w_i)$, and we'll denote the whole credence function $(p_1, p_2, p_3)$. At the beginning of your epistemic life you have no evidence, and you must pick your prior. We'll assume that you do as I proposed in the previous post and set your prior using GHC with some weights that you have picked, $\alpha_1$ for the best-case scenario, $\alpha_2$ for the second-best, and $\alpha_3$ for the worst. Later on, let's suppose, you learn evidence $E = \{w_1, w_2\}$. How should you update? As we will see, the problem is that there are many seemingly plausible approaches to this question, most of which disagree and most of which give implausible answers.<br /><br />A natural first proposal is to use the same decision rule to select our posterior as we used to select our prior. To illustrate, let's suppose that our Hurwicz weight for the best-case scenario is $\alpha_1 = 0.5$, for the second-best $\alpha_2 = 0.3$, and for the worst $\alpha_3 = 0.2$. Applying GHC with these weights and any additive and continuous strictly proper (acsp) accuracy measure gives the following as the permissible priors:$$\begin{array}{ccc}(0.5, 0.3, 0.2) & (0.5, 0.2, 0.3) & (0.3, 0.5, 0.2) \\ (0.3, 0.2, 0.5) & (0.2, 0.5, 0.3) &(0.2, 0.3, 0.5)\end{array}$$<br />Let's suppose we pick $(0.5, 0.3, 0.2)$. And now suppose we learn $E = \{w_1, w_2\}$. If we simply apply GHC again, we get the same set of credence functions as permissible posteriors. But none of these even respects the evidence we've obtained -- that is, all of them assign positive credence to world $w_3$, which our evidence has ruled out. So that can't be quite right.<br /><br />Perhaps, then, we should first limit the permissible posteriors to those that respect our evidence -- by assigning credence $0$ to world $w_3$ -- and then find the credence function that maximizes GHC <i>among them</i>. It turns out that the success of this move depends on the measure of accuracy that you use. Suppose, for instance, you use the Brier score $\mathfrak{B}$, whose accuracy for a credence function $p = (p_1, p_2, p_3)$ at world $w_i$ is $$\mathfrak{B}(p, w_i) = 2p_i - (p_1^2 + p_2^2 + p_3^2)$$That is, you find the credence function of the form $q = (q_1, 1-q_1, 0)$ that minimizes $H^A_\mathfrak{B}$. But it turns out that this is $q = (0.6, 0.4, 0)$, which is not the result of conditioning $(0.5, 0.3, 0.2)$ on $E = \{w_1, w_2\}$ -- that would be $q = (0.625, 0.375, 0)$.<br /><br />However, as I explained in <a href="https://m-phi.blogspot.com/2020/07/updating-by-minimizing-expected.html" target="_blank">another previous post</a>, there is a unique additive and continuous strictly proper accuracy measure that will give conditionalization in this way. I called it the enhanced log score $\mathfrak{L}^\star$, and it is also found in Juergen Landes' paper <a href="https://doi.org/10.1016/j.ijar.2015.05.007" target="_blank">here</a> (Proposition 9.1) and Schervish, Seidenfeld, and Kadane's paper <a href="https://doi.org/10.1287/deca.1090.0153" target="_blank">here</a> (Example 6). Its accuracy for a credence function $p = (p_1, p_2, p_3)$ at world $w_i$ is $$\mathfrak{L}^\star(p, w_i) = \log p_i - (p_1 + p_2 + p_3)$$If we apply GHC with that accuracy measure and with the restriction to credence functions that satisfy the evidence, we get $(0.625, 0.375, 0)$ or $(0.375, 0.625, 0)$, as required. So while GHC doesn't mandate conditioning on your evidence, it does at least permit it. However, while this goes smoothly if we pick $(0.5, 0.3, 0.2)$ as our prior, it does not work so well if we pick $(0.2, 0.3, 0.5)$, which, if you recall, is also permitted by the Hurwicz weights we are using. After all, the two permissible posteriors remain the same, but neither is the result of conditioning that prior on $E$. This proposal, then, is a non-starter.<br /><br />There is, in any case, something strange about the approach just mooted. After all, GHC assigns a weight to the accuracy of a candidate posterior in each of the three worlds, even though in world $w_3$ you wouldn't receive evidence $E$ and would thus not adopt this posterior. Let's suppose that you'd receive evidence $\overline{E} = \{w_3\}$ instead at world $w_3$; and let's suppose you'd adopt the only credence function that respects this evidence, namely, $(0, 0, 1)$. If that's the case, we might try applying GHC not to potential posteriors but to potential rules for picking posteriors. I'll call these <i>posterior rules</i>. In the past, I've called them updating rules, but this is a bit misleading. An updating rule would take as inputs both prior and evidence and give the result of updating the former on the latter. But these rules really just take evidence as an input and say which posterior you'll adopt if you receive it. Thus, for our situation, in which you might learn either $E$ or $\overline{E}$, the posterior rule would have the following form:$$p' = \left \{ \begin{array}{rcl}E & \mapsto & p'_E \\ \overline{E} & \mapsto & p'_{\overline{E}}\end{array}\right.$$for some suitable specification of $p'_E$ and $p'_\overline{E}$. Then the accuracy of a rule $p'$ at a world is just the accuracy of the output of that rule at that world. Thus, in this case:$$\begin{array}{rcl}\mathfrak{I}(p', w_1) & = & \mathfrak{I}(p'_E, w_1) \\ \mathfrak{I}(p', w_2) & = & \mathfrak{I}(p'_E, w_2) \\\mathfrak{I}(p', w_3) & = & \mathfrak{I}(p'_\overline{E}, w_3)\end{array}$$The problem is that this move doesn't help. Part of the reason is that whatever was the best-case scenario for the prior, the best case for the posterior is sure to be world $w_3$, since $p'_\overline{E} = (0, 0, 1)$ is perfectly accurate at that world. Thus, suppose you pick $(0.5, 0.3, 0.2)$ as your prior. It turns out that the rules that minimize $H^A_{\mathfrak{L}^\star}$ will give $p'_E = (0.4, 0.6, 0)$ or $p'_E = (0.6, 0.4, 0)$, whereas conditioning your prior on $E$ gives $p'_E = (0.625, 0.375, 0)$ or $p'_E = (0.375, 0.625, 0)$.<br /><br />Throughout our discussion so far, we have dismissed various possible approaches because they are not consistent with conditionalization. But why should that be a restriction? Perhaps the approach we are taking will tell us that the Bayesian fixation with conditionalization is misguided. Perhaps. But there are strong arguments for conditionalization within accuracy-first epistemology, so we'd have to see why they go wrong before we start rewriting Bayesian textbooks. I'll consider three such arguments here. The first isn't as strong as it seems; the second isn't obviously available to someone who used GHC to pick priors; the third is promising but it leads us initially down a tempting road into an inhospitable morass.<br /><br />The first is closely related to a proposal I explored <a href="https://m-phi.blogspot.com/2020/07/updating-by-minimizing-expected.html" target="_blank">in a previous blogpost</a>. So I'll briefly outline the approach here and refer to the issues raised in that post. The idea is this: Your prior is $(p_1, p_2, p_3)$. You learn $E$. You must now adopt a posterior that respects your new evidence, namely, $(q_1, 1-q_1, 0)$. You should choose the posterior of that form that maximises expected accuracy from the point of view of your prior, that is, you're looking for $(x, 1-x, 0)$ that maximizes$$p_1 \mathfrak{I}((x, 1-x, 0), w_1) + p_2 \mathfrak{I}((x, 1-x, 0), w_2) + p_3 \mathfrak{I}((x, 1-x, 0), w_3)$$This approach is taken in a number of places: at the very least, <a href="https://statweb.stanford.edu/~cgates/PERSI/papers/zabell82.pdf" target="_blank">here</a>, <a href="https://drive.google.com/file/d/1YwFOV5abKhsjftoyIEtWI-owO4HjJSEm/view?usp=sharing" target="_blank">here</a>, and <a href="https://philpapers.org/rec/WROBUM" target="_blank">here</a>. Now, it turns out that there is only one additive and continuous strictly proper accuracy measure that is guaranteed always to give conditionalization on this approach. That is, there is only one measure such that, for any prior, the posterior it expects to be best among those that respect the evidence is the one that results from conditioning the prior on the evidence. Indeed, that accuracy measure is one we've already met above, namely, the enhanced log score $\mathfrak{L}^\star$ (see <a href="https://m-phi.blogspot.com/2020/07/updating-by-minimizing-expected.html" target="_blank">here</a>). However, it turns out that it only works if we assume our credences are defined only over the set of possible states of the world, and not over more coarse-grained propositions (see <a href="https://m-phi.blogspot.com/2020/07/update-on-updating-or-fall-from-favour.html" target="_blank">here</a>). So I think this approach is a non-starter.<br /><br />More promising at first sight is the argument by <a href="https://doi.org/10.1093/mind/fzl607" target="_blank">Hilary Greaves and David Wallace</a> from 2006. Here, just as we considered earlier, we look not just at the posterior we will adopt having learned $E$, but also the posterior we would adopt were we to learn $\overline{E}$. Thus, if your prior is $(p_1, p_2, p_3)$, then you are looking for $(x, 1-x, 0)$ that maximizes$$p_1 \mathfrak{I}((x, 1-x, 0), w_1) + p_2 \mathfrak{I}((x, 1-x, 0), w_2) + p_3 \mathfrak{I}((0, 0, 1), w_3)$$And it turns out that this will always be$$x = \frac{p_1}{p_1 + p_2}\ \ 1-x = \frac{p_2}{p_1+p_2}$$providing $\mathfrak{I}$ is strictly proper.<br /><br />Does this help us? Does it show that, if we set our priors using GHC, we should then set our posteriors using conditionalization? One worry might be this: What justifies you in choosing your posteriors using one decision rule -- namely, maximise subjective expected utility -- when you picked your priors using a different one -- namely, GHC? But there seems to be a natural answer. As I emphasised above, GHC is specifically designed for situations in which probabilities, either subjective or objective, are not available. It allows us to make decisions in their absence. But of course when it comes to choosing the posterior, we are no longer in such a situation. At that point, we can simply resort to what became more orthodox decision theory, namely, Savage's subjective expected utility theory.<br /><br />But there's a problem with this. GHC is not a neutral norm for picking priors. When you pick your Hurwicz weights for the best case, the second-best case, and so on down to the second-worst case and the worst case, you reflect an attitude to risk. Give more weight to the worst cases and you're risk averse, choosing options that make those worst cases better; give more weight to the best cases and you're risk seeking; spread the weights equally across all cases and you are risk neutral. But the problem is that subjective expected utility theory is a risk neutral theory. (One way to see this is to note that it is the special case of <a href="https://global.oup.com/academic/product/risk-and-rationality-9780199672165" target="_blank">Lara Buchak's risk-weighted expected utility theory</a> that results from using the neutral risk function $r(x) = x$.) Thus, for those who have picked their prior using a risk-sensitive instance of GHC when they lacked probabilities, the natural decision rule when they have access to probabilities is not going to be straightforward expected utility theory. It's going to be a risk-sensitive rule that can accommodate subjective probabilities. The natural place to look would be Lara Buchak's theory, for instance. And it's straightforward to show that Greaves and Wallace's result does not hold when you use such a rule. (In <a href="https://philpapers.org/rec/CAMAUF" target="_blank">forthcoming work</a>, Catrin Campbell-Moore and Bernhard Salow have been working on what does follow and how we might change our accuracy measures to fit with such a theory and what follows from an argument like Greaves and Wallace's when you do that.) In sum, I think arguments for conditionalization based on maximizing expected accuracy won't help us here.<br /><br />Fortunately, however, there is another argument, and it doesn't run into this problem. As we will see, though, it does face other challenges. In Greaves and Wallace's argument, we took the view from from the prior that we picked using GHC, and we used it to evaluate our way of picking posteriors. In this argument, due to <a href="https://drive.google.com/open?id=1kjY_wQ0nlXIGfnla_MhWF1uQmcJtcTUB" target="_blank">me and Ray Briggs</a>, we take the view from nowhere, and we use it to evaluate the prior and the posterior rule together. Thus, suppose $p$ is your prior and $p'$ is your posterior rule. Then we evaluate them together by taking their joint accuracy to be the sum of their individual accuracies. Thus,$$\mathfrak{I}((p, p'), w) = \mathfrak{I}(p, w) + \mathfrak{I}(p', w)$$Then we have the following fact, where $p'$ is a conditioning rule for $p$ over some partition $\mathcal{E}$ iff, for all $E$ in $\mathcal{E}$, if $p(E) > 0$, then $p'_E(-) = p(-|E)$:<br /><br /><b>Theorem</b> Suppose $\mathfrak{I}$ is an additive and continuous strictly proper scoring rule. Then, if $p'$ is not a conditioning rule for $p$ over $\mathcal{E}$, there are $q$ and $q'$ such that$$\mathfrak{I}((p, p'), w) < \mathfrak{I}((q, q'), w)$$for all worlds $w$.<br /><br />That is, if $p'$ is not a conditioning rule for $p$, then, taken together, they are accuracy-dominated. There is an alternative pair, $q$ and $q'$, that, taken together, are guaranteed to be more accurate than $p$ and $p'$ are, taken together.<br /><br />Notice that this argument establishes a slightly different norm from the one that the expected accuracy argument secures. The latter is a narrow scope norm: if $p$ is your prior, then your posterior rule should be to condition on $p$ with whatever evidence you learn. The former is a wide scope norm: you should not have prior $p$ and a posterior rule that does not condition on the evidence you learn. This suggests that, if you're sitting at the beginning of your epistemic life and you're picking priors and posterior rules together, as a package, you should pick them so that the posterior rule involves conditioning on the prior with the evidence received. Does it also tell you anything about what to do if you're sitting with your prior already fixed and new evidence comes in? I'm not sure. Here's a reason to think it might. You might think that it's only rational to do at a later time what it was rational to plan to do at an earlier time. If that's right, then we can obtain the narrow scope norm from the wide scope one.<br /><br />Let's park those questions for the moment. For the approach taken in this argument suggests something else. In the previous post, we asked how to pick your priors, and we hit upon GHC. Now that we have a way of evaluating priors and posterior rules together, perhaps we should just apply GHC to those? Let's see what happens if we do that. As before, assume the best case receives weight $\alpha_1 = 0.5$, the second-best $\alpha_2 = 0.3$, and the third best $\alpha_3 = 0.2$. Then we know that the priors that GHC permits when we consider them on their own without the posteriors plans appended to them are just$$\begin{array}{ccc}(0.5, 0.3, 0.2) & (0.5, 0.2, 0.3) & (0.3, 0.5, 0.2) \\ (0.3, 0.2, 0.5) & (0.2, 0.5, 0.3) &(0.2, 0.3, 0.5)\end{array}$$Now let's consider what happens when we add in the posterior rules for learning $E = \{w_1, w_2\}$ or $\overline{E} = \{w_3\}$. Then it turns out that the minimizers are the priors$$(0.3, 0.2, 0.5)\ \ (0.2, 0.3, 0.5)$$combined with the corresponding conditionalizing posterior rules. Now, since those two priors are among the ones that GHC permits when applied to the priors alone, this might seem consistent with the original approach. The problem is that these priors are specific to the case in which you'll learn either $E$ or $\overline{E}$. If, on the other hand, you'll learn $F = \{w_1\}$ or $\overline{F} = \{w_2, w_3\}$, the permissible priors are$$(0.5, 0.2, 0.3)\ \ (0.5, 0.3, 0.2)$$And, at the beginning of your epistemic life, you don't know which, if either, is correct.<br /><br />In fact, there's what seems to me a deeper problem. In the previous paragraph we considered a situation in which you might learn either $E$ or $\overline{E}$ or you might learn either $F$ or $\overline{F}$, and you don't know which. But the two options determine different permissible priors. The same thing happens if there are four possible states of the world $\{w_1, w_2, w_3, w_4\}$ and you might learn either $E_1 = \{w_1, w_2\}$ or $E_2 = \{w_3, w_4\}$ or you might learn either $F_1 = \{w_1, w_2\}$ or $F_2 = \{w_3\}$ or $F_3 = \{w_4\}$. Now, suppose you assign the following Hurwicz weights: to the best case, you assign $\alpha_1 = 0.4$, to the second best $\alpha_2 = 0.3$, to the second worst $\alpha_3 = 0.2$ and to the worst $\alpha_1 = 0.1$. Then if you'll learn $E_1 = \{w_1, w_2\}$ or $E_2 = \{w_3, w_4\}$, then the permissible priors are<br /> $$\begin{array}{cccc}(0.1, 0.4, 0.2, 0.3) & (0.4, 0.1, 0.2, 0.3) & (0.1, 0.4, 0.3, 0.2) & (0.4, 0.1, 0.2, 0.3) \\ (0.2, 0.3, 0.1, 0.4) & (0.2, 0.3, 0.4, 0.1) & (0.3, 0.2, 0.1, 0.4) & (0.2, 0.3, 0.4, 0.1) \end{array}$$But if you'll learn $F_1 = \{w_1, w_2\}$ or $F_2 = \{w_3\}$ or $F_3 = \{w_4\}$, then your permissible priors are<br /> $$\begin{array}{cccc}(0.1, 0.2, 0.3, 0.4) & (0.1, 0.2, 0.4, 0.3) & (0.2, 0.1, 0.3, 0.4) & (0.2, 0.1, 0.4, 0.3) \end{array}$$That is, there is no overlap between the two. It seems to me that the reason this is such a problem is that it's always been a bit of an oddity that the two accuracy-first arguments for conditionalization seem to depend on this assumption that there is some partition from which your evidence will come. It seems strange that when you learn $E$, in order to determine how to update, you need to know what alternative propositions you might have learned instead. The reason this assumption hasn't proved so problematic so far is that the update rule is in fact not sensitive to the partition. For instance, if I will learn $E_1 = F_1 = \{w_1, w_2\}$, both the Greaves and Wallace argument and the Briggs and Pettigrew argument for conditionalization say that you should update on that in the same way whether or not you might have learned $E_2 = \{w_3, w_4\}$ instead or whether you might have learned $F_2 = \{w_3\}$ or $F_3 = \{w_4\}$ instead. But here the assumption does seem problematic, because the permissible priors are sensitive to what the partition is from which you'll receive your future evidence.<br /><br />What to conclude from all this? It seems to me that the correct approach is this: choose priors using GHC; choose posterior rules to go with them using the dominance argument that Ray and I gave--that is, update by conditioning.Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com18tag:blogger.com,1999:blog-4987609114415205593.post-20302209167792135142020-07-16T06:58:00.000+01:002020-07-16T07:04:50.309+01:00Taking risks and picking priors<div class="separator" style="clear: both; text-align: center;"></div>For a PDF of this post, see <a href="https://drive.google.com/file/d/17uBUrv_RO6C6U2Fy-WOS8JRo6sUm0r-1/view?usp=sharing" target="_blank">here</a>. <br /><br />In my last couple of posts (<a href="https://m-phi.blogspot.com/2020/07/hurwiczs-criterion-of-realism-and.html" target="_blank">here</a> and <a href="https://m-phi.blogspot.com/2020/07/a-generalised-hurwicz-criterion.html" target="_blank">here</a>), I've been discussing decision rules for situations in which probabilities over the possible states of the world are not available for some reason. Perhaps your evidence is too sparse, and points in no particular direction, or too complex, and points in too many. So subjective probabilities are not available. And perhaps you simply don't know the objective probabilities. In those situations, you can't appeal to standard subjective expected utility theory of the sort described by Ramsey, Savage, Jeffrey, etc., nor to objective expected utility theory of the sort described by von Neumann & Morgenstern. What, then, are you to do? As I've discussed in the previous posts, this was in fact a hot topic in the early days of decision theory. Abraham Wald discussed the Maximin approach, Leonid Hurwicz expanded that to give his Criterion, Franco Modigliani mentioned Maximax approaches, and Leonard Savage discussed Minimax Regret. Perhaps the culmination of this research programme was John Milnor's elegant 'Games against Nature' in 1951 (revised in 1954), where he provided simple axiomatic characterisations of each of these decision rules. In the first post on this, I pointed out a problem with Hurwicz's approach (independently noted in this <a href="http://johanegustafsson.net/papers/decisions-under-ignorance-and-the-individuation-of-states-of-nature.pdf" target="_blank">draft paper</a> by Johan Gustafsson); in the second, I expanded that approach to avoid the problem.<br /><br />Perhaps unsurprisingly, my interest in these decision rules stems from the possibility of applying them in accuracy-first epistemology. On that approach in formal epistemology, we determine the rationality of credences by considering the extent to which they promote what we take to be the sole epistemic good for credences, namely, accuracy. And we make this precise by applying decision theory to the question of which credence functions are rational. Thus: first, in place of the array of acts or options that are the usual focus of standard applications of decision theory, we substitute the different possible credence functions you might adopt; second, in place of the utility function that measures the value of acts or options at different possible states of the world, we substitute mathematical functions that measure the accuracy of a possible credence function at a possible state of the world. Applying a decision rule then identifes the rationally permissible credence functions. For instance, in <a href="https://www.journals.uchicago.edu/doi/10.1086/392661" target="_blank">the classic paper</a> by Jim Joyce in 1998, he applies a dominance rule to show that only probabilistic credence functions are rationally permissible.<br /><br />Now, one of the central questions in the epistemology of credences is the problem of the priors. Before I receive any evidence at all, which credences should I have? What should my ur-priors be, if you like? What should a superbaby's credences be, as David Lewis would put it? Now, people have reasonable concerns about the very idea of a superbaby -- this cognitive agent who has no evidence whatsoever but is nonetheless equipped with the conceptual resources to formulate a rich algebra of propositions. Evidence and conceptual resources grow together, after all. However, the problem of the priors arises even when you do have lots of evidence about other topics, but take yourself to have none that bears on a new topic in which you have yet to set your credences. And indeed it arises even if you do have credences that bear on the new topic, but you don't think they are justified, or they're inadmissible for the purpose to which you wish to use the credences -- for instance, if you are a scientist producing the priors for the Bayesian model you will use in your paper. So I think we needn't see the superbaby as a problematic idealization, but as a way of representing a situation in which we in fact quite often find ourselves.<br /><br />So, what credences should a superbaby choose? The key point is that they have no recourse to any subjective or objective probabilities. So they can't appeal to expected accuracy to set their credences. Thus, we might naturally look to Wald, Hurwicz, Milnor, etc. to see what they might use instead. In <a href="https://drive.google.com/file/d/1DwSFhYHptQSzpaDr08hwDAouIYCxCxfU/view?usp=sharing" target="_blank">this paper</a> from 2016, I explored what Maximin requires, and in <a href="https://drive.google.com/file/d/1f98ztmxh_x7zPQz39TO3sqQ7sKIotN0S/view?usp=sharing" target="_blank">this follow-up</a> later that year, I explored what the Hurwicz Criterion mandates. In this post, I'd like to explore what the Generalized Hurwicz Criterion stated in the previous blogpost requires of your credences. Let's remind ourselves of this decision rule.<br /><br /><b>Generalized Hurwicz Criterion (GHC)</b> Suppose the set of possible states of the world is $W = \{w_1, \ldots, w_n\}$. Pick $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\alpha_1 + \ldots + \alpha_n = 1$, and denote this sequence of weights $A$. If $a$ is an option defined on $W$ and$$a(w_{i_1}) \geq a(w_{i_2}) \geq \ldots \geq a(w_{i_n})$$then let$$H^A(a) = \alpha_1a(w_{i_1}) + \ldots + \alpha_na(w_{i_n})$$Pick an option that maximises $H^A$.<br /><br />So $\alpha_1$ weights the best-case utility, $\alpha_2$ the second best, and so on down to $\alpha_n$, which weights the worst-case utility. We then sum these weighted utilities to give the generalised Hurwicz score and we choose in order to maximise this.<br /><br />Now, suppose that our options are credence functions, and the utility of a credence function is given by an accuracy measure $\mathfrak{I}$. In fact, for our purposes here, we'll consider only credence functions defined over three worlds $w_1, w_2, w_3$. Things get complicated pretty fast here, and there will be plenty of interest in this simple case. As usual in accuracy-first epistemology, we'll say that the accuracy of your credence function is determined as follows:<br /><br />(1) $\mathfrak{s}(1, x)$ is your measure of accuracy for credence $x$ in a truth,<br /><br />(2) $\mathfrak{s}(0, x)$ is your measure of accuracy for credence $x$ in a falsehood,<br /><br />(3) $\mathfrak{s}$ is strictly proper, so that, for all $0 \leq p \leq 1$, $p\mathfrak{s}(1, x) + (1-p)\mathfrak{s}(0, x)$ is maximised, as a function of $x$, at $x = p$,<br /><br />(4) if $c$ is a credence function defined on $w_1, w_2, w_3$, and we write $c_i$ for $c(w_i)$, then <br /><br /><ul><li>$\mathfrak{I}(c, w_1) = \mathfrak{s}(1, c_1) + \mathfrak{s}(0, c_2) + \mathfrak{s}(0, c_3)$</li><li>$\mathfrak{I}(c, w_2) = \mathfrak{s}(0, c_1) + \mathfrak{s}(1, c_2) + \mathfrak{s}(0, c_3)$</li><li>$\mathfrak{I}(c, w_3) = \mathfrak{s}(0, c_1) + \mathfrak{s}(0, c_2) + \mathfrak{s}(1, c_3)$ </li></ul>So the accuracy of your credence function at a world is the sum of the accuracy at that world of the credences that it assigns. <br /><br />Now, given a credence function on $w_1, w_2, w_3$, we represent it by the triple $(c_1, c_2, c_3)$. And let's set our generalized Hurwicz weights to be $\alpha_1, \alpha_2, \alpha_3$. The first thing to note is that, if $(c_1, c_2, c_3)$ minimizes $H^A_\mathfrak{I}$, then so does any permutation of it---that is,$$(c_1, c_2, c_3),\ \ (c_1, c_3, c_2),\ \ (c_2, c_1, c_3),\ \ (c_2, c_3, c_1),\ \ (c_3, c_1, c_2),\ \ (c_3, c_2, c_1)$$all minimize $H^A_\mathfrak{I}$ if any of one of them does. The reason is that the generalized Hurwicz score for the three-world case depends on the best, middle, and worst inaccuracies for a credence function, and those are exactly the same for those six credence functions, even though the best, middle, and worst inaccuracies occur at different worlds for each. This means that, in order to find the minimizers, we only need to seek those for which $c_1 \geq c_2 \geq c_3$, since all others will be permutations of those. Let $\mathfrak{X} = \{(c_1, c_2, c_3)\, |\, c_1 \geq c_2 \geq c_3\}$. Since the accuracy measure $\mathfrak{I}$ is strictly proper and therefore truth-directed, for each $c$ in $\mathfrak{X}$,$$\mathfrak{I}(c, w_1) \geq \mathfrak{I}(c, w_2) \geq \mathfrak{I}(c, w_3)$$And so$$H^A_\mathfrak{I}(c) = \alpha_1 \mathfrak{I}(c, w_1) + \alpha_2\mathfrak{I}(c, w_2) + \alpha_3 \mathfrak{I}(c, w_3)$$<br />That means that $H^A_\mathfrak{I}(c)$ is the expected inaccuracy of $c$ by the lights of the credence function $(\alpha_1, \alpha_2, \alpha_3)$ generated by the Hurwicz weights. This allows us to calculate each case. As Catrin Campbell-Moore helped me to see, it turns out that the minimizer does not depend on which strictly proper scoring rule you use---each gives the same. In the first column of the table below, I list the different possible orderings of the three Hurwicz weights. In two cases, specifying that order is not sufficient to determine the minimizer. To do that, you also have to know the absolute values of some of the weights. Where necessary, I include those in the second column. In the third column, I specify the member of $\mathfrak{X}$ that minimizes $H^A_\mathfrak{I}$ relative to those weights.$$\begin{array}{c|c|ccc}<br />\mbox{Ordering of} & \mbox{Further properties} & c_1 & c_2 & c_3\\<br />\mbox{the weights} & \mbox{of the weights} & && \\<br />&&&\\<br />\hline <br />&&&\\<br />\alpha_1 \leq \alpha_2 \leq \alpha_3 & - & \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\<br />&&&\\<br />\alpha_1 \leq \alpha_3 \leq \alpha_2 & \alpha_1 + \alpha_2 \geq \frac{2}{3} & \frac{\alpha_1 + \alpha_2}{2} & \frac{\alpha_1 + \alpha_2}{2} & \alpha_3 \\<br />&&&\\<br />\alpha_1 \leq \alpha_3 \leq \alpha_2 & \alpha_1 + \alpha_2 \leq \frac{2}{3} & \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\<br />&&&\\<br />\alpha_2 \leq \alpha_1 \leq \alpha_3 & \alpha_1 \leq \frac{1}{3}& \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\<br />&&&\\<br />\alpha_2 \leq \alpha_1 \leq \alpha_3 & \alpha_1 > \frac{1}{3} & \alpha_1 & \frac{\alpha_2 + \alpha_3}{2} & \frac{\alpha_2 + \alpha_3}{2} \\<br />&&&\\<br />\alpha_2 \leq \alpha_3 \leq \alpha_1 &- & \alpha_1 & \frac{\alpha_2 + \alpha_3}{2} & \frac{\alpha_2 + \alpha_3}{2} \\<br />&&&\\<br />\alpha_3 \leq \alpha_1 \leq \alpha_2 &- & \frac{\alpha_1 + \alpha_2}{2} & \frac{\alpha_1 + \alpha_2}{2} & \alpha_3 \\<br />&&&\\<br />\alpha_3 \leq \alpha_2 \leq \alpha_1 &- & \alpha_1 & \alpha_2 & \alpha_3 <br />\end{array}$$<br />In the following diagram, we plot the different possible Hurwicz weights in a barycentric plot, so that the bottom left corner of the triangle is $(1, 0, 0)$, the bottom-right is $(0, 1, 0)$ and the top is $(0, 0, 1)$. We then divide this into four regions. If your weights $A = (\alpha_1, \alpha_2, \alpha_3)$ lie in a given region, then the triple I've placed in that region gives the credence function that minimizes the Generalized Hurwicz Score $H^A$ for those weights. Note, the bottom left triangle is $\mathfrak{X}$. Essentially, to find which member of $\mathfrak{X}$ a given weighting demands, you plot that weighting in this diagram and then find the closest member of $\mathfrak{X}$. You can use Euclidean closeness for this purpose, since all strictly proper accuracy measures will give that same result.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Ipxt8OHa02Q/Xw8oFF64x0I/AAAAAAAAE0A/IBNtFmidVm8aI8wFaLUyydHhiOkO9QkrgCLcBGAsYHQ/s1600/bary.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1469" data-original-width="1600" height="366" src="https://1.bp.blogspot.com/-Ipxt8OHa02Q/Xw8oFF64x0I/AAAAAAAAE0A/IBNtFmidVm8aI8wFaLUyydHhiOkO9QkrgCLcBGAsYHQ/s400/bary.jpg" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"></div> To finish, some interesting points about this:<br /><br />First, for any probabilistic credence function, there are Hurwicz weights for which that credence function is a minimizer. If $c = (c_1, c_2, c_3)$ and $c_{i_1} \geq c_{i_2} \geq c_{i_3}$, then the weights are $c_{i_1}, c_{i_2}, c_{i_3}$.<br /><br />Second, in my earlier paper, I noted that Maximin demands the uniform distribution. So you might see the uniform distribution as the maximally risk-averse option. And indeed you can read something like that into William James' remark that the person who always suspends judgment has an irrational aversion to being a dupe. But actually it turns out that quite a wide range of attitudes to risk--in this case represented by generalised Hurwicz weights in the upper trapezoid region--demand the uniform distribution.<br /><br />Third, in my first blogpost about all this, I criticized Hurwicz's original version of his criterion for being incompatible with a natural dominance norm, which says that if one option is always at least as good as another and sometimes better, then it should be strictly preferred. Suppose we apply this strengthening to Weak Dominance in our characterization of the Generalized Hurwicz Criterion in the previous blogpost. Then we obtain a slightly more demanding version of the Generalized Hurwicz Criterion that requires that each of your Hurwicz weightings is strictly positive. And if we do that, then it's no longer true that any probabilistic credence function can be rationalised in this way. Only regular ones can be. So that plausible strengthening of Weak Dominance leads us to an argument for Regularity, the principle that says that you should not assign extremal credences to propositions that are not tautologies or contradictions.Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com4tag:blogger.com,1999:blog-4987609114415205593.post-56631108987331860272020-07-15T06:41:00.000+01:002020-07-15T07:20:20.884+01:00A Generalised Hurwicz CriterionHere's a <a href="https://drive.google.com/file/d/1Wmmzmf2xZ_cjQ3ZBIOwmMs7ezDBmzDKe/view?usp=sharing" target="_blank">PDF</a> of this blogpost.<br /><br />In yesterday's <a href="https://m-phi.blogspot.com/2020/07/hurwiczs-criterion-of-realism-and.html" target="_blank">post</a>, I discussed Leonid Hurwicz's Criterion of Realism. This is a decision rule for situations in which your evidence is so sparse and your uncertainty so great that you cannot assign probabilities to the possible states of the world. It requires you to pick a real number $0 \leq \alpha \leq 1$ and then assign to each option or act the weighted average of its worst-case utility and its best-case utility, where $\alpha$ is the weight for the former and $1-\alpha$ is the weight for the latter. That is, given an option defined on a set of possible worlds $W$, let$$H^\alpha(a) := \alpha \max_{w \in W} a(w) + (1-\alpha) \min_{w \in W} a(w)$$The Hurwicz Criterion says: Pick an option that maximises $H^\alpha$.<br /><br />In that post, I also mentioned Hurwicz's own characterization of a class of preference orderings to which the orderings determined by $H^\alpha$ belong, namely, the class of all preference orderings that are indifferent between two options with the same minimum and maximum utilities. And I pointed out that there is something troubling about it: if you strengthen Hurwicz's Dominance axiom in a natural way, his axioms are no longer consistent. (As I mention in an update at the beginning of that blog, after posting it, Johan Gustafsson sent me a <a href="http://johanegustafsson.net/papers/decisions-under-ignorance-and-the-individuation-of-states-of-nature.pdf" target="_blank">draft paper</a> of his that makes this same point.) To see this, let $W = \{w_1, w_2, w_3\}$, and suppose $a$ and $a'$ are options defined on $W$ with<br />$$<br />\begin{array}{r|ccc}<br />& w_1 & w_2 & w_3\\<br />\hline <br />a & m & n & M \\<br />a' & m & n' & M<br />\end{array}<br />$$where $m < n < n' < M$. Then $H^\alpha(a) = H^\alpha(a')$, for any $0 \leq \alpha \leq 1$, but $a'$ weakly dominates $a$---that is, $a(w_i) \leq a'(w_i)$ for all $i = 1, 2, 3$ and $a(w_i) < a'(w_i)$ for some $i = 1, 2, 3$. That is, the Hurwicz Criterion is incompatible with a natural Dominance principle that says that if one option is always at least as good as another and sometimes better than it, it should be strictly preferred. And indeed all of the other rules in the class that Hurwicz characterised suffer from the same problem. Gustafsson uses this to mount an argument against one of Hurwicz's axioms, namely, the one that I called Coarse-Graining Invariance in the previous blogpost. <br /><br />How to address this? A natural response is to note that the Hurwicz Criterion was too narrow in its view. While it was broader than Wald's Maximin, which was also in the air at the time, and its duel, Maximax, it still only considered best- and worst-case utilities. What of the rest? Of course, Hurwicz thought that his narrow focus was justified by his axioms. But if you are moved by the argument against those axioms that Gustafsson and I favour, that justification evaporates. Thus, I introduce what I'll call the Generalised Hurwicz Criterion as follows, where we are considering only options defined on a finite set of possible states of the world $W = \{w_1, \ldots, w_n\}$:<br /><br /><b>Generalized Hurwicz Criterion (GHC)</b> Pick $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ with $\alpha_1 + \ldots + \alpha_n = 1$, and denote this sequence of weights $A$. If $a$ is defined on $W$ and$$a(w_{i_1}) \leq a(w_{i_2}) \leq \ldots \leq a(w_{i_n})$$then let$$H^A(a) = \alpha_1a(w_{i_1}) + \ldots + \alpha_na(w_{i_n})$$Pick an option that maximises $H^A$.<br /><br />Note: it's important to reassure yourself that $H^A$ is well-defined in this statement. Now, the next thing is to offer a characterization of GHC. Recall, Hurwicz himself didn't actually characterise his Criterion, but rather a larger class of which it was a member. However, <a href="https://en.wikipedia.org/wiki/John_Milnor" target="_blank">John Milnor</a>, at that point just shortly out of his undergraduate degree and embarking on his doctoral work in knot theory, did provide a characterization in his paper <a href="https://www.rand.org/pubs/research_memoranda/RM0679.html" target="_blank">'Games against Nature'</a>. We'll build on that result here to characterise GHC. First, two pieces of notation:<br /><ul><li>If $a$, $a'$ are options, then define $a + a'$ to be the option with $(a + a')(w) = a(w) + a'(w)$, for all $w$ in $W$.</li><li>If $a$ is an option and $m$ is a real number, then define $ma$ to be the option with $ma(w) = m \times a(w)$, for all $w$ in $W$. </li></ul><b>(A1)</b> <b>Structure</b> $\preceq$ is reflexive and transitive. <br /><br /><b>(A2) Weak Dominance </b><br /><ol><li>If $a(w) \leq a'(w)$ for all $w$ in $W$, then $a \preceq a'$.</li><li>If $a(w) < a'(w)$ for all $w$ in $W$, then $a \prec a'$.</li></ol><b>(A3) Permutation Invariance </b>For any set of worlds $W$ and any options $a, a'$ defined on $W$, if $\pi : W \cong W$ is a permutation of the worlds in $W$ and if $a'(w) = a(\pi(w))$ for all $w$ in $W$, then $a \sim a'$.<br /><br /><b>(A4) Continuity</b> Suppose $a_1, a_2, \ldots$ is a sequence of options that converges on $a$ in the limit. Then, if $a_i \prec a'$ for all $i = 1, 2, \ldots$, then $a \preceq a'$.<br /><br /><b>(A5) Linearity</b> If $m > 0$ and $a \sim a'$, then $ma \sim ma'$.<br /><br /><b>(A6) Summation</b> If $a \sim b$ and $a' \sim b'$, then $a + a' \sim b + b'$.<br /><br /><b>Theorem</b> Suppose (A1-6). Then there is some sequence $A$ of weights $0 \leq \alpha_1, \ldots, \alpha_n \leq 1$ such that $a \preceq a'$ iff $H^A(a) \leq H^A(a')$.<br /><br />Before we continue to the proof, one point to note: as I noted in the previous post and Johan Gustafsson noted in his draft paper, if we strengthen Weak Dominance, Hurwicz's axioms are inconsistent. However, if we strengthen it here, the axioms remain consistent, and they characterise a slightly narrower version of the Generalized Hurwicz Criterion in which all weights must be non-zero.<br /><br /><i>Proof of Theorem</i>. First, we determine the weights. We do this in $n$ steps. Here, we denote an option by the $n$-tuple of its utility values at the $n$ different worlds. Thus, $a = (a(w_1), \ldots, a(w_n))$.<br /><ul><li>If $\beta_n$ is the supremum of the set$$\{\alpha : (\alpha, \ldots, \alpha) \preceq (0, \ldots, 0, 1)\}$$then let $\alpha_n = \beta_n$;</li><li>If $\beta_{n-1}$ is the supremum of the set$$\{\alpha : (\alpha, \ldots, \alpha) \preceq (0, \ldots, 0, 1, 1)\}$$then let $\alpha_{n-1} = \beta_{n-1}- \alpha_n$;</li><li>If $\beta_{n-2}$ is the supremum of the set$$\{\alpha : (\alpha, \ldots, \alpha) \preceq (0, \ldots, 0, 1, 1, 1)\}$$then let $\alpha_{n-2} = \beta_{n-2} - \alpha_n - \alpha_{n-1}$;</li><li>and so on until...</li><li>If $\beta_1$ is the supremum of the set$$\{\alpha : (\alpha, \ldots, \alpha) \preceq (1, \ldots, 1, 1)\}$$then let $\alpha_1 = \beta_1 - \alpha_n - \alpha_{n-1} - \ldots - \alpha_2$.</li></ul>Now, by Weak Dominance, for each $i = 1, \ldots, n$, $0 \leq \beta_i \leq 1$, since $(0, \ldots, 0) \preceq (0, \ldots, 1, \ldots, 1) \preceq (1, \ldots, 1)$. What's more, and again by Weak Dominance, $\beta_i \leq \beta_{i-1}$. Thus, $\alpha_i \geq 0$. What's more, $\beta_1 = 1$, so $\alpha_1 + \ldots + \alpha_n = 1$.<br /><br />What's more, by Continuity:<br />\begin{eqnarray*}<br />(\alpha_n, \ldots, \alpha_n) & \sim & (0, 0, \ldots, 0, 1) \\<br />(\alpha_n + \alpha_{n-1}, \ldots, \alpha_n + \alpha_{n-1}) & \sim & (0, 0, \ldots, 1, 1) \\<br />\vdots & \vdots & \vdots \\<br />(\alpha_n + \ldots + \alpha_2, \ldots, \alpha_n+ \ldots + \alpha_2) & \sim & (0, 1, \ldots, 1, 1) \\ <br />(\alpha_n + \ldots + \alpha_1, \ldots, \alpha_n+ \ldots + \alpha_1) & \sim & (1, 1, \ldots, 1, 1) \\ <br />\end{eqnarray*}<br /><br />Now, suppose $u_1 \leq \ldots \leq u_n$, and consider the option $(u_1, \ldots, u_n)$. Then, by Linearity:<br /><br />$(\alpha_n (u_n-u_{n-1}), \ldots, \alpha_n(u_n-u_{n-1})) \sim (0, 0, \ldots, 0, u_n - u_{n-1})$<br /><br />$((\alpha_n + \alpha_{n-1})(u_{n-1}-u_{n-2}), \ldots, (\alpha_n + \alpha_{n-1})(u_{n-1}-u_{n-2})) \sim$<br />$(0, 0, \ldots, u_{n-1}-u_{n-2}, u_{n-1}-u_{n-2})$<br /><br />and so on, until...<br /><br />$((\alpha_n + \ldots + \alpha_2)(u_2-u_1), \ldots, (\alpha_n+ \ldots + \alpha_2)(u_2-u_1)) \sim$<br />$(0, u_2-u_1, \ldots, u_2-u_1, u_2-u_1)$<br /><br />$((\alpha_n + \ldots + \alpha_1)u_1, \ldots, (\alpha_n+ \ldots + \alpha_1)u_1) \sim (u_1, u_1, \ldots, u_1, u_1)$<br /><br />And so, by Summation:$$(u_1, \ldots, u_n) \sim (\sum_i \alpha_i u_i, \ldots, \sum_i \alpha_i u_i)$$So, if $v_1 \leq \ldots \leq v_n$, and$$\sum_i \alpha_iv_i = \sum_i \alpha_iu_i$$then$$(v_1, \ldots, v_n) \sim (\sum_i \alpha_i v_i, \ldots, \sum_i \alpha_i v_i) \sim (\sum_i \alpha_i u_i, \ldots, \sum_i \alpha_i u_i) \sim (u_1, \ldots, u_n)$$as required. $\Box$ Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com1tag:blogger.com,1999:blog-4987609114415205593.post-90576754340989428422020-07-12T10:03:00.000+01:002020-07-13T18:57:34.505+01:00Hurwicz's Criterion of Realism and decision-making under massive uncertaintyFor a PDF version of this post, click <a href="https://drive.google.com/file/d/1XEMWmypE_lptZ_Ikm7SJ3bFT-9tDKtDr/view?usp=sharing" target="_blank">here</a>.<br /><br />[UPDATE: After posting this, Johan Gustafsson got in touch and it seems he and I have happened upon similar points via slightly different routes. His paper is <a href="http://johanegustafsson.net/papers/decisions-under-ignorance-and-the-individuation-of-states-of-nature.pdf" target="_blank">here</a>. He takes his axioms from Binmore's <i>Rational Decisions</i>, who took them from Milnor's 'Games against Nature'. Hurwicz and Arrow also cite Milnor, but Hurwicz's original characterisation appeared before Milnor's paper, and he cites Chernoff's Cowles Commission Discussion Paper: Statistics No. 326A as the source of his axioms.]<br /><br />In 1951, Leonid Hurwicz, a Polish-American economist who would go on to share the Nobel prize for his work on mechanism design, published a series of short notes as part of the Cowles Commission Discussion Paper series, where he introduced a new decision rule for choice in the face of massive uncertainty. The situations that interested him were those in which your evidence is so sparse that it does not allow you to assign probabilities to the different possible states of the world. These situations, he thought, fall outside the remit of Savage's expected utility theory.<br /><br />The rule he proposed is called <i>Hurwicz's Criterion of Realism</i> or just <i>the Hurwicz Criterion</i>. He introduced it in the form in which it is usually stated in February 1951 in the Cowles Commission Discussion Paper: Statistics No. 356 -- the title was <a href="https://cowles.yale.edu/sites/default/files/files/pub/cdp/s-0356.pdf" target="_blank">'A Class of Criteria for Decision-Making under Ignorance'</a>. The Hurwicz Criterion says that you should choose an option that maximises what I'll call its <i>Hurwicz score</i>, which is a particular weighted average of its best-case utility and its worst-case utility. A little more formally: We follow Hurwicz and let an option be a function $a$ from a set $W$ of possible states of the world to the real numbers $\mathbb{R}$. Now, you begin by setting the weight $0 \leq \alpha \leq 1$ you wish to assign to the best-case utility of an option, and then you assign the remaining weight $1-\alpha$ to its worst-case. Then the Hurwicz score of option $a$ is just $$H^\alpha(a) := \alpha \max_{w \in W} a(w) + (1-\alpha) \min_{w \in W} a(w)$$<br /><br />However, reading his other notes in the Cowles series that surround this brief three-page note, it's clear that Hurwicz's chief interest was not so much in this particular form of decision rule, but rather with any such rule that determines the optimal choices solely by looking at their best- and worst-case scenarios. The Hurwicz Criterion is one such rule, but there are others. You might, for instance, weight the best- and worst-cases not by fixed constant coefficients, but by coefficients that change with the minimum and maximum values, or change with the difference between them or with their ratio. One of the most interesting contributions of these papers that surround the one in which Hurwicz gives us his Criterion is a characterization of rules that depend only on best- and worst-case utilities. Hurwicz gave rather an inelegent initial version of that characterization in Cowles Commission Discussion Paper: Statistics No. 370, published at the end of 1951 -- the title was <a href="https://cowles.yale.edu/sites/default/files/files/pub/cdp/s-0370.pdf" target="_blank">'Optimality Criteria for Decision-Making under Ignorance'</a>. Kenneth Arrow then seems to have helped clean it up, and they published the new version together in the <a href="https://www.cambridge.org/core/books/studies-in-resource-allocation-processes/appendix-an-optimality-criterion-for-decisionmaking-under-ignorance/7846B18137B686F377133D3C7AA4404A" target="_blank">Appendix</a> of their edited volume, in which they contributed most of the chapters, often with co-authors, <a href="https://www.cambridge.org/core/books/studies-in-resource-allocation-processes/B120D93AE7A55F249285BCA429E5EB87" target="_blank"><i>Studies in Resource Allocation</i></a>. The version with Arrow is still reasonably involved, but the idea is quite straightforward, and it is remarkable how strong a restriction Hurwicz obtains from seemingly weak and plausible axioms. This really seems to me a case where axioms that seem quite innocuous on their own can combine in interesting ways to make trouble. So I thought it might be interesting to give a simplified version that has all the central ideas.<br /><br />Here's the framework:<br /><br /><i>Possibilities and possible worlds. </i>Let $\Omega$ be the set of possibilities. A possible world is a set of possibilities--that is, a subset of $\Omega$. And a set $W$ of possible worlds is a partition of $\Omega$. That is, $W$ presents the possibilities at $\Omega$ at a certain level of grain. So if $\Omega = \{\omega_1, \omega_2, \omega_3\}$, then $\{\{\omega_1\}, \{\omega_2\}, \{\omega_3\}\}$ is the most fine-grained set of possible worlds, but there are coarser-grained sets as well, such as $\{\{\omega_1, \omega_2\}, \{\omega_3\}\}$ or $\{\{\omega_1\}, \{\omega_2, \omega_3\}\}$. (This is not quite how Hurwicz understands the relationship between different sets of possible states of the world -- he talks of deleting worlds rather than clumping them together, but I think this formalization better captures his idea.)<br /><br /><i>Options</i>. For any set $W$ of possible worlds, an option defined on $W$ is simply a function from $W$ into the real numbers $\mathbb{R}$. So an option $a : W \rightarrow \mathbb{R}$ takes each world $w$ in $W$ and assigns a utility $a(w)$ to it. (Hurwicz refers to von Neumann and Morgenstern to motivate the assumption that utilities can be measured by real numbers.)<br /><br /><i>Preferences</i>. For any set $W$ of possible worlds, there is a preference relation $\preceq_W$ over the options defined on $W$. (Hurwicz states his result in terms of optimal choices rather than preferences. But I think it's a bit easier to see what's going on if we state it in terms of preferences. There's then a further question as to which options are optimal given a particular preference ordering, but we needn't address that here.)<br /><br />Hurwicz's goal was to lay down conditions on these preference relations such that the following would hold:<br /><br /><b>Hurwicz's Rule </b>Suppose $a$ and $a'$ are options defined on $W$. Then<br /><br /><b>(H1)</b> If<br /><ul><li>$\min_w a(w) = \min_w a'(w)$</li><li>$\max_w a(w) = \max_w a'(w)$</li></ul>then $a \sim_W a'$. That is, you should be indifferent between any two options with the same maximum and minimum.<br /><br /><b>(H2)</b> If<br /><ul><li>$\min_w a(w) < \min_w a'(w)$</li><li>$\max_w a(w) < \max_w a'(w)$</li></ul>then $a \prec_W a'$. That is, you should prefer one option to another if the worst case of the first is better than the worst case of the second and the best case of the first is better than the best case of the second.<br /><br />Here are the four conditions or axioms:<br /><br /><b>(A1)</b> <b>Structure</b> $\preceq_W$ is reflexive and transitive. <br /><br /><b>(A2) Weak Dominance </b><br /><ol><li>If $a(w) \leq a'(w)$ for all $w$ in $W$, then $a \preceq_W a'$.</li><li>If $a(w) < a'(w)$ for all $w$ in $W$, then $a \prec_W a'$.</li></ol>This is a reasonably weak version of a standard norm on preferences.<br /><br /><b>(A3) Permutation Invariance </b>For any set of worlds $W$ and any options $a, a'$ defined on $W$, if $\pi : W \cong W$ is a permutation of the worlds in $W$ and if $a'(w) = a(\pi(w))$ for all $w$ in $W$, then $a \sim_W a'$.<br /><br />This just says that it doesn't matter to you which worlds receive which utilities -- all that matters are the utilities received. <br /><br /><b>(A4) Coarse-Graining Invariance </b>Suppose $W = \{\ldots, w_1, w_2, \ldots\}$ is a set of possible worlds and suppose $a, a'$ are options on $W$ with $a(w_1) = a(w_2)$ and $a'(w_1) = a'(w_2)$. Then let $W' = \{\ldots, w_1 \cup w_2, \ldots\}$, so that $W'$ has the same worlds as $W$ except that, instead of $w_1$ and $w_2$, it has their union. And define options $b$ and $b'$ on $W'$ as follows: $b(w_1 \cup w_2) = a(w_1) = a(w_2)$ and $b'(w_1 \cup w_2) = a'(w_1) = a'(w_2)$, and $b(w) = a(w)$ and $b'(w) = a'(w)$ for all other worlds. Then $a \sim_W a'$ iff $b \sim_W b'$.<br /><br />This says that if two options don't distinguish between two worlds, it shouldn't matter to you whether they are defined on a fine- or coarse-grained space of possible worlds.<br /><br />Then we have the following theorem:<br /><br /><b>Theorem</b> <b>(Hurwicz)</b> (A1) + (A2) + (A3) + (A4) $\Rightarrow$ (H1) + (H2).<br /><br />Here's the proof. Assume (A1) + (A2) + (A3) + (A4). First, we'll show that (H1) follows. We'll sketch the proof only for the case in which $W = \{w_1, w_2, w_3\}$, since that gives all the crucial moves. So denote an act on $W$ by a triple $(a(w_1), a(w_2), a(w_3))$. Now, suppose that $a$ and $a'$ are options defined on $W$ with the same minimum, $m$, and maximum, $M$. Let $n$ be the middle value of $a$ and $n'$ the middle value of $a'$.<br /><br />Now, first note that<br />$$(m, m, M) \sim_W (m, M, M)$$ After all, $(m, m, M) \sim_W (M, m, m)$ by Permutation Invariance. And, by Coarse-Graining Invariance, $(m, M, M) \sim_W (M, m, m)$ iff $(m, M) \sim_{W'} (M, m)$, where $W' = \{w_1, w_2 \cup w_3\}$. And, by Permutation Invariance and the reflexivity of $\sim_{W'}$, $(m, M) \sim_{W'} (M, m)$. So $(m, M, M) \sim_W (M, m, m) \sim_W (m, m, M)$, as required. And now we have, by previous results, Permutation Invariance, and Weak Dominance: <br />$$a \sim_W (m, n, M) \preceq_W (m, M, M) \sim_W (m, m, M) \preceq_W (m, n', M) \sim_W a'$$<br />and <br />$$a' \sim_W (m, n', M) \preceq_W (m, M, M) \sim_W (m, m, M) \preceq_W (m, n, M) \sim_W a$$ <br />And so, by transitivity, $a \sim_W a'$. That gives (H1).<br /><br />For (H2), suppose $a$ has worst case $m$, middle case $n$, and best-case $M$, while $a'$ has worst case $m'$, middle case $n'$, and worst case $M'$. And suppose $m < m'$ and $M < M'$. Then$$a \sim_W (m, n, M) \preceq_W (m, M, M) \sim_W (m, m, M) \prec_W (m', n', M') \sim_W a'$$as required. $\Box$<br /><br />In a follow-up blog post, I'd like to explore Hurwicz's conditions (A1-4) in more detail. I'm a fan of his approach, not least because I want to use something like his decision rule within the framework of accuracy-first epistemology to understand how we select our first credences -- our ur-priors or superbaby credences (see <a href="https://www.cambridge.org/core/journals/episteme/article/jamesian-epistemology-formalised-an-explication-of-the-will-to-believe/5DD3912B582124D812DFAC948CE75BF3" target="_blank">here</a>). But I now think Hurwicz's focus on only the worst-case and best-case scenarios is too restrictive. So I have to grapple with the theorem I've just presented. That's what I hope to do in the next post. But here's a quick observation. (A1-4), while plausible at first sight, sail very close to inconsistency. For instance, (A1), (A3), and (A4) are inconsistent when combined with a slight strengthening of (A2). Suppose we add the following to (A2) to give (A2$^\star$):<br /><br />3. If $a(w) \leq a'(w)$ for all $w$ in $W$ and $a(w) < a'(w)$ for some $w$ in $W$, then $a \prec_W a'$.<br /><br />Then we have know from above that $(m, m, M) \sim_W (m, M, M)$, but (A2$^\star$) entails that $(m, m, M) \prec_W (m, M, M)$, which gives a contradiction. Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com4tag:blogger.com,1999:blog-4987609114415205593.post-22964475148795066352020-07-06T08:29:00.000+01:002020-07-06T08:43:46.481+01:00Update on updating -- or: a fall from favourFor a PDF version of this post, click <a href="https://drive.google.com/file/d/1ITjFYDAloTZQkKl1MVVqZkwtxfWTzWhV/view?usp=sharing" target="_blank">here</a>. <br /><br />Life comes at you fast. Last week, I wrote <a href="https://m-phi.blogspot.com/2020/07/updating-by-minimizing-expected.html" target="_blank">a blogpost</a> extolling the virtues of the following scoring rule, which I called the enhanced log rule: $$\mathfrak{l}^\star_1(x) = -\log x + x \ \ \ \ \ \mbox{and}\ \ \ \ \ \ \ \mathfrak{l}^\star_0(x) = x$$And I extolled its virtues. I noted that it is strictly proper and therefore furnishes an accuracy dominance argument for Probabilism. And I showed that, if we restrict attention to credence functions defined over partitions, rather than full algebras, it is the unique strictly proper scoring rule that delivers Conditionalization when you ask for the posterior that minimizes expected inaccuracy with respect to the prior and under the constraint that the posterior credence in the evidence must be 1. But then Catrin Campbell-Moore asked the natural question: what happens when you focus attention instead on full algebras rather than partitions? And looking into this revealed that things don't look so rosy for the enhanced log score. Indeed, if we focus just on the algebra built over three possible worlds, we see that every strictly proper scoring rule delivers the same updating rule, and it is not Conditionalization.<br /><br />Let's see this in more detail. First, let $\mathcal{W} = \{w_1, w_2, w_3\}$ be our set of possible worlds. And let $\mathcal{F}$ be the algebra over $\mathcal{W}$. That is, $\mathcal{F}$ contains the singletons $\{w_1\}$, $\{w_2\}$, $\{w_3\}$, the pairs $\{w_1, w_2\}$, $\{w_1, w_3\}$, and $\{w_2, w_3\}$ and the tautology $\{w_1, w_2, w_3\}$. Now suppose that your prior credence function is $(p_1, p_2, p_3 = 1-p_1-p_2)$. And suppose that you learn evidence $E = \{w_1, w_2\}$. Then we want to find the posterior, among those that assign credence 1 to $E$, that minimizes expected inaccuracy. Such a posterior will have the form $(x, 1-x, 0)$. Now let $\mathfrak{s}$ be the strictly proper scoring rule by which you measure inaccuracy. Then you wish to minimize:<br />\begin{eqnarray*}<br />&& p_1[\mathfrak{s}_1(x) + \mathfrak{s}_0(1-x) + \mathfrak{s}_0(0) + \mathfrak{s}_1(x+(1-x)) + \mathfrak{s}_1(x+0) + \mathfrak{s}_0((1-x)+0)] + \\<br />&& p_2[\mathfrak{s}_0(x) + \mathfrak{s}_1(1-x) + \mathfrak{s}_0(0) + \mathfrak{s}_1(x+(1-x)) + \mathfrak{s}_0(x+0) + \mathfrak{s}_1((1-x)+0)] +\\<br />&& p_3[\mathfrak{s}_0(x) + \mathfrak{s}_0(1-x) + \mathfrak{s}_1(0) + \mathfrak{s}_0(x+(1-x)) + \mathfrak{s}_1(x+0) + \mathfrak{s}_1((1-x) +0))] <br />\end{eqnarray*}<br />Now, ignore the constant terms, since they do not affect the minima; replace $p_3$ with $1-p_1-p_2$; and group terms together. Then we get:<br />\begin{eqnarray*}<br />&& \mathfrak{s}_1(x)(1+p_1 - p_2) + \mathfrak{s}_1(1-x)(1-p_1 + p_2) + \\<br />&& \mathfrak{s}_0(x)(1-p_1 + p_2) + \mathfrak{s}_0(x)(1+p_1 - p_2)<br />\end{eqnarray*}<br />Now, divide through by 2, which again doesn't affect the minimization, and note that$$\frac{1+p_i-p_j}{2} = p_i + \frac{1-p_i-p_j}{2}$$. Then we have<br />\begin{eqnarray*}<br />&& (p_1 + \frac{1-p_1-p_2}{2})\mathfrak{s}_1(x) + (p_2 + \frac{1-p_1-p_2}{2})\mathfrak{s}_0(x) + \\<br />&& (p_2 + \frac{1-p_1-p_2}{2})\mathfrak{s}_1(1-x) + (p_1 + \frac{1-p_1-p_2}{2})\mathfrak{s}_0(1-x) <br />\end{eqnarray*} <br />Now, $\mathfrak{s}$ is strictly proper. And $p_2 + \frac{1 -p_1 -p_2}{2} = 1 - (p_1 + \frac{1-p_1-p_2}{2})$. So providing $p_1 + \frac{1-p_1-p_2}{2} \leq 1$ and $p_2 + \frac{1-p_1-p_2}{2} \leq 1$, the posterior that minimizes expected inaccuracy from the point of view of the prior and that assigns credence 1 to $E$ is $(x, 1-x, 0)$ where:$$x = p_1 + \frac{1-p_1-p_2}{2}\ \ \ \ \mbox{and}\ \ \ \ 1-x = p_2 + \frac{1-p_1-p_2}{2}$$And this is very much not Conditionalization. It turns out then, that no strictly proper scoring rule gives Conditionalization on full algebras in this manner. Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com4tag:blogger.com,1999:blog-4987609114415205593.post-48653194715979027952020-07-03T07:14:00.000+01:002020-07-05T07:20:35.840+01:00Updating by minimizing expected inaccuracy -- or: my new favourite scoring ruleFor a PDF version of this post, click <a href="https://drive.google.com/file/d/16gsOrCiW2zsSn8y4NEOe0F2JOW0VcIee/view?usp=sharing" target="_blank">here</a>. <br /><br />One of the central questions of Bayesian epistemology concerns how you should update your credences in response to new evidence you obtain. The proposal I want to discuss here belongs to an approach that consists of two steps. First, we specify the constraints that your evidence places on your posterior credences. Second, we specify a means by which to survey the credence functions that satisfy those constraints and pick one to adopt as your posterior.<br /><br />For instance, in the first step, we might say that when we learn a proposition $E$, we must become certain of it, and so it imposes the following constraint on our posterior credence function $Q$: $Q(E) = 1$. Or we might consider the sort of situation Richard Jeffrey discussed, where there is a partition $E_1, \ldots, E_m$ and credences $q_1, \ldots, q_m$ with $q_1 + \ldots + q_m = 1$ such that your evidence imposes the constraint: $Q(E_i) = q_i$, for $i = 1, \ldots, m$. Or the situation van Fraassen discussed, where your evidence constrains your posterior conditional credences, so that there is a credence $q$ and propositions $A$ and $B$ such that your evidence imposes the constraint: $Q(A|B) = q$.<br /><br />In the second step of the approach, on the other hand, we might following objective Bayesians like Jon Williamson, Alena Vencovská, and Jeff Paris and say that, from among those credence functions that respect your evidence, you should pick the one that, on a natural measure of informational content, contains minimal information, and which thus goes beyond your evidence as little as possible (Paris & Vencovská 1990, Williamson 2010). Or we might follow what I call the method of minimal mutilation proposed by Persi Diaconis and Sandy Zabell and pick the credence function among those that respect the evidence that is closest to your prior according to some measure of divergence between probability functions <a href="https://amstat.tandfonline.com/doi/abs/10.1080/01621459.1982.10477893" target="_blank">(Diaconis & Zabell 1982)</a>. Or, you might proceed as Hannes Leitgeb and I suggested and pick the credence function that minimizes expected inaccuracy from the point of view of your prior, while satisfying the constraints the evidence imposes <a href="https://www.journals.uchicago.edu/doi/abs/10.1086/651318" target="_blank">(Leitgeb & Pettigrew 2010)</a>. In this post, I'd like to fix a problem with the latter proposal.<br /><br />We'll focus on the simplest case: you learn $E$ and this requires you to adopt a posterior $Q$ such that $Q(E) = 1$. This is also the case in which the norm governing it is least controversial. The largely undisputed norm in this case says that you should conditionalize your prior on your evidence, so that, if $P$ is your prior and $P(E) > 0$, then your posterior should be $Q(-) = P(-|E)$. That is, providing you assigned a positive credence to $E$ before you learned it, your credence in the proposition $X$ after learning $E$ should be your prior credence in $X$ conditional on $E$.<br /><br />In order to make the maths as simple as possible, let's assume you assign credences to a finite set of worlds $\{w_1, \ldots, w_n\}$, which forms a partition of logical space. Given a credence function $P$, we write $p_i$ for $P(w_i)$, and we'll sometimes represent $P$ by the vector $(p_1, \ldots, p_n)$. Let's suppose further that your measure of the inaccuracy of a credence function is $\mathfrak{I}$, which is generated additively from a scoring rule $\mathfrak{s}$. That is,<br /><ul><li>$\mathfrak{s}_1(x)$ measures the inaccuracy of credence $x$ in a truth;</li><li>$\mathfrak{s}_0(x)$ measures the inaccuracy of credence $x$ in a falsehood;</li><li>$\mathfrak{I}(P, w_i) = \mathfrak{s}_0(p_1) + \mathfrak{s}_0(p_{i-1} ) + \mathfrak{s}_1(p_i) + \mathfrak{s}_0(p_{i+1} ) + \ldots + \mathfrak{s}_0(p_n)$.</li></ul>Hannes and I then proposed that, if $P$ is your prior, you should adopt as your posterior the credence function $Q$ such that<br /><ol><li>$Q(E) = 1$;</li><li>for any other credence function $Q^\star$ for which $Q^\star(E) = 1$, the expected inaccuracy of $Q$ by the lights of $P$ is less than the expected inaccuracy of $Q^\star$ by the lights of $P$. </li></ol>Throughout, we'll denote the expected inaccuracy of $Q$ by the lights of $P$ when inaccuracy is measured by $\mathfrak{I}$ as $\mathrm{Exp}_\mathfrak{I}(Q | P)$. Thus,<br />$$ \mathrm{Exp}_\mathfrak{I}(Q | P) = \sum^n_{i=1} p_i \mathfrak{I}(Q, w_i)$$<br />At this point, however, a problem arises. There are two inaccuracy measures that tend to be used in statistics and accuracy-first epistemology. The first is the <i>Brier inaccuracy measure</i> $\mathfrak{B}$, which is generated by the <i>quadratic scoring rule</i> $\mathfrak{q}$:<br />$$\mathfrak{q}_0(x) = x^2\ \ \ \mbox{and}\ \ \ \ \mathfrak{q}_1(x) = (1-x)^2$$<br />So<br />$$\mathfrak{B}(P, w_i) = 1-2p_i + \sum^n_{i=1} p_i^2$$<br />The second is the <i>local log inaccuracy measure</i> $\mathfrak{L}$, which is generated by what I'll call here the <i>basic log score</i> $\mathfrak{l}$:<br />$$\mathfrak{l}_0(x) = 0\ \ \ \ \mbox{and}\ \ \ \ \mathfrak{l}_1(x) = -\log x$$<br />So<br />$$\mathfrak{L}(P, w_i) = -\log p_i$$<br />The problem is that both have undesirable features for this purpose: the Brier inaccuracy measure does not deliver Conditionalization when you take the approach Hannes and I described; the local log inaccuracy measure does give Conditionalization, but while it is strictly proper in a weak sense, the basic log score that generates it is not; and relatedly, but more importantly, the local log inaccuracy measure does not furnish an accuracy dominance argument for Probabilism. Let's work through this in more detail.<br /><br />According to the standard Bayesian norm of Conditionalization, if $P$ is your prior and $P(E) > 0$, then your posterior after learning at most $E$ should be $Q(-) = P(-|E)$. That is, when I remove all credence from the worlds at which my evidence is false, in order to respect my new evidence, I should redistribute it to the worlds at which my evidence is true <i>in proportion to my prior credence in those worlds</i>.<br /><br />Now suppose that I update instead by picking the posterior $Q$ for which $Q(E) = 1$ and that minimizes expected inaccuracy as measured by the Brier inaccuracy measure. Then, at least in most cases, when I remove all credence from the worlds at which my evidence is false, in order to respect my new evidence, I redistribute it <i>equally to the worlds at which my evidence is true</i>---not in proportion to my prior credence in those worlds, but equally to each, regardless of my prior attitude.<br /><br />Here's a quick illustration in the case in which you distribute your credences over three worlds, $w_1$, $w_2$, $w_3$ and the proposition you learn is $E = \{w_1, w_2\}$. Then we want to find a posterior $Q = (x, 1-x, 0)$ with minimal expected Brier inaccuracy from the point of view of the prior $P = (p_1, p_2, p_3)$. Then:<br />\begin{eqnarray*}<br />& & \mathrm{Exp}_\mathfrak{B}((x, 1-x, 0) | (p_1, p_2, p_3))\\<br />& = & p_1[(1-x)^2 + (1-x)^2 + 0^2] + p_2[x^2 + x^2 + 0^2] +p_3[x_2 + (1-x)^2 + 1]<br />\end{eqnarray*} <br />Differentiating this with respect to $x$ gives $$-4p_1 + 4x - 2p_3$$ which equals 0 iff $$x = p_1 + \frac{p_3}{3}$$ Thus, providing $p_1 + \frac{p_3}{3}, p_2 + \frac{p_3}{3} \leq 1$, then the posterior that minimizes expected Brier inaccuracy while respecting the evidence is $$Q = \left (p_1 + \frac{p_3}{3}, p_2 + \frac{p_3}{3}, 0 \right )$$ And this is typically not the same as Conditionalization demands.<br /><br />Now turn to the local log measure, $\mathfrak{L}$. Here, things are actually a little complicated by the fact that $-\log 0 = \infty$. After all, $$\mathrm{Exp}_\mathfrak{L}((x, 1-x, 0)|(p_1, p_2, p_3)) = -p_1\log x - p_2 \log (1-x) - p_3 \log 0$$ and this is $\infty$ regardless of the value of $x$. So every value of $x$ minimizes, and indeed maximizes, this expectation. As a result, we have to look at the situation in which the evidence imposes the constraint $Q(E) = 1-\varepsilon$ for $\varepsilon > 0$, and ask what happens as we let $\varepsilon$ approach 0. Then<br />$$\mathrm{Exp}_\mathfrak{L}((x, 1-\varepsilon-x, \varepsilon)|(p_1, p_2, p_3)) = -p_1\log x - p_2 \log (1-\varepsilon-x) - p_3 \log \varepsilon$$<br />Differentiating this with respect to $x$ gives <br />$$-\frac{p_1}{x} + \frac{p_2}{1-\varepsilon - x}$$<br />which equals 0 iff <br />$$x = (1-\varepsilon) \frac{p_1}{p_1 + p_2}$$<br />And this approaches Conditionalization as $\varepsilon$ approaches 0. So, in this sense, as Ben Levinstein pointed out, the local log inaccuracy measure gives Conditionalization, and indeed Jeffrey Conditionalization or Probability Kinematics as well <a href="https://doi.org/10.1086/666064" target="_blank">(Levinstein 2012)</a>. So far, so good. <br /><br />However, throughout this post, and in the two derivations above---the first concerning the Brier inaccuracy measure and the second concerning the local log inaccuracy measure---we assumed that all credence functions must be probability functions. That is, we assumed Probabilism, the other central tenet of Bayesianism alongside Conditionalization. Now, if we measure inaccuracy using the Brier measure, we can justify that, for then we have the accuracy dominance argument, which originated mathematically with Bruno de Finetti, and was given its accuracy-theoretic philosophical spin by Jim Joyce (de Finetti 1974, Joyce 1998). That is, if your prior or your posterior isn't a probability function, then there is an alternative that is and that is guaranteed to be more Brier-accurate. However, the local log inaccuracy measure doesn't furnish us with any such argument. One very easy way to see this is to note that the non-probabilistic credence function $(1, 1, \ldots, 1)$ over $\{w_1, \ldots, w_n\}$ dominates <i>all other credence functions</i> according to the local log measure. After all, $\mathfrak{L}((1, 1, \ldots, 1), w_i) = -\log 1 = 0$, for $i = 1, \ldots, n$, while $\mathfrak{L}(P, w_i) > 0$ for any $P$ with $p_i < 1$ for some $i = 1, \ldots, n$. <br /><br />Another related issue is that the scoring rule $\mathfrak{l}$ that generates $\mathfrak{L}$ is not strictly proper. A scoring rule $\mathfrak{s}$ is said to be strictly proper if every credence expects itself to be the best. That is, for any $0 \leq p \leq 1$, $p\mathfrak{s}_1(x) + (1-p) \mathfrak{s}_0(x)$ is minimized, as a function of $x$, at $x = p$. But $-p\log x + (1-p)0 = -p\log x$ is always minimized, as a function of $x$, at $x = 1$, where $-p\log x = 0$. Similarly, an inaccuracy measure $\mathfrak{I}$ is strictly proper if, for any probabilistic credence function $P$, $\mathrm{Exp}_\mathfrak{I}(Q | P) = \sum^n_{i=1} p_i \mathfrak{I}(Q, w_i)$ is minimized, as a function of $Q$ at $Q = P$. Now, in this sense, $\mathfrak{L}$ is not strictly proper, since $\mathrm{Exp}_\mathfrak{L}(Q | P) = \sum^n_{i=1} p_i \mathfrak{L}(Q, w_i)$ is minimized, as function of $Q$ at $Q = (1, 1, \ldots, 1)$, as noted above. Nonetheless, if we restrict our attention to probabilistic $Q$, $\mathrm{Exp}_\mathfrak{L}(Q | P) = \sum^n_{i=1} p_i \mathfrak{L}(c, w_i)$ is minimized at $Q = P$. In sum: $\mathfrak{L}$ is only a reasonable inaccuracy measure to use if you already have an independent motivation for Probabilism. But accuracy-first epistemology does not have that luxury. One of central roles of an inaccuracy measure in that framework is to furnish an accuracy dominance argument for Probabilism.<br /><br />So, we ask: is there a scoring rule $\mathfrak{s}$ and resulting inaccuracy measure $\mathfrak{I}$ such that:<br /><ol><li>$\mathfrak{s}$ is a strictly proper scoring rule;</li><li>$\mathfrak{I}$ is a strictly proper inaccuracy measure; </li><li>$\mathfrak{I}$ furnishes an accuracy dominance argument for Probabilism;</li><li>If $P(E) > 0$, then $\mathrm{Exp}_\mathfrak{I}(Q | P)$ is minimized, as a function of $Q$ among credence functions for which $Q(E) = 1$, at $Q(-) = P(-|E)$.</li></ol>Straightforwardly, (1) entails (2). And, by a result due to <a href="https://ieeexplore.ieee.org/document/5238758" target="_blank">Predd, et al.</a>, (1) also entails (3) (Predd 2009). So we seek $\mathfrak{s}$ with (1) and (4). Theorem 1 below shows that essentially only one such $\mathfrak{s}$ and $\mathfrak{I}$ exist and they are what I will call the <i>enhanced log score</i> $\mathfrak{l}^\star$ and the <i>enhanced log inaccuracy measure $\mathfrak{L}^\star$</i>:<br />$$\mathfrak{l}^\star_0(x) = x\ \ \ \ \mathrm{and}\ \ \ \ \mathfrak{l}^\star_1(x) = -\log x + x-1$$<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-7vxZ5CP5wRk/Xv7hM5GV8MI/AAAAAAAAEyU/igefz95p5w8-ofxcuRaOr8TRgRkQywZowCLcBGAsYHQ/s1600/enhanced-log.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="444" data-original-width="720" height="246" src="https://1.bp.blogspot.com/-7vxZ5CP5wRk/Xv7hM5GV8MI/AAAAAAAAEyU/igefz95p5w8-ofxcuRaOr8TRgRkQywZowCLcBGAsYHQ/s400/enhanced-log.jpeg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The enhanced log score $\mathfrak{l}^\star$. $\mathfrak{s}_0$ in yellow; $\mathfrak{s}_1$ in blue.</td></tr></tbody></table><br /><br />Before we state and prove the theorem, there are some features of this scoring rule and its resulting inaccuracy measure that are worth noting. Juergen Landes has identified this scoring rule for a different purpose <a href="https://doi.org/10.1016/j.ijar.2015.05.007" target="_blank">(Proposition 9.1, Landes 2015)</a>.<br /><br /><br /><b>Proposition 1 </b><i>$\mathfrak{l}^\star$ is strictly proper</i>.<br /><br /><i>Proof.</i> Suppose $0 \leq p \leq 1$. Then<br />$$\frac{d}{dx} p\mathfrak{l}^\star_1(x) + (1-p)\mathfrak{l}^\star_0(x) = \frac{d}{dx} p[-\log x + x] + (1-p)x = -\frac{p}{x} + 1 = 0$$ iff $p = x$. $\Box$<br /><br /><b>Proposition 2</b> <i>If $P$ is non-probabilistic, then $P^\star = \left (\frac{p_1}{\sum_k p_k}, \ldots, \frac{p_n}{\sum_k p_k} \right )$ accuracy dominates $P = (p_1, \ldots, p_n)$</i>.<br /><br /><i>Proof</i>. $$\mathfrak{L}^\star(P^\star, w_i) = -\log\left ( \frac{p_i}{\sum_k p_k} \right ) + 1 = -\log p_i + \log\sum_k p_k + 1$$ and $$\mathfrak{L}^\star(P, w_i) = -\log p_i + \sum_k p_k$$ But $\log x + 1 \leq x$, for all $x> 0$, with equality iff $x = 1$. So, if $P$ is non-probabilistic, then $\sum_k p_k \neq 1$ and $$\mathfrak{L}^\star(P^\star, w_i) < \mathfrak{L}^\star(P, w_i)$$ for $i = 1, \ldots, n$. $\Box$<br /><br /><b>Proposition 3</b> <i>If $P$ is probabilistic, $\mathfrak{L}^\star(P, w_i) = 1 + \mathfrak{L}(P, w_i)$</i>.<br /><br /><i>Proof</i>.<br />\begin{eqnarray*}<br />\mathfrak{L}^\star(P, w_i) & = & p_1 + \ldots + p_{i-1} + (-\log p_i + p_i ) + p_{i+1} + \ldots + p_n \\<br />& = & -\log p_i + 1 \\<br />& = & 1 + \mathfrak{L}(P, w_i) <br />\end{eqnarray*} <br /> $\Box$<br /><br /><b>Corollary 1</b> <i>If $P$, $Q$ are probabilistic, then</i><br />$$\mathrm{Exp}_{\mathfrak{L}^\star}(Q | P) = 1 + \mathrm{Exp}_\mathfrak{L}(Q | P)$$<br /><br /><i>Proof</i>. By Proposition 3. $\Box$<br /><br /><b>Corollary 2 </b><i>Suppose $E_1, \ldots, E_m$ is a partition and $0 \leq q_1, \ldots, q_m \leq 1$ with $\sum^m_{i=1} q_i = 1$. Then, among $Q$ for which $Q(E_i) = q_i$ for $i = 1, \ldots, m$, $\mathrm{Exp}_{\mathfrak{L}^\star}(Q |P)$ is minimized at the Jeffrey Conditionalization posterior $Q(-) = \sum^k_{i=1} q_iP(-|E_i)$</i>.<br /><br /><i>Proof</i>. This follows from Corollary 1 and Theorem 5.1 from (Diaconis & Zabell 1982). $\Box$<br /><br />Having seen $\mathfrak{l}^\star$ and $\mathfrak{L}^\star$ in action, let's see that they are unique in having this combination of features.<br /><br /><b>Theorem 1</b> <i>Suppose $\mathfrak{s}$ is a strictly proper scoring rule and $\mathfrak{I}$ is the inaccuracy measure it generates. And suppose that, for any $\{w_1, \ldots, w_n\}$ and any $E \subseteq \{w_1, \ldots, w_n\}$, and any probabilistic credence function $P$, the probabilistic credence function $Q$ that minimizes the expected inaccuracy of $Q$ with respect to $P$ with the constraint $Q(E) = 1$, and when inaccuracy is measured by $\mathfrak{I}$, is $Q(-) = P(-|E)$. Then the scoring rule is</i><br /><i>$$\mathfrak{s}_1(x) = -\log x +x\ \ \ \ \mbox{and}\ \ \ \ \mathfrak{s}_0(x) = x$$ or any affine transformation of this</i>.<br /><br /><i>Proof. </i>First, we appeal to the following lemma (Proposition 2, Predd, et al. 2009):<br /><br /><b>Lemma 1</b><br /><br />(i) <i>Suppose $\mathfrak{s}$ is a continuous strictly proper scoring rule. Then define$$\varphi_\mathfrak{s}(x) = -x\mathfrak{s}_1(x) - (1-x)\mathfrak{s}_0(x)$$Then $\varphi_\mathfrak{s}$ is differentiable on $(0, 1)$ and convex on $[0, 1]$ and $$\mathrm{Exp}_\mathfrak{I}(Q | P) - \mathrm{Exp}_\mathfrak{I}(P | P) = \sum^n_{i=1} \varphi_\mathfrak{s}(p_i) - \varphi_\mathfrak{s}(q_i) - \varphi_\mathfrak{s}^\prime (q_i)(p_i - q_i)$$</i> (ii)<i> Suppose $\varphi$ is differentiable on $(0, 1)$ and convex on $[0, 1]$. Then let</i><br /><ul><li><i>$\mathfrak{s}^\varphi_1(x) = - \varphi(x) - \varphi'(x)(1-x)$ </i></li><li><i>$\mathfrak{s}^\varphi_0(x) = - \varphi(x) - \varphi'(x)(0-x)$</i></li></ul><i>Then $\mathfrak{s}^\varphi$ is a strictly proper scoring rule.</i><br /><i><br /></i><i>Moreover, $\mathfrak{s}^{\varphi_\mathfrak{s}} = \mathfrak{s}$.</i><br /><br />Now, let's focus on $\{w_1, w_2, w_3, w_4\}$ and let $E = \{w_1, w_2, w_3\}$. Let $p_1 = a$, $p_2 = b$, $p_3 = c$. Then we wish to minimize<br />$$\mathrm{Exp}_\mathfrak{I}((x, y, 1-x-y, 0) | (a, b, c, 1-a-b-c))$$<br />Now, but Lemma 1, <br />\begin{eqnarray*}<br />&& \mathrm{Exp}_\mathfrak{I}((x, y, 1-x-y, 0) | (a, b, c, 1-a-b-c)) \\<br />& = & \varphi(a) - \varphi(x) - \varphi'(x)(a-x)\\<br />& + & \varphi(b) - \varphi(y) - \varphi'(y)(b-y) \\ <br />& + & \varphi(c) - \varphi(1-x-y) - \varphi'(1-x-y)(c - (1-x-y)) \\<br />& + & \mathrm{Exp}_\mathfrak{I}((a, b, c, 1-a-b-c) | (a, b, c, 1-a-b-c)) <br />\end{eqnarray*} <br />Thus:<br />\begin{eqnarray*}<br />&& \frac{\partial}{\partial x} \mathrm{Exp}_\mathfrak{I}((x, y, 1-x-y, 0) | (a, b, c, 1-a-b-c))\\<br />& = & \varphi''(x)(x-a) - ((1-x-y) - c) \varphi''(1-x-y)<br />\end{eqnarray*} <br />and<br />\begin{eqnarray*}<br />&& \frac{\partial}{\partial y} \mathrm{Exp}_\mathfrak{I}((x, y, 1-x-y, 0) | (a, b, c, 1-a-b-c))\\<br />& = & \varphi''(y)(y-b) - ((1-x-y) - c) \varphi''(1-x-y)<br />\end{eqnarray*} <br />which are both 0 iff$$\varphi''(x)(x-a) = \varphi''(y)(y-b) = ((1-x-y) - c) \varphi''(1-x-y)$$ Now, suppose this is true for $x = \frac{a}{a+b+c}$ and $y = \frac{b}{a + b+ c}$. Then, for all $0 \leq a, b, c \leq 1$ with $a + b + c \leq 1$, $$a\varphi'' \left ( \frac{a}{a+b+c} \right ) = b\varphi'' \left ( \frac{b}{a+b+c} \right ) $$<br />We now wish to show that $\varphi''(x) = \frac{k}{x}$ for all $0 \leq x \leq 1$. If we manage that, then it follows that $\varphi'(x) = k\log x + m$ and $\varphi(x) = kx\log x + (m-k)x$. And we know from Lemma 1:<br />\begin{eqnarray*}<br />& & \mathfrak{s}_0(x) \\<br />& = & - \varphi(x) - \varphi'(x)(0-x) \\<br />& = & - [kx\log x + (m-k)x] - [k\log x + m](0-x) \\<br />& = & kx<br />\end{eqnarray*}<br />and<br />\begin{eqnarray*}<br />&& \mathfrak{s}_1(x) \\<br />& = & - \varphi(x) - \varphi'(x)(1-x) \\<br />& = & - [kx\log x + (m-k)x] - [k\log x + m](1-x) \\<br />& = & -k\log x + kx - m <br />\end{eqnarray*}<br />Now, first, let $f(x) = \varphi''\left (\frac{1}{x} \right )$. Thus, it will suffice to prove that $f(x) = x$. For then $\varphi''(x) = \varphi''\left (\frac{1}{\frac{1}{x}} \right ) = f \left ( \frac{1}{x} \right ) = \frac{1}{x}$, as required. And to prove $f(x) = x$, we need only show that $f'(x)$ is a constant function. We know that, for all $0 \leq a, b, c \leq 1$ with $a + b + c \leq 1$, we have<br />$$a f \left ( \frac{a + b + c}{a} \right ) = bf \left ( \frac{a + b + c}{b} \right )$$<br />So$$<br />\frac{d}{dx} a f \left ( \frac{a + b + x}{a} \right ) = \frac{d}{dx} bf \left ( \frac{a + b + x}{b} \right )<br />$$So, for all $0 \leq a, b, c \leq 1$ with $a + b + c \leq 1$<br />$$<br />f'\left (\frac{a+b+c}{a} \right ) = f'\left (\frac{a + b + c}{b} \right )<br />$$We now show that, for all $x \geq 1$, $f'(x) = f'(2)$, which will suffice to show that it is constant. First, we consider $2 \leq x$. Then let<br />$$a = \frac{1}{x}\ \ \ \ \ b = \frac{1}{2}\ \ \ \ \ c = \frac{1}{2}-\frac{1}{x}$$<br />Then<br />$$f'(x) = f'\left (\frac{a + b + c}{a} \right ) = f'\left (\frac{a + b + c}{b} \right ) = f'(2)$$<br />Second, consider $1 \leq x \leq 2$. Then pick $2 \leq y$ such that $\frac{1}{x} + \frac{1}{y} \leq 1$. Then let<br />$$a = \frac{1}{x}\ \ \ \ \ b = \frac{1}{y}\ \ \ \ \ c = 1 - \frac{1}{x} - \frac{1}{y}$$<br />Then<br />$$f'(x) = f'\left (\frac{a + b + c}{a} \right ) = f'\left (\frac{a + b + c}{b} \right ) = f'(y) = f'(2)$$<br />as required. $\Box$Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com4tag:blogger.com,1999:blog-4987609114415205593.post-44070058905831160292019-12-06T20:45:00.000+00:002019-12-07T08:29:26.975+00:00Deterministic updating and the symmetry argument for ConditionalizationAccording to the Bayesian, when I learn a proposition to which I assign a positive credence, I should update my credences so that my new unconditional credence in a proposition is my old conditional credence in that proposition conditional on the proposition I learned. Thus, if $c$ is my credence function before I learn $E$, and $c'$ is my credence function afterwards, and $c(E) > 0$, then it ought to be the case that $$c'(-) = c(-|E) := \frac{c(-\ \&\ E)}{c(E)}$$ There are many arguments for this Bayesian norm of updating. Some pay attention to the pragmatic costs of updating any other way (Brown 1976; Lewis 1999); some pay attention to the epistemic costs, which are spelled out in terms of the accuracy of the credences that result from the updating plans (Greaves & Wallace 2006; Briggs & Pettigrew 2018); others show that updating as the Bayesian requires, and only updating in that way, preserves as much as possible about the prior credences while still respecting the new evidence (Diaconis & Zabell 1982; Dietrich, List, and Bradley 2016). And then there are the symmetry arguments that are our focus here (Hughes & van Fraassen 1985; van Fraassen 1987; Grove & Halpern 1998).<br /><br />In <a href="https://link.springer.com/article/10.1007/s11098-019-01377-y" target="_blank">a recent paper</a>, I argued that the pragmatic and epistemic arguments for Bayesian updating are based on an unwarranted assumption, which I called <i>Deterministic Updating</i>. An <i>updating plan</i> says how you'll update in response to a specific piece of evidence. Such a plan is <i>deterministic</i> if there's a single credence function that it says you'll adopt in response to that evidence, rather than a range of different credence functions that you might adopt in response. Deterministic Updating says that your updating plan for a particular piece of evidence should be deterministic. That is, if $E$ is a proposition you might learn, your plan for responding to receiving $E$ as evidence should take the form:<br /><ul><li><i>If I learn $E$, I'll adopt $c'$</i> </li></ul>rather than the form:<br /><ul><li><i>If I learn $E$, I might adopt $c'$, I might adopt $c^+$, and I might adopt $c^*$</i>.</li></ul>Here, I want to show that the symmetry arguments make the same assumption.<br />Let's start by laying out the symmetry argument. Suppose $W$ is a set of possible worlds, and $F$ is an algebra over $W$. Then an <i>updating plan</i> on $M = (W, F)$ is a function $U^M$ that takes a credence function $P$ defined on $F$ and a proposition $E$ in $F$ and returns the set of credence functions that the updating plan endorses as responses to learning $E$ for those with credence function $P$. Then we impose three conditions on a family of updating plans $U$.<br /><br /><b>Deterministic Updating</b> This says that an updating plan should endorse at most one credence function as a response to learning a given piece of evidence. That is, for any $M = (W, F)$ and $E$ in $F$, $U^M$ endorses at most one credence function as a response to learning $E$. That is, $|U^M(P, E)| \leq 1$ for all $P$ on $F$ and $E$ in $F$.<br /><br /><b>Certainty</b> This says that any credence function that an updating plan endorses as a response to learning $E$ must be certain of $E$. That is, for any $M = (W, F)$, $P$ on $F$ and $E$ in $F$, if $P'$ is in $U^M(P, E)$, then $P'(E) = 1$.<br /><br /><b>Symmetry</b> This condition requires a bit more work to spell out. Very roughly, it says that the way that an updating plan would have you update should not be sensitive to the way the possibilities are represented. More precisely: Let $M = (W, F)$ and $M' = (W', F')$. Suppose $f : W \rightarrow W'$ is a surjective function. That is, for each $w'$ in $W'$, there is $w$ in $W$ such that $f(w) = w'$. And suppose for each $X$ in $F'$, $f^{-1}(X) = \{w \in W | f(w) \in X\}$ is in $F$. Then the worlds in $W'$ are coarse-grained versions of the worlds in $W$, and the propositions in $F'$ are coarse-grained versions of those in $F$. Now, given a credence function $P$ on $F$, let $f(P)$ be the credence function over $F'$ such that $f(P)(X) = P(f^{-1}(X))$. Then the credence functions that result from updating $f(P)$ by $E'$ in $F'$ using $U^{M'}$ are the image under $f$ of the credence functions that result from updating $P$ on $f^{-1}(E')$ using $U^M$. That is, $U^{M'}(f(P), E') = f(U^M(P, f^{-1}(E')))$.<br /><br />Now, van Fraassen proves the following theorem, though he doesn't phrase it like this because he assumes Deterministic Updating in his definition of an updating rule:<br /><br /><b>Theorem (van Fraassen)</b> <i>If $U$ satisfies Deterministic Updating, Certainty, and Symmetry, then $U$ is the conditionalization updating plan. That is, if $M = (W, F)$, $P$ is defined on $F$ and $E$ is in $F$ with $P(E) > 0$, then $U^M(P, E)$ contains only one credence function $P'$ and $P'(-) = P(-|E)$.</i><br /><br />The problem is that, while Certainty is entirely uncontroversial and Symmetry is very plausible, there is no particularly good reason to assume Deterministic Updating. But the argument cannot go through without it. To see this, consider the following updating rule:<br /><ul><li>If $0 < P(E) < 1$, then $V^M(P, E) = \{v_w | w \in W\ \&\ w \in E\}$, where $v_w$ is the credence function on $F$ such that $v_w(X) = 1$ if $w$ is in $X$, and $v_w(X) = 0$ is $w$ is not in $X$ ($v_w$ is sometimes called the <i>valuation function</i> for $w$, or the <i>omniscience credence function</i> at $w).</li><li>If $P(E) = 1$, then $V^M(P, E) = P$.</li></ul>That is, if $P$ is not already certain of $E$, then $V^M$ takes any credence function on $F$ and any proposition in $F$ and returns the set of valuation functions for the worlds in $W$ at which that proposition is true. Otherwise, it keeps $P$ unchanged.<br /><br />It is easy to see that $V$ satisfies Certainty, since $v_w(E) = 1$ for each $w$ in $E$. To see that $V$ satisfies Symmetry, the crucial fact is that $f(v_w) = v_{f(w)}$. First, take a credence function in $V^{M'}(f(P), E')$: that is, $v_{w'}$ for some $w'$ in $E'$. Then $f^{-1}(w')$ is in $f^{-1}(E')$ and so $v_{f^{-1}(w')}$ is in $V^M(P, f^{-1}(E')))$. And $f(v_{f^{-1}(w')}) = v_{w'}$, so $v_{w'}$ is in $f(V^M(P, f^{-1}(E')))$. Next, take a credence function in $f(V^M(P, f^{-1}(E')))$. That is, $f(v_w)$ for some $w$ in $f^{-1}(E')$. Then $f(v_w) = v_{f(w)}$ and thus $f(v_w)$ is in $V^{M'}(f(P), E')$, as required.<br /><br />So $V$ satisfies Certainty and Symmetry, but it is not the Bayesian updating rule.<br /><br />Now, perhaps there is some further desirable condition that $V$ fails to meet? Perhaps. And it's difficult to prove a negative existential claim. But one thing we can do is to note that $V$ satisfies all the conditions on updating plans on sets of probabilities that Grove & Halpern explore as they try to extend van Fraassen's argument from the case of precise credences to the case of imprecise credences. All, that is, except Deterministic Updating, which they also impose. Here they are:<br /><br /><b>Order Invariance</b> This says that updating first on $E$ and then on $E \cap F$ should result in the same posteriors as updating first on $F$ and then on $E \cap F$. This holds because, either way, you end up with $$U^M(P, E \cap F) = \{v_w : w \in W\ \&\ w \in E \cap F\}$$.<br /><br /><b>Stationarity</b> This says that updating on $E$ should have no effect if you are already certain of $E$. That is, if $P(E) = 1$, then $U^M(P, E) = P$. The second clause of our definition of $V$ ensures this.<br /><br /><b>Non-Triviality </b>This says that there's some prior that is less than certain of the evidence such that updating it on the evidence leads to some posteriors that the updating plan endorses. That is, for some $M = (W, F)$, some $P$ on $F$, and some $E$ in $F$, $U^M(P, E) \neq \emptyset$. Indeed, $V$ will satisfy this for any $P$ and any $E \neq \emptyset$.<br /><br />So, in sum, it seems that van Fraassen's symmetry argument for Bayesian updating shares the same flaw as the pragmatic and epistemic arguments, namely, they rely on Deterministic Updating, and yet that assumption is unwarranted.<br /><br /><h2>References</h2><ol class="BibliographyWrapper"><li class="Citation"><div class="CitationContent" id="CR1">Briggs, R. A., & Pettigrew, R. (2018). An accuracy-dominance argument for conditionalization. <em class="EmphasisTypeItalic ">Noûs</em>. <span class="ExternalRef"> <a href="https://doi.org/10.1111/nous.12258" rel="noopener" target="_blank"><span class="RefSource">https://doi.org/10.1111/nous.12258</span></a></span> <span class="Occurrences"><span class="Occurrence OccurrenceGS"><a class="google-scholar-link gtm-reference" data-reference-type="Google Scholar" href="http://scholar.google.com/scholar_lookup?title=An%20accuracy-dominance%20argument%20for%20conditionalization&author=RA.%20Briggs&author=R.%20Pettigrew&journal=No%C3%BBs&publication_year=2018&doi=10.1111%2Fnous.12258" rel="noopener" target="_blank"><span><span></span></span></a></span></span></div></li><li class="Citation"><div class="CitationContent" id="CR3">Brown, P. M. (1976). Conditionalization and expected utility. <em class="EmphasisTypeItalic ">Philosophy of Science</em>, <em class="EmphasisTypeItalic ">43</em>(3), 415–419.<span class="Occurrences"><span class="Occurrence OccurrenceGS"><a class="google-scholar-link gtm-reference" data-reference-type="Google Scholar" href="http://scholar.google.com/scholar_lookup?title=Conditionalization%20and%20expected%20utility&author=PM.%20Brown&journal=Philosophy%20of%20Science&volume=43&issue=3&pages=415-419&publication_year=1976" rel="noopener" target="_blank"><span><span></span></span></a></span></span></div></li><li class="Citation"><div class="CitationContent" id="CR5">Diaconis, P., & Zabell, S. L. (1982). Updating subjective probability. <em class="EmphasisTypeItalic ">Journal of the American Statistical Association</em>, <em class="EmphasisTypeItalic ">77</em>(380), 822–830.<span class="Occurrences"><span class="Occurrence OccurrenceGS"><a class="google-scholar-link gtm-reference" data-reference-type="Google Scholar" href="http://scholar.google.com/scholar_lookup?title=Updating%20subjective%20probability&author=P.%20Diaconis&author=SL.%20Zabell&journal=Journal%20of%20the%20American%20Statistical%20Association&volume=77&issue=380&pages=822-830&publication_year=1982" rel="noopener" target="_blank"><span><span></span></span></a></span></span></div></li><li class="Citation"><div class="CitationContent" id="CR6">Dietrich, F., List, C., & Bradley, R. (2016). Belief revision generalized: A joint characterization of Bayes’s and Jeffrey’s rules. <em class="EmphasisTypeItalic ">Journal of Economic Theory</em>, <em class="EmphasisTypeItalic ">162</em>, 352–371.<span class="Occurrences"><span class="Occurrence OccurrenceGS"><a class="google-scholar-link gtm-reference" data-reference-type="Google Scholar" href="http://scholar.google.com/scholar_lookup?title=Belief%20revision%20generalized%3A%20A%20joint%20characterization%20of%20Bayes%E2%80%99s%20and%20Jeffrey%E2%80%99s%20rules&author=F.%20Dietrich&author=C.%20List&author=R.%20Bradley&journal=Journal%20of%20Economic%20Theory&volume=162&pages=352-371&publication_year=2016" rel="noopener" target="_blank"><span><span></span></span></a></span></span></div></li><li class="Citation"><div class="CitationContent" id="CR8">Greaves, H., & Wallace, D. (2006). Justifying conditionalization: Conditionalization maximizes expected epistemic utility. <em class="EmphasisTypeItalic ">Mind</em>, <em class="EmphasisTypeItalic ">115</em>(459), 607–632.<span class="Occurrences"><span class="Occurrence OccurrenceGS"><a class="google-scholar-link gtm-reference" data-reference-type="Google Scholar" href="http://scholar.google.com/scholar_lookup?title=Justifying%20conditionalization%3A%20Conditionalization%20maximizes%20expected%20epistemic%20utility&author=H.%20Greaves&author=D.%20Wallace&journal=Mind&volume=115&issue=459&pages=607-632&publication_year=2006" rel="noopener" target="_blank"><span><span></span></span></a></span></span></div></li><li class="Citation"><div class="CitationContent" id="CR9">Grove, A. J., & Halpern, J. Y. (1998). Updating sets of probabilities. In <em class="EmphasisTypeItalic ">Proceedings of the 14th conference on uncertainty in AI</em> (pp. 173–182). San Francisco, CA: Morgan Kaufman.<span class="Occurrences"><span class="Occurrence OccurrenceGS"><a class="google-scholar-link gtm-reference" data-reference-type="Google Scholar" href="https://scholar.google.com/scholar?q=Grove%2C%20A.%20J.%2C%20%26%20Halpern%2C%20J.%20Y.%20%281998%29.%20Updating%20sets%20of%20probabilities.%20In%20Proceedings%20of%20the%2014th%20conference%20on%20uncertainty%20in%20AI%20%28pp.%20173%E2%80%93182%29.%20San%20Francisco%2C%20CA%3A%20Morgan%20Kaufman." rel="noopener" target="_blank"><span><span></span></span></a></span></span></div></li><li class="Citation"><div class="CitationContent" id="CR14">Lewis, D. (1999). Why conditionalize? <em class="EmphasisTypeItalic ">Papers in metaphysics and epistemology</em> (pp. 403–407). Cambridge: Cambridge University Press.<span class="Occurrences"><span class="Occurrence OccurrenceGS"><a class="google-scholar-link gtm-reference" data-reference-type="Google Scholar" href="http://scholar.google.com/scholar_lookup?title=Why%20conditionalize%3F&author=D.%20Lewis&pages=403-407&publication_year=1999" rel="noopener" target="_blank"><span><span></span></span></a></span></span></div></li></ol><br /><br /><br /><br /><br />Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com13tag:blogger.com,1999:blog-4987609114415205593.post-37887081217419781692019-06-27T11:20:00.000+01:002019-06-27T11:20:07.486+01:00CFP (Formal Philosophy, Gdansk)<div dir="ltr" style="text-align: left;" trbidi="on"><br />The International Conference for Philosophy of Science and Formal Methods in Philosophy (CoPS-FaM-19) of the Polish Association for Logic and Philosophy of Science will take place on December 4-6, 2019 at the University of Gdansk (in cooperation with the University of Warsaw). Extended abstract submission: August 31, 2019.<br /><br />*Keynote speakers*<br />Hitoshi Omori (Ruhr-Universität Bochum)<br />Oystein Linnebo (University of Oslo)<br />Miriam Schoenfield (MIT)<br />Stanislav Speransky (St. Petersburg State University)<br />Katya Tentori (University of Trento)<br /><br />Full submission details available at:<br /><a href="http://lopsegdansk.blogspot.com/p/cops-fam-19-cfp.html">http://lopsegdansk.blogspot.com/p/cops-fam-19-cfp.html</a><br /><br /><br />*Programme Committee*<br />Patrick Blackburn (University of Roskilde)<br />Cezary Cieśliński (University of Warsaw)<br />Matteo Colombo (Tilburg University)<br />Juliusz Doboszewski (Harvard University)<br />David Fernandez Duque (Ghent University)<br />Benjamin Eva (University of Konstanz)<br />Benedict Eastaugh (LMU Munich)<br />Federico Faroldi (Ghent University)<br />Michał Tomasz Godziszewski (University of Warsaw)<br />Valentin Goranko (Stockholm University)<br />Rafał Gruszczyński (Nicolaus Copernicus University)<br />Alexandre Guay (University of Louvain)<br />Zalan Gyenis (Jagiellonian University)<br />Ronnie Hermens (Utrecht University)<br />Leon Horsten (University of Bristol)<br />Johannes Korbmacher (Utrecht University)<br />Louwe B. Kuijer (University of Liverpool)<br />Juergen Landes (LMU Munich)<br />Marianna Antonnutti Marfori (LMU Munich)<br />Frederik Van De Putte (Ghent University)<br />Jan-Willem Romeijn (University of Groningen)<br />Sonja Smets (University of Amsterdam)<br />Anthia Solaki (University of Amsterdam)<br />Jan Sprenger (University of Turin)<br />Stanislav Speransky (St. Petersburg State University)<br />Tom F. Sterkenburg (LMU Munich)<br />Johannes Stern (University of Bristol)<br />Allard Tamminga (University of Groningen)<br />Mariusz Urbański (Adam Mickiewicz University)<br />Erik Weber (Ghent University)<br />Leszek Wroński (Jagiellonian University)<br /><br />*Local Organizing Committee:*<br />Rafal Urbaniak<br />Patryk Dziurosz-Serafinowicz<br />Pavel Janda<br />Pawel Pawlowski<br />Paula Quinon<br />Weronika Majek<br />Przemek Przepiórka<br />Małgorzata Stefaniak</div>Rafal Urbaniakhttp://www.blogger.com/profile/10277466578023939272noreply@blogger.com11tag:blogger.com,1999:blog-4987609114415205593.post-47463280211445568922019-05-17T17:32:00.001+01:002019-05-17T17:32:56.541+01:00What is conditionalization and why should we do it?The three central tenets of traditional Bayesian epistemology are these:<br /><br /><b>Precision</b> Your doxastic state at a given time is represented by a credence function, $c$, which takes each proposition $X$ about which you have an opinion and returns a single numerical value, $c(X)$, that measures the strength of your belief in $X$. By convention, we let $0$ represent your minimal credence and we let $1$ represent your maximal credence.<br /><br /><b>Probabilism</b> Your credence function should be a probability function. That is, you should assign minimal credence (i.e. 0) to necessarily false propositions, maximal credence (i.e. 1) to necessarily true propositions, and your credence in the disjunction of two propositions whose conjunction is necessarily false should be the sum of your credences in the disjuncts.<br /><br /><b>Conditionalization</b> You should update your credences by conditionalizing on your total evidence.<br /><br />Note: Precision sets out the way in which doxastic states will be represented; Probabilism and Conditionalization are norms that are stated using that representation.<br /><br />Here, we will assume Precision and Probabilism and focus on Conditionalization. In particular, we are interested in what exactly the norm says; and, more specifically, which versions of the norm are supported by the standard arguments in its favour. That is, we are interested in what versions of the norm we can justify using the existing arguments. We will consider three versions of the norm; and we will consider four arguments in its favour. For each combination, we'll ask whether the argument can support the norm. In each case, we'll notice that the standard formulation relies on a particular assumption, which we call Deterministic Updating and which we formulate precisely below. We'll ask whether the argument really does rely on this assumption, or whether it can be amended to support the norm without that assumption. Let's meet the interpretations and the arguments informally now; then we'll be ready to dive into the details.<br /><br />Here are the three interpretations of Conditionalization. According to the first, Actual Conditionalization, Conditionalization governs your actual updating behaviour.<br /><br /><b>Actual Conditionalization (AC)</b> <br /><br />If<br /><ul><li>$c$ is your credence function at $t$ (we'll often refer to this as your prior);</li><li>the total evidence you receive between $t$ and $t'$ comes in the form of a proposition $E$ learned with certainty;</li><li>$c(E) > 0$;</li><li>$c'$ is your credence function at the later time $t'$ (we'll often refer to this as your posterior);</li></ul>then it should be the case that $c'(-) = c(-|E) = \frac{c(-\ \&\ E)}{c(E)}$. <br />According to the second, Plan Conditionalization, Conditionalization governs the updating behaviour you would endorse in all possible evidential situations you might face:<br /><br /><b>Plan Conditionalization (PC)</b> <br /><br />If<br /><ul><li>$c$ is your credence function at $t$;</li><li>the total evidence you receive between $t$ and $t'$ will come in the form of a proposition learned with certainty, and that proposition will come from the partition $\mathcal{E} = \{E_1, \ldots, E_n\}$;</li><li>$R$ is the plan you endorse for how to update in response to each possible piece of total evidence,</li></ul>then it should be the case that, if you were to receive evidence $E_i$ and if $c(E_i) > 0$, then $R$ would exhort you to adopt credence function $c_i(-) = c(-|E_i) = \frac{c(-\ \&\ E_i)}{c(E_i)}$.<br /><br />According to the third, Dispositional Conditionalization, Conditionalization governs the updating behaviour you are disposed to exhibit.<br /> <br /><b>Dispositional Conditionalization (DC)</b> <br /><br />If<br /><ul><li>$c$ is your credence function at $t$;</li><li>the total evidence you receive between $t$ and $t'$ will come in the form of a proposition learned with certainty, and that proposition will come from the partition $\mathcal{E} = \{E_1, \ldots, E_n\}$;</li><li>$R$ is the plan you are disposed to follow in response to each possible piece of total evidence,</li></ul>then it should be the case that, if you were to receive evidence $E_i$ and if $c(E_i) > 0$, then $R$ would exhort you to adopt credence function $c_i(-) = c(-|E_i) = \frac{c(-\ \&\ E_i)}{c(E_i)}$.<br /><br />Next, let's meet the four arguments. Since it will take some work to formulate them precisely, I will give only an informal gloss here. There will be plenty of time to see them in high-definition in what follows.<br /><br /><b>Diachronic Dutch Book or Dutch Strategy Argument (DSA)</b> This purports to show that, if you violate conditionalization, there is a pair of decisions you might face, one before and one after you receive your evidence, such that your prior and posterior credences lead you to choose options when faced with those decisions that are guaranteed to be worse by your own lights than some alternative options (Lewis 1999).<br /><br /><b>Expected Pragmatic Utility Argument (EPUA)</b> This purports to show that, if you will face a decision after learning your evidence, then your prior credences will expect your updated posterior credences to do the best job of making that decision if they are obtained by conditionalizing on your priors (Brown 1976).<br /><br /><b>Expected Epistemic Utility Argument (EEUA)</b> This purports to show that your prior credences will expect your posterior credences to be best epistemically speaking if they are obtained by conditionalizing on your priors (Greaves & Wallace 2006).<br /><br /><b>Epistemic Utility Dominance Argument (EUDA)</b> This purports to show that, if you violate conditionalization, then there will be alternative priors and posteriors that are guaranteed to be better epistemically speaking, when considered together, than your priors and posteriors (Briggs & Pettigrew 2018).<br /><br /><h2>The framework</h2><br />In the following sections, we will consider each of the arguments listed above. As we will see, these arguments are concerned directly with updating plans or dispositions, rather than actual updating behaviour. That is, the items that they consider don't just specify how you in fact update in response to the particular piece of evidence you actually receive. Rather, they assume that your evidence between the earlier and later time will come in the form of a proposition learned with certainty (Certain Evidence); they assume the possible propositions that you might learn with certainty by the later time form a partition (Evidential Partition); and they assume that each of the propositions you might learn with certainty is one about which you had a prior opinion (Evidential Availability); and then they specify, for each of the possible pieces of evidence in your evidential partition, how you might update if you were to receive it.<br /><br />Some philosophers, like David Lewis (1999), assume that all three assumptions---Certain Evidence, Evidential Partition, Evidential Availability---hold in all learning situations. Others, deny one or more. So Richard Jeffrey (1992) denies Certain Evidence and Evidential Availability; Jason Konek (2019) denies Evidential Availability but not Certain Evidence; Bas van Fraassen (1999), Miriam Schoenfield (2017), and Jonathan Weisberg (2007) deny Evidential Partition. But all agree, I think, that there are certain important situations when all three assumptions are true; there are certain situations where there is a set of propositions that forms a partition and about each member of which you have a prior opinion, and the possible evidence you might receive at the later time comes in the form of one of these propositions learned with certainty. Examples might include: when you are about to discover the outcome of a scientific experiment, perhaps by taking a reading from a measuring device with unambiguous outputs; when you've asked an expert a yes/no question; when you step on the digital scales in your bathroom or check your bank balance or count the number of spots on the back of the ladybird that just landed on your hand. So, if you disagree with Lewis, simply restrict your attention to these cases in what follows.<br /><br />As we will see, we can piggyback on conclusions about plans and dispositions to produce arguments about actual behaviour in certain situations. But in the first instance, we will take the arguments to address plans and dispositions defined on evidential partitions primarily, and actual behaviour only secondarily. Thus, to state these arguments, we need a clear way to represent updating plans or dispositions. We will talk neutrally here of an updating rule. If you think conditionalization governs your updating dispositions, then you take it to govern the updating rule that matches those dispositions; if you think it governs your updating intentions, then you take it to govern the updating rule you intend to follow.<br /><br />We'll introduce a slew of terminology here. You needn't take it all in at the moment, but it's worth keeping it all in one place for ease of reference.<br /><br /><b>Agenda</b> We will assume that your prior and posterior credence functions are defined on the same set of propositions $\mathcal{F}$, and we'll assume that $\mathcal{F}$ is finite and $\mathcal{F}$ is an algebra. We say that $\mathcal{F}$ is your <i>agenda</i>. <br /><br /><b>Possible worlds</b> Given an agenda $\mathcal{F}$, the set of possible worlds relative to $\mathcal{F}$ is the set of classically consistent assignments of truth values to the propositions in $\mathcal{F}$. We'll abuse notation throughout and write $w$ for (i) a truth value assignment to the propositions in $\mathcal{F}$, (ii) the proposition in $\mathcal{F}$ that is true at that truth value assignment and only at that truth value assignment, and (iii) what we might call the omniscient credence function relative to that truth value assignment, which is the credence function that assigns maximal credence (i.e. 1) to all propositions that are true on it and minimal credence (i.e. 0) to all propositions that are false on it. <br /><br /><b>Updating rules</b> An <i>updating rule</i> has two components:<br /><ul><li>a set of propositions, $\mathcal{E} = \{E_1, \ldots, E_n\}$. This contains the propositions that you might learn with certainty at the later time $t'$; each $E_i$ is in $\mathcal{F}$, so $\mathcal{E} \subseteq \mathcal{F}$; $\mathcal{E}$ forms a partition;</li><li>a set of sets of credence functions, $\mathcal{C} = \{C_1, \ldots, C_n\}$. For each $E_i$, $C_i$ is the set of possible ways that the rule allows you to respond to evidence $E_i$; that is, it is the set of possible posteriors that the rule permits when you learn $E_i$; each $c'$ in $C_i$ in $\mathcal{C}$ is defined on $\mathcal{F}$.</li></ul><br /><b>Deterministic updating rule</b> We say that an updating rule $R = (\mathcal{E}, \mathcal{C})$ is <i>deterministic</i> if each $C_i$ is a singleton set $\{c_i\}$. That is, for each piece of evidence there is exactly one possible response to it that the rule allows.<br /><br /><b>Stochastic updating rule</b> A <i>stochastic updating rule</i> is an updating rule $R = (\mathcal{C}, \mathcal{E})$ equipped with a probability function $P$. $P$ records, for each $E_i$ in $\mathcal{E}$ and $c'$ in $C_i$, how likely it is that I will adopt $c'$ in response to learning $E_i$. We write this $P(R^i_{c'} | E_i)$, where $R^i_{c'}$ is the proposition that says that you adopt posterior $c'$ in response to evidence $E_i$.<br /><ul><li>We assume $P(R^i_{c'} | E_i) > 0$ for all $c'$ in $C_i$. If the probability that you will adopt $c'$ in response to $E_i$ is zero, then $c'$ does not count as a response to $E_i$ that the rule allows.</li><li>Note that every deterministic updating rule is a stochastic updating rule for which $P(R^i_{c'} | E_i) = 1$ for each $c'$ in $C_i$. If $R = (\mathcal{E}, \mathcal{C})$ is deterministic, then, for each $E_i$, $C_i = \{c_i\}$. So let $P(R^i_{c_i} | E_i) = 1$.</li></ul><br /><b>Conditionalizing updating rule</b> An updating rule $R = (\mathcal{E}, \mathcal{C})$ is a <i>conditionalizing rule</i> for a prior $c$ if, whenever $c(E_i) > 0$, $C_i = \{c_i\}$ and $c_i(-) = c(-|E_i)$.<br /><br /><b>Conditionalizing pairs</b> A pair $\langle c, R \rangle$ of a prior and an updating rule is a <i>conditionalizing pair</i> if $R$ is a conditionalizing rule for $c$.<br /><br /><b>Pseudo-conditionalizing updating rule</b> Suppose $R = (\mathcal{E}, \mathcal{C})$ is an updating rule. Then let $\mathcal{F}^*$ be the smallest algebra that contains all of $\mathcal{F}$ and also $R^i_{c'}$ for each $E_i$ in $\mathcal{E}$ and $c'$ in $C_i$. (As above $R^i_{c'}$ is the proposition that says that you adopt posterior $c'$ in response to evidence $E_i$.) Then an updating rule $R$ is a <i>pseudo-conditionalizing rule</i> for a prior $c$ if it is possible to extend $c$, a credence function defined on $\mathcal{F}$, to $c^*$, a credence function defined on $\mathcal{F}^*$, such that, for each $E_i$ in $\mathcal{E}$ and $c'$ in $C_i$, $c'(-) = c^*(-|R^i_{c'})$. That is, each posterior is the result of conditionalizing the extended prior $c^*$ on the evidence to which it is a response and the fact that it was your response to this evidence. <br /><br /><b>Pseudo-conditionalizing pair</b> A pair $\langle c, R \rangle$ of a prior and an updating rule is a <i>pseudo-conditionalizing pair</i> if $R$ is a pseudo-conditionalizing rule for $c$.<br /><br />Let's illustrate these definitions using an example. Condi is a meteorologist. There is a hurricane in the Gulf of Mexico. She knows that it will make landfall soon in one of the following four towns: Pensecola, FL, Panama City, FL, Mobile, AL, Biloxi, MS. She calls a friend and asks whether it has hit yet. It has. Then she asks whether it has hit in Florida. At this point, the evidence she will receive when her friend answers is either $F$---which says that it made landfall in Florida, that is, in Pensecola or Panama City---or $\overline{F}$---which says it hit elsewhere, that is, in Mobile or Biloxi. Her prior is $c$:<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-l4KoBV2YGyM/XN7NZ-8qoII/AAAAAAAAB_c/PpFzZaB4ZdojM0zKfmDmOQym6HLVAW0jwCLcBGAs/s1600/Screenshot%2B2019-05-17%2Bat%2B16.02.59.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="147" data-original-width="969" height="48" src="https://3.bp.blogspot.com/-l4KoBV2YGyM/XN7NZ-8qoII/AAAAAAAAB_c/PpFzZaB4ZdojM0zKfmDmOQym6HLVAW0jwCLcBGAs/s320/Screenshot%2B2019-05-17%2Bat%2B16.02.59.png" width="320" /></a></div><br />Her evidential partition is $\mathcal{E} = \{F, \overline{F}\}$. And here are some posteriors she might adopt:<br /><br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-QX8gJ3c3osU/XN7ONCqoFWI/AAAAAAAAB_k/dLiRx-M1M4csuinu0wDxGIRX71EFMBk7ACLcBGAs/s1600/Screenshot%2B2019-05-17%2Bat%2B16.05.03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="412" data-original-width="956" height="137" src="https://1.bp.blogspot.com/-QX8gJ3c3osU/XN7ONCqoFWI/AAAAAAAAB_k/dLiRx-M1M4csuinu0wDxGIRX71EFMBk7ACLcBGAs/s320/Screenshot%2B2019-05-17%2Bat%2B16.05.03.png" width="320" /></a></div><br />And here are four possible rules she might adopt, along with their properties:<br /><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-K-mtoeYnsMM/XN7OPcTNDJI/AAAAAAAAB_o/ZaQnIvJX7Y0rnxEIUfMT-7w-S2R76-pfwCEwYBhgL/s1600/Screenshot%2B2019-05-17%2Bat%2B16.06.10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="328" data-original-width="1110" height="94" src="https://3.bp.blogspot.com/-K-mtoeYnsMM/XN7OPcTNDJI/AAAAAAAAB_o/ZaQnIvJX7Y0rnxEIUfMT-7w-S2R76-pfwCEwYBhgL/s320/Screenshot%2B2019-05-17%2Bat%2B16.06.10.png" width="320" /></a></div><br />As we will see below, for each of our four arguments for conditionalization---DSA, EPUA, EEUA, and EUDA---the standard formulation of the argument assumes a norm that we will call Deterministic Updating:<br /><br /><b>Deterministic Updating (DU)</b> Your updating rule should be deterministic.<br /><br />As we will see, this is crucial for the success of these arguments. In what follows, I will present each argument in its standard formulation, which assumes Deterministic Updating. Then I will explore what happens when we remove that assumption.<br /><br /><h2>The Dutch Strategy Argument (DSA)</h2><br />The DSA and EPUA both evaluate updating rules by their pragmatic consequences. That is, they look to the choices that your priors and/or your possible posteriors lead you to make and they conclude that they are optimal only if your updating rule is a conditionalizing rule for your prior.<br /><br /><h3>DSA with Deterministic Updating</h3><br />Let's look at the DSA first. In what follows, we'll take a decision problem to be a set of options that are available to an agent: e.g. accept a particular bet or refuse it; buy a particular lottery ticket or don't; take an umbrella when you go outside, take a raincoat, or take neither; and so on. The idea behind the DSA is this. One of the roles of credences is to help us make choices when faced with decision problems. They play that role badly if they lead us to make one series of choices when another series is guaranteed to serve our ends better. The DSA turns on the claim that, unless we update in line with Conditionalization, our credences will lead us to make such a series of choices when faced with a particular series of decision problems. <br /><br />Here, we restrict attention to a particular class of decision problems you might face. They are the decision problems in which, for each available option, its outcome at a given possible world obtains for you a certain amount of a particular quantity, such as money or chocolate or pure pleasure, and your utility is linear in that quantity---that is, obtaining some amount of that quantity increases your utility by the same amount regardless of how much of the quantity you already have. The quantity is typically taken to be money, and we'll continue to talk like that in what follows. But it's really a placeholder for some quantity with this property. We restrict attention to such decision problems because, in the argument, we need to combine the outcome of one decision, made at the earlier time, with the outcome of another decision, made at the later time. So we need to ensure that the utility of a combination of outcomes is the sum of the utilities of the individual outcomes. <br /><br />Now, as we do throughout, we assume that the prior $c$ and the possible posteriors $c_1, \ldots, c_n$ permitted by a deterministic updating rule $R$ are all probability functions. And we will assume further that, when your credences are probabilistic, and you face a decision problem, then you should choose from the available options one of those that maximises expected utility relative to your credences.<br /><br />With this in hand, let's define two closely related features of a pair $\langle c, R \rangle$ that are undesirable from a pragmatic point of view, and might be thought to render that pair irrational. First:<br /><br /><b>Strong Dutch Strategies</b> $\langle c, R \rangle$ is vulnerable to a <i>strong Dutch strategy</i> if there are two decision problems, $\mathbf{d}$, $\mathbf{d}'$ such that<br /><ol><li>$c$ requires you to choose option $A$ from the possible options available in $\mathbf{d}$;</li><li>for each $E_i$ and each $c'$ in $C_i$, $c'$ requires you to choose $B$ from $\mathbf{d}'$;</li><li>there are alternative options, $X$ in $\mathbf{d}$ and $Y$ in $\mathbf{d}'$, such that, at every possible world, you'll receive more utility from choosing $X$ and $Y$ than you receive from choosing $A$ and $B$. In the language of decision theory, $X + Y$ strongly dominates $A + B$.</li></ol><b>Weak Dutch Strategies</b> $\langle c, R \rangle$ is vulnerable to a <i>weak Dutch strategy</i> if there are decision problems $\mathbf{d}$ and, for each $c'$ in $C_i$ in $\mathcal{C}$, $\mathbf{d}_{c'}$ such that<br /><ol><li>$c$ requires you to choose $A$ from $\mathbf{d}$;</li><li>for each $E_i$ and each $c'$ in $C_i$, $c'$ requires you to choose $B^i_{c'}$ from $\mathbf{d}'_{c'}$;</li><li>there are alternative options, $X$ in $\mathbf{d}$ and, for $E_i$ and $c'$ in $C_i$, $Y^i_{c'}$ in $\mathbf{d}'_{c'}$, such that (a) for each $E_i$, each world in $E_i$, and each $c'$ in $C_i$, you'll receive at least as much utility at that world from choosing $X$ and $Y^i_{c'}$ as you'll receive from choosing $A$ and $B^i_{c'}$, and (b) for some $E_i$, some world in $E_i$, and some $c'$ in $C_i$, you'll receive strictly more utility at that world from $X$ and $Y^i_{c'}$ than you'll receive from $A$ and $B^i_{c'}$. </li></ol>Then the Dutch Strategy Argument is based on the following mathematical fact (de Finetti 1974):<br /><br /><b>Theorem 1</b> Suppose $R$ is a deterministic updating rule. Then:<br /><ol><li>if $R$ is not a conditionalizing pair for $c$, then $\langle c, R \rangle$ is vulnerable to a strong Dutch strategy;</li><li>if $R$ is a conditionalizing rule for $c$, then $\langle c, R \rangle$ is not vulnerable even to a weak Dutch strategy.</li></ol>That is, if your updating rule is not a conditionalizing rule for your prior, then your credences will lead you to choose a strongly dominated pair of options when faced with a particular pair of decision problems; if you satisfy it, that can't happen.<br /><br />Now that we have seen how the argument works, let's see whether it supports the three versions of conditionalization that we met above: Actual (AC), Plan (PC), and Dispositional (DC) Conditionalization. Since they speak directly of rules, let's begin with PC and DC.<br /><br />The DSA shows that, if you endorse a deterministic rule that isn't a conditionalizing rule for your prior, then there is pair of decision problems, one that you'll face at the earlier time and the other at the later time, where your credences at the earlier time and your planned credences at the later time will require you to choose a dominated pair of options. And it seems reasonable to say that it is irrational to endorse a plan when you will be rendered vulnerable to a Dutch Strategy if you follow through on it. So, for those who endorse deterministic rules, DSA plausibly supports Plan Conditionalization.<br /><br />The same is true of Dispositional Conditionalization. Just as it is irrational to <i>plan</i> to update in a way that would render you vulnerable to a Dutch Strategy if you were to stick to the plan, it is surely irrational to be <i>disposed</i> to update in a way that will renders you vulnerable in this way. So, for those whose updating dispositions are deterministic, DSA plausibly supports Dispositional Conditionalization.<br /><br />Finally, AC. There various different ways to move from either PC or DC to AC, but each one of them requires some extra assumptions. For instance:<br /><br />(I) I might assume: (i) between an earlier and a later time, there is always a partition such that you know that the strongest pieces of evidence you might receive between those times is a proposition from that partition learned with certainty; (ii) if you know you'll receive evidence from some partition, you are rationally required to plan how you will update on each possible piece of evidence before you receive it; and (iii) if you plan how to respond to evidence before you receive it, you are rationally required to follow through on that plan once you have received it. Together with PC + DU, these give AC.<br /><br />(II) I might assume: (i) you have updating dispositions. So, if you actually update other than by conditionalization, then it must be a manifestation of a disposition other than conditionalizing. Together with DC + DU, this gives AC.<br /><br />(III) I might assume: (i) that you are rationally required to update in any way that can be represented as the result of updating on a plan that you were rationally permitted to endorse or as the result of dispositions that you were rationally permitted to have, even if you did not in fact endorse any plan prior to receiving the evidence nor have any updating dispositions. Again, together with PC + DU or DC + DU, this gives AC.<br /><br />Notice that, in each case, it was essential to invoke Deterministic Updating (DU). As we will see below, this causes problems for AC. <br /><br /><h3>DSA without Deterministic Updating</h3><br />We have now seen how the DSA proceeds if we assume Deterministic Updating. But what if we don't? Consider, for instance, rule $R_3$ from our list of examples above:<br />$$R_3 = (\mathcal{E} = \{F, \overline{F}\}, \mathcal{C} = \{\{c^\circ_F, c^+_F\}, \{c^\circ_{\overline{F}}, c^+_{\overline{F}}\}\})$$<br />That is, if Condi learns $F$, rule $R_3$ allows her to update to $c^\circ_F$ or to $c^+_F$. And if she receives $\overline{F}$, it allows her to update to $c^\circ_{\overline{F}}$ or to $c^+_{\overline{F}}$. Notice that $R_3$ violates conditionalization thoroughly: it is not deterministic; and, moreover, as well as not mandating the posteriors that conditionalization demands, it does not even permit them. Can we adapt the DSA to show that $R_3$ is irrational? No. We cannot use Dutch Strategies to show that $R_3$ is irrational because it isn't vulnerable to them.<br /><br />To see this, we first note that, while $R_3$ is not deterministic and not a conditionalizing rule, it is a pseudo-conditionalizing rule. And to see that, it helps to state the following representation theorem for pseudo-conditionalizing rules.<br /><br /><b>Lemma 1</b> $R$ is a pseudo-conditionalizing pair for $c$ iff<br /><ol><li>for all $E_i$ in $\mathcal{E}$ and $c'$ in $C_i$, $c'(E_i) = 1$, and</li><li>$c$ is in the convex hull of the possible posteriors that $R$ permits.</li></ol>But note:$$c(-) = 0.4c^\circ_F(-) + 0.4c^+_F(-) + 0.1c^\circ_{\overline{F}}(-) + 0.1 c^+_{\overline{F}}(-)$$<br />So $R_3$ is pseudo-conditionalizing. What's more:<br /><br /><br /><b>Theorem 2</b><br /><ul><li>If $R$ is not a pseudo-conditionalizing rule for $c$, then $\langle c, R \rangle$ is vulnerable at least to a weak Dutch Strategy, and possibly also a strong Dutch Strategy.</li><li>If $R$ is a pseudo-conditionalizing rule for $c$, then $\langle c, R \rangle$ is not vulnerable to a weak Dutch Strategy.</li></ul>Thus, $\langle c, R_3 \rangle$ is not vulnerable even to a weak Dutch Strategy. The DSA, then, cannot say what is irrational about Condi if she begins with prior $c$ and either endorses $R_3$ or is disposed to update in line with it. Thus, the DSA cannot justify Deterministic Updating. And without DU, it cannot support PC or DC either. After all, $R_3$ violates each of those, but it is not vulnerable even to a weak Dutch Strategy. And moreover, each of the three arguments for AC break down because they depend on PC or DC. The problem is that, if Condi updates from $c$ to $c^\circ_F$ upon learning $F$, she violates AC; but there is a non-deterministic updating rule---namely, $R_3$---that allows $c^\circ_F$ as a response to learning $F$, and, for all DSA tells us, she might have rationally endorsed $R_3$ before learning $F$ or she might rationally have been disposed to follow it. Indeed, the only restriction that DSA can place on your actual updating behaviour is that you should become certain of the evidence that you learned. After all:<br /><br /><b>Theorem 3</b> Suppose $c$ is your prior and $c'$ is your posterior. Then there is a rule $R$ such that:<br /><ol><li>$c'$ is in $C_i$, and</li><li>$R$ is a pseudo-conditionalizing rule for $c$</li></ol>iff $c'(E_i) = 1$.<br /><br />Thus, at the end of this section, we can conclude that, whatever is irrational about planning to update using non-deterministic but pseudo-conditionalizing updating rules, it cannot be that following through on those plans leaves you vulnerable to a Dutch Strategy, for it does not. And similarly, whatever is irrational about being disposed to update in those ways, it cannot be that those dispositions will equip you with credences that lead you to choose dominated options, for they do not. With PC and DC thus blocked, our route to AC is therefore also blocked.<br /><br /><h2>The Expected Pragmatic Utility Argument (EPUA)</h2><br />Let's look at EPUA next. Again, we will consider how our credences guide our actions when we face decision problems. In this case, there is no need to restrict attention to monetary decision problems. We will only consider a single decision problem, which we face at the later time, after we've received the evidence, so we won't have to combine the outcomes of multiple options as we did in the DSA. The idea is this. Suppose you will make a decision after you receive whatever evidence it is that you receive at the later time. And suppose that you will use your later updated credence function to make that choice---indeed, you'll choose from the available options by maximising expected utility from the point of view of your new updated credences. Which updating rules does your prior expect will lead you to make the choice best?<br /><br /><h3>EPUA with Deterministic Updating</h3><br />Suppose you'll face decision problem $\mathbf{d}$ after you've updated. And suppose further that you'll use a deterministic updating rule $R$. Then, if $w$ is a possible world and $E_i$ is the element of the evidential partition $\mathcal{E}$ that is true at $w$, the idea is that we take the pragmatic utility of $R$ relative to $\mathbf{d}$ at $w$ to be the utility at $w$ of whatever option from $\mathbf{d}$ we should choose if our posterior credence function were $c_i$, as $R$ requires it to be at $w$. But of course, for many decision problems, this isn't well defined because there is no unique option in $\mathbf{d}$ that maximises expected utility by the lights of $c_i$; rather there are sometimes many such options, and they might have different utilities at $w$. Thus, we need not only $c_i$ but also a selection function, which picks a single option from any set of options. If $f$ is such a selection function, then let $A^{\mathbf{d}}_{c_i, f}$ be the option that $f$ selects from the set of options in $\mathbf{d}$ that maximise expected utility by the lights of $c_i$. And let<br />$$u_{\mathbf{d},f}(R, w) = u(A^{\mathbf{d}}_{c_i, f}, w).$$<br />Then the EPUA argument turns on the following mathematical fact (Brown 1976):<br /><br /><b>Theorem 4</b> Suppose $R$ and $R^\star$ are both deterministic updating rules. Then:<br /><ul><li>If $R$ and $R^\star$ are both conditionalizing rules for $c$, and $f$, $g$ are selection functions, then for all decision problems $\mathbf{d}$ $$\sum_{w \in W} c(w) u_{\mathbf{d}, f}(R, w) = \sum_{w \in W} c(w) u_{\mathbf{d}, g}(R^\star, w)$$</li><li>If $R$ is a conditionalizing rule for $c$, and $R^\star$ is not, and $f$, $g$ are selection functions, then for all decision problems $\mathrm{d}$, $$\sum_{w \in W} c(w) u_{\mathbf{d}, f}(R, w) \geq \sum_{w \in W} c(w) u_{\mathbf{d}, g}(R^\star, w)$$with strict inequality for some decision problems $\mathbf{d}$.</li></ul>That is, a deterministic updating rule maximises expected pragmatic utility by the lights of your prior just in case it is a conditionalizing rule for your prior.<br /><br />As in the case of the DSA above, then, if we assume Deterministic Updating (DU), we can establish PC and DC, and on the back of those AC as well. After all, it is surely irrational to plan to update in one way when you expect another way to guide your actions better in the future; and it is surely irrational to be disposed to update in one way when you expect another to guide you better. And as before there are the same three arguments for AC on the back of PC and DC.<br /><br /><h3>EPUA without Deterministic Updating</h3><br />How does EPUA fare when we widen our view to include non-deterministic updating rules as well? An initial problem is that it is no longer clear how to define the pragmatic utility of such an updating rule relative to a decision problem at a possible world. Above, we said that, relative to a decision problem $\mathbf{d}$ and a selection function $f$, the pragmatic utility of rule $R$ at world $w$ is the utility of the option that you would choose when faced with $\mathbf{d}$ using the credence function that $R$ mandates at $w$ and $f$: that is, if $E_i$ is true at $w$, then<br />$$u_{\mathbf{d}, f}(R, w) = u(A^{\mathbf{d}}_{c_i, f}, w).$$<br />But, if $R$ is not deterministic, there might be no single credence function that it mandates at $w$. If $E_i$ is the piece of evidence you'll learn at $w$ and $R$ permits more than one credence function in response to $E_i$, then there might be a range of different options in $\mathbf{d}$, each of which maximises expected utility relative to a different credence function $c'$ in $C_i$. So what are we to do?<br /><br />Our response to this problem depends on whether we wish to argue for Plan or Dispositional Conditionalization (PC or DC). Suppose, first, that we are interested in DC. That is, we are interested in a norm that governs the updating rule that records how you are disposed to update when you receive certain evidence. Then it seems reasonable to assume that the updating rule that records your dispositions is stochastic. That is, for each possible piece of evidence $E_i$ and each possible response $c'$ in $C_i$ to that evidence that you might adopt in response to receiving that evidence, there is some objective chance that you will respond to $E_i$ by adopting $c'$. As we explained above, we'll write this $P(R^i_{c'} | E_i)$, where $R^i_{c'}$ is the proposition that you receive $E_i$ and respond by adopting $c'$. Then, if $E_i$ is true at $w$, we might take the pragmatic utility of $R$ relative to $\mathbf{d}$ and $f$ at $w$ to be the expectation of the utility of the options that each permitted response to $E_i$ (and selection function $f$) would lead us to choose:<br />$$u_{\mathbf{d}, f}(R, w) = \sum_{c' \in C_i} P(R^i_{c'} | E_i) u(A^{\mathbf{d}}_{c', f}, w)$$<br />With this in hand, we have the following result: <br /><br /><b>Theorem 5</b> Suppose $R$ and $R^\star$ are both updating rules. Then:<br /><ul><li>If $R$ and $R^\star$ are both conditionalizing rules for $c$, and $f$, $g$ are selection functions, then for all decision problems $\mathbf{d}$, $$\sum_{w \in W} c(w) u_{\mathbf{d}, f}(R, w) = \sum_{w \in W} c(w) u_{\mathbf{d}, g}(R^\star, w)$$</li><li>$R$ is a conditionalizing rule for $c$, and $R^\star$ is a stochastic but not conditionalizing rule, and $f$, $g$ are selection functions, then for all decision problems $\mathbf{d}$,$$\sum_{w \in W} c(w) u_{\mathbf{d}, f}(R, w) \geq \sum_{w \in W} c(w) u_{\mathbf{d}, g}(R^\star, w)$$with strictly inequality for some decision problems $\mathbf{d}$.</li></ul>This shows the first difference between the DSA and EPUA. The latter, but not the former, provides a route to establishing Dispositional Conditionalization (DC). If we assume that your dispositions are governed by a chance function, and we use that chance function to calculate expectations, then we can show that your prior will expect your posteriors to do worse as a guide to action unless you are disposed to update by conditionalizing on the evidence you receive.<br /><br />Next, suppose we are interested in Plan Conditionalization (PC). In this case, we might try to appeal again to Theorem 5. To do that, we must assume that, while there are non-deterministic updating rules that we might endorse, they are all at least stochastic updating rules; that is, they all come equipped with a probability function that determines how likely it is that I will adopt a particular permitted response to the evidence. That is, we might say that the updating rules that we might endorse are either deterministic or non-deterministic-but-stochastic. In the language of game theory, we might say that the updating strategies between which we choose are either pure or mixed. And then Theorem 5 will show that we should adopt a deterministic-and-conditionalizing rule, rather than any deterministic-but-non-conditionalizing or non-deterministic-but-stochastic rule. The problem with this proposal is that it seems just as arbitrary to restrict to deterministic and non-deterministic-but-stochastic rules as it was to restrict to deterministic rules in the first place. Why should we not be able to endorse a non-deterministic and non-stochastic rule---that is, a rule that says, for at least one possible piece of evidence $E_i$ in $\mathcal{E}$, there are two or more posteriors that the rule permits as responses, but does not endorse any chance mechanism by which we'll choose between them? But if we permit these rules, how are we to define their pragmatic utility relative to a decision problem and at a possible world?<br /><br />Here's one suggestion. Suppose $E_i$ is the proposition in $\mathcal{E}$ that is true at world $w$. And suppose $\mathbf{d}$ is a decision problem and $f$ is a selection rule. Then we might take the pragmatic utility of $R$ relative to $\mathbf{d}$ and $f$ and at $w$ to be the average utility of the options that each permissible response to $E_i$ and $f$ would choose when faced with $\mathbf{d}$. That is,$$u_{\mathbf{d}, f}(R, w) = \frac{1}{|C_i|} \sum_{c' \in C_i} u(A^{\mathbf{d}}_{c', f}, w)$$where $|C_i|$ is the size of $C_i$, that is, the number of possible responses to $E_i$ that $R$ permits. If that's the case, then we have the following:<br /><br /><b>Theorem 6</b> Suppose $R$ and $R^\star$ are updating rules. Then if $R$ is a conditionalizing rule for $c$, and $R^\star$ is not deterministic, not stochastic, and not a conditionalizing rule for $c$, and $f$, $g$ are selection functions, then for all decision problems $\mathbf{d}$,<br />$$\sum_{w \in W} c(w) u_{\mathbf{d}, f}(R, w) \geq \sum_{w \in W} c(w) u_{\mathbf{d}, f}(R^\star, w)$$with strictly inequality for some decision problems $\mathbf{d}$.<br /><br />Put together with Theorems 4 and 5, this shows that our prior expects us to do better by endorsing a conditionalizing rule than by endorsing any other sort of rule, whether that is a deterministic and non-conditionalizing rule, a non-deterministic but stochastic rule, or a non-deterministic and non-stochastic rule.<br /><br />So, again, we see a difference between DSA and EPUA. Just as the latter, but not the former, provides a route to establishing DC without assuming Deterministic Updating, so the latter but not the former provides a route to establishing PC without DU. And from both of those, we have the usual three routes to AC. This means that EPUA explains what might be irrational about endorsing a non-deterministic updating rule, or having dispositions that match one. If you do, there's some alternative updating rule that your prior expects to do better as a guide to future action.<br /><br /><h2>Expected Epistemic Utility Argument (EEUA)</h2><br />The previous two arguments criticized non-conditionalizing updating rules from the standpoint of pragmatic utility. The EEUA and EUDA both criticize such rules from the standpoint of epistemic utility. The idea is this: just as credences play a pragmatic role in guiding our actions, so they play other roles as well---they represent the world; they respond to evidence; they might be more or less coherent. These roles are purely epistemic. And so just as we defined the pragmatic utility of a credence function at world when faced with a decision problem, so we can also define the epistemic utility of a credence function at a world---it is a measure of how valuable it is to have that credence function from a purely epistemic point of view. <br /><br /><h3>EEUA with Deterministic Updating</h3><br />We will not give an explicit definition of the epistemic utility of a credence function at a world. Rather, we'll simply state two properties that we'll take measures of such epistemic utility to have. These are widely assumed in the literature on epistemic utility theory and accuracy-first epistemology, and I'll defer to the arguments in favour of them that are outlined there (Joyce 2009, Pettigrew 2016, Horowitz 2019).<br /><br />A local epistemic utility function is a function $s$ that takes a single credence and a truth value---either true (1) or false (0)---and returns the epistemic value of having that credence in a proposition with that truth value. Thus, $s(1, p)$ is the epistemic value of having credence $p$ in a truth, while $s(0, p)$ is the epistemic value of having credence $p$ in a falsehood. A global epistemic utility function is a function $EU$ that takes an entire credence function defined on $\mathcal{F}$ and a possible world and returns the epistemic value of having that credence function when the propositions in $\mathcal{F}$ have the truth values they have in that world.<br /><br /><b>Strict Propriety</b> A local epistemic utility function $s$ is <i>strictly proper</i> if each credence expects itself and only itself to have the greatest epistemic utility. That is, for all $0 \leq p \leq 1$,$$<br />ps(1, x) + (1-p) s(0, x)$$<br />is maximised, as a function of $x$ at $p = x$.<br /><br /><b>Additivity</b> A global epistemic utility function is <i>additive</i> if, for each proposition $X$ in $\mathcal{F}$, there is a local epistemic utility function $s_X$ such that the epistemic utility of a credence function $c$ at a possible world is the sum of the epistemic utilities at that world of the credences it assigns. If $w$ is a possible world and we write $w(X)$ for the truth value (0 or 1) of proposition $X$ at $w$, this says:$$EU(c, w) = \sum_{X \in \mathcal{F}} s_X(w(X), c(X))$$<br /><br />We then define the epistemic utility of a deterministic updating rule $R$ in the same way we defined its pragmatic utility above: if $E_i$ is true at $w$, and $C_i = \{c_i\}$, then<br />$$EU(R, w) = EU(c_i, w)$$Then the standard formulation of the EEUA turns on the following theorem (Greaves & Wallace 2006):<br /><br /><b>Theorem 7</b> Suppose $R$ and $R^\star$ are deterministic updating rules. Then:<br /><ul><li>If $R$ and $R^\star$ are both conditionalizing rules for $c$, then$$\sum_{w \in W} c(w) EU(R, w) = \sum_{w \in W} c(w) EU(R^\star, w)$$</li><li>If $R$ is a conditionalizing rule for $c$ and $R^\star$ is not, then$$\sum_{w \in W} c(w) EU(R, w) > \sum_{w \in W} c(w) EU(R^\star, w)$$</li></ul>That is, a deterministic updating rule maximises expected epistemic utility by the lights of your prior just in case it is a conditionalizing rule for your prior.<br />So, as for DSA and EPUA, if we assume Deterministic Updating, we obtain an argument for PC and DC, and indirectly one for AC too.<br /><br /><h3>EEUA without Deterministic Updating</h3><br />If we don't assume Deterministic Updating, the situation here is very similar to the one we encountered above when we considered EPUA. Suppose $R$ is a non-deterministic but stochastic updating rule. Then, as above, we let its epistemic utility at a world be the expectation of the epistemic utility that the various possible posteriors permitted by $R$ take at that world. That is, if $E_i$ is the proposition in $\mathcal{E}$ that is true at $w$, then$$EU(R, w) = \sum_{c' \in C_i} P(R_{c'} | E_i) EU(c', w)$$Then, we have a similar result to Theorem 5:<br /><br /><b>Theorem 8</b> Suppose $R$ and $R^\star$ are updating rules. Then if $R$ is a conditionalizing rule for $c$, and $R^\star$ is stochastic but not a conditionalizing rule for $c$, then<br />$$\sum_{w \in W} c(w) EU(R, w) > \sum_{w \in W} c(w) EU(R^\star, w)$$<br /><br />Next, suppose $R$ is a non-deterministic but also a non-stochastic rule. Then we let its epistemic utility at a world be the average epistemic utility that the various possible posteriors permitted by $R$ take at that world. That is, if $E_i$ is the proposition in $\mathcal{F}$ that is true at $w$, then<br />$$EU(R, w) = \frac{1}{|C_i|}\sum_{c' \in C_i} EU(c', w)$$And again we have a similar result to Theorem 6:<br /><br /><b>Theorem 9 </b>Suppose $R$ and $R^\star$ are updating rules. Then if $R$ is a conditionalizing rule for $c$, and $R^\star$ is not deterministic, not stochastic, and not a conditionalizing rule for $c$. Then:<br />$$\sum_{w \in W} c(w) EU(R, w) > \sum_{w \in W} c(w) EU(R^\star, w)$$<br /><br />So the situation is the same as for EPUA. Whether we assess a rule by looking at how well the posteriors it produces guide our future actions, or how good they are from a purely epistemic point of view, our prior will expect a conditionalizing rule for itself to be better than any non-conditionalizing rule. And thus we obtain PC and DC, and indirectly AC as well.<br /><br /><h2>Epistemic Utility Dominance Argument (EUDA)</h2><br />Finally, we turn to the EUDA. In EPUA and EEUA, we assess the pragmatic or epistemic utility of the updating rule from the viewpoint of the prior. In DSA, we assess the prior and updating rule together, and from no particular point of view; but, unlike the EPUA and EEUA, we do not assign utilities, either pragmatic or epistemic, to the prior and the rule. In EUDA, like in DSA and unlike EPUA and EEUA, we assess the the prior and updating rule together, and again from no particular point of view; but unlike in DSA and like in EPUA and EEUA, we assign utilities to them---in particular, epistemic utilities---and assess them with reference to those.<br /><br /><h3>EUDA with Deterministic Updating</h3><br />Suppose $R$ is a deterministic updating rule. Then, as before, if $E_i$ is true at $w$, let the epistemic utility of $R$ be the epistemic utility of the credence function $c_i$ that it mandates at $w$: that is,$$EU(R, w) = EU(c_i, w).$$<br />But this time also let the epistemic utility of the pair $\langle c, R \rangle$ consisting of the prior and the updating rule be the sum of the epistemic utility of the prior and the epistemic utility of the updating rule: that is,$$EU(\langle c, R \rangle, w) = EU(c, w) + EU(R, w) = EU(c, w) + EU(c_i, w)$$<br />Then the EUDA turns on the following mathematical fact (Briggs & Pettigrew 2018):<br /><br /><b>Theorem 10 </b> Suppose $EU$ is an additive, strictly proper epistemic utility function. And suppose $R$ and $R^\star$ are deterministic updating rules. Then:<br /><ul><li>if $\langle c, R \rangle$ is non-conditionalizing, there is $\langle c^\star, R^\star \rangle$ such that, for all $w$ $$EU(\langle c, R \rangle, w) < EU(\langle c^\star, R^\star \rangle, w))$$</li><li>if $\langle c, R \rangle$ is conditionalizing, there is no $\langle c^\star, R^\star \rangle$ such that, for all $w$ $$EU(\langle c, R \rangle, w) < EU(\langle c^\star, R^\star \rangle, w))$$</li></ul>That is, if $R$ is not a conditionalizing rule for $c$, then together they are $EU$-dominated; if it is a conditionalizing rule, they are not. Thus, like EPUA and EEUA and unlike DSA, if we assume Deterministic Updating, EUDA gives PC, DC, and indirectly AC.<br /><br /><h3>EUDA without Deterministic Updating</h3><br />Now suppose we permit non-deterministic updating rules as well as deterministic ones. In this case, there are two approaches we might take. On the one hand, we might define the epistemic utility of non-deterministic rules, both stochastic and non-stochastic, just as we did for EEUA. That is, we might take the epistemic utility of a stochastic rule at a world to be the expectation of the epistemic utility of the various posteriors that it permits in response to the evidence that you obtain at that world; and the epistemic utility of a non-stochastic rule at a world is the average of those epistemic utilities. This gives us the following result:<br /><br /><b>Theorem 11 </b> Suppose $EU$ is an additive, strictly proper epistemic utility function. Then, if $\langle c, R \rangle$ is not a conditionalizing pair, there is an alternative pair $\langle c^\star, R^\star \rangle$ such that, for all $w$, $$EU(\langle c, R \rangle, w) < EU(\langle c^\star, R^\star \rangle, w)$$And this therefore supports an argument for PC and DC and indirectly AC as well.<br /><br />On the other hand, we might consider more fine-grained possible worlds, which specify not only the truth value of all the propositions in $\mathcal{F}$, but also which posterior I adopt. We can then ask: given a particular pair $\langle c, R \rangle$, is there an alternative pair $\langle c^\star, R^\star \rangle$ that has greater epistemic utility at every fine-grained world by the lights of $EU$? If we judge updating rules by this standard, we get a rather different answer. If $E_i$ is the element of $\mathcal{E}$ that is true at $w$, and $c'$ is in $C_i$ and $c^{\star \prime}$ is in $C^\star_i$, then we write $w\ \&\ R^i_{c'}\ \&\ R^{\star i}_{c^{\star \prime}}$ for the more fine-grained possible world we obtain from $w$ by adding that $R$ updates to $c'$ and $R^\star$ updates to $c^{\star\prime}$ upon receipt of $E_i$. And let<br /><ul><li>$EU(\langle c, R \rangle, w\ \&\ R^i_{c'}\ \&\ R^{\star i}_{c^{\star \prime}} ) = EU(c, w) + EU(c', w)$</li><li>$EU(\langle c^\star, R^\star \rangle, w\ \&\ R^i_{c'}\ \&\ R^{\star i}_{c^{\star \prime}} ) = EU(c^\star, w) + EU(c^{\star\prime}, w)$</li></ul>Then:<br /><b>Theorem 12 </b>Suppose $EU$ is an additive, strictly proper epistemic utility function. Then:<br /><ul><li>If $\langle c, R \rangle$ is a pseudo-conditionalizing pair, there is no alternative pair $\langle c^\star, R^\star\rangle$ such that, for all $E_i$ in $\mathcal{E}$, $w$ in $E_i$, $c'$ in $C_i$ and $c^{\star\prime}$ in $C^\star_i$, $$EU(\langle c, R \rangle, w\ \&\ R^i_{c'}\ \&\ R^{\star i}_{c^{\star \prime}} ) < EU(\langle c^\star, R^\star \rangle, w\ \&\ R^i_{c'}\ \&\ R^{\star i}_{c^{\star \prime}})$$</li><li>There are pairs $\langle c, R \rangle$ that are non-conditionalizing and non-pseudo-conditionalizing for which there is no alternative pair $\langle c^\star, R^\star\rangle$ such that, for all $E_i$ in $\mathcal{E}$, $w$ in $E_i$, $c'$ in $C_i$ and $c^{\star\prime}$ in $C^\star_i$, $$EU(\langle c, R \rangle, w\ \&\ R^i_{c'}\ \&\ R^{\star i}_{c^{\star \prime}} ) < EU(\langle c^\star, R^\star \rangle, w\ \&\ R^i_{c'}\ \&\ R^{\star i}_{c^{\star \prime}})$$</li></ul>Interpreted in this way, then, and without the assumption of Deterministic Updating, EUDA is the weakest of all the arguments. Where DSA at least establishes that your updating rule should be pseudo-conditionalizing for your prior, even if it does not establish that it should be conditionalizing, EUDA does not establish even that. <br /><br /><h2>Conclusion</h2><br />One upshot of this investigation is that, so long as we assume Deterministic Updating (DU), all four arguments support the same conclusions, namely, Plan and Dispositional Conditionalization, and also Actual Conditionalization. But once we drop DU, that agreement vanishes.<br /><br />Without DU, DSA shows only that, if we plan to update using a particular rule, it should be a pseudo-conditionalizating rule for our prior; and similarly for our dispositions. As a result, it cannot support AC. Indeed, it can support only the weakest restrictions on our actual updating behaviour, since nearly any such behaviour can be seen as an implementation of a pseudo-conditionalizing rule.<br /><br />EPUA and EEUA are much more hopeful. Let's consider our updating dispositions first. It seems natural to assume that, even if these are not deterministic, they are at least governed by some objective chances. If so, this gives a natural definition of the pragmatic and epistemic utilities of my updating dispositions at a world---they are expectations of pragmatic and epistemic utilities the posteriors, calculated using the objective chances. And, relative to that, we can in fact establish DU---we no longer need to assume it. With that in hand, we regain DC and two of the routes to AC.<br /><br />Next, let's consider the updating plans we endorse. It also seems natural to assume that those plans, if not deterministic, might not be stochastic either. And, if that's the case, we can take their pragmatic or epistemic utility at a world to be the average pragmatic or epistemic utility of the different possible credence functions they endorse as responses to the evidence you gain at that world. And, relative to that, we can again establish DU. And with it PC and two of the routes to AC.<br /><br />Finally, EUDA is a mixed bag. Understanding the epistemic and pragmatic utility of an updating rule as we have just described gives us DU and with it PC, DC, and AC. But if we take a fine-grained approach, we cannot even establish that your updating rule should be a pseudo-conditionalizing rule for your prior.<br /><br /><h2>Proofs</h2>For proofs of the theorems in this post, please see the paper version <a href="https://drive.google.com/open?id=1aoWwDDOWDjF6jXCY2WyuFqkW1wPV8CsK" target="_blank">here</a>. Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com38tag:blogger.com,1999:blog-4987609114415205593.post-72304981680549323752019-03-05T08:12:00.002+00:002019-03-06T06:15:37.267+00:00Dutch Books, Money Pumps, and 'By Their Fruits' ReasoningThere is a species of reasoning deployed in some of the central arguments of formal epistemology and decision theory that we might call 'by their fruits' reasoning. It seeks to establish certain norms of rationality that govern our mental states by showing that, if your mental states fail to satisfy those norms, they lead you to make choices that have some undesirable feature. Thus, just as we might know false prophets by their behaviour, and corrupt trees by their evil fruit, so can we know that certain attitudes are irrational by looking not to them directly but to their consequences. For instance, the Dutch Book argument seeks to establish the norm of Probabilism for credences, which says that your credences should satisfy the axioms of the probability calculus. And it does this by showing that, if your credences do not satisfy those axioms, they will lead you to enter into a series of bets that, taken together, lose you money for sure (Ramsey 1931, de Finetti 1937). The Money Pump argument seeks to establish, among other norms, the norm of Transitivity for preferences, which says that if you prefer one option to another and that other to a third, you should prefer the first option to the third. And it does this by showing that, if your preferences are not transitive, they will lead you, again, to make a series of choices that loses you money for sure (Davidson, et al. 1955). Both of these arguments use 'by their fruits' reasoning. In this paper, I will argue that such arguments fail. I will focus particularly on the Dutch Book argument so that I can illustrate the points with examples. But the objections I raise apply equally to Money Pump arguments.<br /><br /><h2>The Dutch Book argument: an example</h2><br />Joachim is more confident that Sarah is an astrophysicist and a climate activist (proposition $A\ \&\ B$) than he is that she is an astrophysicist (proposition $A$). He is 60% confident in $A\ \&\ B$ and only 30% confident in $A$. But $A\ \&\ B$ entails $A$. So, intuitively, Joachim's credences are irrational.<br /><br />How can we establish this? According to the Dutch Book argument, we look to the choices that Joachim's credences will lead him to make. The first premise of that argument posits a connection between credences and betting behaviour. Suppose $X$ is a proposition and $S$ is a number $S$, positive, negative, or zero. Then a £$S$ bet on $X$ is a bet that pays £$S$ if $X$ is true and £$0$ if $X$ is false. £$S$ is the stake of the bet. The first premise of the Dutch Book argument says that, if you have credence $p$ in $X$, you will buy a £$S$ bet on $X$ for anything less than £$pS$. That is, the more confident you are in a proposition, the greater a proportion of the stake you are prepared to pay to buy it. Thus, in particular:<br /><ul><li>Bet 1: Joachim will buy a £$100$ bet on $A\ \&\ B$ for £$50$;</li><li>Bet 2: Joachim will sell a £$100$ bet on $A$ for £$40$.</li></ul>The total net gain of these bets, taken together, is guaranteed to be negative. Thus, his credences will lead him to a perform a pair of actions that, taken together, loses him money for sure. This is the second premise of the Dutch Book argument against Joachim. We say that this pair of actions (buy the first bet for £$40$; sell the second for £$50$) is <i>dominated</i> by the pair of actions in which he refuses to enter into each bets (refuse the first bet; refuse the second). The latter pair is guaranteed to result in greater total value than the former pair; the latter pair results in no loss and no gain, while the former results in a loss for sure. The third premise of the Dutch Book argument contends that, since it is undesirable to choose a pair of dominated options, it is irrational to have credences that lead you to do this. Ye shall know them by their fruits.<br /><br />Thus, a Dutch Book argument has three premises. The first premise posits a connection between having a particular credence in a proposition and accepting certain bets on that proposition. The second is a mathematical theorem that shows that, if the first premise is true, and if your credences do not satisfy the probability axioms, they will lead you to make a series of choices that is dominated by some alternative series of choices you might have made instead; the third premise says that your credences are irrational if, together with the connection posited in the first premise, they lead you to choose a dominated series of options. My objection is this: there is no account of the connection between credences and betting behaviour that makes both the first and third premise plausible; those accounts strong enough to make the third premise plausible are too strong to make the first premise plausible. Our strategy will be to enumerate the possible putative accounts of that connection and show either that either the first or the third premise is false when we adopt that account.<br /><br />Let $C(p, X)$ be the proposition that you have credence $p$ in proposition $X$; and let $B(x, S, X)$ be the proposition that you pay £$x$ for a £$S$ bet on $X$. Then the first premise of the Dutch Book argument has the following form:<br /><br />For all credences $p$, propositions $X$, prices $x$, and stakes $S$, if $x < pS$<br />$$O(C(p, X) \rightarrow B(x, S, X))$$where $O$ is a modal operator. But which modal operator? Different answers to this constitute different versions of the connection between credences and betting behaviour that appears in the first and third premise of the Dutch Book argument. We will consider six different candidate operators and argue that none makes the first and third premises both true. The six candidates are: metaphysical necessity; nomological necessity; nomological high probability; deontic necessity; deontic possibility; and the modality of defeasible reasons.<br /><br /><h2>$O$ is metaphysical necessity</h2><br />We begin with metaphysical modality. According to this account, the first premise of the Dutch Book argument says that it is metaphysically impossible to have a credence of $p$ in $X$ while refusing to pay £$x$ for a £$S$ bet on $X$ (for $x < pS$). If you were to refuse such a bet, that would simply mean that you do not have that credence. This sort of account would be appealing to a behaviourist, who seeks an operational definition of what it means to have a particular precise credence in a proposition---a definition in terms of betting behaviour might well satisfy them.<br /><br />If this account were true, the third premise of the Dutch Book argument would be plausible. If having a set of mental states were to guarantee as a matter of metaphysical necessity that you'd make a dominated series of choices when faced with a particular series of decisions, that seems sufficient to show that those credences are irrational. The problem is that, as David Christensen (1996) shows, the account itself cannot be true. Christensen's central point is this: credences are often and perhaps typically connected to betting behaviour and decision-making more generally; but they are often and perhaps typically connected to other things as well, such as emotional states, conative states, and other doxastic states. If I have a high credence that my partner loves me, I'm likely to pay a high proportion of the stake for a bet on it; but I'm also likely to feel joy, plan to spend more time with him, hope that his love continues, and believe that we will still be together in five years' time. What's more, none of these connections is obviously more important than any other in determining that a mental state is a credence. And each might fail while the others hold. Indeed, as Christensen notes, in Dutch Book arguments, we are concerned precisely with those cases in which there is a breakdown of the rationally required connections between credences, namely, the connections described by the probability axioms. Having a credence in one proposition usually leads you to have at least as high a credence in another proposition it entails. But, as we saw in Joachim's case, this connection can break down. So, just as Joachim's case shows that it is metaphysically possible to have a particular credence that has all the other connections that we typically associate with it except the connection to other credences, so it must be at least metaphysically possible to have a credence has all the other connections that we associate with it but not the connection to betting behaviour posited by the first premise. Such a mental state would still count as the credence in question because of all the other connections; but it wouldn't give rise to the apparently characteristic betting behaviour that is required to run the Dutch Book argument. Moreover, note that we need not assume that the credence has none of the usual connections to betting behaviour. Consider Joachim again. Every Dutch Book to which he is vulnerable involves him buying a bet on $A\ \&\ B$ and selling a bet on $A$. That is, it involves him buying a bet on $A\ \&\ B$ with a positive stake and buying a bet on $A$ with a negative stake. So he would evade the argument if his credence in $A\ \&\ B$ were to lead him to buy the bets <i>with any stake</i> that the first premise says they will, while his credence in $A$ were only to lead him to buy the bets <i>with positive stake</i> that the first premise says they will. In this case, we'd surely say he has the credences we assign to him. But he would not be vulnerable to a Dutch Book argument.<br /><br />Thus, if $O$ is metaphysical necessity, the third premise might well be true; but the first premise is false.<br /><br /><h2>$O$ is nomological necessity</h2><br />Learning from the problems with the previous proposal, we might retreat to a weaker modality. For instance, we might suggest that $O$ is a nomological modality. There are two that it might be. We might say that the connection between credences and betting behaviour posited by the first premise is nomologically necessary---that is, it is entailed by the laws of nature. Or, we might say that it is nomologically highly probable---that is, the objective chance of the consequent given the antecedent is high. Let's take them in turn.<br /><br />First, $O$ is nomological necessity. The problem with this is the same as the problem with the suggestion from the previous section that $O$ is metaphysical necessity. Above, we imagined a mental state that had all the other features we'd typically expect of a particular credence in a proposition, except some range of connections to betting behaviour that was crucial for the Dutch Book argument. We noted that this would still count as the credence in question. All that needs to be added here is that the example we considered is not only metaphysically possible, but also nomologically possible. That is, this is not akin to an example in which the fine structure constant is different from what it actually is---in that case, it would be metaphysically possible, but nomologically impossible. There is no law of nature that entails that your credence will lead to particular betting behaviour.<br /><br />Thus, again, the first premise is false.<br /><br /><h2>$O$ is nomological high probability</h2><br />Nonetheless, while it is not guaranteed by the laws of nature that an individual with a particular credence in a proposition will engage in the betting behaviour posited by the first premise, it does seem plausible that they are very likely to do so---that is, the objective chance that they will display the behaviour given that they have the credence is high. In other words, while weakening from metaphysical to nomological necessity doesn't make the first premise plausible, weakening further to nomological high probability does. So let's suppose, then, that $O$ is nomological high probability. Unfortunately, this causes two problems for the third premise.<br /><br />Here's the first. Suppose I have credences in 1,000 mutually exclusive and exhaustive propositions. And suppose each credence is $\frac{1}{1,001}$. So they violate Probabilism. Suppose further that each credence is 99% likely to give rise to the betting behaviour mentioned in the first premise of the Dutch Book argument; and suppose that whether one of the credences does or not is independent of whether any of the others does or not. Then the objective chance that the set of 1,000 credences will give rise to the betting behaviour that will lose me money for sure is $0.99^{1,000} = 0.00004 \approx \frac{1}{23,163}$. And this tells against the third premise. After all, what is so irrational about a set of credences that will lead to a dominated series of choices less than once in every 20,000 times I face the bets described in the Dutch Book argument against me?<br /><br />Here's the second problem. On the account we are considering, having a particular credence in a proposition makes it highly likely that you'll bet in a particular way. Let's say, then, that you violate Probabilism, and your credences do indeed result in you making a dominated series of choices. The third premise infers from this that your credences are irrational. But why lay the blame at the credences' door? After all, there is another possible culprit, namely, the probabilistic connection between the credence and the betting behaviour. Consider an analogy. Suppose that, as the result of some bizarre causal pathway, when I fall in love, it is very likely that I will feed myself a diet of nothing but mud and leaves for a week. I hate the taste of the mud and the leaves make me very sick, and so I lower my utility considerably by responding in this way. But I do it anyway. In this case, we would not, I think, say that it is irrational to fall in love. Rather, we'd say that what is irrational is my response to falling in love. Similarly, suppose I make a dominated series of choices and thus reveal some irrationality in myself. Then, for all the Dutch Book argument says, it might be that the irrationality lies not in the credences, but rather in my response to having those credences. <br /><br />Thus, on this account, the first premise is plausible, but the third premise is unmotivated, for it imputes irrationality to my credences when it might instead lie in my response to having those credences.<br /><br /><h2>$O$ is deontic necessity</h2><br />A natural response to the argument of the previous section is that the analogy between the credence-betting connection and the love-diet connection fails because the first is a rationally appropriate connection, while the latter is not. This leads us to suggest, along with Christensen (1996), that the connection between credences and betting behaviour at the heart of the Dutch Book argument is not governed by a descriptive modality, such as metaphysical or nomological modality, but rather by a prescriptive modality, such as deontic modality. In particular, it suggests that what the first premise says is not that someone with a particular credence in a proposition <i>will</i> or <i>might</i> or <i>probably will</i> accept certain bets on that proposition; but rather that they <i>should</i> or <i>may</i> or <i>have good but defeasible reason</i> to do so.<br /><br />Let's begin with deontic necessity. Here, my objection is that, if this is the modality at play in the first and third premise, then the argument is self-defeating. To see why, consider Joachim again. Suppose the modality is deontic necessity, and suppose that the first premise is true. So Joachim is rationally required to make a dominated series of choices---buy the £$100$ bet on $A\ \&\ B$ for £$50$; sell the £$100$ bet on $A$ for £$40$. Now suppose further that the third premise is true as well---it does, after all, seem plausible on this account of the modality involved. Then we conclude that Joachim's credences are irrational. But surely it is not rationally required to choose in line with irrational credences. Surely what is rationally required of Joachim instead is that he should correct his irrational credences so that they are now rational, and he should then choose in line with his new rational credences. Now, whatever other features they have, his new rational credences must obey Probabilism. If not, they will be vulnerable to the Dutch Book argument and thus irrational. But the Converse Dutch Theorem shows that, if they obey Probabilism, they will not rationally require or even permit Joachim to make a dominated series of choices. And, in particular, they neither require nor permit him to accept both of the bets described in the original argument. But from this we can conclude that the first premise is false. Joachim's original credences do not rationally require him to accept both of the bets; instead, rationality requires him to fix up those credences and choose in line with the credences that result. But those new fixed up credences do not require what the first premise says they require. Indeed, they don't even permit what the first premise says they require. So, if the premises of the Dutch Book argument are true, Joachim's credences are irrational, and thus the first premise of the argument is false.<br /><br />Thus, on this account, the Dutch Book argument is self-defeating: if it succeeds, its first premise is false.<br /><br /><h2>$O$ is deontic possibility</h2><br />A similar problem arises if we take the modality to be deontic possibility, rather than necessity. On this account, the first premise says not that Joachim is required to make each of the choices in the dominated series of choices, but rather that he is permitted to do so. The third premise must then judge a person irrational if they are permitted to accept each choice in a dominated series of choices. If we grant that, we can conclude that Joachim's credences are irrational. And again, we note that rationality therefore requires him to fix up those credences first and then to choose in line with the fixed up credences. But just as those fixed up credences don't <i>require</i> him to make each of the choices in the dominated series, so they don't <i>permit</i> him to make them either. So the Dutch Book argument, if successful, undermines its first premise again.<br /><br />Again, the Dutch Book argument is self-defeating.<br /><br /><h2>$O$ is the modality of defeasible reasons</h2><br />The final possibility we will consider: Joachim's credences neither rationally require nor rationally permit him to make each of the choices in the dominated series; but perhaps we might say that each credence gives him a <i>pro tanto</i> or defeasible reason to accept the corresponding bet. That is, we might say that Joachim's credence of 60% in $A\ \&\ B$ gives him a <i>pro tanto</i> or defeasible reason to buy a £$100$ bet on $A\ \&\ B$ for £$50$, while his credence of 30% in $A$ gives him a <i>pro tanto</i> or defeasible reason to sell a £$100$ bet on $A$ for £$40$. As we saw above, those reasons must be defeasible, since they will be defeated by the fact that Joachim's credences, taken together, are irrational. Since they are irrational, he has stronger reason to fix up those credences and choose in line with the fixed up one than he has to choose in line with his original credences. But his original credences nonetheless still provide some reason in favour of accepting the bets.*<br /><br />Rendered thus, I think the first premise is quite plausible. The problem is that the third premise is not. It must say that it is irrational to have any set of mental states where (i) each state in the set gives <i>pro tanto</i> reason to make a particular choice and (ii) taken together, that series of choices is dominated by another series of choices. But that is surely false. Suppose I believe this car in front of me is two years old and I also believe it's done 200,000 miles. The first belief gives me <i>pro tanto</i> or defeasible reason to pay £$5,000$ for it. The second gives me <i>pro tanto</i> reason to sell it for £$500$ as soon as I own it. Doing both of these things will lose me £$4,500$ for sure. But there is nothing irrational about my two beliefs. The problem arises only if I make decisions in line with the reasons given by just one of the beliefs, rather than taking into account my whole doxastic state. If I were to attend to my whole doxastic state, I'd never pay £$5,000$ for the car in the first place. And the same might be said of Joachim. If he pays attention only to the reasons given by his credence in $A\ \&\ B$ when he considers the bet on that proposition, and pays attention only to the reasons given by his credence in $A$ when he considers the bet on that proposition, he will choose a dominated series of options. But if he looks to the whole credal state, and if the Dutch Book argument succeeds, he will see that its irrationality defeats those reasons and gives him stronger reason to fix up his credences and act in line with those. In sum, there is nothing irrational about a set of mental states each of which individually gives you <i>pro tanto</i> or defeasible reason to choose an option in a dominated series of options.<br /><br />On this account, the first premise may be true, but the third is false.<br /><br /><h2>Conclusion</h2><br />In conclusion, there is no account of the modality involved in the first and third premises of the Dutch Book argument that can make both premises true. Metaphysical and nomological necessity are too strong to make the first premise true. Nomological high probability is not, but it does not make the third premise true. Deontic necessity and possibility render the argument self-defeating, for if the arguments succeeds, the first premise must be false. Finally, the modality of defeasible reasons, like nomological high probability, renders the first premise plausible. But it is not sufficient to secure to the third premise.<br /><br />Before we conclude, let's consider briefly how these considerations affect money pump arguments. The first premise of a money pump argument does not posit a connection between credences and betting behaviour, but between preferences and betting behaviour. In particular: if I prefer one option to another, there will be some small amount of money I'll be prepared to pay to receive the first option rather than the second. As with the Dutch Book argument, the question arises what the modal force of this connection is. And indeed the same candidates are available. What's more, the same considerations tell against each of those candidates. Just as credences are typically connected not only to betting behaviour but also to emotional states, intentional states, and other doxastic states, so preferences are typically connected to emotional states, intentional states, and other preferences. If I prefer one option to another, then this might typically lead me to pay a little to receive the first rather than the second; but it will also typically lead me to hope that I will receive the first rather than the second, to fear that I'll receive the second, to intend to choose the first over the second when faced with such a choice, and to have a further preference for the first and a small loss of money over the second. And again the connections to behaviour are no more central to this preference than the connections to the emotional states of hope and fear, the intentions to choose, and the other preferences. So the modal force of the connection posited by the first premise cannot be metaphysical or nomological necessity. And for the same reasons as above, it cannot be nomological high probability, deontic necessity or possibility, or the modality of defeasible reasons. In each case, the same objections hold. <br /><br />So these two central instances of 'by their fruits' reasoning fail. We cannot give an account of the connection between the mental states and their evil fruit that renders the argument successful.<br /><br />[* Thanks to Jason Konek for pushing me to consider this account.]<br /><h2>References</h2><br /><ul><li>Christensen, D. (1996). Dutch-Book Arguments Depragmatized: Epistemic Consistency for Partial Believers. <i>The Journal of Philosophy,</i> 93(9), 450–479. </li><li>Davidson, D., McKinsey, J. C. C., & Suppes, P. (1955). Outlines of a Formal Theory of Value, I. <i>Philosophy of Science</i>, 22(2), 140–60.</li><li>de Finetti, B. (1937). Foresight: Its Logical Laws, Its Subjective Sources. In H. E. Kyburg, & H. E. K. Smokler (Eds.) <i>Studies in Subjective Probability</i>. Huntingdon, N. Y.: Robert E. Kreiger Publishing Co.</li><li>Ramsey, F. P. (1931). Truth and Probability. <i>The Foundations of Mathematics and Other Logical Essays</i>, (pp. 156–198).</li></ul>Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com8tag:blogger.com,1999:blog-4987609114415205593.post-50165198240134281492019-02-02T17:38:00.000+00:002019-02-02T19:10:44.208+00:00Credences in vague propositions: supervaluationist semantics and Dempster-Shafer belief functionsSafet is considering the proposition $R$, which says that the handkerchief in his pocket is red. Now, suppose we take <i>red</i> to be a vague concept. And suppose we favour a supervaluationist semantics for propositions that involve vague concepts. According to such a semantics, there is a set of legitimate precisifications of the concept <i>red</i>, and a proposition that involves that concept is true if it is true relative to all legitimate precisifications, false if false relative to all legitimate precisifications, and neither if true relative to some and false relative to others. So <i>London buses are red</i> is true, <i>Daffodils are red</i> is false, and <i>Cherry blossom is red</i> is neither.<br /><br />Safet is assigning a credence to $R$ and a credence to its negation $\overline{R}$. He assigns 20% to $R$ and 20% to $\overline{R}$. Normally, we'd say that he is irrational, since his credences in mutually exclusive and exhaustive propositions don't sum to 100%. What's more, we'd demonstrate his irrationality using either<br /><br />(i) a sure loss betting argument, which shows there is a finite series of bets, each of which his credences require him to accept but which, taken together, are guaranteed to lose him money; or<br /><br />(ii) an accuracy argument, which shows that there are alternative credences in those two propositions that are guaranteed to be closer to the ideal credences.<br /><br />However, in Safet's case, both arguments fail.<br /><br />Take the sure loss betting argument first. According to that, my credences require me to sell a £100 bet on $R$ for £30 and sell a £100 bet on $\overline{R}$ for £30. Thus, I will receive £60 from the sale of these two bets. Usually the argument proceeds by noting that, however the world turns out, either $R$ is true or $\overline{R}$ is true. So I will have to pay out £100 regardless. And I'm therefore guaranteed to lose £40 overall. But, in a supervaluationist semantics, this assumption isn't true. If Safet's handkerchief is a sort of pinkish colour, $R$ will be neither true nor false, and $\overline{R}$ will be neither true nor false. So I won't have to pay out either bet, and I'll gain £60 overall. <br /><br />Next, take the accuracy argument. According to that, my credences are more accurate the closer they lie to the ideal credences; and the ideal credence in a true proposition is 100% while the ideal credence in a proposition that isn't true is 0%. Then, given that the measure of distance between credence functions has a particular property, then we usually show that there are alternative credences in $R$ and $\overline{R}$ that are closer to each set of ideal credences than Safet's are. For instance, if we measure the distance between two credence functions $c$ and $c'$ using the so-called squared Euclidean distance, so that $$SED(c, c') = (c(R) - c'(R))^2 + (c(\overline{R}) - c'(\overline{R}))^2$$ then credences of 50% in both $R$ and $\overline{R}$ are guaranteed to be closer than Safet's to the credences of 100% in $R$ and 0% in $\overline{R}$, which are ideal if $R$ is true, and closer than Safet's to the credences of 0% in $R$ and 100% in $\overline{R}$, which are ideal if $\overline{R}$ is true. Now, if $R$ is a classical proposition, then this covers all the bases--either $R$ is true or $\overline{R}$ is. But since $R$ has a supervaluationist semantics, there is a further possibility. After all, if Safet's handkerchief is a sort of pinkish colour, $R$ will be neither true nor false, and $\overline{R}$ will be neither true nor false. So the ideal credences will be 0% in $R$ and 0% in $\overline{R}$. And 50% in $R$ and 50% in $\overline{R}$ is not closer than Safet's to those credences. Indeed, Safet's are closer.<br /><br />So our usual arguments that try to demonstrate that Safet is irrational fail. So what happens next? The answer was given by Jeff Paris (<a href="http://www.maths.manchester.ac.uk/~jeff/papers/15.ps" target="_blank">'A Note on the Dutch Book Method'</a>). He argued that the correct norm for Safet is not Probabilism, which requires that his credence function is a probability function, and therefore declares him irrational. Instead, it is Dempster-Shaferism, which requires that his credence function is a Dempster-Shafer belief function, and therefore declares him rational. To establish this, Paris showed how to tweak the standard sure loss betting argument for Probabilism, which depends on a background logic that is classical, to give a sure loss betting argument for Dempster-Shaferism, which depends on a background logic that comes from the supervaluationist semantics. To do this, he borrowed an insight from Jean-Yves Jaffray (<a href="https://link.springer.com/article/10.1007/BF00159221" target="_blank">'Coherent bets under partially resolving uncertainty and belief functions'</a>). Robbie Williams then appealed to Jaffray's theorem to tweak the accuracy argument for Probabilism to give an accuracy argument for Dempster-Shaferism (<a href="https://www.jstor.org/stable/41653758" target="_blank">'Generalized Probabilism: Dutch Books and Accuracy Domination'</a>). However, Jaffray's result doesn't explicitly mention supervaluationist semantics. And neither Paris nor Williams fill in the missing details. So I thought it might be helpful to lay out those details here.<br /><br />I'll start by sketching the argument. Then I'll go into the mathematical detail. So first, the law of credences that we'll be justifying. We begin with a definition. Throughout we'll consider only credence functions on a finite Boolean algebra $\mathcal{F}$. We'll represent the propositions in $\mathcal{F}$ as subsets of a set of possible worlds.<br /><br /><b>Definition (belief function)</b> Suppose $c : \mathcal{F} \rightarrow [0, 1]$. Then $c$ is a Dempster-Shafer belief function if<br /><ul><li>(DS1a) $c(\bot) = 0$</li><li>(DS1b) $c(\top) = 1$</li><li>(DS2) For any proposition $A$ in $\mathcal{F}$,$$c(A) \geq \sum_{B \subsetneqq A} (-1)^{|A-B|+1}c(B)$$</li></ul>Then we state the law:<b> </b><br /><br /><b>Dempster-Shaferism</b> $c$ should be a D-S belief function. <br /><br />Now, suppose $Q$ is a set of legitimate precisifications of the concepts that are involved in the propositions in $\mathcal{F}$. Essentially, $Q$ is a set of functions each of which takes a possible world and returns a classically consistent assignment of truth values to the propositions in $\mathcal{F}$. Given a possible world $w$, let $A_w$ be the strongest proposition that is true at $w$ on all legitimate precisifications in $Q$. If $A = A_w$ for some world $w$, we say that $A$ is a <i>state description for </i>$w$.<br /><br /><b>Definition (belief function$^*$) </b>Suppose $c : \mathcal{F} \rightarrow [0, 1]$. Then $c$ is a Dempster-Shafer belief function$^*$ relative to a set of precisifications if $c$ is a Dempster-Shafer belief function and<br /><ul><li>(DS3) For any proposition $A$ in $\mathcal{F}$ that is not a state description for any world, $$c(A) = \sum_{B \subsetneqq A} (-1)^{|A-B|+1}c(B)$$</li></ul><b>Dempster-Shaferism$^*$</b> $c$ should be a Dempster-Shafer belief function$^*$.<br /><br />It turns out that Dempster-Shaferism$^*$ is the strongest credal norm that we can justify using sure loss betting arguments and accuracy arguments. The sure loss betting argument is based on the following assumption: Let's say that a £$S$ bet on a proposition $A$ pays out £$S$ if $A$ is true and £0 otherwise. Then if your credence in $A$ is $p$, then you are required to pay anything less than £$pS$ for a £$S$ bet on $A$. With that in hand, we can show that you are immune to a sure loss betting arrgument iff your credence function is a Dempster-Shafer belief function$^*$. That is, if your credence function violates Dempster-Shaferism$^*$, then there is a finite set of bets on propositions in $\mathcal{F}$ such that (i) your credences require you to accept each of them, and (ii) together, they lose you money in all possible worlds. If your credence function satisfies Dempster-Shaferism$^*$, there is no such set of bets.<br /><br />The accuracy argument is based on the following assumption: The ideal credence in a proposition at a world is 1 if that proposition is true at the world, and 0 otherwise; and the distance from one credence function to another is measured by a particular sort of measure called a Bregman divergence. With that in hand, we can show that you are immune to an accuracy dominance argument iff your credence function is a Dempster-Shafer belief function$^*$. That is, if your credence function violates Dempster-Shaferism$^*$, then there is an alternative credence function that is closer to the ideal credence function than yours at every possible world. If your credence function satisfies Dempster-Shaferism$^*$, there is no such alternative.<br /><br />So much for the sketch of the arguments. Now for some more details. Suppose $c : \mathcal{F} \rightarrow [0, 1]$ is a credence function defined on the set of propositions $\mathcal{F}$. Often, we don't have to assume anything about $\mathcal{F}$, but in the case we're considering here, we must assume that it is a finite Boolean algebra. In both sure loss arguments and accuracy arguments, we need to define a set of functions, one for each possible world. In the sure loss arguments, these specify when certain bets payout; in the accuracy arguments, they specify the ideal credences. In the classical case and in the supervaluationist case that we consider here, they coincide. Given a possible world $w$, we abuse notation and write $w : \mathcal{F} \rightarrow \{0, 1\}$ for the following function:<br /><ul><li>$w(A) = 1$ if $X$ is true at $w$---that is, if $A$ is true on all legitimate precisifications at $w$;</li><li>$w(A) = 0$ if $X$ is not true at $w$---that is, if $A$ is false on some (possibly all) legitimate precisifications at $w$. </li></ul>Then, given our assumptions, we have that a £$S$ bet on $A$ pays out £$Sw(A)$ at $w$; and we have that $w(A)$ is the ideal credence in $A$ at $w$. Now, let $\mathcal{W}$ be the set of these functions. And let $\mathcal{W}^+$ be the convex hull of $\mathcal{W}$. That is, $\mathcal{W}^+$ is the smallest convex set that contains $\mathcal{W}$. In other words, $\mathcal{W}^+$ is the set of convex combinations of the functions in $\mathcal{W}$. There is then <a href="https://m-phi.blogspot.com/2013/09/the-mathematics-of-dutch-book-arguments.html" target="_blank">a general result</a> that says that $c$ is vulnerable to a sure loss betting argument iff $c$ is not in $\mathcal{W}^+$. And another general result that says that $c$ is accuracy dominated iff $c$ is not in $\mathcal{W}^+$. To complete our argument, therefore, we must show that $\mathcal{W}^+$ is precisely the set of Dempster-Shafer belief functions$^*$. That's the central purpose of this post. And that's what we turn to now.<br /><br />We start with some definitions that allow us to given an alternative characterization of the Dempster-Shafer belief functions and belief functions$^*$.<br /><br /><b>Definition (mass function)</b> Suppose $m : \mathcal{F} \rightarrow [0, 1]$. Then $m$ is a mass function if<br /><ul><li>(M1) $m(\bot) = 0$</li><li>(M2) $\sum_{A \in \mathcal{F}} m(A) = 1$</li></ul><b>Definition (mass function$^*$)</b> Suppose $m : \mathcal{F} \rightarrow [0, 1]$. Then $m$ is a mass function$^*$ relative to a set of precisifications if $m$ is a mass function and <br /><ul><li>(M3) For any proposition $A$ in $\mathcal{F}$ that is not the state description of any world, $m(A) = 0$.</li></ul><b>Definition ($m$ generates $c$)</b> If $m$ is a mass function and $c$ is a credence function, we say that $m$ generates $c$ if, for all $A$ in $\mathcal{F}$, $$c(A) = \sum_{B \subseteq A} m(B)$$ That is, a mass function generates a credence function iff the credence assigned to a proposition is the sum of the masses assigned to the propositions that entail it.<br /><br /><b>Theorem 1</b><br /><ul><li>$c$ is a Dempster-Shafer belief function iff there is a mass function $m$ that generates $c$.</li><li>$c$ is a Dempster-Shafer belief function$^*$ iff there is a mass function$^*$ $m$ that generates $c$.</li></ul><i>Proof of Theorem 1</i> Suppose $m$ is a mass function that generates $c$. Then it is straightforward to verify that $c$ is a D-S belief function. Suppose $c$ is a D-S belief function. Then let$$m(A) = c(A) - \sum_{B \subsetneqq A} (-1)^{|A-B|+1}c(B)$$ This is positive, since $c$ is a belief function. It is then straightforward to verify that $m$ is a mass function. And it is straightforward to see that $m(A) = 0$ iff $c(A) = \sum_{B \subsetneqq A} (-1)^{|A-B|+1}c(B)$. That completes the proof.<br /><br /><b>Theorem 2</b> $c$ is in $\mathcal{W}^+$ iff $c$ is a Dempster-Shafer belief function$^*$.<br /><br /><i>Proof of Theorem 2 </i>Suppose $c$ is in $\mathcal{W}^+$. So $c(-) = \sum_{w \in \mathcal{W}} \lambda_w w(-)$. Then:<br /><ul><li>if $A$ is the state description for world $w$ (that is, $A = A_w$), then let $m(A) = m(A_w) = \lambda_w$;</li><li>if $A$ is not a state description of any world, then let $m(A) = 0$.</li></ul>Then $m$ is a mass function$^*$. And $m$ generates $c$. So $c$ is a Dempster-Shafer belief function$^*$.<br /><br />Suppose $c$ is a Dempster-Shafer belief function$^*$ generated by a mass function$^*$ $m$. Then for world $w$, let $\lambda_w = m(A_w)$. Then $c(-) = \sum_{w \in \mathcal{W}} \lambda_w w(-)$. So $c$ is in $\mathcal{W}^+$.<br /><br />This completes the proof. And with the proof we have the sure loss betting argument and the accuracy dominance argument for Dempster-Shaferism$^*$ when the propositions about which you have an opinion are governed by a supervaluationist semantics.Richard Pettigrewhttp://www.blogger.com/profile/07828399117450825734noreply@blogger.com13