## Вы здесь

# Новости LessWrong.com

*Адрес:*https://www.lesswrong.com/

*Обновлено:*49 минут назад

### Covid 8/27: The Fall of the CDC

Most weeks, the disaster that was the head of the FDA not having any understanding of statistics and not making any attempt to think about the world would have been the headline. Then the CDC decided to revise its guidelines on testing from being for it to largely being against it, under pressure from the White House, and suddenly it’s the B-story.

It seems that every day there is a new thing surfacing to enrage me. The difference is that early in the pandemic, every day something would *terrify *me. I’m still periodically scared in an existential or civilization-is-collapsing-in-general kind of way, but not in a ‘the economy is about to collapse’ or ‘millions of Americans are about to die’ kind of way.

I’m not sure whether this is progress.

Either way, a reminder that I’ve started a sports, sports gambling and sports modeling substack to avoid cluttering up this blog, so check it out here if you have yet to do so and that is relevant to your interests.

That’s out of the way. Let’s run the numbers.

Positive Test Counts DateWESTMIDWESTSOUTHNORTHEASTJune 11-June 1741976225107578717891June 18-June 24662922679210722115446June 25-July 1857613497416347216303July 2-July 81038794013920286318226July 9-July 151083955322925007220276July 16-July 221175065779726522120917July 23-July 291102196790324066726008July 30-Aug 5910026446221294523784Aug 6-Aug 12930426193118848621569Aug 13-Aug 19808876338415699820857Aug 20-Aug 26675456654013232218707Apologies for the lack of charts last week – it seems they didn’t copy properly from Google docs to WordPress and I didn’t notice.

Positive test counts are declining rapidly in the West and South, and slowly in the Northeast. Before, I would have considered that strong evidence that things are going great in those areas. Now, with testing on the decline, it’s not that simple, and we’ll have to dig into the positive test percentages and deaths to know for sure.

Deaths DateWESTMIDWESTSOUTHNORTHEASTJune 11-June 17778104012071495June 18-June 2483185912041061June 25-July 18586581285818July 2-July 88945591503761July 9-July 1513805392278650July 16-July 2214696743106524July 23-July 2917077004443568July 30-Aug 518317194379365Aug 6-Aug 1217386634554453Aug 13-Aug 1915768504264422Aug 20-Aug 2615037453876375Deaths continue to tell the story that the South and West have turned the corner, the Northeast is still making steady progress, and the Midwest did indeed have deaths transferred from the Aug 6 – Aug 12 week to the Aug 13 – Aug 19 week and is still steady or slowly getting worse as of a few weeks ago.

Each day, we see the 7-day average deaths fall slightly. If anything, it’s a little too consistent and makes me suspicious, but that is *probably *paranoia. Probably.

Positive test percentages are actually up slightly in the South this week, down in the Northeast, and down slightly in the West:

PercentagesNortheastMidwestSouthWest7/16 to 7/222.49%5.13%13.29%8.56%7/23 to 7/292.54%5.51%12.32%7.99%7/30 to 8/52.58%7.26%12.35%6.68%8/6 to 8/132.30%5.67%14.67%6.98%8/13 to 8/202.06%5.62%9.41%6.47%8/20 to 8/261.86%5.78%9.93%5.88%Hospitalizations are slightly down as well.

A decrease in testing *should *slightly increase positive test percentages, so this isn’t inconsistent with the South’s situation continuing to improve, but it’s also definitely not a good sign. A reminder that you can check the state-by-state data in my spreadsheet under the Positive Test Percentages tab here.

For the first time, the Midwest did more tests this week than the West did, and they continue to have similar positive percentages on tests, so it will be interesting to see how long it takes their death differential to close.

Hawaii seems to have turned a corner, or at least it’s no longer seeing unchecked exponential growth. Arizona and Florida continue to make steady progress, with Arizona continuing to go faster. Texas news is not as good, and looks ambiguous due to its dramatic drop in testing.

South Dakota and Iowa saw big scary jumps in positive test percentages. North Dakota also got noticeably worse. Minnesota continues to be headed in the wrong direction and looks poised for a new wave of protests and unrest.

Test Counts DateUSA testsPositive %NY testsPositive %Cumulative PositivesJune 11-June 173,453,4404.6%442,9511.1%0.66%June 18-June 243,686,3365.9%440,8331.0%0.72%June 25-July 14,352,9817.1%419,6961.2%0.82%July 2-July 84,468,8508.2%429,8041.1%0.93%July 9-July 155,209,2438.4%447,0731.1%1.06%July 16-July 225,456,1688.6%450,1151.1%1.20%July 17-July 295,746,0567.9%448,1821.1%1.34%July 30-Aug 55,107,7397.8%479,6131.0%1.46%Aug 6-Aug 125,121,0117.3%502,0460.9%1.58%Aug 13-Aug 195,293,5366.2%543,9220.8%1.68%Aug 20-Aug 264,785,0566.0%549,2320.8%1.77%After two weeks of what looked like stable test counts, we once again headed substantially in the wrong direction. The positive test percentage declining in spite of this is of course good news, but raises the worry of whether testing has shifted to where it is least needed while being shut down where it is needed most. New York’s testing keeps expanding even as its case counts decline.

The numbers look like one would have expected them to look last week, plus a decline in tests. There is slow improvement in deaths and hospitalizations. The outlook continues to be that things will continue much as they are for quite a while, in terms of the virus itself.

Thus, the news this week lies elsewhere.

Head of FDA Fails Statistics ForeverSee this tweet from my friend Andrew Rettek, noting that the head of the FDA is far from alone in failing statistics forever.

(Or as the new cowards in charge of the page at TVTropes are now calling it, ‘Artistic License: Statistics.’ To me, that’s another sign of an alarming societal shift. It’s not ‘artistic license,’ it’s some combination of lying/fraud and a failure to understand how numbers work on a deep and essential level. There are times when it makes sense to use a little ‘artistic license’ with your physics or biology to tell a good story. But much more frequently, They Just Didn’t Care about what is true and what causes what in the physical world, the same as others who are making actual decisions with actual consequences don’t care either. If you don’t apply for the license, you don’t have one. Period. I call for the immediate and total restoration of failing forever.)

And there isn’t one for statistics or mathematics. You just fail. Period.

The mistake here isn’t one that someone who knows how numbers work would ever make. It’s also not one that people who think about what their statements actually translate to in the real world would actually make.

It’s a statement one makes when one is doing word manipulations. Where one sees a 35% on a piece of paper, doesn’t notice at all that this is from 11.9% to 8.7%, and decides that means that 35% of all patients will go from dying to not dying rather than 3.2%.

No, seriously, they doubled down on this pretty hardcore. Check this out from deeper in that thread:

And now the FDA's Twitter account is promoting Hahn's disinfo.

Full quote from Hahn: "Let me just put that in perspective… What that means is if… 100 people who are sick with covid-19, 35 would have been saved because of the administration of plasma." https://t.co/XL8he5YWQ6

He did eventually walk it back and apologize. But think about the mindset required to get things this wrong. To not stop and think ‘wait, I don’t think 35% of patients even die, something’s wrong’ and check again. Until verified, everything such people say has to be presumed to likely be complete nonsense or propaganda, with no relation to the physical world.

That doesn’t mean the treatment itself is bad.

A Miasma of Convalescent PlasmaSo does convalescent plasma work?

Wired reports that 97,000 People Got Convalescent Plasma. Who Knows if It Works? The article contains lots of good information. Clearly epic ball droppings occurred. No one wanted to fund the clinical trials. If you’re wondering, all we had was money, what could we possibly have done to help? That’s something you could have done to help. We had tests ready to go, with no ethical issues, all they needed was funding. They didn’t happen, so now ‘no one knows if it works.’

Of course, that’s complete nonsense. It works.

It’s not the major breakthrough the President claimed it was. The range of possible effectiveness is ‘somewhat’ to ‘actually quite a bit’ but it’s still expensive in several senses to administer and all but certain to be supplanted by superior treatments soon. As that second link explains, convalescent plasma is a classic emergency stopgap strategy that is used until better options can come online.

Still, it works. If someone is a patient and the plasma is available, until better options come along, this is not medical advice because legal reasons, but there’s no reason I can see to not to take the plasma. As everyone says, it’s a historically safe treatment and seems clearly safe once more. In terms of how effective it is, we know that patients that took it earlier did better than those who took it later, and those who had a stronger dose did better than those who had a weaker dose.

Which likely means that even the strongest doses were probably not as big as they should have been if we have access to unlimited plasma. More dakka!

People can protest that ‘there are no RCTs’ all they like. That doesn’t make Bayes law go away. This one is very cut and dry.

The issue is that if one is doing a tradeoff between the logistical costs of treatment versus spending those resources elsewhere, it’s hard to know what value to put down, but a good estimate does not seem that hard. The 35% number seems large in the sense that it is somewhat cherry picked, but it also seems like there’s a lot of room for improvement in the protocol in terms of timing, quantity/quality and procedure, as we learn more. So if we can get our testing done in a timely manner and get the treatment to patients quickly, my guess is this is good for at least that much. Which is a big forking deal.

Of course, a better treatment for those who get hospitalized if anything makes it harder to squash the virus, rather than easier, given the resulting changes in behavior. So this doesn’t put us any closer to normal. Still, great news as far as it goes.

FDA Fails To Approve Tests, People Die and Economy CollapsesA ‘mid-mortem’ came out this week on the FDA’s test approval policy. It mostly reinforces what we already mostly knew – the FDA made an active choice to shut down testing and force all action to go through proper channels, refusing to use its discretion, and then continued to require constantly changing actions and paperwork in order to get approval, and then did not prioritize the handling of said actions and paperwork on their end. Nor did it communicate its changing priorities and rules to those who needed to follow them.

Many companies report waiting *for months *for word from the FDA, after filling out what they believed were the necessary requirements and paperwork, only to have their tests stranded in limbo.

The biggest news there is that this process wasn’t only wholly unnecessary and monumentally destructive, but according to an HHS statement on August 19, much of it was not even within FDA’s jurisdictional rights to impose those restrictions. Many of its actions were without a legal basis.

Once again: FDA delenda est.

Musician Plays For CrowdIn ‘someone actually did any study of anything at all, so I’ll take it’ news, Coronavirus: Germany puts on crowded concerts to study risks.

To quote the article: The concert study, called Restart-19, was created “to investigate the conditions under which such events can be carried out despite the pandemic,” researchers said.

As far as I can tell, what they studied was *where people stood during the concert*?

It’s not clear what else they could have done with this setup. It’s considered unethical to let anyone assume Covid-19 risk in order to help anyone else, so they made sure to test everyone for Covid-19 before the concert. At which point, they studied… the complete non-spreading of Covid-19 that took place, since everyone had tested negative? That does not seem super useful.

If you want to ‘investigate the conditions under which such events can be carried out’ and you’ve already decided the answer is ‘by testing everyone before the concert’ then I believe you have your answer. The good news is all we have to do is let people use reasonable and cheap tests, and then all that testing can be funded with less of an extra fee than the Ticketmaster ‘online purchase’ surcharge on tickets.

Area Couple Does Math, Also Each Other, Probably Didn’t Actually Do MathA common question I get asked, and also that I ask myself, is what activities are how risky (Google thinks the ‘are how risky’ is a grammar error and wants to correct it to ‘are risky,’ which explains a lot of how we’re in this mess!), and whether a given activity is a sane or responsible thing to be doing given the pandemic.

What about dating and casual sex, or sex in general? (link is to Washington Post)

Several times there have been fun news items about municipalities trying to ban three ways, or encourage glory holes, or other similar things. I never thought any of those attempts would accomplish anything, but I was always both amused and delighted by such efforts. They represent an attempt to think about the physical world and the consequences of actions, and make decisions on that basis.

Setting aside that casual sex and dating and the pursuit of happiness in general is considered by the Very Serious People to be ‘inessential’ making it bad to ‘take risk,’ how risky is it, and are there steps that one can take to limit that risk?

The upper bound on risk is that you’re *completely *exposed to one person. If they have it, you get it. It’s hard to think of something that would more reliably spread Covid-19 than an extended french kiss. That certainly has a leg up on ‘droplet spread’ and is a much scarier prospect than the sexual act itself.

It seems *very *reasonable to pick your spots and then say ‘you know what, I don’t want to worry about this and it’s not clear in practice I can reduce that risk much further with precautions that would be useful, so we’ll just accept that risk and do whatever.’ Or perhaps you can make an effort to avoid anything touching anyone’s nose or mouth, or even leave the masks fully on while everything else comes off. To each their own.

But is it reasonable to expose yourself completely to one person in order to get lucky? If everyone is young and has been mostly isolating, and has no symptoms, this seems like it ends up pretty far down on the actual list of downsides to what you’re about to do. If the encounter would have been super exciting before, and again if everyone is otherwise being responsible and isn’t high risk, judicious amounts of this seem to me like an efficient use of one’s (and society’s) risk points. But keep in mind that doing it more with more different people dramatically increases the risk per encounter, so let’s not go too crazy.

Let’s Go, MetsMy team, The New York Mets, missed five games because one player and one staff member tested positive for Covid-19. Of course, this happened in Miami. Out of an abundance of caution, the team returned to New York and quarantined for five days before resuming play.

It’s good to see baseball acting responsibly. If anything, this was more caution than necessary.

Since more sports games are now being cancelled for purely symbolic protest reasons than for Covid-19 outbreaks, I’d say the restart of sports has been a smashing success.

The Simulacra of a MaskWhile watching the disaster that is the Mets, I saw an ad imploring us all to wear face masks. It showed pictures of people wearing masks that said things like “for my husband” on them. I worry about the message ‘you wear a mask for other people’ given how people are sometimes, but mostly I don’t worry about this. Then the slogan across the screen: Show You Care. Wear a Mask.

So we’ve fully moved on to simulacra level 3.

We could go with the Level 1 message that masks prevent infection. We could go with the level 2 message that they don’t, as many have tried at various points for various reasons, or that they are way more effective than they are, which has also been tried. Such messages are false, but they rest on the premise that people will decide whether masks help, and then do what is helpful and not what is unhelpful.

Instead, we are being given the message that we must signal that we care about others, therefore we must wear a mask.

The mask is being advertised as being as similar to not playing a basketball game to show you care about police violence. The basketball game does not cause police violence.

The message is, hey, we know that wearing masks makes today a little bit worse, but it’s worth it, because you need to virtue signal. Bad things are happening, so you need to give up your nice thing, take away nice things and pressure others to give up their nice things, or people will think you do not care. And maybe, just maybe, if enough of us show we care, then someone will care enough to *actually forking do something useful. *

This is very important, so I want to say this as clearly as I can: Fuck. That. Shit.

Wearing a mask is vital to preventing Covid-19 infection. Wear a forking mask.

But only because it *works. *

If you don’t think it works, don’t do it.

You Had One JobCDC decides third time’s the charm, decides to try harder once more with feeling, insists it needs to join the Delenda Est club.

It’s making a strong case!

You are literally called the *Centers for Disease Control. *Your one job is to *control disease. *

So I want to know what the flying fork is going on over there, given the new guidance they issued this week. First link is the guidance, second and third are news coverage.

An attempt at a “good faith” interpretation of the new testing guidelines, that you ‘do not necessarily need a test’ even if you have been within 6 feet of a *known positive *for *fifteen minutes, *is that the CDC is *lying. *That they are doing the exact same thing with testing that was previously done with masks.

The plan would be to say:

- Testing doesn’t work for you!
- Save the testing for those who really need it!

Because there aren’t enough tests, or doing too many tests is slowing down our turnaround times.

An attempt at an *actual *good faith interpretation is that these tests aren’t actually useful. That if you get exposed that getting a test does not have value until you have symptoms. The argument goes that getting a negative test doesn’t mean you won’t become positive later, so you have to quarantine anyway, so what’s the point in bothering. Because that’s how humans work, and that’s how human lives function. And no one else might want to know if *they *had been exposed in turn.

An attempt at a not-so-good faith interpretation is that this is the result of pressure from the White House, who want to suppress testing so the numbers look superficially better.

Those were the three I could come up with.

And we have our answer, which I saw after writing the rest of this section. The directive came under pressure from the White House.

Then there’s the other half of the new guidance, which states that travellers no longer need to quarantine on arrival. As one would expect, local officials everywhere seem to disagree, such as in California. I would be floored if New York lifted its travel restrictions.

Basically, the CDC is taking the two things we actually know how to do to stop the spread beyond modifying day to day normal behavior. Our four tactics are, essentially: Wear a mask, socially distance, test and contract trace, and restrict travel to low-infection areas. The CDC is saying that contact tracing and a lot of testing isn’t worth it so don’t bother, and restricting travel isn’t worth it. Except that those are *the only tactics that have been shown, anywhere, to actually work. *

We need more testing. Even the money congress allocated for testing, paltry as it was, mostly hasn’t been spent. Test counts are *declining *rather than increasing over the last few weeks, which is alarming. And while we need more testing, it would make sense to tell people who are low priority that they can’t be tested. But this is something else, and the stand against quarantines on travel makes no sense except maybe as some effort by spiteful people to get back at the blue states, either by tricking them into following it, or by getting to argue they are not ‘following CDC guidelines.’

The problem with putting the CDC in the Delenda Est club is that the CDC actually does have its one job to do. We should be doing disease control. The problem with the FDA is that *it is doing its job. *The problem with the CDC is that *it is NOT doing its job. *Big difference. So we need to fix the problems, but alas, I can’t let the CDC into the club. Maybe we need to burn it down and start over, but we’d then need to rebuild it, faster, stronger, better. Under no circumstances should we salt the earth.

But for now, we should treat the CDC as having been captured by the White House, and the White House as fully committed to misinformation. Thus, the CDC is now fully committed to misinformation. Going forward, at least until Trump leaves office, treat anything and everything from the CDC with extreme caution and as plausibly motivated as propaganda efforts to re-elect the President.

Discuss

### Technical model refinement formalism

Introduction

.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} This post aims to formalise the models and model changes/model splintering described in this post. As explained there, the idea is to have a meta-model sufficiently general to be able to directly capture the process of moving from one imperfect model to another.

A note on infinityFor simplicity of exposition, I'll not talk about issues of infinite sets, continuity, convergence, etc... Just assume that any infinite set that comes up is just a finite set, large enough for whatever practical purpose we need it for.

Features, worlds, environmentsA model M is defined by three object, the set F of features, the set E of environments, and a probability distribution Q. We'll define the first two in this section.

FeaturesFeatures are things that might be true or not about worlds, or might take certain values in worlds. For example, "the universe is open" is a possible feature about our universe, "the temperature is 250K" is another possible feature, but instead of returning true or false, it returns the temperature value. Adding more details, such as "the temperature is 250K in room 3, at 12:01" show that features should also be able to take inputs: features are functions.

But what about "the frequency of white light"? That's something that makes sense in many models - white light is used extensively in many contexts, and light has a frequency. The problem with that statement is that light has multiple frequencies; so we should allow features to be, at least in some cases, multivalued functions.

To top that off, sometimes there will be no correct value for a function; "the height of white light" is something that doesn't mean anything. So features have to include partial functions as well.

Fortunately, multivalued and partial functions are even simpler than functions at the formal level: they are just relations. And since the sets in the relations can consist of a single element, in even more generality, a feature is a predicate on a set. We just need to know which set.

So, formally, a feature F∈F consists of an (implicit) label defining what F is (eg "open universe", "temperature in some location") and a set on which it is a predicate. Thus, for example, the features above could be:

- Fopen universe={0} (features which are simply true or false are functions of a single element).
- Ftemperature={R+}.
- Ftemperature at location and time={L×T×R+}, for some set L of locations and T of possible times.
- Ffrequency of specific light={R+}.
- Fheight of object={O×R+} for O a set of objects.

Note these definitions are purely syntactic, not semantic: they don't have any meaning. Indeed, as sets, Ftemperature and Ffrequency of specific light are identical. Note also that there are multiple ways of defining the same things; instead of a single feature Ftemperature at location and time, we could have a whole collection of Ftemperature at land t for all (l,t)∈L×T.

WorldsIn Abram's orthodox case against utility functions he talks about the Jeffrey-Bolker axioms, which allows the construction of preferences from events *without needing full worlds at all*.

Similarly, this formalism is not focused on worlds, but it can be useful to define the full set of worlds for a model. This is simply the possible values that all features could conceivably take; so, if ¯¯¯¯¯F=⊔FF is the disjoint union of all features in F (seen as sets), the set of worlds W is just W=2¯¯¯¯F, the powerset of ¯¯¯¯¯F - equivalently, the set of all functions from ¯¯¯¯¯F to {True,False}.

So W just consists of all things that could be conceivably distinguished by the features. If we need more discrimination than this - just add more features.

EnvironmentsThe set of environments is a subset E of W, the set of worlds (though it need not be defined via W; it's a set of functions from ¯¯¯¯¯F to {True,False}.

Though this definition is still syntactic, it starts putting some restrictions on what the semantics could possibly be, in the spirit of this post.

For example, E could restrict to situations where Ftemperature is a single valued function, while Ffrequency of specific light is allowed to be multivalued. And similarly, Fheight of a specific oject takes no defined values on anything in the domain of Ffrequency of specific light.

ProbabilityThe simplest way of defining Q is as a probability distribution over E.

This means that, if E1 and E2 are subsets of E, we can define the conditional probability

Q(E1∣E2)=Q(E1∩E2∣E2).

Once we have such a probability distribution, then, if the set of features is rich enough, this puts a lot more restrictions on the meaning that these features could have, going a lot of the way towards semantics. For example, if Q captures the ideal gas laws, then there is a specific relation between temperature, pressure, volume, and amount of substance - whatever those features are labelled.

In general, we'd want Q to be expressible in a simple way from the set F of features; that's the point of having those features in the first place.

Broader definition of "probability"The plan for this meta-formalism is to allow transition from imperfect models to other imperfect models. So requiring that they have a probability distribution over all of E may be too much to ask.

In practice, all that is needed is expressions of the type Q(E1∣E2). And these may not be needed for all E1, E2. For example, to go back to the ideal gas laws, it makes perfect sense that we can deduce temperature from the other three features. But what if E2 just fixed the volume - can we deduce the pressure from that?

With Q as a prior over E, we can, by getting the pressure and amount of substance from the prior. But many models don't include these priors, and there's no reason to avoid those.

So, in the more general case, instead of E⊂W, define E⊂2W×2W, so that, for all (E1,E2)∈E, the following probability is defined:

Q(E1∣E2).

To insure consistency, we can require Q to follow axioms similar to the two-valued probabilities appendix *IV in Popper's "Logic of Scientific Discovery".

In full generality, we might need an even more general or imperfect definition of Q. But I'll leave this aside for the moment, and assume the simpler case where Q is a distribution over E.

RefinementHere we'll look at how one can improve a model. Obviously, one can get a better Q, or a more expansive E, or a combination of these. Now, we haven't talked much about the quality of Q, and we'll leave this underdefined. Say that Q∗⪰Q means that Q∗ is 'at least as good as Q'. The 'at least as good' is specified by some mix of accuracy and simplicity.

More expansive E means that the environment of the improvement can be bigger. But in order for something to be "bigger", we need some identification between the two environments (which, so far, have just been defined as subsets of the powerset of feature values).

So, let M=(F,E,Q) and M∗=(F∗,E∗,Q∗) be models, let E∗0 be a subset of E∗, and let q be a surjective map from E∗0 to E (for an e∈E, think of q−1(e)⊂E∗0, the preimage of q, as the set of all environments in E∗ that correspond to e).

We can define Q∗0 on E in the following manner: if E1 and E2 are subsets of E, define

Q∗0(E1∣E2)=Q∗(q−1(E1)∣q−1(E2)).

Then q defines M∗ as a refinement of M if:

- Q∗0⪰Q.

Here are some examples of different types of refinements:

- Q-improvement: F=F∗, E=E∗, Q∗⪰Q (eg using the sine of the angle rather than the angle itself for refraction).
- Environment extension: F=F∗, E⊊E∗, E∗0=E with q the identity, Q∗=Q on E (eg moving from a training environment to a more extensive test environment).
- Natural extension: environment extension where Q is simply defined in terms of F on E, and this extends to Q∗ on E∗ (eg extending Newtonian mechanics from the Earth to the whole of the solar system).
- Non-independent feature extension: F⊊F∗. Let πF be the map that takes an element of W∗ and maps it to W by restricting[1] to features in F. Then πF=q on E∗0, and Q∗0=Q (eg adding electromagnetism to Newtonian mechanics).
- Independent feature extension: as a non-independent feature extension, but E∗0=E∗, and the stronger condition for Q∗ that Q(E1∣E2)=Q∗(q−1(E1)∣E∗2) for any E∗2 with q(E∗2)=E2 (eg non-colliding planets modelled without rotation, changing to modelling them with (mild) rotation).
- Feature refinement: F⊊F∗ (moving from the ideal gas models to the van der Waals equation).
- Feature splintering: when there is no single natural projection E∗→E that extends q (eg Blegg and Rube generalisation, happiness and human smile coming apart, inertial mass in general relativity projected to Newtonian mechanics...)
- Reward function splintering: no single natural extension of the reward function on E from E′=q−1(E) to all of E∗ (any situation where a reward function, seen as a feature, splinters).

Let M∗={F∗,E∗,Q∗} be a refinement of M={F,E,Q} (via q), and let R be a reward function defined on E.

A *refactoring* of R on M∗, is a reward function R∗ on E∗ such that for all e∗∈E∗, R(q(e∗))=R∗(e∗)). A **natural** refactoring is an extension of R is a refactoring that satisfies some naturalness or simplicity properties. For example, if R is the momentum of an object in M, and if momentum still makes sense in M∗, then this should be a natural refactoring.

If there does not exist a unique natural refactoring of R on M∗, then the refinement from M to M∗ splinters R.

Feature splinteringLet R be the indicator function for a feature being equal to some element or in some range. If R splinters in a refinement, then so does that feature.

Note that W∗ is the set of all functions from ¯¯¯¯¯F∗ to {True,False}. Since F⊂F∗, ¯¯¯¯¯F=⊔FF⊂⊔F∗F=¯¯¯¯¯F∗. Then we can project from W∗ to W by restricting a function to its values on ¯¯¯¯¯F. ↩︎

Discuss

### Model splintering: moving from one imperfect model to another

1. The big problem

In the last few months, I've become convinced that there is a key meta-issue in AI safety; a problem that seems to come up in all sorts of areas..mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

It's hard to summarise, but my best phrasing would be:

- Many problems in AI safety seem to be variations of "this approach seems safe in this imperfect model, but when we generalise the model more, it becomes dangerously underdefined". Call this
**model splintering**. - It is intrinsically worth studying how to (safely) transition from one imperfect model to another. This is worth doing, independently of whatever "perfect" or "ideal" model might be in the background of the imperfect models.

This sprawling post will be presenting examples of model splintering, arguments for its importance, a formal setting allowing us to talk about it, and some uses we can put this setting to.

1.1 In the language of traditional MLIn the language of traditional ML, we could connect all these issues to "out-of-distribution" behaviour. This is the problems that algorithms encounter when the set they are operating on is drawn from a different distribution than the training set they were trained on.

Humans can often see that the algorithm is out-of-distribution and correct it, because we have a more general distribution in mind than the one the algorithm was trained on.

In these terms, the issues of this post can be phrased as:

- When the AI finds itself mildly out-of-distribution, how best can it extend its prior knowledge to the new situation?
- What should the AI do if it finds itself strongly out-of-distribution?
- What should the AI do if it finds itself strongly out-of-distribution, and humans don't know the correct distribution either?

Let's build a more general framework. Say that you start with some brilliant idea for AI safety/alignment/effectiveness. This idea is phrased in some (imperfect) model. Then "model splintering" happens when you or the AI move to a new (also imperfect) model, such that the brilliant idea is undermined or underdefined.

Here are a few examples:

- You design an AI CEO as a money maximiser. Given typical assumptions about the human world (legal systems, difficulties in one person achieving massive power, human fallibilities), this results in an AI that behaves like a human CEO. But when those assumptions fail, the AI can end up feeding the universe to a money-making process that produces nothing of any value.
- Eliezer defined "rubes" as smooth red cubes containing palladium that don't glow in the dark. "Bleggs", on the other hand, are furred blue eggs containing vanadium that glow in the dark. To classify these, we only need a model with two features, "rubes" and "bleggs". Then along comes a furred red egg containing vanadium that doesn't glow in the dark. The previous model doesn't know what to do with it, and if you get a model with more features, it's unclear what to do with this new object.
- Here are some moral principles from history: honour is important for anyone. Women should be protected. Increasing happiness is important. These moral principles made sense in the world in which they were articulated, where features like "honour", "gender", and "happiness" are relatively clear and unambiguous. But the world changed, and the models splintered. "Honour" became hopelessly confused centuries ago. Gender is currently finishing its long splintering (long before we got to today, gender started becoming less useful for classifying people, hence the consequences of gender splintered a long time before gender itself did). Happiness, or at least hedonic happiness, is still well defined, but we can clearly see how this is going to splinter when we talk about worlds of uploads or brain modification.
- Many transitions in the laws of physics - from the ideal gas laws to the more advanced van der Waals equations, or from Newtonain physics to general relativity to quantum gravity - will cause splintering if preferences were articulated in concepts that don't carry over well.

In all those cases, there are ways of improving the transition, without needing to go via some idealised, perfect model. We want to define the AI CEO's task in more generality, but we don't need to define this across every possible universe - that is not needed to restrain its behaviour. We need to distinguish any blegg from any rube we are likely to encounter, we don't need to define the platonic essence of "bleggness". For future splinterings - when hedonic happiness splinters, when we get a model of quantum gravity, etc... - we want to know what to do then and there, even if there are future splinterings subsequent to those.

And I think think that model splintering is best addressed directly, rather than using methods that go via some idealised perfect model. Most approaches seem to go for approximating an ideal: from AIXI's set of all programs, the universal prior, KWIK ("Knowing what it knows") learning with a full hypothesis class, Active Inverse Reward Design with its full space of "true" reward functions, to Q-learning which assumes any Markov decisions process is possible. Then the practical approaches rely on approximating this ideal.

Schematically, we can see M∞ as the ideal, Mi∞ as M∞ updated with information to time i, and Mi as an approximation of Mi∞. Then we tend to focus on how well Mi approximates Mi∞, and on how Mi∞ changes to Mi+1∞ - rather than on how Mi relates to Mi+1; the red arrow here is underanalysed:

2 Why focus on the transition?But why is focusing on the Mi→Mi+1 transition important?

2.1 Humans reason like thisA lot has been written about image recognition programs going "out-of-distribution" (encountering situations beyond its training environment) or succumbing to "adversarial examples" (examples from one category that have the features of another). Indeed, some people have shown how to use labelled adversarial examples to improve image recognition.

You know what this reminds me of? Human moral reasoning. At various points in our lives, we humans seem to have pretty solid moral intuitions about how the world should be. And then, we typically learn more, realise that things don't fit in the categories we were used to (go "out-of-distribution") and have to update. Some people push stories at us that exploit some of our emotions in new, more ambiguous circumstances ("adversarial examples"). And philosophers use similarly-designed thought experiments to open up and clarify our moral intuitions.

Basically, we start with strong moral intuitions on under-defined features, and when the features splinter, we have to figure out what to do with our previous moral intuitions. A lot of developing moral meta-intuitions, is about learning how to navigate these kinds of transitions; AIs need to be able to do so too.

2.2 There are no well-defined overarching moral principlesMoral realists and moral non-realists agree more than you'd think. In this situation, we can agree on one thing: there is no well-described system of morality that can be "simply" implement in AI.

To over-simplify, moral realists hope to discover this moral system, moral non-realists hope to construct one. But, currently, it doesn't exist in an implementable form, nor is there any implementable algorithm to discover/construct it. So the whole idea of approximating an ideal is wrong.

All humans seem to start from a partial list of moral rules of thumb, rules that they then have to extend to new situations. And most humans do seem to have some meta-rules for defining moral improvements, or extensions to new situations.

We don't know perfection, but we do know improvements and extensions. So methods that deal explicitly with that are useful. Those are things we can build on.

2.3 It helps distinguish areas where AIs fail, from areas where humans are uncertainSometimes the AI goes out-of-distribution, and humans can see the error (no, flipping the lego block doesn't count as putting it on top of the other). There are cases when humans themselves go out-of-distribution (see for example siren worlds).

It's useful to have methods available for both AIs and humans in these situations, and to distinguish them. "Genuine human preferences, not expressed in sufficient detail" is not the same as "human preferences fundamentally underdefined".

In the first case, it needs more human feedback; in the second case, it needs to figure out way of resolving the ambiguity, knowing that soliciting feedback is not enough.

2.4 We don't need to make the problems harderSuppose that quantum mechanics is the true underlying physics of the universe, with some added bits to include gravity. If that's true, why would we need a moral theory valid in every possible universe? It would be useful to have that, but would be strictly harder than one valid in the actual universe.

Also, some problems might be entirely avoided. We don't need to figure out the morality of dealing with a willing slave race - if we never encounter or build one in the first place.

So a few degrees of "extend this moral model in a reasonable way" might be sufficient, without needing to solve the whole problem. Or, at least, without needing to solve the whole problem in advance - a successful nanny AI might be built on these kinds of extensions.

2.5 We don't know how deep the rabbit hole goesIn a sort of converse to the previous point, what if the laws of physics are radically different from what we thought - what if, for example, they allow some forms of time-travel, or have some narrative features, or, more simply, what if the agent moves to an embedded agency model? What if hypercomputation is possible?

It's easy to have an idealised version of "all reality" that doesn't allow for these possibilities, so the ideal can be too restrictive, rather than too general. But the model splintering methods might still work, since it deals with transitions, not ideals.

Note that, **in retrospect**, we can always put this in a Bayesian framework, once we have a rich enough set of environments and updates rules. But this is misleading: the key issue is the missing feature, and figuring out what to do with the missing feature is the real challenge. The fact that we could have done this in a Bayesian way *if we already knew that feature*, is not relevant here.

Assume the blegg and rube classifier is an industrial robot performing a task. If humans filter out any atypical bleggs and rubes before it sees them, then the robot has no need for a full theory of bleggness/rubeness.

But what it the human filtering is not perfect? Then the classifier still doesn't need a full theory of bleggness/rubeness; it needs methods for dealing with the ambiguities it actually encounters.

Some ideas for AI control - low impact, AI-as-service, Oracles, ... - may require dealing with some model splintering, some ambiguity, but not the whole amount.

2.7 It points out when to be conservativeSome methods, like quantilizers or the pessimism approach rely on the algorithm having a certain degree of conservatism. But, as I've argued, it's not clear to what extent these methods actually are conservative, nor is it easy to calibrate them in a useful way.

Model splintering situations provide excellent points at which to be conservative. Or, for algorithms that need human feedback, but not constantly, these are excellent points to ask for that feedback.

2.8 Difficulty in capturing splintering from the idealised perspectiveGenerally speaking, idealised methods can't capture model splintering at the point we would want it to. Imagine an ontological crisis, as we move from classical physics to quantum mechanics.

AIXI can go over the transition fine: it shifts from a Turing machine mimicking classical physics observations, to one mimicking quantum observations. But it doesn't notice anything special about the transition: changing the probability of various Turing machines is what it does with observations in general; there's nothing in its algorithm that shows that something unusual has occurred for this particular shift.

2.9 It may help amplification and distillationThis could be seen as a sub-point of some of the previous two sections, but it deserves to be flagged explicitly, since iterated amplification and distillation is one of the major potential routes to AI safety.

To quote a line from that summary post:

- The proposed AI design is to use a safe but slow way of scaling up an AI’s capabilities, distill this into a faster but slightly weaker AI, which can be scaled up safely again, and to iterate the process until we have a fast and powerful AI.

At both "scaling up an AI's capabilities", and "distill this into", we can ask the question: has the problem the AI is working on changed? The distillation step is more of a classical AI safety issue, as we wonder whether the distillation has caused any value drift. But at the scaling up or amplification step, we can ask: since the AIs capabilities have changed, the set of possible environments it operates in has changed as well. Has this caused a splintering where the previously safe goals of the AI have become dangerous.

Detecting and dealing with such a splintering could both be useful tools to add to this method.

2.10 Examples of model splintering problems/approachesAt a meta level, most problems in AI safety seem to be variants of model splintering, including:

- The hidden complexity of wishes.
- Ontological crises.
- Conservative/prudential behaviour in algorithms (more specifically, when the algorithm should become conservative).
- How categories are defined.
- The Goodhart problems.
- Out-of-distribution behaviour.
- Low impact and reduced side-effects approaches.
- Underdefined preferences.
- Active inverse reward design.
- Inductive ambiguity identification.
- Wireheading.
- The whole friendly AI problem itself.

Almost every recent post I've read in AI safety, I've been able to connect back to this central idea. Now, we have to be cautious - cure-alls cure nothing, after all, so it's not necessarily a positive sign that *everything* seems to fit into this framework.

Still, I think it's worth diving into this, especially as I've come up with a framework that seems promising for actually solving this issue in many cases.

In a similar concept-space is Abram's orthodox case against utility functions, where he talks about the Jeffrey-Bolker axioms, which allows the construction of preferences from events *without needing full worlds at all*.

This post is dedicated to explicitly modelling the transition to ambiguity, and then showing what we can gain from this explicit meta-modelling. It will do with some formal language (made fully formal in this post), and a lot of examples.

Just as Scott argues that if it's worth doing, it's worth doing with made up statistics, I'd argue that if an idea is worth pursuing, it's worth pursuing with an attempted formalism.

Formalisms are great at illustrating the problems, clarifying ideas, and making us familiar with the intricacies of the overall concept. That's the reason that this post (and the accompanying technical post) will attempt to make the formalism reasonably rigorous. I've learnt a lot about this in the process of formalisation.

3.1 A model, in (almost) all generalityWhat do we mean by a model? Do we mean mathematical model theory? As we talking about causal models, or causal graphs? AIXI uses a distribution over possible Turing machines, whereas Markov Decision Processes (MDPs) sees states and actions updating stochastically, independently at each time-step. Unlike the previous two, Newtonian mechanics doesn't use time-steps but continuous times, while general relativity weaves time into the structure of space itself.

And what does it mean for a model to make "predictions"? AIXI and MDPs make prediction over future observations, and causal graphs are similar. We can also try running them in reverse, "predicting" past observations from current ones. Mathematical model theory talks about properties and the existence or non-existence of certain objects. Ideal gas laws make a "prediction" of certain properties (eg temperature) given certain others (eg volume, pressure, amount of substance). General relativity establishes that the structure of space-time must obey certain constraints.

It seems tricky to include all these models under the same meta-model formalism, but it would be good to do so. That's because of the risk of ontological crises: we want the AI to be able to continue functioning even if the initial model we gave it was incomplete or incorrect.

3.2 Meta-model: models, features, environments, probabilitiesAll of the models mentioned above share one common characteristic: once you know some facts, you can deduce some other facts (at least probabilistically). A prediction of the next time step, a retrodiction of the past, a deduction of some properties from other, or a constraint on the shape of the universe: all of these say that if we know some things, then this puts constraints on some other things.

So let's define F, informally, as the set of *features* of a model. This could be the gas pressure in a room, a set of past observations, the local curvature of space-time, the momentum of a particle, and so on.

So we can define a prediction as a probability distribution over a set of possible features F1, given a base set of features, F2:

Q(F1∣F2).

Do we need anything else? Yes, we need a set of possible environments for which the model is (somewhat) valid. Newtonian physics fails at extreme energies, speeds, or gravitational fields; we'd like to include this "domain of validity" in the model definition. This will be very useful for extending models, or transitioning from one model to another.

You might be tempted to define a set of "worlds" on which the model is valid. But we're trying to avoid that, as the "worlds" may not be very useful for understanding the model. Moreover, we don't have special access to the underlying reality; so we never know whether there actually is a Turing machine behind the world or not.

So define E, the environment on which the model is valid, *as a set of possible features*. So if we want to talk about Newtonian mechanics, F would be a set of Newtonian features (mass, velocity, distance, time, angular momentum, and so on) and E would be the set of these values where relativistic and quantum effects make little difference.

So see a model as

M={F,E,Q},

for F a set of features, E a set of environments, and Q a probability distribution. This is such that, for E1,E2⊂E, we have the conditional probability:

Q(E1∣E2).

Though Q is defined for E, we generally want it to be usable from small subsets of the features: so Q should be simple to define from F. And we'll often define the subsets Ei in similar ways; so E1 might be all environments with a certain angular momentum at time t=0, while E2 might be all environments with a certain angular momentum at a later time.

The full formal definition of these can be found here. The idea is to have a meta-model of modelling that is sufficiently general to apply to almost all models, but not one that relies on some ideal or perfect formalism.

3.3 Bayesian models within this meta-modelIt's very easy to include Bayesian models within this formalism. If we have a Bayesian model that includes a set W of worlds with prior P, then we merely have to define a set of features F that is sufficient to distinguish all worlds in W: each world is uniquely defined by its feature values[1]. Then we can define E as W, and P on W becomes Q on E; the definitions of terms like Q(E1∣E2) is just P(E1∩E2)P(E1)/P(E2), per Bayes' rules (unless P(E2)=0, in which case we set that to 0).

4 Model refinement and splinteringsThis section will look at what we can do with the previous meta-model, looking at refinement (how models can improve) and splintering (how improvements to the model can make some well-defined concepts less well-defined).

4.1 Model refinementInformally, M∗={F∗,E∗,Q∗} is a *refinement* of model M={F,E,Q} if it's at least as expressive as M (it covers the same environments) and is better according to some criteria (simpler, or more accurate in practice, or some other measurement).

At the technical level, we have a map q from a subset E∗0 of E∗, that is surjective onto E. This covers the "at least as expressive" part: every environment in E exists as (possibly multiple) environments in E∗.

Then note that using q−1 as a map from subsets of E to subsets of E∗0, we can define Q∗0 on E via:

Q∗0(E1∣E2)=Q∗(q−1(E1)∣q−1(E2)).

Then this is a model refinement if Q∗0 is 'at least as good as' Q on E, according to our criteria[2].

4.2 Example of model refinement: gas lawsThis post presents some subclasses of model refinement, including Q-improvements (same features, same environments, just a better Q), or adding new features to a basic model, called "non-independent feature extension" (eg adding classical electromagnetism to Newtonian mechanics).

Here's a specific gas law illustration. Let M={F,E,Q} be a model of an ideal gas, in some set of rooms and tubes. The F consists of pressure, volume, temperature, and amount of substance, and Q is the ideal gas laws. The E is the standard conditions for temperature and pressure, where the ideal gas law applies. There are multiple different types of gases in the world, but they all roughly obey the same laws.

Then compare with model M∗={F∗,E∗,Q∗}. The F∗ has all the features of F, but also includes the volume that is occupied by one mole of the molecules of the given substance. This allows Q∗ to express the more complicated van der Waals equations, which are different for different types of gases. The E∗ can now track situations where there are gases with different molar volumes, which include situations where the van der Waals equations differ significantly from the ideal gas laws.

In this case E∗0⊂E∗, since we now distinguish environments that we previously considered identical (environments with same features except for having molar volumes). The q is just projecting down by forgetting the molar volume. Then since Q∗0=Q∗ (van der Waals equations averaged over the distribution of molar volumes) is at least as accurate as Q (ideal gas law), this is a refinement.

4.3 Example of model refinement: rubes and bleegsLet's reuse Eliezer's example of rubes ("red cubes") and bleggs ("blue eggs").

Bleggs are blue eggs that glow in the dark, have a furred surface, and are filled with vanadium. Rubes, in contrast, are red cubes that don't glow in the dark, have a smooth surface, and are filled with palladium:

Define M by having F={red,smooth}, E is the set of all bleggs and rubes in some situation, and Q is relatively trivial: it predicts that an object is red/blue if and only if is smooth/furred.

Define M1 as a refinement of M, by expanding F to F1={red,smooth,cube,dark}. The projection q:E∗→E is given by forgetting about those two last features. The Q1 is more detailed, as it now connects red-smooth-cube-dark together, and similarly for blue-furred-egg-glows.

Note that E1 is larger than E, because it includes, e.g., environments where the cube objects are blue. However, all these extra environments have probability zero.

4.4 Reward function refactoringLet R be a reward function on M (by which we mean that R is define on F, the set of features in M), and M∗ a refinement of M.

A *refactoring* of R for M∗ is a reward function R∗ on the features F∗ such that for any e∗∈E∗0, R∗(e∗)=R(q(e∗)).

For example, let M and M1 be from the rube/blegg models in the previous section. Let Rred on M simply count the number of rubes - or, more precisely, counts the number of objects to which the feature "red" applies.

Let R1red be the reward function that counts the number of objects in M1 to which "red" applies. It's clearly a refactoring of Rred.

But so is R1smooth, the reward function that counts the number of objects in M1 to which "smooth" applies. In fact, the following is a refactoring of Rred, for all α+β+γ+δ=1:

αR1red+βR1smooth+γR1cube+δR1dark.

There are also some non-linear combinations of these features that refactor R, and many other variants (like the strange combinations that generate concepts like grue and bleen).

4.5 Reward function splinteringModel splintering, in the informal sense, is what happens when we pass to a new models in a way that the old features (or a reward function defined by the old features) no longer apply. It is similar to the web of connotations breaking down, an agent going out of distribution, or the definitions of Rube and Blegg falling apart.

- Preliminary definition: If M∗ is a refinement of M and R a reward function on M, then M∗
*splinters*R if there are multiple refactorings of R on M∗ that disagree on elements of E∗ of non-zero probability.

So, note that in the rube/blegg example, M1 is **not** a splintering of Rred: all the refactorings are the same on all bleggs and rubes - hence on all elements of E1 of non-zero probability.

We can even generalise this a bit. Let's assume that "red" and "blue" are not totally uniform; there exists some rubes that are "redish-purple", while some bleggs are "blueish-purple". Then let M2 be like M1, except the colour feature can have four values: "red", "redish-purple", "blueish-purple", and "blue".

Then, as long as rubes (defined, in this instance, by being smooth-dark-cubes) are either "red" or "redish-purple", and the bleggs are "blue", or "blueish-purple", then all refactorings of Rred to M2 agree - because, on the test environment, Rred on F perfectly matches up with R2red+R2redish-purple on F2.

So adding more features does not always cause splintering.

4.6 Reward function splintering: "natural" refactoringsThe preliminary definition runs into trouble when we add more objects to the environments. Define M3 as being the same as M2, except that E3 contains one extra object, o+; apart from that, the environments typically have a billion rubes and a trillion bleggs.

Suppose o+ is a "furred-rube", i.e. a red-furred-dark-cube. Then R3red and R3smooth are two different refactorings of Rred, that obviously disagree on any environment that contains o+. Even if the probability of o+ is tiny (but non-zero), then M3 splinters R.

But things are worse than that. Suppose that o+ is fully a rube: red-smooth-cube-dark, and even contains palladium. Define (R3red)′ as being counting the number of red objects, except for o+ specifically (again, this is similar to the grue and bleen arguments against induction).

Then both (R3red)′ and R3red are refactorings of Rred, so M3 still splinters Rred, even when we add another exact copy of the elements in the training set. Or even if we keep the training set for a few extra seconds, or add any change to the world.

So, for any M∗ a refinement of M, and R a reward function on E, let's define "natural refactorings" of R:

- The reward function R∗ is a natural refactoring of R if it's a reward function on M∗ with:

- R∗≈R∘q on E∗0, and
- R∗ can be defined simply from F∗ and R,
- the F∗ themselves are simply defined.

This leads to a full definition of splintering:

- Full definition: If M∗ is a refinement of M and R a reward function on M, then M∗
*splinters*R if 1) there are no natural refactoring of R on M∗, or 2) there are multiple natural refactorings R∗ and R∗′ of R on M∗, such that R∗≉R∗′.

Notice the whole host of caveats and weaselly terms here; R∗≈R∘q, "simply" (used twice), and R∗≉R∗′. Simply might mean algorithmic simplicity, but ≈ and ≉ are measures of how much "error" we are willing to accept in these refactorings. Given that, we probably want to replace ≈ and ≉ with some *measure* of non-equality, so we can talk about the "degree of naturalness" or the "degree of splintering" of some refinement and reward function.

Note also that:

**Different choices of refinements can result in different natural refactorings.**

An easy example: it makes a big difference whether a new feature is "temperature", or "divergence from standard temperatures".

4.7 Splintering training rewardsThe concept of "reward refactoring" is transitive, but the concept of "natural reward refactoring" need not be.

For example, let Et be a training environment where red/blue ⟺ cube/egg, and Eg be a general environment where red/blue is independent of cube/egg. Let F1 be a feature set with only red/blue, and F2 a feature set with red/blue and cube/egg.

Then define M1t as using F1 in the training environment, M2g as using F2 in the general environment; M1g and M2t are defined similarly.

For these models, M1g and M2t are both refinements of M1t, while M2g is a refinement of all three other models. Define R1t as the "count red objects" reward on M1t. This has a natural refactoring to R1g on M1g, which counts red objects in the general environment.

And R1g has a natural refactoring to R2g on M2g, which still just counts the red objects in the general environment.

But there is no natural refactoring from R1t directly to M2g. That's because, from F2's perspective, R1t on M1t might be counting red objects, or might be counting cubes. This is not true for R1g on M1g, which is clearly only counting red objects.

Thus when a reward function come from a training environment, we'd want our AI to look for splinterings **directly from a model of the training environment**, rather than from previous natural refactorings.

We can also talk about splintering features and models themselves. For M={F,E,Q}, the easiest way is to define a reward function RF,sF as being the indicator function for feature F∈F being in the set SF.

Then a refinement M∗ splinters the feature F if it splinters some RF,SF.

The refinement M∗ splinters the model M if it splinters at least one of its features.

For example, if M is Newtonian mechanics, including "total rest mass" and M∗ is special relativity, then M∗ will splinter "total rest mass". Other examples of feature splintering will be presented in the rest of this post.

4.9 Preserved background featuresA reward function developed in some training environment will ignore any feature that is always present or always absent in that environment. This allows very weird situations to come up, such as training an AI to distinguish happy humans from sad humans, and it ending up replacing humans with humanoid robots (after all, both happy and sad humans were equally non-robotic, so there's no reason not to do this).

Let's try and do better than that. Assume we have a model M={F,E,Q}, with a reward function Rτ defined on E (Rτ and E can be seen as the training data).

Then the feature-preserving reward function RM, is a function that constrains the environments to have similar feature distributions as E and Q. There are many ways this could be defined; here's one.

For an element e∈E, just define

RM(e)=log(Q(e)).

Obviously, this can be improved; we might want to coarse-grain F, grouping together similar worlds, and possibly bounding this below to avoid singularities.

Then we can use this to get the feature-preserving version of Rτ, which we can define as

RMτ=(maxRτ−Rτ)⋅RM,

for maxRτ the maximal value of Rτ on E. Other options can work as well, such as Rτ+αRMτ for some constant 0">α>0.

Then we can ask an AI to use RMτ as its reward function, refactoring that, rather than Rτ.

- A way of looking at it: a natural refactoring of a reward function Rτ will preserve all the implicit features that correlate with Rτ. But RMτ will also preserve all the implicit features that stay constant when Rτ was defined. So if Rτ measures human happiness vs human unhappiness, a natural refactoring of it will preserves things like "having higher dopamine in their brain". But a natural refactoring of RMτ will also preserve things like "having a brain".

The RMτ is almost certainly too restrictive to be of use. For example, if time is a feature, then this will fall apart when the AI has to do something after the training period. If all the humans in a training set share certain features, humans without those features will be penalised.

There are at least two things we can do to improve this. The first is to include more positive and negative examples in the training set; for example, if we include humans and robots in our training set - as positive and negative examples, respectively - then this difference will show up in Rτ directly, so we won't need to use RMτ too much.

Another approach would be to explicitly allow certain features to range beyond their typical values in M, or allow highly correlated variables explicitly to decorrelate.

For example, though training during a time period t to t′, we could explicitly allow time to range beyond these values, without penalty. Similarly, if a medical AI was trained on examples of typical healthy humans, we could decorrelate functioning digestion from brain activity, and get the AI to focus on the second[3].

This has to be done with some care, as adding more degrees of freedom adds more ways for errors to happen. I'm aiming to look further at this issue in later posts.

5 The fundamental questions of model refinements and splinteringWe can now rephrase the out-of-distribution issues of section 1.1 in terms of the new formalism:

- When the AI refines its model, what would count as a natural refactoring of its reward function?
- If the refinements splinter its reward function, what should the AI do?
- If the refinements splinter its reward function, and also splinters the human's reward function, what should the AI do?

The rest of this post is applying this basic framework, and its basic insights, to various common AI safety problems and analyses. This section is not particularly structured, and will range widely (and wildly) across a variety of issues.

6.1 Extending beyond the training distributionLet's go back to the blegg and rube examples. A human supervises an AI in a training environment, labelling all the rubes and bleggs for it.

The human is using a very simple model, MH={FH,Et,Q}, with the only feature being the colour of the object, and Et being the training environment.

Meanwhile the AI, having more observational abilities and no filter as to what can be ignored, notices their colour, their shape, their luminance, and their texture. It doesn't know MH, but is using model M1AI={F1,E1t,Q1}, where F1AI covers those four features (note that M1AI is a refinement of MH, but that isn't relevant here).

Suppose that the AI is trained to be rube-classifier (and hence a blegg classifier by default). Let RF be the reward function that counts the number of objects, with feature F, that the AI has classified as rubes. Then the AI could learn many different reward function in the training environment; here's one:

R1=R1cube+0.5R1smooth+0.5R1dark−R1red.

Note that, even though this gets the colour reward completely wrong, this reward matches up with the human's assessment on the training environment.

Now the AI moves to the larger testing environment E2, and refines its model minimally to M2AI={F1,E2,Q1} (extending R1 to R2 in the obvious way).

In E2, the AI sometimes encounters objects that it can only see through their colour. Will this be a problem, since the colour component of R2 is pointing in the wrong direction?

No. It still has Q1, and can deduce that a red object must be cube-smooth-dark, so R2 will continue treating this as a rube[4].

6.2 Detecting going out-of-distributionNow imagine the AI learns about the content of the rubes and bleggs, and so refines to a new model that includes vanadium/palladium as a feature in M3AI.

Furthermore, in the training environment, all rubes have palladium and all bleggs have vanadium in them. So, for M3AI a refinement of M1AI, q−1(E1AI)⊂E3AI has only palladium-rubes and vanadium-bleggs. But in E3AI, the full environment, there are rather a lot of rubes with vanadium and bleggs with palladium.

So, similarly to section 4.7, there is no natural refactoring of the rube/blegg reward in M1AI, to M3AI. That's because F3AI, the feature set of M3AI, includes vanadium/palladium which co-vary with the other rube/blegg features on the training environment (q^{-1}(\E_{AI}^1)), but not on the full environment of E3AI.

So looking for reward splintering from the training environment is a way of detecting going out-of-distribution - even on features that were not initially detected in the training distribution, by either the human nor the AI.

6.3 Asking humans and Active IRLSome of the most promising AI safety methods today rely on getting human feedback[5]. Since human feedback is expensive, as in it's slow and hard to get compared with almost all other aspects of algorithms, people want to get this feedback in the most efficient ways possible.

A good way of doing this would be to ask for feedback when the AI's current reward function splinters, and multiple options are possible.

A more rigorous analysis would look at the value of information, expected future splinterings, and so on. This is what they do in Active Inverse Reinforcement Learning; the main difference is that AIRL emphasises an unknown reward function with humans providing information, while this approach sees it more as an known reward function over uncertain features (or over features that may splinter in general environments).

6.4 A time for conservatismI argued that many "conservative" AI optimising approaches, such as quantilizers and pessimistic AIs, don't have a good measure of when to become more conservative; their parameters q and β don't encode useful guidelines for the right degree of conservatism.

In this framework, the alternative is obvious: AIs should become conservative when their reward functions splinter (meaning that the reward function compatible with the previous environment has multiple natural refactorings), and very conservative when they splinter a lot.

This design is very similar to Inverse Reward Design. In that situation, the reward signal in the training environment is taken as *information* about the "true" reward function. Basically they take all reward functions that could have given the specific reward signals, and assume the "true" reward function is one of them. In that paper, they advocate extreme conservatism at that point, by optimising the minimum of all possible reward functions.

The idea here is almost the same, though with more emphasis on "having a true reward defined on uncertain features". Having multiple contradictory reward functions compatible with the information, in the general environment, is equivalent with having a lot of splintering of the training reward function.

6.5 Avoiding ambiguous distant situationsThe post "By default, avoid ambiguous distant situations" can be rephrased as: let M be a model in which we have a clear reward function R, and let M2 be a refinement of this to general situations. We expect that this refinement splinters R. Let M1 be like M2, except with E1 smaller than E2, defined such that:

- An AI could be expected to be able to constrain the world to be in E1, with high probability,
- The M1 is not a splintering of R.

Then that post can be summarised as:

- The AI should constrain the world to be in E1 and then maximise the natural refactoring of R in M1.

Stuart Russell writes:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

The approach in sections 4.9 and 4.10 explicitly deals with this.

6.7 Hidden (dis)agreement and interpretabilityNow consider two agents doing a rube/blegg classifications task in the training environment; each agent only models two of the features:

Despite not having a single feature in common, both agents will agree on what bleggs and rubes are, in the training environment. And when refining to a fuller model that includes all four (or five) of the key features, both agents will agree as to whether a natural refactoring is possible or not.

This can be used to help define the limits of interpretability. The AI can use its own model, and its own designed features, to define the categories and rewards in the training environment. These need not be human-parsable, but we can attempt to interpret them in human terms. And then we can give this interpretation to the AI, as a list of positive and negative examples of our interpretation.

If we do this well, the AI's own features and our interpretation will match up in the training environment. But as we move to more general environments, these may diverge. Then the AI will flag a "failure of interpretation" when its refactoring diverges from a refactoring of our interpretation.

For example, if we think the AI detects pandas by looking for white hair on the body, and black hair on the arms, we can flag lots of examples of pandas and that hair pattern (and non-pandas and unusual hair patterns. We don't use these examples for training the AI, just to confirm that, in the training environment, there is a match between "AI-thinks-they-are-pandas" and "white-hair-on-arms-black-hair-on-bodies".

But, in an adversarial example, the AI could detect that, while it is detecting gibbons, this no longer matches up with our interpretaion. A splintering of interpretations, if you want.

6.8 WireheadingThe approach can also be used to detect wireheading. Imagine that the AI has various detectors that allow it to label what the features of the bleggs and rubes are. It models the world with ten features: 5 features representing the "real world" versions of the features, and 5 representing the "this signal comes from my detector" versions.

This gives a total of 10 features, the 5 features "in the real world" and the 5 "AI-labelled" versions of these:

In the training environment, there was full overlap between these 10 features, so the AI might learn the incorrect "maximise my labels/detector signal" reward.

However, when it refines its model to all 10 features *and* environments where labels and underlying reality diverge, it will realise that this splinters the reward, and thus detect a possible wireheading. It could then ask for more information, or have an automated "don't wirehead" approach.

To get around the slowness of the real world, some approaches train AIs in virtual environments. The problem is to pass that learning from the virtual environment to the real one.

Some have suggested making the virtual environment sufficiently detailed that the AI can't tell the difference between it and the real world. But, a) this involves fooling the AI, an approach I'm always wary of, and b) it's unnecessary.

Within the meta-formalism of this post, we could train the AI in a virtual environment which it models by M, and let it construct a model M′ of the real-world. We would then motivate the AI to find the "closest match" between M and M′, in terms of features and how they connect and vary. This is similar to how we can train pilots in flight simulators; the pilots are never under any illusion as to whether this is the real world or not, and even crude simulators can allow them to build certain skills[6].

This can also be used to allow the AI to deduce information from hypotheticals and thought experiments. If we show the AI an episode of a TV series showing people behaving morally (or immorally), then the episode need not be believable or plausible, if we can roughly point to the features in the episode that we want to emphasise, and roughly how these relate to real-world features.

6.10 Defining how to deal with multiple plausible refactoringsThe approach for synthesising human preferences, defined here, can be rephrased as:

- "Given that we expect multiple natural refactorings of human preferences, and given that we expect some of them to go disastrously wrong, here is one way of resolving the splintering that we expect to be better than most."

This is just one way of doing this, but it does show that "automating what AIs do with multiple refactorings" might not be impossible. The following subsection has some ideas with how to deal with that.

6.11 Global, large scale preferencesIn an old post, I talked about the concept of "emergency learning", which was basically, "lots of examples, and all the stuff we know and suspect about how AIs can go wrong, shove it all in, and hope for the best". The "shove it all in" was a bit more structured than that, defining large scale preferences (like "avoid siren worlds" and "don't over-optimise") as constraints to be added to the learning process.

It seems we can do better than that here. Using examples and hypotheticals, it seems we could construct ideas like "avoid slavery", "avoid siren worlds", or "don't over-optimise" as rewards or positive/negative examples certain simple training environments, so that the AI "gets an idea of what we want".

We can then label these ideas as "global preferences". The idea is that they start as loose requirements (we have much more granular human-scale preferences than just "avoid slavery", for example), but, the more the world diverges from the training environment, the stricter they are to be interpreted, with the AI required to respect some softmin of all natural refactorings of these features.

In a sense, we'd be saying "prevent slavery; these are the features of slavery, and in weird worlds, be especially wary of these features".

6.12 Avoiding side-effectsKrakovna et. al. presented a paper on avoiding side-effects from AI. The idea is to have an AI maximising some reward function, while reducing side effects. So the AI would not smash vases or let them break, nor would it prevent humans from eating sushi.

In this environment, we want the AI to avoid knocking the sushi off the belt as it moves:

Here, in contrast, we'd want the AI to remove the vase from the belt before it smashes:

I pointed out some issues with the whole approach. Those issues were phrased in terms of sub-agents, but my real intuition is that syntactic methods are not sufficient to control side effects. In other words, the AI can't learn to do the right thing with sushis and vases, unless it has some idea of what these objects mean to us; we prefer sushis to be eaten and vases to not be smashed.

This can be learnt if the AI has a enough training examples, learning that eating sushi is a general feature of the environments it operates in, while vases being smashed is not. I'll return to this idea in a later post.

6.13 Cancer patientsThe ideas of this post were present in implicit form in the idea of training an AI to cure cancer patients.

Using examples of successfully treated cancer patients, we noted they all shared some positive features (recuperating, living longer) and some incidental or negative features (complaining about pain, paying more taxes).

So, using the approach of section 4.9, we can designate that we want the AI to cure cancer; this will be interpreted as increasing all the features that correlate with that.

Using the explicit decorrelation of section 4.10, we can also explicitly remove the negative options from the desired feature sets, thus improving the outcomes even more.

6.14 The genie and the burning motherIn Eliezer's original post on the hidden complexity of wishes, he talks of the challenge of getting a genie to save your mother from a burning building:

So you hold up a photo of your mother's head and shoulders; match on the photo; use object contiguity to select your mother's whole body (not just her head and shoulders); and define the future function using your mother's distance from the building's center. [...]

You cry "Get my mother out of the building!", for luck, and press Enter. [...]

BOOM! With a thundering roar, the gas main under the building explodes. As the structure comes apart, in what seems like slow motion, you glimpse your mother's shattered body being hurled high into the air, traveling fast, rapidly increasing its distance from the former center of the building.

How could we avoid this? What you want is your mother out of the building. The feature "mother in building" must absolutely be set to false; this is a priority call, overriding almost everything else.

Here we'd want to load examples of your mother outside the building, so that the genie/AI learns the features "mother in house"/"mother out of house". Then it will note that "mother out of house" correlates with a whole lot of other features - like mother being alive, breathing, pain-free, often awake, and so on.

All those are good things. But there are some other features that don't correlate so well - such as the time being earlier, your mother not remembering a fire, not being covered in soot, not worried about her burning house, and so on.

As in the cancer patient example above, we'd want to preserve the features that correlate with the mother out of the house, while allowing decorrelation with the features we don't care about or don't want to preserve.

6.15 Splintering moral-relevant categories: honour, gender, and happinessIf the Antikythera mechanism had been combined with the Aeolipile to produce an ancient Greek AI, and Homer had programmed it (among other things) to "increase people's honour", how badly would things have gone?

If Babbage had completed the analytical engine as Victorian AI, and programmed it (among other things) to "protect women", how badly would things have gone?

If a modern programmer were to combine our neural nets into a superintelligence and program it (among other things) to "increase human happiness", how badly will things go?

There are three moral-relevant categories here, and it's illustrative to compare them: honour, gender, and hedonic happiness. The first has splintered, the second is splintering, and the third will likely splinter in the future.

I'm not providing solutions in this subsection, just looking at where the problems can appear, and encouraging people to think about how they would have advised Homer or Babbage to define their concepts. Don't think "stop using your concepts, use ours instead", because our concepts/features will splinter too. Think "what's the best way they could have extended their preferences even as the features splinter"?

**6.15.1 Honour**

If we look at the concept of honour, we see a concept that has already splintered.

That article reads like a meandering mess. Honour is "face", "reputation", a "bond between an individual and a society", "reciprocity", a "code of conduct", "chastity" (or "virginity"), a "right to precedence", "nobility of soul, magnanimity, and a scorn of meanness", "virtuous conduct and personal integrity", "vengeance", "credibility", and so on.

What a basket of concepts! They only seem vaguely connected together; and even places with strong honour cultures differ in how they conceive of honour, from place to place and from epoch to epoch[7]. And yet, if you asked most people within those cultures about what honour was, they would have had a strong feeling it was a single, well defined thing, maybe even a concrete object.

**6.15.2 Gender**

In his post the categories were made for man, not man for the categories, Scott writes:

Absolutely typical men have Y chromosomes, have male genitalia, appreciate manly things like sports and lumberjackery, are romantically attracted to women, personally identify as male, wear male clothing like blue jeans, sing baritone in the opera, et cetera.

But Scott is writing this in the 21st century, long after the gender definition has splintered quite a bit. In middle class middle class Victorian England[8], the gender divide was much stronger - in that, from one component of the divide, you could predict a lot more. For example, if you knew someone wore dresses in public, you knew that, almost certainly, they couldn't own property if they were married, nor could they vote, they would be expected to be in charge of the household, might be allowed to faint, and were expected to guard their virginity.

We talk nowadays about gender roles multiplying or being harder to define, but they've actually being splintering for a lot longer than that. Even though we could *define* two genders in 1960s Britain, at least roughly, that definition was a lot less informative than it was in Victorian-middle-class-Britain times: it had many fewer features strongly correlated with it.

**6.15.3 Happiness**

On to happiness! Philosophers and others have been talking about happiness for centuries, often contrasting "true happiness", or flourishing, with hedonism, or drugged out stupor, or things of that nature. Often "true happiness" is a life of duty to what the philosopher wants to happen, but at least there is some analysis, some breakdown of the "happiness" feature into smaller component parts.

Why did the philosophers do this? I'd wager that it's because the concept of happiness was already somewhat splintered (as compared with a model where "happiness" is a single thing). Those philosophers had experience of joy, pleasure, the satisfaction of a job well done, connection with others, as well as superficial highs from temporary feelings. When they sat down to systematise "happiness", they could draw on the features of their own mental model. So even if people hadn't systematised happiness themselves, when they heard of what philosophers were doing, they probably didn't react as "What? Drunken hedonism and intellectual joy are not the same thing? How dare you say such a thing!"

But looking into the future, into a world that an AI might create, we can foresee many situations where the implicit assumptions of happiness come apart, and only some remain. I say "we can foresee", but it's actually very hard to know exactly how that's going to happen; if we knew it exactly, we could solve the issues now.

So, imagine a happy person. What do you think that they have in life, that are not trivial synonyms of happiness? I'd imagine they have friends, are healthy, think interesting thoughts, have some freedom of action, may work on worthwhile tasks, may be connected with their community, probably make people around them happy as well. Getting a bit less anthropomorphic, I'd also expect them to be a carbon-based life-form, to have a reasonable mix of hormones in their brain, to have a continuity of experience, to have a sense of identity, to have a personality, and so on.

Now, some of those features can clearly be separated from "happiness". Even ahead of time, I can confidently say that "being a carbon-based life-form" is not going to be a critical feature of "happiness". But many of the other ones are not so clear; for example, would someone without continuity of experience or a sense of identity be "happy"?

Of course, I can't answer that question. Because the question has no answer. We have our current model of happiness, which co-varies with all those features I listed and many others I haven't yet thought of. As we move into more and more bizarre worlds, that model will splinter. And whether we assign the different features to "happiness" or to some other concept, is a choice we'll make, not a well-defined solution to a well-defined problem.

However, even at this stage, some answers are clearly better than others; statues of happy people should not count, for example, nor should written stories describing very happy people.

6.16 Apprenticeship learningIn apprenticeship learning (or learning from demonstration), the AI would aim to copy what experts have done. Inverse reinforcement learning can be used for this purpose, by guessing the expert's reward function, based on their demonstrations. It looks for key features in expert trajectories and attempts to reproduce them.

So, if we had an automatic car driving people to the airport, and fed it some trajectories (maybe ranked by speed of delivery), it would notice that passengers would also arrive alive, with their bags, without being pursued by the police, and so on. This is akin to section 4.9, and would not accelerate blindly to get there as fast as possible.

But the algorithm has trouble getting to truly super-human performance[9]. It's far too conservative, and, if we loosen the conservatism, it doesn't know what's acceptable and what isn't, and how to trade these off: since all passengers survived and the car was always painted yellow, their luggage intact in the training data, it has no reason to prefer human survival to taxi-colour. It doesn't even have a reason to have a specific feature resembling "passenger survived" at all.

This might be improved by the "allow decorrelation" approach from section 4.10: we specifically allow it to maximise speed of transport, while keeping the other features (no accidents, no speeding tickets) intact. As in section 6.7, we'll attempt to check that the AI does prioritise human survival, and that it will warn us if a refactoring moves it away from this.

Now, sometimes worlds w1,w2∈W may be indistinguishable for any feature set. But in that case, they can't be distinguished by any observations, either, so their relative probabilities won't change: as long as it's defined, P(w1|o)/P(w2|o) is constant for all observations o. So we can replace w1 and w2 with {w1,w2}, of prior probability P({w1,w2})=P(w1)+P(w2). Doing this for all indistinguishable worlds (which form an equivalence class) gives W′, a set of distinguishable worlds, with a well defined P on it. ↩︎

It's useful to contrast a refinement with the "abstraction" defined in this sequence. An abstraction throws away irrelevant information, so is not generally a refinement. Sometimes they are exact opposites, as the ideal gas law is an abstraction of the movement of all the gas particles, while the opposite would be a refinement.

But they are exact opposites either. Starting with the neurons of the brain, you might abstract them to "emotional states of mind", while a refinement could also add "emotional states of mind" as new features (while also keeping the old features). A splintering is more the opposite of an abstraction, as it signals that the old abstraction features are not sufficient.

It would be interesting to explore some of the concepts in this post with a mixture of refinements (to get the features we need) and abstractions (to simplify the models and get rid of the features we don't need), but that is beyond the scope of this current, already over-long, post. ↩︎

Specifically, we'd point - via labelled examples - at a clusters of features that correlate with functioning digestion, and another cluster of features that correlate with brain activity, and allow those two clusters to decorrelate with each other. ↩︎

It is no coincidence that, if R and R′ are rewards on M, that are identical on E, and if R∗ is a refactoring of R, then R∗ is also a refactoring of R′. ↩︎

Though note there are some problems with this approach, both in theory and in practice. ↩︎

Some more "body instincts" skills require more realistic environments, but some skills and procedures can perfectly well be trained in minimal simulators. ↩︎

You could define honour as "behaves according to the implicit expectations of their society", but that just illustrates how time-and-place dependent honour is. ↩︎

It's not impossible to get superhuman performance from apprenticeship learning; for example, we could select the best human performance on a collection of distinct tasks, and thus get the algorithm to have a overall performance that no human could ever match. Indeed, one of the purposes of task decomposition is to decompose complex tasks in ways that allow apprenticeship-like learning to have safe and very superhuman performance on the whole task. ↩︎

Discuss

### Are We Right about How Effective Mockery Is?

Crossposted from Figuring Figuring.

*I did a second survey that fixed some of the flaws of the first survey. The results from the second survey significantly color the interpretation of the results from the first survey given in the first “Conclusion and Discussion” section. Please continue reading past the section titled “Second Survey” to get a full picture of the results from all surveys*.

*Intro*A couple days ago a friend of mine on facebook asked about arguments in favor of mockery. They pointed out that they had noticed a lot of facebook posts mocking people for not wearing masks in the covid-19 era, and wondered whether this was an effective way to change people’s behaviors.

I said in the comment section of that post that I would make a survey that worked as follows. Roughly half of the survey takers would be randomly assigned to answer the following questions:

- Do you think that mockery is an effective way to change people’s minds?
- Do you think that mockery is an effective way to change people’s behaviors?

The other half would be randomly assigned to answer these questions:

- Has being mocked ever caused you to change your mind about something?
- Has being mocked ever caused you to change your habits or behaviors?

No survey respondent was permitted to see all four questions. The possible answers to each question were “Yes”, “No”, and “Not sure”.

I made this survey using GuidedTrack. I posted it on my facebook wall, and also posted it to Positly and paid people to participate.

A total of 145 people responded to any of the questions on positly. 74 were asked the first set of questions, and 71 were asked the second set of questions. A total of 66 people responded to any of the questions on facebook. 31 were asked the first set of questions, and 35 were asked the second set of questions.

Before I go on to tell you the results of the survey and the predictions me and some of my friends made, you might want to make your own predictions. I suggest you quickly scribble them down. Some particular questions you might want to make predictions about:

- Did more people answer yes to the first set of questions than the second set of questions, or is the reverse true, or were they about the same?
- Were facebook respondents (presumably people who are friends, or friends of friends of mind on facebook) more or less likely to say yes to the first set of questions?
- Were facebook respondents more or less likely to say yes to the second set of questions?
- What did I predict about the previous two questions?

There may be other fun questions to predict, and I’d be curious to hear how you did in the comments. Predictions from me and my friends coming up, so make sure you make your predictions beforehand. Again, I suggest that you write them down. You may also want to write down your reasoning beforehand.

_

_

_

_

_

_

_

_

_

__

_

_

_

_

_

_

____

__

_

_

__

_

_____

_

_

_

_

_

_

_

_

_

_

_

_

____

_

_

_

_

_

**Predictions**

Ok, last chance to make predictions before you hear some spoilers…

Alright.

I predicted that many more people would answer yes to the first set of questions (ie, the questions about whether mockery is effective) than to the second set of questions. I also predicted more people would say no to the second set of questions than to the first.

I’m not sure exactly what my theory was when I made that prediction—I made the prediction in the same comment that I suggested the survey, but I came up with two post hoc hypotheses that might explain the result I predicted. I do know that part of the reason I made that prediction is that mockery is fun, but admitting that fun is the main reason we do it rather than because of its positive effects on other people’s behavior feels kind of icky. So we use its effectiveness as an excuse.

One hypothesis is that we overestimate the effectiveness of mockery. This would make sense of the predicted result because it would be evidence that we all think mockery works on others, but none of us thinks it works on us.

The second hypothesis I made up to explain this predicted result was that while we know that mockery works on other people, we are hesitant to admit that it works on us, because that is a bit embarrassing. Perhaps people are also not that great at telling what actually caused them to change their minds or behaviors.

These hypotheses are not mutually exclusive.

In a private conversation, my friend predicted using similar reasoning that actually people would tend to answer the second set of questions (ie, those about how often we change our own minds as a result of mockery) affirmatively. Saying that you think mockery is effective feels kind of icky, but saying that you think you have never had your mind or behavior changed because of mockery seems kind of arrogant.

Seeing how such similar reasoning could be used to predict a totally different result made me feel a bit nervous.

Another friend of mine predicted that my facebook friends would be less likely to change their minds because of mockery than randomly selected survey participants. Positly users aren’t quite randomly selected, but they’re closer to randomly selected than whatever people happened to come across my facebook post.

**Results**

Sure enough, I was totally wrong.

Twice as many Positly respondents said that mockery has worked on them as said that mockery is effective. Positly respondents were slightly more likely to say that mockery is effective for changing behavior than for changing people’s minds, both for themselves and for others.

I think this is strong evidence against the hypotheses I suggested, and some evidence in favor of the hypothesis my friend suggested in conversation.

My facebook acquaintances were slightly less disproportionate. Only 1.2 times as many respondents said that mockery is effective on themselves as said that mockery is effective on others when it comes to behavior. However, when it comes to changing minds, still about twice as many said that mockery has worked on them as said that mockery is effective on others.

I was surprised by this, as I tend to think of myself as preferring people who do not use mockery, and not using mockery while also thinking it is effective is a hard pair of things for humans to do simultaneously.

Of the 35 facebook respondents that were asked the second set of questions, 37% said they had changed their mind because of mockery, 59% of Positly respondents said the same. This seems like decent evidence to me that my friend was right about my facebook acquaintances being less likely to change their minds because of mockery.

No respondent said that mockery was effective for changing their own, or other people’s minds and also answered that it was not effective for changing their own, or other people’s behaviors.

**Other Responses**

Here is a list of some of the things that people said they changed because of mockery. This was an optional part of the survey. Slightly edited for brevity, to protect anonymity, and avoid repeats.

- Economic beliefs
- Working out habits
- Avoiding people who mock them
- Writing about topics on fb
- My own appearance
- How good other people are
- Picking my nose in public
- Crying in public
- Basic cultural rules, like where to sit, how to join a conversation, etc.
- Philosophical or ethical beliefs
- Individualism as an ethical stance
- Fashion
- Music
- Using an old fashioned word
- Beliefs about what is socially acceptable
- Conversational habits
- Wearing briefs instead of boxers
- Stopped whistling
- Stopped/started wearing shorts
- Eating habits
- Lost weight
- Stopped playing sports
- Being late
- Hairstyle
- Stopped watching anime
- Mocked for being autistic, so changed the way I interact with people.
- Started wearing make up.
- Mocked for being outgoing, became less outgoing and confident.
- Started liking Trump
- Left Mormon religion
- Started thinking more before speaking
- Started brushing teeth more
- Stopped being conservative

Here is a list of some of the things that people said mockery was effective for changing in other people. This was also an optional part of the survey. The entries in this list have been edited as in the previous list.

- Weird opinions
- The way people think
- The way others dress
- Haircuts
- Mask wearing
- How often someone complains
- Making someone hide their opinions
- Arrogance
- Getting people to stop doing things around you
- Getting someone to stop writing things in public

I think these lists are similar enough in content to rule out another explanation of this data. You might have thought that people think mocking people is an effective way to get other people to change certain kinds of things, but when they think about what sorts of things they have changed themselves because of mockery, the two categories do not have much of an intersection. These lists make that seem unlikely to me.

**Discussion and Conclusion (1)**

These results seem like some evidence to me that people in general underestimate the effectiveness of mockery for getting others to change their minds. This is of course not necessarily an argument for using more mockery. I, for one, take the results of this survey to be a further reason that we should not mock people.

If you thought mockery was just some harmless fun you can have with your in group, as I sort of did, you might have thought that the costs to those being mocked are actually not that great. But it seems like mockery can make someone leave their religion, stop writing in public, change their political preferences, etc. I would strongly prefer for people to make decisions about those sorts of things using object level reasoning rather than reasoning about what will cause them to be mocked less. I will now much more than before see mockery as deliberate enemy action designed to interrupt other people’s cognition—not something to be taken as a joke, especially not in the context of conversations about important topics.

**Second Survey**

*This section was written after getting the data for my second survey which was inspired by some criticisms of the questions in the first. Everything above was written before getting that data.*

On the other hand, the questions I asked people in the original survey were not exactly analogous to each other. Firstly, people might have answered the first set of questions considering that although mockery is rarely effective for changing the behaviors or beliefs of those being mocked, it might work on bystanders who watch the mocking happen. Secondly, people answering the second set of questions with a “yes” might be thinking that “yes, mockery has *ever* caused me to change my mind” but that does not mean it is very effective.

To correct for this, I made a second survey. Half of respondents were asked the following two questions:

- How often has mockery worked as a means of getting you to change your mind about something?
- How often has mockery worked as a means of getting you to change your habits or behavior?

The other half were asked the following two questions:

- How often does mockery work as a means of getting someone to change their mind about something?
- How often does mockery work as a means of getting someone to change their habits or behavior?

The possible responses were: “very often” , “often”, “sometimes”, “rarely”, and “almost never”.

The survey was published on positly.

I will give you some room to make predictions before showing the results.

_

_

_

___

_

_

__

_

_

_

_

_

___

_

_

___

_

____

_

___

_

__

_

_

_

___

_

_

_

_

_

____

_

____

_

_

_

_

_

_

___

_

___

___

__

_

_

_

_

_

_

_

**Second Survey Results**

There were a total of 115 respondents. 57 were asked the first set of questions, 58 were asked the second set of questions. Here are the results compared across groups.

Mapping “very often” to 4, “often” to 3, “sometimes” to 2, “rarely” to 1, and “almost never” to 0, this gives a mean response for group 1 question 1 of 1.4562, and a mean response for group 2 question 1 of 1.5345, meaning that respondents overall thought that mockery was slightly more effective on others than on themselves.

Using the same mapping, the average response for group 1 question 2 was 1.5789, and the average response for group 2 was 1.6667. Again, respondents overall thought that mockery was slightly more effective on others than on themselves.

**Discussion and Conclusion(2)**

These results contradict my original interpretation of the first survey’s data. The second survey suggests that people are in general pretty well calibrated about the effectiveness of mockery, or perhaps slightly underestimate it. I conclude that much of the effect observed in the results of survey one’s data was caused by the two effects discussed at the beginning of the “Second Survey” section and not the result of people genuinely underestimating the effectiveness of mockery.

However, I think I am still going to take mockery more seriously than I did before, mostly because I still think this survey showed me that mockery is more effective than I thought it was. The list of personal examples people gave were fairly chilling. I also imagine people cave to mockery a lot more than they are able to notice or willing to admit on a survey. Furthermore, I don’t think it was a coincidence that it was mostly me and my weirdest friends who (incorrectly) predicted that people would say that mockery is much more effective on others than it is on themselves. Probably, we weirdos have grown numb to mockery’s sting, and fallen out of touch with what it feels like to be mocked for most people.

**I Would like to Thank**

Ozzie Gooen for inspiring me to make these surveys with his facebook post.

Frank Bellamy and Julia Kris Dzweiria for pointing out the assymetry of the questions in the original survey.

Beth Kessler and Aryeh Englander for useful discussion.

And Spencer Greenberg as well as the whole of the Positly and Guidedtrack teams for making it much easier to run surveys like these.

Discuss

### Introduction To The Infra-Bayesianism Sequence

.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

**Prelude:**

Diffractor and Vanessa proudly present: The thing we've been working on for the past five months. I initially decided that Vanessa's scattered posts about incomplete models were interesting, and could benefit from being written up in a short centralized post. But as we dug into the mathematical details, it turned out it didn't really work, and then Vanessa ran across the true mathematical thing (which had previous ideas as special cases) and scope creep happened.

This now looks like a new, large, and unusually tractable vein of research. Accordingly, this sequence supersedes all previous posts about incomplete models, and by now we've managed to get quite a few interesting results, and have ideas for several new research directions.

Diffractor typed everything up and fleshed out the proof sketches, Vanessa originated almost all of the ideas and theorems. It was a true joint effort, this sequence would not exist if either of us were absent. Alex Mennen provided feedback on drafts to make it much more comprehensible than it would otherwise be, and Turntrout and John Maxwell also helped a bit in editing.

Be aware this sequence of posts has the math textbook issue where it requires loading a tower of novel concepts that build on each other into your head, and cannot be read in a single sitting. We will be doing a group readthrough on MIRIxDiscord where we can answer questions and hopefully get collaborators, PM me to get a link.

TLDR: Infra-Bayesianism is a new approach to epistemology / decision theory / reinforcement learning theory, which builds on "imprecise probability" to solve the problem of prior misspecification / grain-of-truth / nonrealizability which plagues Bayesianism and Bayesian reinforcement learning. Infra-Bayesianism also naturally leads to an implementation of UDT, and (more speculatively at this stage) has applications to multi-agent theory, embedded agency and reflection. This post is the first in a sequence which lays down the foundation of the approach.

**Introduction:**

Learning theory traditionally deals with two kinds of setting: "realizable" and "agnostic" or "non-realizable". In realizable settings, we assume that the environment can be described perfectly by a hypothesis inside our hypothesis space. (AIXI is an example of this) We then expect the algorithm to converge to acting as if it already knew the correct hypothesis. In non-realizable settings, we make no such assumption. We then expect the algorithm to converge to the best approximation of the true environment within the available hypothesis space.

As long as the computational complexity of the environment is greater than the computational complexity of the learning algorithm, the algorithm cannot use an easy-to-compute hypothesis that would describe the environment perfectly, so we are in the nonrealizable setting. When we discuss AGI, this is necessarily the case, since the environment is the entire world: a world that, in particular, contains the agent itself and can support other agents that are even more complex, much like how halting oracles (which you need to run Solomonoff Induction) are nowhere in the hypotheses which Solomonoff considers. Therefore, the realizable setting is usually only a toy model. So, instead of seeking guarantees of good behavior assuming the environment is easy to compute, we'd like to get good behavior simply assuming that the environment has some easy-to-compute properties that can be exploited.

For offline and online learning there are classical results in the non-realizable setting, in particular VC theory naturally extends to the non-realizable setting. However, for reinforcement learning there are few analogous results. Even for passive Bayesian inference, the best non-realizable result found in our literature search is Shalizi's which relies on ergodicity assumptions about the true environment. Since reinforcement learning is the relevant setting for AGI and alignment theory, this poses a problem.

Logical inductors operate in the nonrealizable setting, and the general reformulation of them in Forecasting Using Incomplete Models is of interest for broader lessons applicable to acting in an unknown environment. In said paper, reality can be drawn from any point in the space of probability distributions over infinite sequences of observations, Δ(Oω). Almost all of the points in this space aren't computable, and because of that, we shouldn't expect convergence to the true environment, as occurs in the realizable setting where the true environment lies in your hypothesis space.

However, even if we can't hope to learn the *true* environment, we can at least hope to learn some *property* of the true environment, like "every other bit is a 0", and have our predictions reflect that if it holds. A hypothesis in this setting is a closed convex subset of Δ(Oω) which can be thought of as "I don't know what the true environment is, but it lies within this set". The result obtained in the above-linked paper was, if we fix a countable family of properties that reality may satisfy, and define the inductor based on them, then for all of those which reality fulfills, the predictions of the inductor converge to that closed convex set and so fulfill the property in the limit.

**What About Environments?**

However, this just involves sequence prediction. Ideally, we'd want some space that corresponds to environments that you can interact with, instead of an environment that just outputs bits. And then, given a suitable set B in it... Well, we don't have a fixed environment to play against. The environment could be *anything*, even a worst-case one within B. We have Knightian uncertainty over our set of environments, it is *not* a probability distribution over environments. So, we might as well go with the maximin policy.

argmaxπinfe∈B(Eπ⋅e[U])

Where π⋅e is the distribution over histories produced by policy π interacting with environment e. U is just some utility function.

When we refer to "Murphy", this is referring to whatever force is picking the worst-case environment to be interacting with. Of course, if you aren't playing against an adversary, you'll do better than the worst-case utility that you're guaranteed. Any provable guarantees come in the form of establishing lower bounds on expected utility if a policy is selected.

The problem of generating a suitable space of environments was solved in Reinforcement Learning With Imperceptible Rewards. If two environments are indistinguishable by any policy they are identified, a mixture of environments corresponds to picking one of the component environments with the appropriate probability at the start of time, and there was a notion of update.

However, this isn't good enough. We could find no good update rule for a set of environments, we had to go further.

Which desiderata should be fulfilled to make maximin policy selection over a set of environments (actually, we'll have to generalize further than this) to work successfully? We'll have three starting desiderata.

**Desideratum 1:** There should be a sensible notion of what it means to update a set of environments or a set of distributions, which should also give us dynamic consistency. Starting with a policy π, a policy π′ which acts similarly to π but past history h acts better (according to the beliefs of the agent after observing h) than the usual behavior of π after h, should do better as viewed from the start.

**Desideratum 2:** Our notion of a hypothesis (set of environments) in this setting should collapse "secretly equivalent" sets, such that any two distinct hypotheses behave differently in *some* relevant aspect. This will require formalizing what it means for two sets to be "meaningfully different", finding a canonical form for an equivalence class of sets that "behave the same in all relevant ways", and then proving some theorem that says we got everything.

**Desideratum 3:** We should be able to formalize the "Nirvana trick" (elaborated below) and cram any UDT problem where the environment cares about what you *would* do, into this setting. The problem is that we're just dealing with sets of environments which only depend on what you do, not what your policy is, which hampers our ability to capture policy-dependent problems in this framework. However, since Murphy looks at your policy and then picks which environment you're in, there *is* an acausal channel available for the choice of policy to influence which environment you end up in.

The "Nirvana trick" is as follows. Consider a policy-dependent environment, a function Π×(A×O)<ω×A→ΔO (Ie, the probability distribution over the next observation depends on the history so far, the action you selected, and your policy). We can encode a policy-dependent environment as a set of policy-independent environments that don't care about your policy, by hard-coding every possible deterministic policy into the policy slot, making a family of functions of type (A×O)<ω×A→ΔO, which is the type of policy-independent environments. It's similar to taking a function f(x,y), and plugging in all possible x to get a family of functions that only depend on y.

Also, we will impose a rule that, if your action ever violates what the hard-coded policy predicts you do, you attain Nirvana (a state of high or infinite reward). Then, Murphy, when given this set of environments, will go "it'd be bad if they got high or infinite reward, thus I need to pick an environment where the hard-coded policy matches their *actual* policy". When playing against Murphy, you'll act like you're selecting a policy for an environment that *does* pay attention to what policy you pick. As-stated, this doesn't quite work, but it can be repaired.

There's two options. One is making Nirvana count as infinite reward. We will advance this to a point where we can capture any UDT/policy-selection problem, at the cost of some mathematical ugliness. The other option is making Nirvana count as 1 reward forever afterward, which makes things more elegant, and it is much more closely tied to learning theory, but that comes at the cost of only capturing a smaller (but still fairly broad) class of decision-theory problems. We will defer developing that avenue further until a later post.

**A Digression on Deterministic Policies**

We'll be using deterministic policies throughout. The reason for using deterministic policies instead of probabilistic policies (despite the latter being a larger class), is that the Nirvana trick (with infinite reward) doesn't work with probabilistic policies. Also, probabilistic policies don't interact well with embeddedness, because it implicitly assumes that you have a source of random bits that the rest of the environment can never interact with (except via your induced action) or observe.

Deterministic policies can emulate probabilistic policies by viewing probabilistic choice as deterministically choosing a finite bitstring to enter into a random number generator (RNG) in the environment, and then you get some bits back and act accordingly.

However, we aren't assuming that the RNG is a good one. It could be insecure or biased or nonexistent. Thus, we can model cases like Death In Damascus or Absent-Minded Driver where you left your trusty coin at home and don't trust yourself to randomize effectively. Or a nanobot that's too small to have a high bitrate RNG in it, so it uses a fast insecure PRNG (pseudorandom number generator). Or game theory against a mindreader that can't see your RNG, just the probability distribution over actions you're using the RNG to select from, like an ideal CDT opponent. It can also handle cases where plugging certain numbers into your RNG chip cause lots of heat to be released, or maybe the RNG is biased towards outputting 0's in strong magnetic fields. Assuming you have a source of true randomness that the environment can't read isn't general enough!

**Motivating Sa-Measures**

Sets of probability distributions or environments aren't enough, we need to add in some extra data. This can be best motivated by thinking about how updates should work in order to get dynamic consistency.

Throughout, we'll be using a two-step view of updating, where first, we chop down the measures accordingly (the "raw update"), and then we renormalize back up to 1.

So, let's say we have a set of two probability distributions μ1 and μ2. We have Knightian uncertainty within this set, we genuinely don't know which one will be selected, it may even be adversarial. μ1 says observation o has 0.5 probability, μ2 says observation o has 0.01 probability. And then you see observation o! The wrong way to update would be to go "well, both probability distributions are consistent with observed data, I guess I'll update them individually and resume being completely uncertain about which one I'm in", you don't want to ignore that one of them assigns 50x higher probability to the thing you just saw.

However, neglecting renormalization, we can do the "raw update" to each of them individually, and get m1 and m2 (finite measures, not probability distributions), where m1 has 0.5 measure and m2 has 0.01 measure.

Ok, so instead of a set of *probability distributions*, since that's insufficient for updates, let's consider a set of measures m, instead. Each individual measure in that set can be viewed as λμ, where μ is a probability distribution, and λ≥0 is a scaling term. Note that λ is not uniform across your set, it varies depending on which point you're looking at.

However, this *still* isn't enough. Let's look at a toy example for how to design updating to get dynamic consistency. We'll see we need to add *one more* piece of data. Consider two environments where a fair coin is flipped, you see it and then say "heads" or "tails", and then you get some reward. The COPY Environment gives you 0 reward if you say something different than what the coin shows, and 1 reward if you match it. The REVERSE HEADS Environment always you 0.5 reward if the coin comes up tails, but it comes up heads, saying "tails" gets you 1 reward and "heads" gets you 0 reward. We have Knightian uncertainty between the two environments.

For finding the optimal policy, we can observe that saying "tails" when the coin is tails helps out in COPY and doesn't harm you in REVERSE HEADS, so that's a component of an optimal policy.

Saying "tails" no matter what the coin shows means you get 0.5⋅0+0.5⋅1=0.5 utility on COPY, and 0.5⋅1+0.5⋅0.5=0.75 utility on REVERSE HEADS. Saying "tails" when the coin is tails and "heads" when the coin is heads means you get 0.5⋅1+0.5⋅1=1 utility on COPY and 0.5⋅0+0.5⋅0.5=0.25 utility on REVERSE HEADS. Saying "tails" no matter what has a better worst-case value, so it's the optimal maximin policy.

Now, if we see the coin come up heads, how should we update? The *wrong* way to do it would be to go "well, both environments are equally likely to give this observation, so I've got Knightian uncertainty re: whether saying heads or tails gives me 1 or 0 utility, both options look equally good". This is because, according to past-you, regardless of what you did upon seeing the coin come up "tails", the maximin expected values of saying "heads" when the coin comes up heads, and saying "tails" when the coin comes up heads, are unequal. Past-you is yelling at you from the sidelines not to just shrug and view the two options as equally good.

Well, let's say you *already* know that you would say "tails" when the coin comes up tails and are trying to figure out what to do now that the coin came up heads. The proper way to reason through it is going "I have Knightian uncertainty between COPY which has 0.5 expected utility assured off-history since I say "tails" on tails, and REVERSE HEADS, which has 0.25 expected utility assured off-history. Saying "heads" now that I see the coin on heads would get me (0.5×1)+0.5=1 expected utility in COPY and (0.5×0)+0.25=0.25 utility in REVERSE HEADS, saying "tails" would get me (0.5×0)+0.5=0.5 utility in COPY and (0.5×1)+0.25=0.75 utility in REVERSE HEADS, I get higher worst-case value by saying "tails"." And then you agree with your past self re: how good the various decisions are.

Huh, the proper way of doing this update to get dynamic consistency requires keeping track of the fragment of expected utility we get off-history.

Similarly, if you messed up and precommitted to saying "heads" when the coin comes up tails (a bad move), we can run through a similar analysis and show that keeping track of the expected utility off-history leads you to take the action that past-you would advise, after seeing the coin come up heads.

So, with the need to keep track of that fragment of expected utility off-history to get dynamic consistency, it isn't enough to deal with finite measures m, that still isn't keeping track of the information we need. What we need is (m,b), where m is a finite measure, and b is a number ≥0. That b term keeps track of the expected value off-history so we make the right decision after updating. (We're glossing over the distinction between probability distributions and environments here, but it's inessential)

We will call such a (m,b) pair an "affine measure", or "a-measure" for short. The reason for this terminology is because a measure can be thought of as a linear function from the space of continuous functions to R. But then there's this +b term stuck on that acts as utility, and a linear function plus a constant is an affine function. So, that's an a-measure. A pair of a finite measure and a b term where b≥0.

But wait, we can go even further! Let's say our utility function of interest is bounded. Then we can do a scale-and-shift until it's in [0,1].

Since our utility function is bounded in [0,1]... what would happen if you let in measures with negative parts, but only if they're paired with a sufficiently large b term? Such a thing is called an sa-measure, for signed affine measure. It's a pair of a finite signed measure and a b term that's as-large-or-larger than the amount of negative measure present. No matter your utility function, even if it assigns 0 reward to outcomes with positive measure and 1 reward to outcomes with negative measure, you're still assured nonnegative expected value because of that +b term. It turns out we actually *do* need to expand in this direction to keep track of equivalence between sets of a-measures, get a good tie-in with convex analysis because signed measures are dual to continuous functions, and have elegant formulations of concepts like minimal points and the upper completion.

Negative measures may be a bit odd, but as we'll eventually see, we can ignore them and they only show up in intermediate steps, not final results, much like negative probabilities in quantum mechanics. And if negative measures ever become relevant for an application, it's effortless to include them.

**Belief Function Motivation**

Also, we'll have to drop the framework we set up at the beginning where we're considering sets of environments, because working with sets of environments has redundant information. As an example, consider two environments where you pick one of two actions, and get one of two outcomes. In environment e0, regardless of action, you get outcome 0. In environment e1, regardless of action, you get outcome 1. Then, we should be able to freely add an environment e2, where action 0 implies outcome 0, and where action 1 implies outcome 1. Why?

Well, if your policy is to take action 0, e2 and e0 behave identically. And if your policy is to take action 1, e2 and e1 behave identically. So, adding an environment like this doesn't affect anything, because it's a "chameleon environment" that will perfectly mimic *some* preexisting environment regardless of which policy you select. However, if you consider the function mapping an action to the set of possible probability distributions over outcomes, adding e2 didn't change that at all. Put another way, if it's impossible to distinguish in any way whether an environment was added to a set of environments because no matter what you do it mimics a preexisting environment, we might as well add it, and seek some alternate formulation instead of "set of environments" that doesn't have the unobservable degrees of freedom in it.

To eliminate this redundancy, the *true* thing we should be looking at isn't a set of environments, but the "belief function" from policies to sets of probability distributions over histories. This is the function produced by having a policy interact with your set of environments and plotting the probability distributions you could get. Given certain conditions on a belief function, it is possible to recover a set of environments from it, but belief functions are more fundamental. We'll provide tools for taking a wide range of belief functions and turning them into sets of environments, if it is desired.

Well, actually, from our previous discussion, sets of probability distributions are insufficient, we need a function from policies to sets of sa-measures. But that's material for later.

**Conclusion**

So, our fundamental mathematical object that we're studying to get a good link to decision theory is not sets of probability distributions, but sets of sa-measures. And instead of sets of environments, we have functions from policies to sets of sa-measures over histories. This is because probability distributions alone aren't flexible enough for the sort of updating we need to get dynamic consistency, and in addition to this issue, sets of environments have the problem where adding a new environment to your set can be undetectable in any way.

In the next post, we build up the basic mathematical details of the setting, until we get to a duality theorem that reveals a tight parallel between sets of sa-measures fulfilling certain special properties, and probability distributions, allowing us to take the first steps towards building up a version of probability theory fit for dealing with nonrealizability. There are analogues of expectation values, updates, renormalizing back to 1, priors, Bayes' Theorem, Markov kernels, and more. We use the "infra" prefix to refer to this setting. An infradistribution is the analogue of a probability distribution. An infrakernel is the analogue of a Markov kernel. And so on.

The post after that consists of extensive work on belief functions and the Nirvana trick to get the decision-theory tie-ins, such as UDT behavior while still having an update rule, and the update rule is dynamically consistent. Other components of that section include being able to specify your entire belief function with only part of its data, and developing the concept of Causal, Pseudocausal, and Acausal hypotheses. We show that you can encode almost any belief function as an Acausal hypothesis, and you can translate Pseudocausal and Acausal hypotheses to Causal ones by adding Nirvana appropriately (kinda). And Causal hypotheses correspond to actual sets of environments (kinda). Further, we can mix belief functions to make a prior, and there's an analogue of Bayes for updating a mix of belief functions. We cap it off by showing that the starting concepts of learning theory work appropriately, and show our setting's version of the Complete Class Theorem.

Later posts (not written yet) will be about the "1 reward forever" variant of Nirvana and InfraPOMDP's, developing inframeasure theory more, applications to various areas of alignment research, the internal logic which infradistributions are models of, unrealizable bandits, game theory, attempting to apply this to other areas of alignment research, and... look, we've got a lot of areas to work on, alright?

If you've got the relevant math skills, as previously mentioned, you should PM me or Turntrout to get a link to the MIRIxDiscord server and participate in the group readthrough, and you're more likely than usual to be able to contribute to advancing research further, there's a lot of shovel-ready work available.

**Links to Further Posts:**

Discuss

### Basic Inframeasure Theory

.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

Our task in this post will be to develop the basic theory and notation for inframeasures and sa-measures. The proofs and concepts require some topology and functional analysis. We assume the reader is familiar with topology and linear algebra but not functional analysis, and will explain the functional analysis concepts more. If you wish to read through these posts, PM me to get a link to MIRIxDiscord, we'll be doing a group readthrough where I or Vanessa can answer questions. Here's the previous post and here are the proof sections. Beware, the proof sections are hard.

**Notation Reference**

Feel free to skip this segment and refer back to it when needed. Duplicate the tab, keep one of them on this section, and you can look up notation here.

X,Y: some compact metric space. Points in this are denoted by x or y.

d: Some distance metric, the usual one is the KR-metric between signed measures, defined as d(m,m′):=supflip∣∣∫Xflip(x)dm−∫Xflip(x)dm′∣∣ where flip is a function X→[−1,1] that has a Lipschitz constant of 1 or less. In other words, what's the biggest distance between the values that the measures output when you feed in a function that's limited in its ability to distinguish measures that are close to each other due to having a small Lipschitz constant?

M±(X): The Banach space of finite signed measures over X equipped with the KR-metric/norm, defined as above (the norm is derived from the metric by assessing the distance between the signed measure and the 0 measure). Elements are denoted by m. By Jordan decomposition, we can uniquely split it into m++m− where the former is all-positive and the latter is all-negative.

C(X),C(X,[0,1]): The Banach space of continuous functions X→R. The latter one is the space of continuous functions bounded in [0,1]. Elements of C(X,[0,1]) are typically denoted by f.

m(f): We can interpret signed measures as continuous linear functionals on C(X). This is given by ∫Xf(x)dm. If m was actually a probability distribution, this would just be Eμ(f). They're generalized expectations.

b: used to refer to the number component of an a-measure or sa-measure.

Msa(X): The closed convex cone of sa-measures. An sa-meaure is a pair (m,b) where b+m−(1)≥0. Elements of this (sa-measures) are denoted by M.

f+: A positive functional. A continuous linear functional that's nonnegative for all sa-measures.

B: A set of sa-measures.

EB(f): The expectation of a function f relative to a set of sa-measures. Defined as inf(m,b)∈B(m(f)+b).

Bmin,Buc: The set of minimal points of B (points that can't be written as a different point in B plus a nonzero sa-measure), and the upper completion of B (B+Msa(X)), respectively.

Ma(X): The closed convex cone of a-measures. An a-measure is a pair (m,b) where m is a measure (no negative component) and b≥0. It can also be written as (λμ,b) where λ≥0, and μ is a probability distribution.

λ: Given an a-measure, its λ value is the λ from writing the a-measure as (λμ,b). At the end, it's used for lambda-notation to describe complicated functions. Context distinguishes.

λ⊙: Either the minimal upper bound on the λ values of the minimal points of a set B, or the Lipschitz constant of a function h (there's a close link between the two).

¯¯¯¯B,c.h(B): The closure and convex hull of a set, respectively.

H: An infradistribution. A set of sa-measures fulfilling the properties of nonemptiness, closure, convexity, upper-completeness, positive-minimals, (weak)-bounded minimals, and normalization.

□X: The set of infradistributions over X. □bX is the set of bounded infradistributions over X.

ΔX: The space of probability distributions over some set X.

h: The function induced by an H that goes f↦EH(f). Or, just a function C(X,[0,1])→[0,1] that's concave, monotone, uniformly continuous, and normalized, there's a link with infradistributions.

g,g∗: If you're seeing it in the context of a pushforward, g is a continuous function X→Y, and g∗ is the induced function □X→□Y. If you're seeing g in the context of updating, it's a continuous function in C(X,[0,1]).

ζ: Used to denote probability distributions used for mixing stuff together, a probability distribution over the natural numbers or a finite subset of them in all cases. ζi is the probability on the number i.

i: The index variable for mixing, like indexing infradistributions or points or functions.

EζHi: A mix of sets, defined as the set of every point that can be constructed by selecting a point from each Hi and mixing according to ζ.

L: A function in C(X,[0,1]), thought of as the indicator function for a fuzzy set, that we use for updating.

f★Lg: The function made by gluing f and g together via L, defined as (f★Lg)(x):=L(x)f(x)+(1−L(x))g(x).

supp(f): The support of a function f, the set of x where 0">f(x)>0.

1E: The indicator function that's 1 on set E, 0 everywhere else.

PgH(L): The probability of L relative to the function g according to the infradistribution H, it's defined as EH(1★Lg)−EH(0★Lg)

H|gL: The infradistribution H updated on the fuzzy set L relative to the function g.

K: An infrakernel, a function fulfilling some continuity properties, of type signature X→□Y.

**Basic concepts**

Time to start laying our mathematical groundwork, like the spaces we'll be working in.

X is some compact metric space, equipped with the Borel σ-algebra (which, in this case, is the same as the Baire σ-algebra) We could probably generalize this further to complete metric spaces, but things get significantly trickier (one of many directions for future research), and compact metric spaces are quite well-behaved.

Concrete examples of compact metric spaces include the space of infinite bitstrings, the color space for a mantis shrimp, the surface of a sphere, a set of finitely many points, the unit interval, the space of probability distributions over a compact metric space, and countable products of compact metric spaces.

Let's recap some functional analysis terminology for those seeing it for the first time, and a bit of the notation we're using, you can skip this part if you already know it. Vector spaces in full generality may lack much of the nice structure present in Rn that's used in Linear Algebra. Going from the strongest and most useful structure to the weakest, there's a chain of implication that goes inner product, norm, metric, topology. If you have an inner product, you can get a norm. If you have a norm, you can get a metric from that via d(x,y)=||x−y||, and if you have a metric, you can get a topology from that (with a basis of open balls centered at points). The structure must be imposed on the vector space, and there may be some freedom in doing so, like how Rn can have the L1, L2, or L∞ norm.

A Banach space is a vector space equipped with a norm (a notion of size for vectors), that's also closed under taking limits, just like R is.

The term "functional" is used for a function to R. So, a continuous linear functional on a vector space V is a function that's: linear, continuous, and has type signature V→R.

The term "dual space of V" is used for the vector space of continuous linear functionals V→R.

The space C(X) is the Banach space of continuous functions X→R. We'll also use C(X,[0,1]) to denote the subset that just consists of continuous functions bounded in [0,1]. The dual space of C(X) is M±(X), the Banach space of finite signed measures over X.

Moving on from the functional analysis terminology, let's consider finite signed measures m, an element of M±(X). A signed measure m can be uniquely split into a positive part and a negative part m++m−, by the Jordan Decomposition Theorem. The "finite" part just means that m+ doesn't assign any set ∞ measure and m− doesn't asssign any set −∞ measure.

Now, we said that the space of finite signed measures was the dual space of C(X) (continuous functions X→R). So... how does a signed measure correspond to a continuous linear functional over C(X)? Well, that corresponds to m(f):=∫Xf(x)dm

If m was a probability distribution μ, then μ(f) would be Eμ(f), so this is just like taking the expected value, but generalizing to negative regions in m. We'll be using this notation m(f) a whole lot. Because finite signed measures perfectly match up with continuous linear functionals C(X)→R, we can toggle back and forth between whichever view is the most convenient in the moment, viewing continuous linear functionals on C(X) as finite signed measures, and vice-versa.

Our ambient vector space we work in is M±(X)⊕R, a pair of a signed measure and a number. ⊕ is the direct sum, which is basically Cartesian product for vector spaces. The direct sum of Banach spaces is a Banach space, with the norm defined in the obvious way as ||(m,b)||=||m||+|b|.

We should take a moment to talk about what norm/metric we're using. The norm/metric we're using on M±(X) is the KR(Kantorovich-Rubenstein)-norm/metric.

**Definition 1: KR-metric**

*The metric defined by *d(m,m′):=supflip|m(flip)−m′(flip)|* where *flip* is a a continuous function *X→[−1,1]* with a Lipschitz constant of 1 or less. The KR-norm is *||m||=d(m,0)

But why are we using this unfamiliar KR-metric, instead of the trusty old total variation distance? Well, the KR-metric, since it can only query the measure with functions that aren't too steep or big, says that two distributions are close if, y'know, they're close in the intuitive sense. If we have a bunch of Dirac-delta distributions at 0.9, 0.99, 0.999..., then according to the KR-metric, they'd limit to a Dirac-delta distribution at 1. According to total variation distance, all these distributions at distance 2 from each other and don't converge at all. Similarly, if we've got two probability distributions over histories for an environment we're in, and they behave *very* similarly and then start diverging after the gazillionth timestep, the KR-metric would go "hm, those two distributions are very close to each other", while total variation distance says they're very different. Also, if we've got finitely many points at distance 1 from each other, the KR-metric and total variation distance match up with each other (up to a constant). But total variation distance is a bit too restrictive for the continuous case.

There's a sense in which convergence in total variation distance too strict because it requires a "perfect match" to exist in your hypothesis space, while convergence in KR-distance is just right for nonrealizability because, instead of requiring a "perfect match", it just requires that you get sufficiently close. Instead of getting accurate predictions for the rest of forever, it's a requirement that's something more like "you'll be accurate for the next zillion timesteps, and the time horizon where you start being inaccurate gets further and further away over time". You can draw an analogy to how utility functions with the time discount don't care that much about what happens at very late times.

Going with the KR-metric means we've got very nice dual spaces and compactness properties, while with total variation distance, Wikipedia doesn't even know what the dual space *is*.

So, tl;dr, the KR-metric is a much better choice for our setting, and we're working in M±(X)⊕R as our vector space, which is equipped with the KR-norm and is closed under limits.

**Definition 2: Sa-Measure**

*A point *M∈M±(X)⊕R*, which, when written as a pair of a signed measure and a number *(m,b)*, has *b+m−(1)≥0*. The set of sa-measures is denoted by *Msa(X)*.*

**Definition 3: A-Measure**

*A point *M∈M±(X)⊕R*, which, when written as a pair of a signed measure and a number *(m,b)*, has *m* as a measure, and *b≥0*. The set of a-measures is denoted by *Ma(X)*.*

Note that Ma(X) and Msa(X) are both closed convex cones. A closed convex cone is a subset of a vector space that: Is closed under multiplication by any a≥0, is closed under addition, and is closed under limits. For visual intuition, imagine a literal cone with its point at 0 in R3 that's heading off in some direction, and see how it fulfills those 3 properties.

**Basic Inframeasure Conditions**

Before proceeding further, we should mention that Theorems 1, 2, and 3 are fairly elementary and have probably been proved in more generality in some different paper on convex analysis. We just call them theorems because they're important, not necessarily original. Past that, things get more novel. Sets of distributions instead of distributions have been considered before under the name "Imprecise Probability", as have nonlinear expectations and some analogues to probability theory, Shige Peng wrote a book on the latter. We found out about this after coming up with it independently. The main innovations that have not been found elsewhere are augmenting the sets of probability distributions with extra data (ie, our sa-measures) to get a dynamically consistent update rule, and how to deal with environments/link the setting to reinforcement learning. Let's move on.

Let B be some arbitrary set of sa-measures. We're obviously nowhere near calling it an infradistribution, because we haven't imposed any properties on it. And different B may have the same behavior, we're nowhere near our second desideratum of collapsing equivalence classes of B's with the same behavior. Well, nonemptiness should be a fairly obvious property to add.

**Condition 1: Nonemptiness:** B≠∅

From here, let's see how we can enlarge B without affecting its behavior. But hang on, what do we even mean by "behavior"??

**Definition 4: Expectation w.r.t. a Set of Sa-Measures**

EB(f):=inf(m,b)∈B(m(f)+b)* where *f∈C(X,[0,1])* and *B* is nonempty.*

This is what we mean by "behavior", all these values should be left unchanged, regardless of f.

**Proposition 1:** *If *f∈C(X,[0,1])* then *f+:(m,b)↦m(f)+b* is a positive functional for *Msa(X)*.*

A positive functional for Msa(X) is a continuous linear function M±(X)⊕R→R that is nonnegative everywhere on Msa(X).

This suggests two more conditions besides Nonemptiness.

**Condition 2: Closure:** B=¯¯¯¯B

**Condition 3: Convexity:** B=c.h(B)

Why can we impose closure and convexity? Taking the closure of B wouldn't affect any expectation values, it'd only swap inf for min in some cases because (m,b)↦m(f)+b is continuous. Also, since everything we're querying our set B with is inf of a linear functional by Proposition 1, we can take the convex hull without changing anything. So, swapping B out for its closed convex hull, no expectation values change at all.

But wait, we aren't querying B with *all* linear functionals, we're *only* querying it with positive functionals that are constructed from a f in C(X,[0,1]). Or does this class of positive functionals go further than we think? Yes, it does, actually.

**Theorem 1, Functionals To Functions:** *Every positive functional on *Msa(X)* can be written as *(m,b)↦c(m(f)+b)*, where *c≥0*, and *f∈C(X,[0,1])

Nice! We actually are querying our set with all positive functionals, because we've pretty much got everything with just f∈C(X,[0,1]), and everything else is just a scalar multiple of that.

**Upper Completion and Minimal Points**

If you have a point M∈B, and some other point M∗ that's an sa-measure, we might as well add M+M∗ to B. Why? Well, given some positive functional f+ (and everything we're querying our set B with is a positive functional by Proposition 1,

f+(M+M∗)=f+(M)+f+(M∗)≥f+(M)

By linearity and positive functionals being nonnegative on sa-measures, your new point M+M∗ has equal or greater value than M, so when we do infM∈Bf+(M), the addition of the new point didn't change anything at all, regardless of which positive functional/continuous function (by Theorem 1) we're using. So then, let's add in *all* the points like this! It's free. This would be done via Minkowski sum.

B+Msa(X)={M|M=MB+M∗,MB∈B,M∗∈Msa(X)}

**Definition 5: Upper Completion**

*The upper completion of a set *B*, *Buc*, is *Buc:=B+Msa(X)

**Condition 4: Upper Completeness:** B=B+Msa(X)

Ok, so we add in all those points. Since we're adding two nonempty convex sets, the result is also convex. As for closure...

**Lemma 2:** *The upper completion of a closed set of sa-measures is closed.*

However, B+Msa(X) isn't *quite* what we wanted. Maybe there's more points we could add! We want to add in every sa-measure we possibly can to our set as long as it doesn't affect the essential "behavior"/worst-case values. So, we *should* be able to add in a point if every positive functional/continuous function (Proposition 1 and Theorem 1) goes "the value of the point you're looking is undershot by this preexisting point over here". This more inclusive notion of adding points to make B as big as possible (adding any more points would start affecting the "behavior" of our set) would be:

Add a point (m′,b′) to B if, for all f in C(X,[0,1]), there's a (m,b) in B where m′(f)+b′≥m(f)+b

Actually, this gets us *nothing* over just taking the upper completion/adding Msa(X)! Check out the next result.

**Proposition 2:** *For closed convex nonempty* B,

B+Msa(X)={M|∀f+∃M′∈B:f+(M)≥f+(M′)}

Combining Proposition 2 and Theorem 1, our notion of upper closure is exactly the same as "add all the points you possibly can that don't affect the inf(m,b)∈B(m(f)+b) value for any f"

Along with the the notion of the upper completion comes the notion of a minimal point.

**Definition 6: Minimal Point**

*A minimal point of a closed nonempty set of sa-measures *B* is a point *M∈B* where, if *M=MB+M∗*, and *MB∈B*, and *M∗∈Msa(X)* then *M∗=0*. The set of minimal points is denoted *Bmin

So, minimal points can't be written as a different point in the same set plus a nonzero sa-measure. It's something that can't possibly have been added by the upper-completion if it wasn't there originally. We'll show a picture of the upper completion and minimal points (these two notions generalize to any closed subset of any closed cone), to make things more concrete.

**Theorem 2, Minimal Decomposition:** *Given a nonempty closed set of sa-measures *B*, the set of minimal points *Bmin* is nonempty and all points in *B* are above a minimal point.*

This means that we can take any point M∈B and decompose it into Mmin+M∗, where Mmin∈Bmin, and M∗ is an sa-measure. We use this a whole lot in proofs. The proof of this uses the axiom of choice in the form of Zorn's Lemma, but separability may let us find some way to dodge the use of the full axiom of choice.

**Proposition 3:** *Given a *f∈C(X,[0,1])*, and a *B* that is nonempty closed, *inf(m,b)∈B(m(f)+b)=inf(m,b)∈Bmin(m(f)+b)

So when evaluating EB(f), we can just minimize within the set of minimal points. Minimal points are the *only* thing that matters for the "behavior" of B.

**Proposition 4:** Given a nonempty closed convex B, Bmin=(Buc)min and (Bmin)uc=Buc

The set of minimal points is left unchanged when you take the upper completion, and taking the upper completion of the set of minimal points equals taking the upper completion of B. This is fairly intuitive from the picture.

**Theorem 3, Full Characterization:** *If the nonempty closed convex sets *A* and *B* have *Amin≠Bmin*, then there is some *f∈C(X,[0,1])* where *EA(f)≠EB(f)

**Corollary 1:** *If two nonempty closed convex upper-complete sets *A* and *B* are different, then there is some *f∈C(X,[0,1])* where *EA(f)≠EB(f)

Looking back at our second desideratum, it says "Our notion of a hypothesis in this setting should collapse "secretly equivalent" sets, such that any two distinct hypotheses behave differently in *some* relevant aspect. This will require formalizing what it means for two sets to be "meaningfully different", finding a canonical form for an equivalence class of sets that "behave the same in all relevant ways", and then proving some theorem that says we got everything."

And we did exactly that. Also, Theorem 3 and the other results justify the view of the minimal points as the "unique identifier" of a closed convex set. If two closed convex sets have the same minimal-point ID, then taking the upper completion gets you the same set, and they behave the same w.r.t all the queries we can throw at them. If two sets have a different minimal-point ID, then when we take the upper completion, they're different, and there's some query that distinguishes them.

**Minimal Point Conditions**

Well, this is the basics. But we can impose some more conditions. We don't really like these signed measures, it'd be nice to work *exclusively* with positive measures, if possible. The minimal points are all we really need to care about by Proposition 3, so let's require that they're all in the smaller cone Ma(X), which has no negative-measure shenanigans afoot. Renormalization may fail if there's minimal points that have negative parts, which is analogous to how you can renormalize a positive measure back to a probability distribution, but a signed measure may not be able to be renormalized back to 1.

**Condition 5: Minimal-positivity:** Bmin⊆Ma(X)

Further, things can get a bit tricky in various places if the minimal points don't lie in some compact set. Applying compactness arguments lets you show that you don't have to close your set after updating, and show up a lot in our proofs of properties of belief functions. However, this next condition isn't essential, just convenient, and it's worthwhile looking at what happens when it's dropped, for future research.

**Condition 6a: Minimal-boundedness:** *There is a compact set *C* s.t. *Bmin⊆C*.*

**Proposition 5:** *Let *μ* denote an arbitrary probability distribution. If *Bmin⊆Ma(X)*, then the condition "there is a *λ⊙* where, *∀(λμ,b)∈Bmin:λ≤λ⊙*" is equivalent to "there is a compact *C* s.t. *Bmin⊆C*"*

We mostly use this formulation of Minimal-boundedness instead of the compact set requirement. We only have to bound the scale-terms on the minimal points and we have this property. Again, it's not *essential*, but very convenient.

Is there a weakening of bounded-minimals? Yes there is. I haven't figured out what it means for the set of minimal points, but it's more mathematically essential, and I don't think it can be dropped if some other post wants to go further than we did. It can't be motivated at this point, we'll have to wait until we get to Legendre-Fenchel Duality.

**Condition 6b: Weak minimal-boundedness:** * *f↦EB(f)* is uniformly continuous.*

**Normalization**

So, we have almost everything. Nonemptiness, closure, convexity, upper-completion, bounded-minimals, minimal-positivity, and minimal-boundedness/weak minimal-boundedness are our conditions so far.

However, there's one more condition. What's the analogue of renormalizing a measure back to 1 in this situation? Well, for standard probability distributions μ, Eμ(0)=0, and Eμ(1)=1. This can be cleanly ported over. We shall require that EB(0)=0, EB(1)=1, that's our analogue of normalization. Unpacking what the expectation means, this corresponds to: There's minimal points (λμ,b) where b is arbitrarily close to 0, and there's some minimal point (λμ,b) where λ+b=1, and there's no points with a lower λ+b value.

**Condition 7: Normalization:** EB(1)=1,EB(0)=0

Let's recap all the conditions. We'll be using H for something fulfilling all of the following conditions except maybe 6a.

**1: Nonemptiness:** H≠∅

**2: Closure:** H=¯¯¯¯¯H

**3: Convexity:** H=c.h(H)

**4: Upper Completeness:** H=H+Msa(X)

**5: Positive-Minimals:** Hmin⊆Ma(X)

**6a: Bounded-Minimals:** ∃λ⊙:(λμ,b)∈Hmin→λ≤λ⊙

**6b: Weak Bounded-Minimals:** The function f↦EH(f) is uniformly continuous.

**7: Normalization:** EH(0)=0,EH(1)=1

**Definition 7: (Bounded)-Infradistribution/Inframeasure**

*An inframeasure is a set of sa-measures that fulfills conditions 1-5 and 6b. An infradistribution *H* is a set of sa-measures that fulfills conditions 1-5, 6b, and 7. The "bounded" prefix refers to fulfilling condition 6a. The set of infradistributions is denoted as *□X*, the set of bounded infradistributions is denoted as *□bX*. *

Now, how do we get normalization if it isn't already present? Closure, Convexity, and Upper Completeness can be introduced by closure, convex hull, and upper completion, respectively. How do we turn an inframeasure into an infradistribution?

Well, just take every (m,b) in your set, and map it to: 1EB(1)−EB(0)(m,b−EB(0))

This may seem a bit mysterious. This normalization can be thought of as analogous to rescaling a utility function to be in [0,1] via scale-and-shift.

What we're doing first, is shifting everything down by EB(0), which is as much as we possibly can manage without making b go negative anywhere. The utility function analogue would be, if your expected utilities are (0.4,0.5,0.6), this is like shifting them down to (0,0.1,0.2).

The second thing we do is go "ok, what's the lowest value at 1? Let's scale that back up to 1". Well, it'd be EB(1)−EB(0) (remember, we shifted first), so we multiply everything by (EB(1)−EB(0))−1 to get our set of interest.

Hang on, what if there's a divide-by-zero error? Well.. yes, that can happen. In order for it to happen, EB(0)=EB(1). Does this correspond to any sensible condition which hopefully doesn't happen often?

**Proposition 6:** EB(0)=EB(1)* occurs iff there's only one minimal a-measure, of the form *(0,b)*.*

Put another way, divide by zero errors occur exactly when Murphy is like "oh cool, no matter what function they pick, the worst thing I can do is give them b value, and then nothing happens so they can't pick up any more value than that", so nothing matters at all. This is exactly like Bayesian renormalization failing when you condition on a probability-0 event (note that the measure component is 0 before rescaling). You give up and cry because, in the worst case, nothing you do matters.

**Proposition 7: ***Renormalizing a (bounded)inframeasure produces a (bounded) infradistribution, if renormalization doesn't fail.*

And we're done for now, we've made it up to infradistributions. Now, how can we analyze them?

**Legendre-Fenchel Duality**

There's an powerful way of transforming an infradistribution to another form, to look at the same thing in two completely different mathematical contexts, and we can build up a sort of dictionary of what various different concepts are in two completely different settings, or develop concepts in one setting and figure out what they correspond to in the other setting.

An example of this sort of thing is Stone Duality, where you can represent a topological space as its lattice of open sets, to translate a huge array of concepts back and forth between topology and lattice theory and work in whichever setting is more convenient. And, working with special lattices that can't always translate to topological spaces, you get locales and pointless topology! Duality theorems are highly fruitful.

**Theorem 4, LF-duality, Sets to Functionals:** *If *H* is an infradistribution/bounded infradistribution, then *h:f↦EH(f)* is concave, monotone, uniformly continuous/Lipschitz over *C(X,[0,1])*, *h(0)=0,h(1)=1*, and *range(f)⊈[0,1]→h(f)=−∞

So, expectation w.r.t an infradistribution is concave (not linear, as probability distributions are) and if f≥f′ then EH(f)≥EH(f′) (monotonicity). Paired with normalization, this means every appropriate f has EH(f)∈[0,1].

You get concavity from convexity, the −∞ thing from upper-completeness, monotonicity matches up with "all minimal points are a-measures", Lipschitz corresponds to "all minimal points have λ≤λ⊙", uniform continuity corresponds to the weak-minimal-bound condition, and h(0)=0,h(1)=1 obviously corresponds to normalization. This is moderately suggestive, our conditions we're imposing on the set side are manifesting as natural conditions on the concave functional we get from H.

Is there a reverse direction? How do we start with a h:C(X,[0,1])→[0,1] that fulfills the conditions, and get an infradistribution from that?

**Theorem 5, LF-Duality, Functionals to Sets:** *If *h* is a function *C(X)→R* that is concave, monotone, uniformly-continuous/Lipschitz, *h(0)=0,h(1)=1*, and *range(f)⊈[0,1]→h(f)=−∞*, then it specifies a infradistribution/bounded infradistribution by: *{(m,b)|b≥(h′)∗(m)}

*Where *h′* is the function given by *h′(−f)=−h(f)*, and *(h′)∗* is the **convex conjugate** of *h′*. Also, going from a infradistribution to an *h* and back recovers exactly the infradistribution, and going from an *h* to a infradistribution and back recovers exactly *h*.*

Another name for the convex conjugate is the Legendre-Fenchel transformation, so that's where we get the term Legendre-Fenchel Duality from. If you want, you can shorten it as LF-duality.

So, (bounded)-infradistributions are isomorphic to concave monotone normalized Lipschitz/uniformly continuous functions h:C(X,[0,1])→R. This is the LF-Duality. We can freely translate concepts back and forth between "sets of sa-measures that fulfill some conditions" and "concave functionals on C(X,[0,1]) that fulfill some conditions".

In particular, actual probability distributions correspond to a: linear, monotone, 1-Lipschitz, normalized functional C(X,[0,1])→R. So, in the other half of the LF-duality, probability distributions and infradistributions are very nearly the same sort of thing, the only difference between them is that the former is linear and the latter is concave! This is the essential reason why so many probability theory concepts have analogues for infradistributions.

But... what does the LF-transformation actually do? Well, continuous linear functionals C(X)→R are equivalent to signed measures, and continuous linear functionals M±(X)→R (with our KR-norm) correspond to continuous functions over X. A hyperplane in M±(X)⊕R corresponds to a point in C(X)⊕R and hyperplanes in the latter correspond to points in the former. Points in H turn into hyperplanes above h, points on-or-below the graph of h turn into hyperplanes below H.

What this transform basically does, is take each suitable f, check its value w.r.t h, convert the (f,h(f)) pair into a hyperplane in M±(X)⊕R, and go "alright, whichever set h came from must be above this hyperplane". Eventually all the hyperplanes are drawn and you've recovered your infradistribution as the region above the graph of the hyperplanes.

Viewing □X as suitable concave functionals on C(X,[0,1]), we now have a natural notion of distance between infradistributions analogous to total variation distance

d(H1,H2):=supf∈C(X,[0,1])|EH1(f)−EH2(f)|

At this point, the use of our uniform-continuity condition is clearer. A uniform limit of Lipschitz functions may not be Lipschitz. However, a uniform limit of uniformly continuous functions is uniformly continuous, so the space □X is complete (has all its limit points).

A lot of probability-theory concepts carry over to infradistributions. We have analogues of products, markov kernels, pushforwards, semidirect products, expectations, probabilities, updating, and mixtures. These are most naturally defined in the concave-functional side of the duality first, and then you can conjecture how they work on sets, and you know you got the right set (modulo closure, convex hull, and upper completion) if the expectations w.r.t your set match up with the defining property in the concave functional picture. The post was getting long-enough as is, so we won't cover most of them in detail, and leave developing inframeasure theory more fully till later posts.

**Pushforwards and Mixing**

Let's look at the first one, pushforwards of a (bounded) infradistribution via a continuous g:X→Y. The standard probability theory analogue of a pushforward is starting with a probability distribution over X, and going "what probability distribution over Y is generated by selecting a point from X according to its probability distribution and applying g?"

On the positive functional level, this is defined by: (g∗(h))(f):=h(f∘g)

Let's take a guess as to what this is on the set level. The obvious candidate is: Take the (m,b) in H, do the usual pushforward of the measure via g to get a signed measure over Y, and keep the b term the same, getting a function g∗:Msa(X)→Msa(Y). If g is something that maps everything to the same point, our resulting measures will only be supported on one point, so we'd have to take upper-completion again in that case. But maybe if g is surjective we don't need to do that?

Let g∗(H) be the set produced by applying g∗ to H and taking the upper completion.

**Proposition 8:** *If *f∈C(X,[0,1])* and *g:X→Y* is continuous, then *Eg∗(H)(f)=EH(f∘g)

**Proposition 9:** g∗(H)* is a (bounded) inframeasure if *H* is, and the image of *H* doesn't require upper completion if *g* is surjective.*

Proposition 8 certifies that our unique defining (g∗(h))(f)=h(f∘g) property is fulfilled by doing this to the set of sa-measures, so we have the proper set-analogue of the pushforward. And proposition 9 says we were right, the only thing that may go wrong is upper completion. Generally, doing operations like may not exactly make an inframeasure, but you're only a closed convex hull and upper completion away from your canonical form as long as you check that the expectations match up with what you'd expect on the concave functional side.

Alright now, what about mixing? Like, mixing hypotheses to make a prior, according to some distribution ζ on the numbers. The concave functional analogue is (Eζhi)(f):=Eζ(hi(f))

This works just fine with no extra conditions if we have uniform continuity, but for Lipschitzness, we need an extra condition. Letting λ⊙i be the Lipschitz constant for the functional hi, we need ∑iζiλ⊙i<∞ in order for the result to be Lipschitz. The set version should just be mixing of sets.

**Definition 8: Mixing Sets**

*Given a countable family of inframeasures *Hi* where *i∈I⊆N*, and a probability distribution *ζ∈ΔI*, then *EζHi:={M|∃c∈∏iHi:∑iζic(i)=M}

*Ie, *EζHi* is the set of sa-measures that can be made by choosing one sa-measure from each *Hi* and mixing them together w.r.t. *ζ*.*

Try sketching out two sets on a piece of paper and figure out what the 50/50 mix of them would be. This corresponds to "Murphy can pick whatever they want from each set, but they're constrained to play the points they selected according to the probability assigned to each Hi"

We should note that for bounded inframeasures, letting λ⊙i be the bound on the λ value of the minimal points of Hi by minimal-boundedness, we want ∑iζiλ⊙i<∞ to preserve our minimal-bounded condition for the mix.

**Proposition 10:** EEζHi(f)=Eζ(EHi(f))

**Proposition 11:** *A mixture of infradistributions is an infradistribution. If it's a mixture of bounded infradistributions with minimal point *λ* bounds of *λ⊙i*, and *∑iζiλ⊙i<∞*, then the mixture is a bounded infradistribution.*

Again, Proposition 10 certifies that we got the right notion on the set level, and Proposition 11 certifies that we don't need to do any closure or upper completion cleanup. Now, how do mixtures interact with pushforwards?

**Proposition 12:** g∗(Eζ(Hi))=Eζ(g∗(Hi))

So, it doesn't matter whether you mix before or after pushing your infradistribution through a continuous function. The proof of this is quite nontrivial if we were to do it on the set level because of exhaustive verification of each of the conditions, but now that we've shown that we have the right set-level analogue of mixing and pushforwards, we can just work entirely in the concave-functional picture and knock the proof out in a couple of lines, so our duality is doing useful work.

**Updating**

Let's move on to updates. We should mention that we'll be taking a two-step view of updating. First, there is chopping down the measure so only measure remaining within the set we're updating on remains, and then there's renormalizing the measure back up to 1. Thinking of updating as a unitary process instead of separated into these two phases will confuse you.

First, what sorts of events can we update on? Well, we only have expectations of continuous functions, and in a classical setting, we can get probabilities from expectations (and then updates from that) by considering the expectation of an indicator function that's 1 on a set and 0 everywhere else. Sadly, in this setting, we can only take the expectation of a continuous function. A sharp indicator function for an event will be discontinuous unless the set we're updating on is clopen. Fortunately, the specific application of "conditioning on a finite history" (with discrete action and observation spaces) only involves conditioning on clopen sets, because the set of histories which have a certain finite history as a prefix are clopen.

But for more generality, with continuous sets of observations, we have a problem. Our restriction to only being able to update on *clopen* sets seems rather severe.

But, really, this isn't a problem. If you flip it around from "we can only update on clopen sets" to "huh, we can update on any continuous indicator function, we're basically updating on fuzzy sets now", things get more interesting. Like, maybe you observe the position of something through a noisy sensor? That'd be a fuzzy update. Also, remember how, in logical induction, since the traders weren't allowed to make discontinuous trades, MIRI had to work with fuzzy sets. We can justify this (and the restriction to only being able to do sharp updates) via topology.

In many cases, open sets correspond to properties that can be verified if true, but you might be out of luck if it's false. As an example, (0,∞) is open because, if you're reading the output bits one at a time, and you get 0.0000..., but it shifts to not-zero bits, you can verify that, but if it's 0.0000... forever, you're out of luck.

Conversely, closed sets correspond to properties that can be disproved if false, but you might be out of luck if it's true. (−∞,0] is an example of this, just consider getting 0.0000... and it continues forever.

Doing a crisp (not fuzzy) update requires ruling out everything outside the set we update on, and ruling in everything inside the set we update on, so this is an intuition for why we can update in the usual way on clopen sets, and can't do crisp updates on anything else. Clopen sets correspond to "verifiable if true, falsifiable if false".

Let's formalize how fuzzy updates work. A "likelihood function" is a continuous L:X→[0,1]. We interpret it as the indicator function for a fuzzy set.

A nifty feature of a continuous likelihood function L is that it lets us glue two continuous functions f,g∈C(X,[0,1]) into another continuous function.

**Definition 9: L-Gluing**

*Given three continuous functions *L,f,g:X→[0,1]*, *f* glued to *g* via *L*, *f★Lg*, is the function defined as: *f★Lg:=Lf+(1−L)g* *

So, f★Lg behaves like f inside our region we're updating on, and g outside of our region we're we're updating on, and get mixed in regions we're unsure of.

Now, let's try to motivate how updates work. Consider a toy example where there is a set of environments E. There is an agent with a fixed utility function which will observe some sensory data at the start, and select what to do. π⋅e is the distribution over outcomes produced by the policy π interacting with the environment e.

Suppose the agent observes the sensory data ρ at the start. Let L:E→[0,1] be the function e↦Pe(ρ) which gives the probability of observing ρ in each environment. Having observed ρ, the agent no longer has any control over what it would do had it not observed ρ, so let π¬ρ, the portion of its policy that is defined when it does not observe ρ, be fixed at whatever it is. The agent now must optimize over its choice of πρ, the portion of its policy that is defined when it observes ρ.

Let g:supp(1−L)→[0,1] be the function mapping an environment to the agent's utility conditional on ¬ρ, under the assumption that it follows the fixed policy π¬ρ; that is, g(e):=Eπ¬ρ⋅e(U|¬ρ)

For each possible choice of on-history policy πρ, we get a corresponding function f:supp(L)→[0,1] as the agent's utility conditional on ρ if it follows the policy πρ.

So, how do we design an update to get dynamic consistency? Well, just update such that the expectation of f post-update is the same as the expectation of f★Lg pre-update. The reason for this is that f★Lg rewrites as:

(f★Lg)(e)=L(e)f(e)+(1−L(e))g(e)

=Pe(ρ)⋅Eπρ⋅e(U|ρ)+Pe(¬ρ)⋅Eπ¬ρ⋅e(U|¬ρ)=Eπ⋅e(U)

So, your choice of πρ affects f, *that* can freely vary, but you can't control the probability of the observation (corresponding to your L), or your off-observation policy π¬ρ (corresponding to your g).

Going back to the positive functional picture with h (our analogue of an expectation), we can thus try to define an update relative to L (likelihood function) and g (off-event value), as:

hgL(f):=h(f★Lg), where f∈C(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯supp(L),[0,1])

But wait, this doesn't work, the resulting thing isn't normalized!. This is more like a raw update (probability-theory analogue: Just chopping down your measure) with no renormalization back to 1 afterwards. By monotonicity, we know that ∀f:h(0★Lg)≤h(f★Lg)≤h(1★Lg). Our renormalization is "subtract until h(0)=0, then scale until h(1)=1", so our update should *actually* be "do that, then renormalize".

Attempt 2: (h|gL)(f):=1hgL(1)−hgL(0)(h(f★Lg)−hgL(0))=h(f★Lg)−h(0★Lg)h(1★Lg)−h(0★Lg)

Squinting at this, this is: "do a raw update. Then subtract the excess value off (first step of renormalizing), and then rescale according to h(1★Lg)−h(0★Lg) which... well, ignoring the g part, looks kinda like "expectation of 1 - expectation of 0"... oh, it's like using the gap between the measure on X and the measure on ∅ to define how much you need to scale up a measure to be a probability distribution! That term on the bottom is the analogue of the probability of the event you're updating on, relative to the function g!"

Let's flesh this out a bit more in preparation for defining updates on the set side of the duality.

Backing up to our definition of renormalization, we required that EH(1)=1, and EH(0)=0. If we pivot to viewing the function inside the expectation as our L that gives an indicator function for something, the normalization condition says something like "probability of X is 1, probability of ∅ is 0".

Let's generalize this and define probabilities of fuzzy events relative to a function g.

**Definition 10: Probabilities of Fuzzy Sets**

*Given two functions *g,L∈C(X,[0,1])*, and a nonempty set of sa-measures *B*, the probability of *L* (interpreted as a fuzzy set) w.r.t *B* and *g* is: *PgB(L):=EB(1★Lg)−EB(0★Lg)

If g is 0, then for an infradistribution H, by normalization, and unpacking what that ★ means, we get P0H(L)=EH(L). This is much like how the probability of a set is the expectation of the indicator function for the set, just interpret the L as the indicator function for a fuzzy set on the right, and as a fuzzy set on the left. So, for g=0, it behaves like an actual probability.

However, when g≠0, then this is better interpreted as caring-measure. PgB(L) is the difference between the best score you can get vs Murphy and the worst score you can get vs Murphy if you know how things go outside of L (interpreted as fuzzy set). This g-dependent "probability" is actually "how much value is at stake here/how much do I care about what happens inside set L"

And, further, our scaling term for renormalizing an inframeasure B that isn't an infradistribution yet was (EB(1)−EB(0))−1. So, using this reformulation, our rescaling term turns into (PgB(1X))−1 regardless of g. So, our renormalization term is "rescale according to the probability you assign to any event at all occuring"

Alright, we have enough tools to define updating on the set level. For the raw update (no rescaling yet), we need to chop the measure down according to L. We should also fold in the off-L value (requires specifying g) to the b term, by our dynamic consistency example. And then we do appropriate scale-and-shift terms to subtract as much as we can off the b term, and rescale according to the probability of L relative to the g we're also updating on.

Let's use m⋅L to denote the original finite signed measure but scaled down by the function L. If we view L as an indicator function, this is just slicing out all of the measure outside of L, ie, usual conditioning.

**Definition 11: Updating**

H|gL, H *updated on *g* and *L*, is the set made by mapping *H* through the following function and taking closure and upper-completion.* (m,b)↦1PgH(L)(m⋅L,b+m(0★Lg)−EH(0★Lg))

Closure is unnecessary if H is a bounded inframeasure, and upper-completion is uneccesary if your L is the indicator function for a clopen set.

Roughly, this is: Chop down the measure according to your fuzzy set, m(0★Lg) is the fragment of expected value you get outside of your fuzzy set so you fold that into the b term. For rescaling, when we unpack EH(0★Lg), it's just inf(m′,b′)∈H(m′(0★Lg)+b′), so that's the maximum amount of value we can take away from the second vector component without ever making it go negative. And then rescale "how much do I care about this situation" (PgH(L)) back up to 1.

We'll now show that in the special case where our H has a single minimal point of the form (μ,0) (a conventional probability distribution) this is just normal Bayesian updating. Updating an infradistribution with one minimal point is the same as just updating that one minimal point and taking the upper completion. And let's have 1S as our L, it's the indicator function for set S⊆X. So, by the definition of update, (μ,0) updates to

1PgH(1S)(μ⋅1S,0+μ(0★1Sg)−EH(0★1Sg))

By unpacking the definition of PgH(1S), we get

1EH(1★1Sg)−EH(0★1Sg)(μ⋅1S,μ(0★1Sg)−EH(0★1Sg))

Now, because (μ,0) is the only minimal point, it is the thing that determines all the expectation values w.r.t the set H, so EH(f)=μ(f)+0 for all f, and we get

1μ(1★1Sg)−μ(0★1Sg)(μ⋅1S,μ(0★1Sg)−μ(0★1Sg))

Canceling, and noting that f★1Sg unpacks as 1Sf+(1−1S)g, we have:

1μ(1S+(1−1S)g)−μ((1−1S)g)(μ⋅1S,0)

Using linearity, and canceling, we have:

1μ(1S)(μ⋅1S,0)

Now, μ(1S) is the same as Eμ(1S) which is the same as Pμ(S), so we have

1Pμ(S)(μ⋅1S,0)

Which is just (μ|S,0). So, updating our probability distribution in this way produces... exactly a conventional update.

Note that this updating process lands us in a different vector space. Our new vector space is M±(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯supp(L))⊕R. It still fulfills the nice properties we expect, because a closed subset of a compact space is compact so every nice property still carries over. And it still has a closed convex cone of sa-measures w.r.t. the new space, abbreviate that as Msa(L).

What properties can we show about updates?

**Proposition 13:** *When updating a bounded infradistribution over *Msa(X)*, if the renormalization doesn't fail, you get a bounded infradistribution over the set *Msa(L)*. (for infradistributions in general, you may have to take the closure)*

**Proposition 14: **EH(f★Lg)=EH(0★Lg)+PgH(L)EH|gL(f)

Ok, so it's sensible on the set level. And for proposition 14, it means we can break down the expectation of two functions glued together by L into the expectation of g outside of L, and the probability of L relative to g times the updated expectation of f. We get something interesting when we reshuffle this. It rearranges to:

EH|gL(f)=1PgH(L)(EH(f★Lg)−EH(0★Lg))

Further unpacking the probability, we get

EH|gL(f)=EH(f★Lg)−EH(0★Lg)EH(1★Lg)−EH(0★Lg)

And then, translating to concave functionals via LF-duality, we get:

(h|gL)(f)=h(f★Lg)−h(0★Lg)h(f★Lg)−h(0★Lg)

So, this shows that we got the right update! What happens when we update twice?

**Proposition 15: **(H|gL)|g′L′=H|⎛⎝g★1−L1−LL′g′⎞⎠LL′

So, updating twice in a row produces the same effect as one big modified update. It may be a bit clearer if we express it as:

**Corollary 2:** Regardless of L and L′ and g, (H|gL)|gL′=H|g(LL′)

**Corollary 3:** *If *X* and *Y* are clopen sets, then, glossing over the difference between indicator functions and sets, *(H|gY)|gZ=H|g(Y∩Z)

Now, what happens when you update a prior/mixture of infradistributions? We get something that recovers Bayesian updating.

**Theorem 6, InfraBayes:**

(EζHi)|gL=Eζ(PgHi(L)⋅(Hi|gL))Eζ(PgHi(L))* if the update doesn't fail.*

This means that when we update a prior, it's the same as updating everything individually, and then mixing those with probabilities weighted by the probability the infradistribution assigned to L, just like standard Bayes!

In particular, if some hypothesis goes "nothing matters anymore" and gives up and cries after seeing L, then its probability term is 0, so it will drop out of the updated prior entirely, and now you're only listening to hypotheses that think what you do matters. Thus, with a fairly broad prior, we don't have to worry about the agent giving up on life because nothing matters post-update, just as long as *some* component in its prior gives it the will to continue living/says different policies have different values. Well, actually, we need to show an analogue of this for belief functions, but it pretty much works there too.

**Additional Constructions**

There are more probability theory analogues, but they are primarily material for a future post. We'll just give their forms in the concave functional view. If they look unmotivated, just note that they match up with the standard probability-theory notions if we use infradistributions corresponding to actual probability distributions. We'll be using lambda-notation for functions. If you haven't seen it before, λx.f(x,2) is the function that takes in an x, and returns f(x,2). λz.(λa.a+z) is the function that maps z to the function that maps a to a+z.

**Definition 12: Product**

*If *h1∈□X* and *h2∈□Y*, the product *h1×h2∈□(X×Y)* is given by: *(h1×h2)(f):=h1(λx.h2(λy.f(x,y)))

These products are noncommutative! The product of bounded infradistributions is a bounded infradistribution.

There's also infrakernels, the infra-analogue of Markov kernels. A Markov kernel is a function X\to\Delta Y that maps each x to some probability distribution over Y. Concrete example: The function mapping income to a probability distribution over house size.

**Definition 13: Infrakernel**

*An infrakernel is a function *X→□Y* that is:*

*1: Pointwise convergent. For all sequences *xn* limiting to *x* and all *f∈C(Y,[0,1])*, *limn→∞K(xn)(f)=K(x)(f)

*2: Uniformly equicontinuous. For all *ϵ*, there is a *δ* where if *|f−f′|<δ*, then *∀x:|K(x)(f)−K(x)(f′)|<ϵ

*If there is some *0">λ⊙>0* where property 2 works with *δ=ϵλ⊙*, then it is a Lipschitz infrakernel.*

The first two conditions let you preserve uniform continuity, and the strengthening of the second condition lets you preserve Lipschitzness/being a bounded inframeasure.

Now, we can define the semidirect product, h⋉K. The semidirect product in the probability-theory case is... Consider the aforementioned Markov kernel mapping income to a probability distribution over house size. Given a starting probability distribution over income, the semidirect product of (income distribution) with (house size kernel) would be the joint distribution over income and house size. It's a critically important concept to know that isn't discussed much.

**Definition 14: Semidirect Product**

*If *h∈□X*, and *K* is an infrakernel *X→□Y*, the semidirect product *h⋉K∈□(X×Y)* is given by *(h⋉K)(f):=h(λx.K(x)(λy.f(x,y)))

Products are just a special case of this, where K(x)=h2, regardless of x. If h is a bounded infradistribution and K is a Lipschitz infrakernel, then the semidirect product is a bounded infradistribution.

The pushforward of a probability distribution is... given the house size Markov kernel, and a distribution over income, the pushforward is the induced distribution over house size. We earlier gave pushforwards for a continuous function g:X→Y. What's the analogue of the pushforward for an infrakernel? Or, can you do a pushforward w.r.t. a Markov kernel?

**Definition 15: Pushforward**

*If *h∈□X*, and *K* is an infrakernel *X→□Y*, the pushforward *K∗(h)∈□Y* is given by *K∗(h)(f):=h(λx.K(x)(f))* And if *k* is a continuous (in the KR-metric) Markov kernel *X→ΔY*, the pushforward *k∗(h)∈□Y* is given by *k∗(h)(f):=h(λx.Ek(x)(f))

An interesting note about this is, if h is a bounded infradistribution, then we need Lipschitz infrakernels to preserve that property for pushforwards, but we do not need any additional condition on a Markov kernel to preserve boundedness besides continuity. Exercise: Try to figure out why.

Really, everything originates from the semidirect product. The product is the special case of a semidirect product for a constant infrakernel, the pushforward is a semidirect product that's projected down to the Y coordinate, the pushforward w.r.t a Markov kernel is a special case of pushforward w.r.t. an infrakernel, and the pushforward w.r.t. a continuous function is a special case of pushforward w.r.t. a Markov kernel.

And that's about it for now, see you in the next post!

Discuss

### Belief Functions And Decision Theory

.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

So, if you haven't read the first two posts, do so now.

In this post, we'll be going over the basic theory of belief functions, which are functions that map policies to sets of sa-measures, much like how an environment can be viewed as a function that maps policies to probability distributions over histories. Also, we'll be showing some nifty decision theory results at the end. The proofs for this post are in the following three posts (1,2,3), though it's inessential to read them and quite difficult.

Now, it's time to address desideratum 1 (dynamic consistency), and desideratum 3 (how do we formalize the Nirvana trick to capture policy selection problems) from the first post. We'll be taking the path where Nirvana counts as infinite reward, instead of counting as 1 reward forever. The latter option is cleaner mathematically, ties into learning theory better, and makes it clearer what a pseudocausal hypothesis is. But this post was long enough as-is, so we'll just be presenting the path where Nirvana counts as infinite reward, as that allows for more flexibility in encoding policy selection problems and a full tie-in with UDT.

**Definition 1: Nirvana**

*A special observation *N* which may occur at any time, counts as infinite reward, and brings the history to an end. An a-measure or sa-measure may be described as Nirvana-free when the set of histories ending in Nirvana has measure 0. The set of Nirvana-free a-measures is denoted as *NF*.*

Our ambient mathematical setting is one where: At each point, there are finitely many discrete actions or observations available (no continuous observation spaces here, though the available space of actions or observations may vary depending on the history so far). In any situation, Nirvana may occur, and it ends the history. All policies are deterministic (refer back to the first post to see why this isn't much of an imposition).

We'll be working with bounded inframeasures (like a bounded infradistribution in all ways except normalization), instead of inframeasures, because it gets harder if we don't. Also, we'll be working within the cone of a-measures, but use the notion of upper-completion that uses the cone of sa-measures. This odd restriction is in place because Nirvana doesn't interact well with negative measures, as that would be −∞ reward and Murphy minimizes your reward, which forces a lot of awkward kludges in various settings.

If we had Nirvana count as 1 reward forever, we could drop almost all of these issues, but we're trying to replicate UDT here. When cramming a UDT problem into this setting, we do it by hard-coding every possible policy into an environment that depends on a policy, which produces a family of environments, and add the constraint that you attain Nirvana if you violate the hard-coded policy.

In order for Murphy to go "dang, I'd better make the hard-coded policy match the agent's own" (simulating a policy-dependent environment), you need the loss incurred by the environment mispredicting you to be smaller than the gain in value from attaining Nirvana when the environment mispredicts you. Infinite reward gets this easily, while 1 reward forever isn't quite potent enough to stamp out mispredictions in all circumstances, it's limited to problem setups where predictors are predicting you in circumstances that don't have a tiny probability of arising and don't happen too far in the future.

**Notation**

Feel free to skip this part and refer back to it as necessary, or open up a seperate tab so you don't have to scroll all they way back up to this. A lot of it won't make sense now, it's just a reference. We have all the notation from the previous notation section, in addition to:

h: A finite history, ending in an observation. Similarly, a is an action.

γ: A number in [0,1), the time-discount parameter. This is used to define the distance between histories (for defining the KR-metric), as well as defining the distance between partial policies. We can also use γ as a superscript in a utility function to specify the time discount.

Π+,Π: The space of partial policies and full policies, respectively. All policies are deterministic. A partial policy could be thought of as a full policy, except that what it does may become undefined in some situations. Partial policies make a poset where the ordering is given by one partial policy being an extension of another. Full policies are maximal in this poset. There's also a distance metric between policies and partial policies, given by γt where t is the shortest time where they behave differently (whether by disagreeing with each other on what to do, or by one policy being defined where the other isn't)

πst,πpa,π: A policy-stub, partial-policy, and full policy, respectively. Full policies are pretty obvious. A policy-stub is a partial policy which is guaranteed to be undefined after some finite time, though the time at which it becomes undefined may be history-dependent. A partial policy is the most inclusive concept, including both policy stubs and full policies, and is a tree of actions and observations that dead-ends at histories where you don't specify what is to be done after that history.

π∅: The empty policy/empty stub. The maximally undefined partial policy that specifies absolutely nothing about what the policy does.

F(πpa),FNF(πpa): The outcome set of a partial policy. All the ending observation leaves on the tree given by πpa. This has the property that no matter what, if you play that partial policy, you'll get one of those outcomes. If πpa is infinite down some paths, then full outcomes (infinite histories) down those paths are also taken to be in this set. FNF(πpa) is the same, but omits every Nirvana observation.

Ma(F(πpa)),Msa(F(πpa)): The a-measures and sa-measures over the set F(πpa). An important part of this is that Msa(F(πpa)) can't assign negative measure to any Nirvana event, which is an important limitation, otherwise we could have −∞ expected value and the agent would give up and cry since it plans for the worst-case.

prπhipa,πlopa∗: The function with type signature Ma(F(πhipa))→Ma(F(πlopa)) (or the Nirvana-free variant) defined as the pushforward of the function mapping each bitstring in F(πhipa) to its unique prefix in F(πlopa), which happens exactly when πhipa≥πlopa. To be more concrete, the function that takes in a measure on 10-bit bitstrings and crunches it down to a measure on 8-bit prefixes is an instance of this sort of projection.

e: An environment.

Θ: A belief function, mapping each πpa to a set of a-measures over F(πpa). If these fulfill enough conditions, we call them hypotheses.

NF: The set of Nirvana-free a-measures for a given F(πpa). Ie, the measure component doesn't assign nonzero measure to any Nirvana observation. Technically, there's a dependency on πpa, but we suppress this in the notation since the definition of the set is pretty much the same for all the πpa.(ie, everything without Nirvana in it)

λ⊙,b⊙: Minimal points of a-measures can be broken down as (λμ,b) where λ,b≥0 and μ is a probability distribution. These are the minimal upper bounds on λ and b for all the Θ(πpa) sets, relative to a given Θ.

π¬h: Some partial policy which specifies everything the agent does except for what it does after history h.

EΘ(πpa)(f): Like our standard expectations relative to a set of sa-measures, just remember that Nirvana means infinite reward, so this can be rephrased as EΘ(πpa)∩NF(f).

PgΘ,π¬h(h): The quantity \pi_{\neg h}}\mathbb{E}_{\BF(\pi)}(1\bigstar^{h}g)-\mathbb{E}_{\BF(\pi_{\neg h})}(0\bigstar^{h}g)">maxπ>π¬hEΘ(π)(1★hg)−EΘ(π¬h)(0★hg). This is the rescaling term in updates, it can be (very) roughly thought of as the probability of getting the finite history h relative to an off-h policy and a belief function.

¯¯¯¯¯¯¯¯c.h: The closed convex hull of a set.

Ma(∞): The space of a-measures on all infinite histories (no Nirvana in here, it'd end the history early). Obeying a deterministic policy only let you reach a certain portion of this space.

{≤⊙}: Given a Θ and the associated λ⊙ and b⊙ upper bounds on the minimal points of all Θ(πpa), this is the set of all a-measures over F(πpa) where λ+b≤λ⊙+b⊙. Again, like NF, there's an implicit dependence on πpa we're suppressing here. Used to clip away part of our set for some arguments.

of: An outcome function. A function mapping each πpa to a point in Θ(πpa) such that prπhipa,πlopa∗(of(πhipa))=of(πlopa).

SM: An a-surmeasure or sa-surmeasure. Like an a-measure or sa-measure, except that the measure component can specify that some Nirvana events occur with arbitrarily-small-but-nonzero measure, where this quantity is denoted as 0+

S: A set of a-environments/a-survironments. They can be written as (λe,b), where b is our constant term, λ is "how much we care about the environment", and e is an environment.

Iπlopa,πhipa∗: There's also an I∗s variant. This is the Nirvana injection up, and a measure is injected up by going "if we need to extend a piece of measure on a history h to longer histories, we just say that h is guaranteed to end in Nirvana after you take an action". I∗s does the same thing, except that it also assigns every Nirvana event outside of F(πlopa) that'd normally have 0 measure, 0+ measure instead.

↑,↓: The functions mapping a Θst or Θω defined only over policy stubs or full policies to a Θ over all partial policies, via:

↑(Θst)(πpa):=⋂πst≤πpa(prπpa,πst∗)−1(Θst(πst))

↓(Θω)(πpa):=¯¯¯¯¯¯¯¯c.h(⋃π≥πpaprπ,πpa∗(Θω(π)))

→ω,→st: The functions taking a Θ and restricting it to be only defined over full policies or only defined over stubs.

→c,→sc: The functions taking a hypothesis over stubs Θst that's pseudocausal/acausal and translating it to a causal/surcausal hypothesis over stubs, via:

→c(Θ)(πst):=¯¯¯¯¯¯¯¯c.h(⋃πlost≤πstIπlost,πst∗(Θst(πlost)))

And similar for →sc, just swap I∗ for I∗s.

→NF: The function taking a causal/surcausal hypothesis Θ and translating it to a pseudocausal/acausal hypothesis by just intersecting everything with NF, ie, clipping away every a-measure with nonzero measure on some Nirvana event.

π¬h∙πpa: The partial policy made by taking π¬h which specifies everything except what happens after observation h, and going "obey πpa after observation h". Just glue the two partial policies together.

c: The bookkeeping function which clips off the h prefix from all the histories in your measure after you update on h. Technically, there's an h dependence on this, but we suppress it in the notation.

gh: The adjusted function g over histories to compensate for clipping off the h prefix on histories after updating, via gh(h′):=g(hh′).

R: When used as a superscript, it means something has been renormalized, when used as normal, it means regret.

U: Some utility function. Uγ denotes a utility function with time discount parameter γ.

**Foundations:**

**Definition 2: A-history**

*An a-history is a finite history ending with an action.*

**Definition 3: O-history**

*An o-history is a finite history ending with an observation, or an infinite history. The empty history will also count as an o-history.*

**Definition 4: Partial Policy**

*A partial policy *πpa* is a partial function from finite o-histories that can be continued with an action to actions, s.t. if *πpa(h)* is well-defined, and *h′a* is an a-history that is a prefix of *h*, then *πpa(h′)=a*. The set of these is denoted *Π+*.*

Note that in order to specify the behavior of a partial policy in a situation, you need to specify its behavior in all prefixes of the situation, which means you act in such a way that it's *possible* for the situation to come about. Also because you can only specify one action in any given situation, your action can't be specified on any o-history where the past actions don't match up with what your partial policy says you do.

**Definition 5: Policy Stub**

*A policy stub *πst* is a partial policy that has a finite *n* where if *n">|h|>n*, then *πst(h)=⊥*. It is undefined beyond a certain time.*

**Definition 6: Full Policy**

*A full policy *π* is a partial policy that cannot have its domain where it is defined enlarged, without ceasing to be a partial policy. The set of these is *Π*. This is the same as a policy in the usual sense of the word.*

So, visualizing a partial policy as a big tree, policy stubs are the finite trees, full policies are the trees infinite down every branch, and partial policies are the all-encompassing category which also include trees that are infinite down some branches but not others.

**Definition 7: Empty Policy**

*The empty policy *π∅* is the partial policy/policy stub that is completely undefined. For all o-histories *h*, even including the empty history, *π∅(h)=⊥

We can equip Π+ with a distance metric. The distance metric is γt(πpa,π′pa), where γ<1, and t(πpa,π′pa) is the "time of first difference", the length of the shortest o-history where the action of πpa doesn't match up with the action of π′pa, or where one of the πpa is defined and the other one isn't.

We can also equip Π+ with a partial order. The partial order is πhipa≥πlopa if the two policies never disagree on which action to take, and πhipa is defined on more histories than πlopa is. inf can be taken of arbitrary collections of partial policies (intersection), and sup (union) can be taken of arbitrary collections of partial policies as long as they're all below some full policy π.

The full policies are the maximal elements in this poset, the empty policy is the bottom element in this poset, and every partial policy πpa is associated with a sequence of stubs πnpa that are like "specify all of what this partial policy does up until time n, then be undefined afterwards". The chain πnpa ascends up towards πpa, and has the property that any stub πst≤πpa has some πnpa where πst≤πnpa.

**Outcome Sets and Projection Mappings**

The "outcome set" of a partial policy, F(πpa), is... well, let's start with a visualization and then give the formal definition.

Consider the branching tree of alternating actions and observations. The partial policy fixes a subtree of this tree, where every child of an observation node is an action given by πpa, and you extend no further if the policy becomes undefined at an observation. Then the outcome set F(πpa) would be all the leaves and infinite paths of that tree. If you follow the behavior of the partial policy, your history is *guaranteed* to go through one of the leaves. Thus, if your partial policy interacts with an environment, you'd get a probability distribution over F(πpa) (for the history is guaranteed to have something in F(πpa) as a prefix). Observe the picture.

**Definition 8: Outcome Set**

*The outcome set *F(πpa)* is the set of o-histories *h* that are not in the domain of *πpa*, but all prefix a-histories *h′a* of *h*, have *a=πpa(h′)*. By this definition, *F(π∅)* contains one element, the empty history. *FNF(πpa)* is the subset of *F(πpa)* consisting of all o-histories which don't end in Nirvana.*

F(πpa) and FNF(πpa) will take the role of our arbitrary compact metric spaces X that we defined infradistributions over. The metric is d(h,h′)=γt(h,h′), where γ<1 and t(h,h′) is the first timestep where the histories differ. So two histories are close if they agree for a long time before differing.

However, there's a technicality we need. While Ma(F(πpa)) and Ma(FNF(πpa)) and Msa(FNF(πpa)) work exactly as you'd expect, Msa(F(πpa)) can't contain signed measures that assign negative mass to a Nirvana event. This is because a positive functional (that respects Nirvana) would think that such a measure may have −∞ value. And requiring that the amount of Nirvana-measure is equal or greater than the amount of negative-Nirvana measure interacts badly with updates. This issue causes minor kludges all over the place, we'll reiterate that Nirvana counting as 1 reward is much more elegant.

We'll also need the notion of a projection (this is going to be our analogue of the g∗ mapping from the basic theory).

**Definition 9: Projection Mapping**

*If *πhipa≥πlopa*, then the function *prπhipa,πlopa∗:Msa(F(πhipa))→Msa(F(πlopa))* is defined by:*

prπhipa,πlopa∗(m,b):=(m′,b)

*Where, if *h* is a prefix of something in *F(πlopa)*, then *m′(h)=m(h)*. This function also maps a-measures to a-measures and Nirvana-free stuff to Nirvana-free stuff.*

To expand more on this, if πhipa≥πlopa, then there's a natural surjection from F(πhipa) to F(πlopa) (which also acts as a surjection from FNF(πhipa) to FNF(πlopa)). Every history in F(πhipa) has a unique prefix in F(πlopa), so just send a history to its unique prefix. This is prπhipa,πlopa.

To visualize it, take your tree induced by πlopa. F(πlopa) is the coat of observation leaf nodes and all the infinite paths. Extending πlopa requires the new tree πhipa to poke out through some observation leaf nodes in F(πlopa), so every leaf node in F(πhipa) is attached to a part of the πhipa tree that pokes out through the original coat of leaf nodes F(πlopa) somewhere.

prπhipa,πlopa∗ is just the pushforward mapping induced by this, like how we got a function g∗ from sa-measures to sa-measures from a function g:X→Y back in the last post. pr∗ is the *exact* function you'd come up with if you went "given a measure on 10-bit bitstrings, how do I craft a measure on 7-bit prefixes" and generalized it.

Some important properties of this are that it's identity when mapping from a partial policy to itself, linear, preserves the λ and b value of an a-measure when you project it down, and it commutes. It doesn't matter what sequence of projections you take to get from a high partial policy to a low partial policy, they'll all be equal.

**Definition 10: A-Environment**

*An a-environment is a point written as *(λe,b)*, where *b,λ≥0*, and *e* is an environment. This is parallel to an a-measure.*

The intuition for this is that λ tells you how much you care about the environment e/the probability of the environment existing in the first place if λ<1, and b is the utility you're guaranteed/portion of expected utility you get if the environment doesn't exist.

We can think of querying an a-environment with a partial policy to get an a-measure via (λ(πpa⋅e),b), where πpa⋅e is the probability distribution over F(πpa) induced by πpa interacting with the environment. We'll use this as an intuition pump a lot.

**Belief Functions**

**Definition 11: Belief Function**

*A belief function *Θ* is a function that assigns all *πpa* to a nonempty set of a-measures over *F(πpa)* (or the Nirvana-free variant).*

The reason we're using a-measures instead of sa-measures is because we really don't need negative measures except for defining the upper completion, and negative measures *really* don't interact well with Nirvana being infinite reward (a drawback of this approach)

From the a-environment view, you can consider taking a set of a-environments, letting them all interact with a partial policy, and then plotting the resulting a-measures over F(πpa) you get, that'd be your Θ induced by a set of a-environments. But we'll just work with belief functions Θ more generally. We'll also abuse notation and use NF (Nirvana-free) as an abbreviation for Ma(FNF(πpa)) sometimes, when we don't want to specify an ambient space, and are just doing an intersection to clip away Nirvana.

Also, EΘ(πpa)(f)=EΘ(πpa)∩NF(f), because any probability of Nirvana counts as infinite value, so Murphy will only pick from Nirvana-free a-measures.

**Belief Function Conditions**

Let's go over the various conditions for a belief function to behave well. The first 6 are analogues of the corresponding conditions for an infradistribution, and the last 3+2 are special conditions for how the different Θ(πpa) sets interact with each other.

**Condition 1: Nirvana-free Nonemptiness:** ∀πpa:Θ(πpa)∩NF≠∅

This is parallel to condition 1: Nonemptiness, for infradistributions.

It's nontrivial in the Nirvana-containing case, because it says "regardless of your policy, Murphy can find an environment to play that avoids Nirvana", as should be the case if we don't want the agent ignoring all else to optimize its chances of getting infinite reward which never happens.

**Condition 2: Closure: **∀πpa:Θ(πpa)=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Θ(πpa)

This is parallel to condition 2: Closure, for infradistributions.

This also works with the Nirvana-free interpretation of Θ, because Ma(FNF(πpa)) is closed.

**Condition 3: Convexity:** ∀πpa:Θ(πpa)=c.h(Θ(πpa))

This is parallel to condition 3: Convexity, for infradistributions.

Again, Ma(FNF(πpa)) is convex so we can do this in that setting too. c.h is the convex hull.

**Condition 4: Nirvana-Free Upper Completeness:**

∀πpa:Θ(πpa)∩NF=((Θ(πpa)∩NF)+Msa(FNF(πpa)))∩Ma(F(πpa))

This is roughly parallel to condition 4: Upper Completeness, for infradistributions. This says that the Nirvana-free part of Θ(πpa) should be upper-complete w.r.t. the cone of Nirvana-free sa-measures. This is a bit of a hack, because Nirvana counts as infinite reward.

You probably have several questions. The first one: why do we restrict to a-measures? Well, our minimal points should be a-measures (by Condition 5: Positive Minimals for infradistributions), and those are the only thing that matters, so it's no big deal that we trim off all the sa-measures. Plus, some later tools like Nirvana injections don't work well with sa-measures.

Second question: Why do we need the cone of Nirvana-free sa-measures? Can't we just add the cone of Nirvana-free a-measures? Well, it's possible to add an sa-measure to an a-measure and get an a-measure, so just adding the cone of a-measures *wouldn't* capture enough points, we could add more.

Third question: Why do we require everything to be Nirvana-free? Can't we just require upper completion in general? Well, there's a weird thing where the projection of an upper-complete set isn't necessarily upper-complete when there's Nirvana about. But, in the Nirvana-free setting, the projection of an upper-complete set is upper-complete. This feature messes with several downstream things. Murphy will only pick from Nirvana-free a-measures, so it's no big loss.

Again, we see the pattern that having Nirvana be infinite forces a bunch of weird minor issues all over the place.

**Condition 5: Bounded Minimals:**

∃λ⊙,b⊙:∀πpa:(λμ,b)∈(Θ(πpa))min→λ≤λ⊙∧b≤b⊙

This is roughly parallel to condition 6: Bounded-Minimals for infradistributions. Note that said bound must be uniform among all the Θ(πpa) sets. We need to bound the b term uniformly as well to make some proofs work out.

**Condition 6: Normalization: **minπEΘ(π)(0)=0∧maxπEΘ(π)(1)=1

This is not the standard renormalization, but analogous. It also has the bad property that a mixture of normalized belief functions is not normalized. In our defense, it's all scale-and-shift anyways, you can just renormalize after mixing if you want.

Why is this defined in such a way? Well, it's because we're actually dealing with a bunch of different sets of a-measures, one for each partial policy, and since they're different, we can't get a guarantee that they're *all* normalized at once. You can't do individual normalizations, that'd break dynamic consistency. So we do a sort of collective normalization to Θ as a whole, to preserve the natural analogue of the normalization condition in this setting. Worst-case value should be 0 and best-case value should be 1, and this attains that.

Now, we can move on to three additional conditions on how the various Θ(πpa) sets link together, instead of merely checking the usual infradistribution conditions for individual Θ(πpa).

**Consistency and Extreme Points**

**Condition 7: Consistency:**

∀πpa:Θ(πpa)=⋂πst≤πpa(prπpa,πst∗)−1(Θ(πst))

∀πpa:Θ(πpa)=¯¯¯¯¯¯¯¯c.h(⋃π≥πpaprπ,πpa∗(Θ(π)))

This seems a bit mystifying, but it pretty much says that the behavior of the finite levels (the policy-stubs πst which only specify things up to a finite time), and the infinite levels (the full policies π), uniquely pin each other down. This condition lets you recover Θ uniquely via only knowing it on policy-stubs or only knowing it on full policies.

The intuition for the first equality will be taken from the view where Θ tells you what you get via querying a set of a-environments with a partial policy and recording the a-measures you get. So, to begin with, if π≥πpa, then querying an a-environment with π and projecting your result down via prπ,πpa∗, gets you the same thing as querying that environment with πpa. And if the first query gave you a Nirvana-free thing, the second one must have done so too because π≥πpa. Therefore, prπ,πpa∗(Θ(π))⊆Θ(πpa). Also, if the set of a-environments is convex (the corresponding property to all the Θ(πpa) being convex), we should be able to mix a-environments, which mixes the a-measures you get via a query in the same way. So,

Θ(πpa)⊇c.h(⋃π≥πpaprπ,πpa∗(Θ(π)))

And also, Θ(πpa) is closed, so Θ(πpa) contains the closed convex hull as well.

In the Nirvana-containing case, we can go "hey, if there's a point in Θ(πpa) that the closed convex hull doesn't nab, can't we just query the associated superenvironment with some random π≥πpa to get something in Θ(π) that projects down to our point of interest?" And this gets you equality.

If we slice out the Nirvana, this equality is a significantly stronger condition. We could imagine taking an a-environment that's Nirvana-free for a long time, but past a certain point, regardless of policy, Nirvana is guaranteed. The a-measures made by querying this a-environment with all partial policies would show up in the small policy-stubs (being Nirvana-free), but because Nirvana eventually shows up regardless of policy, the a-measures would get clipped out of our Θ(πpa) sets because they have Nirvana in them and we're only recording Nirvana-free a-measures.

So, in the Nirvana-free case, this condition corresponds to saying that, given some Nirvana-free a-measure produced by a partial policy interacting with an a-environment, there should be *some* way to extend the partial policy to a full policy that's Nirvana-free as well. Or, if that can't be done, the a-environment should be a mixture of a-environments where you *can* extend your policy in a Nirvana-free way.

In the Nirvana-free case, this "Θ(πpa) is the closed convex hull of projections from above" property is basically saying "don't clog Θ of the partial policy with extraneous crap that didn't arise from appropriate Nirvana-free a-measures for full policies".

After all, we want to recover the whole thing from either the behavior of Θ on full policies, or the policy-stubs, and adding extra Nirvana-free stuff into the policy-stubs would mess with our ability to do this.

Now, let's look at the other one, Θ(πpa)=⋂πst≤πpa(prπpa,πst∗)−1(Θ(πst))

What's the motive for this one? Well, let's cover both directions. In one direction, if you were like "this intersection of preimages of projections doesn't have enough points, I'm gonna throw in another point M outside the intersection", then there'd be some stub πst≤πpa, where your added point would project down to be outside of Θ(πst), which, oops, can't be done. Remember, if you got your extra point M via querying an a-environment with πpa, then you should be able to query the same a-environment with πst which is shorter, to get the projection of M which had *better* be in Θ(πst). But it's not.

In the other direction, this is sort of an "instantiating points at infinity" condition. A point in that intersection would project down to Θ(πst) for all appropriate stubs. We've got a sequence of points in those sets, one for each stub. So, you can't actually detect the addition of these points at infinity, because their finite-time behavior will *always* flawlessly match up with some preexisting thing. Remember, we want to add as many points as we can that don't affect any observable behavior.

Thus, the consistency condition can be viewed as a mix of:

1: Projecting Θ(πpa) down should make a subset of Θ(πst) because if you can query an a-environment with a long policy, you can also query it with a short one.

2: The finite levels aren't clogged with extra crap that didn't come from infinite levels.

3: A point is present at infinity if it's present at all the finite levels, because no finite-time observations can make any headway towards ruling that point out.

**Condition 8: Extreme Point Condition:**

\st,\sm':\sm'\in\BF(\pi)\cap\nfree\wedge pr_{*}^{\pi,\st}(\sm')=\sm\right)">(M∈(Θ(πst))xmin∩NF)→(∃π>πst,M′:M′∈Θ(π)∩NF∧prπ,πst∗(M′)=M)

(Θ(πst))xmin is the set of extreme minimal points of Θ(πst).

To unpack this, an extreme point of a set B is a point that can't be written as a mixture of distinct points in B. Given a tetrahedron, the vertices would be extreme points, and given a sphere, the surface would be the extreme points.

So this says that, for stubs, if a point M is an extreme point, minimal, and Nirvana-free, then it's the projection of a Nirvana-free point from a full policy above the stub.

In the Nirvana-free case, we get this for free from consistency. An extreme point can't be written as a mixture of other points, so it wasn't introduced when we took the convex hull of the union of projections (closure is more fiddly and we'll be ignoring it), so it had to come from the projection of a point from above.

*However*, in the Nirvana-containing case, this condition has two interpretations. The first is that shearing off all the a-measures with Nirvana occurring in them should get you a well-formed hypothesis. Our first condition, Nirvana-Free Nonemptiness, can be thought of as a condition in that vein, which ensures that shearing off the Nirvana doesn't leave you with an empty set. This condition is similar. Shearing off all the Nirvana had better preserve consistency! And this condition ensures exactly that.

Another characterization is that Nirvana cannot become inevitable without already happening. Minimal points are the essential elements of an inframeasure that can't be stripped away without affecting the expectation values, and extreme points are those that cannot be introduced by taking convex hull if they weren't already present. So, this condition says that if a point is minimal and extreme (can't be made by mixing and can't be removed without affecting anything else), then if Nirvana/infinite reward is guaranteed to have a shot at occurring in the future, then we might as well assume it has already occurred and said point can't be Nirvana-free.

**Hausdorff-Continuity**

Let's introduce some new concepts for this one.

**Definition 12: A-Measures on Nirvana-Free Histories**

Ma(∞)* is the set of a-measures on infinite o-histories. A finite history you can extend doesn't count. A finite history that ends with Nirvana doesn't count. An infinite history counts. This is all histories of this form, not just the ones compatible with some policy or other.*

**Definition 13: Clip Set**

*Given a belief function *Θ* that fulfills bounded-minimals, there are *λ⊙* and *b⊙* upper bounds on the *λ* and *b* values of minimal points. *{≤⊙}* is the set of a-measures *(λμ,b)* where *λ+b≤λ⊙+b⊙

**Definition 14: Preimage to Infinity**

(pr∞,πpa∗)−1* is the function that maps a set *B* in *Ma(FNF(πpa))* to the set of a-measures in *Ma(∞)* with a measure of 0 on histories which don't have a history in *FNF(πpa)* as a prefix (so they can project down onto *Ma(FNF(πpa))* safely), and which project down to *B*.*

This allows us to take the preimage of *all* our Θ(πpa)∩NF sets and have them all coexist in the same space, for ready comparison.

Now, let's define our next condition.

**Condition 9: Hausdorff Continuity: **πpa↦(pr∞,πpa)−1(Θ(πpa)∩NF∩{≤⊙}) is continuous.

"Wait a minute", you might ask, "this maps partial policies to *sets*, what notion of continuity are you using here??"

Well, given two sets B and B′ in a metric space, the Hausdorff-distance between these sets is the maximum distance you'd have to travel to get to a point in one set, if you started at an arbitrary point in the other set. Two sets being close in Hausdorff-distance means that every point from B is close to a point in B′, and vice-versa. From earlier, we've got a distance metric between partial policies. Further, because Π+ (the set of partial policies) is *compact*, said function isn't just continuous, but *uniformly* continuous.

Why do we need Nirvana-freeness? Well, Ma(∞) doesn't include any Nirvana in it, so projecting it down can't make any a-measure that has Nirvana in it.

Why must we clip off part of our set? For full policies, (pr∞,π∗)−1(Θ(π)∩NF)=Θ(π)∩NF

Now, Θ(π)∩NF is upper-closed w.r.t FNF(π). So there's some points in that set with a *gigantic* amount of measure on infinite histories compatible with π, and 0 everywhere else. Moving to a slightly different π′, its associated set has points with a gigantic amount of measure on infinite histories compatible with π′, and 0 everywhere else. Basically, the thing going wrong if we don't clip off the upper completion "tail" of our set is that the upper-completion tail is pointing in *slightly* different directions for different policies, so the Hausdorff-distance gets arbitarily large in the upper-completion tails. So we need to clip off that part. This doesn't clip off any minimal points, which are the important ones and everything else is extraneous.

This condition may seem quite mysterious. Speaking roughly, it says that for a Nirvana-free a-measure M∈Θ(π)∩NF, if we make a perturbation of our policy at a very late time to get π′, there should be some nearby Nirvana-free a-measure M′∈Θ(π′)∩NF.

This is a continuity condition on Murphy. If your policy trembles a bit to some nearby policy, then Murphy should only have to change the environment a little bit to get something Nirvana-free for the new policy. Another thing that may make the situation clearer is providing an example of what this condition rules out.

Let's say you're doing iterated Newcomb. If Omega fills the box only if it predicts you 1-box on the next 100 rounds, that's permitted by this condition. Altering your policy on round 1 million only alters the environment past round 1 million-100, which isn't much of a change w.r.t the KR metric we're using (it's insensitive to changes at late times). Omega can even look at your *full* policy, as long as really distant changes in your policy only have a small influence on the probabilities of what happens now, that'd still fulfill this condition. What this *does* forbid is the variant of iterated Newcomb where Omega fills the box only if it predicts you *always* 1-box. Because an arbitrarily late policy perturbation leads to a big change in what happens now.

So, despite looking hard to fulfill, this condition is an "acausal influences from very distant futures are weak" condition which is fairly natural. Another way of thinking about it is that you should be able to figure out what a policy-dependent environment is doing by just making bounded precommitments, instead of precommitments that extend for all time.

Why do we need this condition? Well, to prove that there's an optimal policy. Surprisingly enough, it's also essential for the proof of the consistency condition when we're trying to recover the whole Θ from a belief function defined only on stubs or full policies.

So, the Hausdorff-continuity condition is more natural than it seems, and essentially says that extremely late policy changes don't change your situation now very much.

**Causal, Pseudocausal, Acausal**

But there's two more conditions, to define special sorts of hypotheses! Causal, Pseudocausal, and Acausal respectively. Causal hypotheses can have Nirvana in them, Pseudocausal and Acausal must be Nirvana-free. Acausal doesn't have any extra conditions on it besides the previous nine and being Nirvana-free, but Causal and Pseudocausal do need one extra condition each.

**Definition 15: Outcome Function**

*An outcome function *of* is a function mapping each *πpa* to a point in *Ma(F(πpa))*, such that:*

∀πlopa,πhipa≥πlopa:prπhipa,πlopa∗(of(πhipa))=of(πlopa)

**Condition C: Causality**

∀πpa,M∈Θ(πpa)∃of:of(πpa)=M∧(∀π′pa:of(π′pa)∈Θ(π′pa))

The intuition for this is that, if you fix an a-environment, and queried it with all possible partial policies, you'd produce an outcome function. Similarly, an outcome function uniquely picks out an a-environment because you know what occurs in all possible situations and all the projections commute. So, the causality condition indirectly says "every point in all of of your belief function sets arose from querying an a-environment". This lets us easily go from the belief function to a set of a-environments, and similarly, querying a set of a-environments with partial policies will make a causal hypothesis.

**Condition P: Pseudocausality**

∀πpa,π′pa:((m,b)∈Θ(πpa)∧supp(m)⊆FNF(π′pa))→(m,b)∈Θ(π′pa)

This is the Nirvana-free analogue of causality. Let's say we've got some a-environment that we query with πpa, and the resulting distribution is Nirvana-free, and supported over F(π′pa). Now, if we query said a-environment with π′pa instead... we don't have to worry about introducing Nirvana! Yup, π′pa behaves a bit differently. But, it only behaves differently than πpa on histories that have 0 probability of happening, so it should induce the exact same a-measure. Causality implies this condition, but this is weaker than causality. This is the property you'd get if you were taking a set of a-environments and querying it and reporting only the Nirvana-free a-measures. The special property of this is that, if pseudocausality is fulfilled, we can add Nirvana in such a way to craft a causal hypothesis (which is linked to actual sets of a-environments)

Recap! Ok, so our conditions are:

**1: Nirvana-free Nonemptiness**

This is easy, we shouldn't have the agent chasing after infinite reward.

**2,3,4: Closure, Convexity, Nirvana-Free Upper Completion**

These collapse equivalence classes of a-measures that have the exact same behavior, and are parallel to the conditions for inframeasures.

**5: Minimal-Boundedness**

Again, this is parallel to the condition for bounded inframeasures, just adding in a b bound as well. It makes proofs much nicer.

**6: Normalization**

We impose a sort of collective normalization for the belief function as a whole, parallel to the corresponding condition for an infradistribution. This is not preserved when we mix belief functions.

**7: Consistency**

You should be able to recover the finite behavior of Θ from the infinite behavior of Θ, or vice-versa. Projecting big partial policies down to low levels makes subsets, the finite levels aren't clogged with extraneous crap, the infinite levels have all the points that match up with finite behavior.

**8: Extreme Point Condition**

Shearing the Nirvana off a consistent Nirvana-containing hypothesis should produce a consistent hypothesis.

**9: Hausdorff-Continuity**

Policy changes at extremely late times don't have much effect on what happens now.

**C: Causality**

Your belief function is induced by querying a set of a-environments.

**P: Pseudocausality**

Your belief function is induced by querying a set of a-environments and only recording the Nirvana-free results.

**Definition 16: Causal/Pseudocausal/Acausal/Surcausal Hypothesis**

*A hypothesis is a belief function *Θ* fulfilling the 9 conditions. A causal hypothesis may include Nirvana and fulfills Causality. A pseudocausal hypothesis is Nirvana-free and fulfills Pseudocausality. An acausal hypothesis is Nirvana-free. A surcausal hypothesis is a causal hypothesis over a-surmeasures (to be defined)*

What's the essential difference between pseudocausal and acausal hypotheses? Well, it's a natural dividing line between two importantly distinct classes of decision-theory problems. In the former, no matter how you're mispredicted, you must have some chance of actually entering the situation where you're mispredicted. In the other class, you could be mispredicted and then locked out so you can't show the misprediction is wrong. More on this in the next section.

As we'll show later, you can turn any pseudocausal hypothesis into an equivalent causal one by adding Nirvana appropriately, and turn any causal hypothesis into an actual set of a-environments. But we have a hard time viewing acausal hypotheses as coming from a set of a-environments, at least without further work. So, the next section will be about poking at a toy case where we have an acausal hypothesis we can't turn into a causal one, and seeing how we have to adapt things to view an agent as believing that it's in some set of environments, instead of the agent working purely with belief functions that don't have a representation as a set of environments.

**Acausal Hypotheses and Surmeasures**

Let's take the example of Transparent Newcomb, where Omega fills the transparent box with a hundred dollars if it predicts you'd 1-box upon seeing the transparent box filled, and has ϵ probability of making a mistake (the other box contains 1 dollar). This is a policy-dependent environment, which depends on your policy, as well as your action and the past history. The way we can encode this as a set of environments is by hardcoding all possible policies into the policy slot to make a set of environments, and send you to Nirvana if you ever act contrary to the hardcoded policy. This forces Murphy to make the hardcoded policy match your own, lest you go to Nirvana. The ϵ chance of error guarantees that you have a shot at Nirvana if the hardcoded policy doesn't match your own. Translating this set of environments to a Nirvana-free belief function yields a pseudocausal hypothesis. XOR blackmail and Counterfactual Mugging are also pseudocausal hypotheses.

But what about Transparent Newcomb with a *perfect* predictor? In that case, we have a very serious problem with our standard way of viewing a policy-dependent environment as a set of environments via hardcoding in all policies and sending you to Nirvana if you violate the hardcoded policy. Maybe you 1-box upon seeing a full box, 2-box upon seeing it empty, but the hardcoded policy in the environment is "you 2-box in both cases". You're stuck 2-boxing and getting 1 dollar, you can never get into a situation where you get the Nirvana reward from calling out Omega on its misprediction. In fact, both "1-box upon full, 2-box upon empty" and "always 2-box" are optimal policies here, so this encoding of a policy-dependent environment doesn't let us get UDT behavior. If we don't want to permanently give up on viewing belief functions as sets of a-environments, how can we go forward?

Well, there's two ways to deal with this. One is encoding ϵ-exploration appropriately, the other doesn't require ϵ-exploration and is more interesting.

For the first one, assume a situation where you have an "intended policy" (deterministic), and you have an internal sense of what you intended to do, but maybe exploration overwrites that. Omega has a hard-coded prediction of your policy, and an exploration coin is flipped on each round. If the coin says "act as normal", things proceed as normal, if the coin says "explore", then Omega does the opposite of what it would normally do re: box-filling or not since it knows you'll do the opposite of what you'd normally do. Then, if the action you intended upon seeing the box's condition doesn't match up with Omega's prediction of what you intended, you go to Nirvana. In either case, the real action is determined by the exploration coin and what you intended to do. If we try going "Omega's prediction is that you intend to 2-box regardless of whether the box is full or empty", and our true intent is 1-boxing if full, 2-boxing if empty, and the coin says to explore, then Omega goes "they do the opposite of what they'd normally do, better fill the box". Then, as you reach for both boxes (because of exploration), the mismatch between your intent of 1-boxing and Omega's prediction that you intended to 2-box manifests and you attain Nirvana in the exploration step. This particular way of setting up things guarantees some probability of getting into a situation where you prove Omega wrong if there's a mismatch between your policy and the hardcoded policy, so we can turn it into a set of environments. But now, let's look at a different way of resolving the problem.

What happens if we just shove Perfect Transparent Newcomb into the form of a suitable Nirvana-free Θ? Neglecting some conditions like upper-completion, Θ(π1,1) and Θ(π1,2) both have the single point where the box is guaranteed-full and you 1-box for a hundred bucks. Θ(π2,1) has the single point where the box is guaranteed-empty and you 1-box for nothing. Θ(π2,2) has the single point where the box is guaranteed-empty and you 2-box for 1 dollar.

But this violates the pseudocausality condition. The outcome "the box is empty, and you 2-box" is in Θ(π2,2), and supported over FNF(π1,2). However, this distribution isn't present in Θ(π1,2). The pseudocausality condition wants us to add the bad outcome back in!

One more stab. Let's look at the non-closed set of environments (with Nirvana) corresponding to ϵ-noise Transparent Newcomb, where 0">0.01≥ϵ>0. The value 0.01 isn't that important, the key part is just that we are in a situation where there's some noise in the prediction and we know it's small but we don't know how small. We can try taking the closure either before or after we turn it into a Θ, and see what happens.

Taking the closure in the space of a-environments, and turning it into a Θ ends up adding in the bad distribution where it's assumed we 2-box, and sadly, this is Nirvana-free for a 1-boxing strategy. So we get the same failure mode as before.

Something rather interesting happens if we take the closure of the Θ(π) sets after translating this problem into a Nirvana-free belief function though. For Θ(π1,2), we've got a bunch of distributions going "1−ϵ-probability on a full box and we 1-box, ϵ-probability on an empty box and we 2-box" Taking the closure won't add the bad distribution, it'll only add in the desired Perfect Transparent Newcomb distribution where it's full and you 1-box, guaranteed. But we still don't have pseudocausality.

This mismatch points towards needing a condition that's something like "you can't add in a Nirvana-free a-measure to Θ(π) if it's a limit of Nirvana-containing a-measures, you can only add it in if it's a limit of Nirvana-free a-measures". Murphy can't take *arbitrary* limits to make an inconvenient a-measure for you, they can only take limits of things that don't give you a shot at infinite value.

Our fundamental problem is that the expected value of 0">0.01≥ϵ>0 Newcomb, if Omega falsely has "2-box on full-box" locked in as the strategy, and we 1-box in that situation instead, goes "infinity, infinity, infinity... 1 dollar". Limits can lead outside the set of Nirvana-containing a-measures.

What would happen if we decreed that the set of Nirvana-containing a-measures was closed? The limit point of our sequence should still have infinite value, so it should be something like "probability 1 of an empty box and we 2-box, probability 0+ of a full box and we 1-box leading to Nirvana, of infinite value", where 0+ is interpreted as an infinitesimal quantity. Murphy still avoids this, because Nirvana can't *truly* be ruled out. Or, maybe we could slap some distance metric on a-measures that goes "this one has some probability of Nirvana here, this one doesn't, so they don't get arbitrarily close"

**Definition 17: Surmeasure**

*A surmeasure is a pair of a measure and a function mapping Nirvana-histories to "possible" or "impossible", where Nirvana-histories with positive measure must be labeled "possible". Possible Nirvana-histories with 0 measure are assigned a value of *0+*. A-surmeasures and sa-surmeasures are defined in the obvious way from this, and denoted by *SM*.*

And a survironment is like an environment, except it can have 0+ chance of getting "impossible" observations and from there you're guaranteed to get Nirvana at some point.

**Definition 18: Surtopology**

*The surtopology is the topology over the space of (s)a-(sur)measures with a subbasis of the usual open balls around points, and the sets *{(m,b)|m(h)=0}* where *h* ranges over the histories that end with Nirvana.*

**Definition 19: Surmetric**

*The surmetric is the metric over the space of (s)a-(sur)measures defined as:*

ds((m1,b1),(m2,b2))=max(d(m1,m2)+|b1−b2|,γt(m1,m2))

*where *d* is the usual KR-metric, *γ<1*, and *t* is the minimum length of a Nirvana-containing history that is possible/positive measure according to *m1* and impossible/0 measure according to *m2* (or vice-versa)*

Murphy will avoid all points with *any* chance of Nirvana. Remember, we only need to deal with these shenanigans if our decision-theory problem has the environment *completely* lock us out of being able to disprove a false prediction. Or, if you really want to deal with these decision-theoretic cases and don't want to deal with surmeasures, you could just stick with ordinary acausal hypotheses and avoid attempting to cram them into a causal hypothesis/set of environments.

Interestingly enough, these three ways of dodging the problem are equivalent!

**Theorem 1:** *The surmetric on the space of sa-measures *Msa(F(πpa))* induces the surtopology. The Cauchy completion of *Msa(F(πpa))* w.r.t the surmetric is exactly the space of sa-surmeasures.*

If you didn't quite follow the previous discussion, the tl;dr version of it is that the standard way of getting a set of environments from a set of policy-dependent environments, where you hardcode in all possible policies and you go to Nirvana if you violate the hardcoded policy, only works when you're guaranteed to have some chance at Nirvana if you're mispredicted. Some decision theory problems with perfect predictors violate this property, but in that case, we can still make it work by going "oh, the probability of getting into a situation where you're mispredicted isn't 0, it's just arbitrarily small". Letting in outcomes that are possible but have arbitrarily low measure, gets us survironments and surmeasures, so we can view all policy-dependent decision theory problems this way, as arising from a set of survironments (with that 0+ chance of some outcomes), instead of just a vanilla set of environments.

**Hypothesis Translation Theorems**

Alright, now that we've got all that stuff out of the way, we can start presenting a bunch of theorems!

Let ↑ be the function mapping a Θst defined only over policy-stubs to a Θ over all partial policies, via ↑(Θst)(πpa):=⋂πst≤πpa(prπpa,πst∗)−1(Θst(πst))

And let ↓ be the function mapping a Θω defined only over full policies to a Θ over all partial policies, via ↓(Θω)(πpa):=¯¯¯¯¯¯¯¯c.h(⋃π≥πpaprπ,πpa∗(Θ(π)))

Let →st and →ω be the functions mapping a Θ to the same Θ but only defined over policy stubs/full policies respectively.

**Theorem 2: Isomorphism Theorem:** *For (causal, pseudocausal, acausal, surcausal) *Θst* or *Θω* which fulfill finitary or infinitary analogues of all the defining conditions, *↑(Θst)* and *↓(Θω)* are (causal, pseudocausal, acausal, surcausal) hypotheses. Also, *↑* and *→st* define an isomorphism between *Θ* and *Θst*, and *↓* and *→ω* define an isomorphism between *Θ* and *Θω*.*

This says we can uniquely recover the whole Θ from just behavior on policy-stubs, or full policies, as long as analogues of the hypothesis conditions hold for the stubs or full policies, and the full policy behavior and policy-stub behavior will uniquely pin each other down. The interesting part of this is recovering consistency (the key element of the isomorphism) from conditions which aren't consistency.

Now for the next one. Let SΘ be the set of a-environments (or a-survironments) (λe,b) s.t. for all πpa, (λ(πpa⋅e),b)∈Θ(πpa). Let ΘS where S is an arbitary set of a-environments, be the belief function where ΘS(πpa):={(λμ,b)|∃(λe,b)∈S:πpa⋅e=μ}

πpa⋅e is the probability distribution over F(πpa) produced by πpa interacting with the environment e.

**Proposition 1:** *If *Θ* fulfills the causality condition, nonemptiness, closure, and convexity, then *SΘ* is a nonempty, closed, convex set of a-environments or a-survironments and *ΘSΘ=Θ*. Also, *S⊆SΘS*.*

So, we can go from causal hypotheses to actual sets of a-environments (or a-survironments, as the case may be). If we're fortunate enough to have our set of a-environments induce a causal hypothesis (the hard part isn't causality, you get that for free, the hard part is stuff like Hausdorff-continuity and the extreme point property), then going back to environments may introduce additional points corresponding to the long-ago mentioned "chameleon environments" which mimic the behavior of preexisting environments for every policy, so going from a belief function to a set of a-environments removes the redundancy in sets of environments we mentioned back in the first post.

So, we can freely switch between finitary and infinitary versions of causal, pseudocausal, acausal, and surcausal hypotheses, they're isomorphic. And we can translate between causal/surcausal hypotheses and actual sets of environments. What about linking pseudocausal/acausal to causal/surcausal by strategic addition of Nirvana? Can we do that? Yes.

**Definition 20: Nirvana Injection**

*If *πhipa≥πlopa*, then *Iπlopa,πhipa* is the function *F(πlopa)→F(πhipa)* where*

h∈F(πhipa)→Iπlopa,πhipa(h)=h

h∉F(πhipa)→Iπlopa,πhipa(h)=hπhipa(h)N

Iπlopa,πhipa∗:Ma(F(πlopa))→Ma(F(πhipa))* is the pushforward induced by *Iπlopa,πhipa*.*

**Definition 21: Nirvana Injection (Surcausal)**

Iπlopa,πhipa∗s* is the same as *Iπlopa,πhipa∗*, except it labels every Nirvana-history not in *F(πlopa)* as possible.*

Note that these injections goes up, not down, as projections do. I∗ is a sort of Nirvana-adding injection up where we cap off every history we need to extend with "Nirvana happens immediately afterwards, guaranteed". I∗s is defined identically, except that if a Nirvana-history would normally get 0 measure, it's assigned 0+ measure instead. It's more aggressive about adding Nirvana everywhere it can.

→c is the function mapping a pseudocausal Θst only defined on policy stubs, to a causal Θst only defined on policy stubs, via: →c(Θst)(πst):=¯¯¯¯¯¯¯¯c.h(⋃πlost≤πstIπlost,πst∗(Θst(πlost)))

And →sc is defined identically, just with I∗s, it maps acausal to surcausal.

→NF is just "intersect with NF", it turns a causal/surcausal Θst to a pseudocausal/acausal Θst.

Now that we have the machinery to go from pseudocausal/acausal to causal/surcausal, and back, we have:

**Theorem 3.1: Pseudocausal Translation:** *For all pseudocausal *Θst* hypotheses defined only on policy stubs, *→c(Θst)* is a causal hypothesis only defined on policy stubs. And,*→NF(→c(Θst))=Θst*. For all causal *Θst* hypotheses defined only on policy stubs, *→NF(Θst)* is a pseudocausal hypothesis only defined on policy stubs.*

**Theorem 3.2: Acausal Translation: ***For all acausal *Θst* hypotheses defined only on policy stubs, *→sc(Θst)* is a surcausal hypothesis only defined on policy stubs. And, *→NF(→sc(Θst))=Θst*. For all surcausal *Θst* hypotheses defined only on policy stubs, *→NF(Θst)* is an acausal hypothesis only defined on policy stubs.*

These two are quite powerful. First, they let us freely translate between psuedocausal/acausal and causal/surcausal. Second, there's the philosphical meaning.

If you think that Nirvana looks like an ugly hack, but aren't attached to viewing everything as a set of a-environments, then you can just go "Awesome, we can just strip off all the Nirvana and work with Nirvana-free belief functions directly!"

If you think Nirvana looks like a sensible way of making Murphy not pick certain outcomes to encode policy selection problems, but really want to view everything as coming from a set of a-environments, then you can go "ooh, by adding in Nirvana, we can get all these wonky belief functions into a causal form, and from there to a set of a-environments/a-survironments!"

I should also remark that in the "Nirvana is 1 reward forever" setting, this turns into an isomorphism and we get a parallel with the original formulation of an infradistribution where all possible points have been added to the set as long as they don't affect expected values.

Ok, so now that we can freely translate things... Well, there were an awful lot of conditions for something to be a hypothesis/belief function. Let's say our Θ is some random-ass batch of a-measures for each policy. Which conditions do we *need* to check in order to ensure that this is legitimate, and which hypothesis conditions can be massaged into existence if they weren't present already?

**Proposition 2:** *Given a Nirvana-free *Θ?ω* which is defined over full policies but not necessarily a hypothesis (may violate some conditions), the minimal constraints we must check of *Θ?ω* to turn it into an acausal hypothesis via convex hull, closure, upper-completion, renormalization, and using the Isomorphism Theorem to extend to all partial policies are: Nonemptiness, Restricted Minimals, Hausdorff-Continuity, and non-failure of renormalization.*

**Proposition 3:** *Given a Nirvana-free *Θ?ω* which can be turned into an acausal hypothesis, turning it into a *Θ* has: *EΘ(π)(f)=α(EΘ?ω(π)(f)−β)* for all *π* and *f*.*

So, when we turn an arbitrary batch of points for each policy into an acausal belief, it leaves the worst-case values the same, modulo a scale α and shift β. The process is basically "take closed convex hull, take upper completion, renormalize, generate the finite levels from the infinite ones by isomorphism"

Taking a break to recap, we can uniquely specify a hypothesis by only presenting Θ for policy-stubs, or full policies, and it works with any of our hypothesis variants. Causal and surcausal hypotheses let you craft an actual set of a-environments/a-survironments that they came from. We can freely translate pseudocausal/acausal hypotheses to causal/surcausal hypotheses, showing that either Nirvana or a-environments can be dropped, depending on philosophical taste. And finally, you only need to verify four conditions to craft an acausal hypothesis from some completely arbitrary Θ?ω, and the behavior matches up modulo a scale-and-shift.

Hm, what else might we want? Probably some result that says we got everything. Fortunately, there is:

**Proposition 4:** *For all hypotheses *Θ* and *Θ′

(∀π,f:EΘ(π)(f)=EΘ′(π)(f))↔(→NF(Θ)=→NF(Θ′))

In other words, if, regardless of the utility function we picked, Murphy can force the same minimum value for both belief functions, then when we slice away the Nirvana, they're identical, full stop.

Er... we wanted to do maximin policy selection, and since we're working with deterministic policies, it's unclear that argmax exists, that there are even maximizing policies at all. Fortunately, this is the case!

**Proposition 5:** *For all hypotheses *Θ*, and all continuous functions *g* from policies to functions *f∈C((A×O)ω,[0,1])*, then the set *argmaxπEΘ(π)(g(π))* exists and is closed.*

Of course, we only really need one function, our utility function of choice. I figured it was worthwhile to show it in more generality, maybe we'll need it one day for learning utility functions over time.

**Mixing and Updates**

"But wait", I hear you ask, "What about updates or mixing belief functions to make a prior? We didn't cover those at all!"

Well, for mixing, we've got three issues. We can easily mix belief functions for Nirvana-free Θω, by normal set mixing. (EζΘi)(π):=Eζ(Θi(π))

Our first issue is that mixing on the full-policy level, and regenerating the lower levels via the Isomorphism theorem, is *not* the same as mixing the sets at the lower levels. If you really care about what happens at lower levels, it's (EζΘi)(πpa)=¯¯¯¯¯¯¯¯c.h(⋃π≥πpa(Eζ(prπ,πpa∗(Θi(π)))))

Our second issue is that mixing normalized belief functions may fail to be normalized. We'll use EζΘi for the raw mix, and (EζΘi)R for the renormalized mix.

And our third issue is that mixing breaks a condition in the causal case. So, if you want to mix causal hypotheses to make a prior, you should strip off the Nirvana, translate to a pseudocausal hypothesis, mix there, and then go back via Pseudocausal Translation. This is another issue that we expect to vanish if Nirvana just counts as 1 reward forever.

We do need the usual ∑iζiλ⊙i<∞ condition for our mixture of belief functions to be a belief function. Finally, to state the next theorem, we need to define what a nontrivial Θ is.

**Definition 22: Nontriviality**

*A hypothesis *Θ* is nontrivial iff there exists some policy *π* where *EΘ(π)(1)≠EΘ(π)(0)*. This is a very weak condition.*

Nontriviality of some Θi (a component hypothesis that you're mixing together to make a prior) is a sufficient (though not necessary) condition for the prior you make by mixing together the Θi to be able to be turned into a hypothesis by renormalizing.

**Proposition 6:** *For pseudocausal and acausal hypotheses *Θi* where *∑iζiλ⊙i<∞* and there exists a nontrivial *Θi*, then mixing them and renormalizing produces a pseudocausal or acausal hypothesis.*

**Proposition 7:** *For pseudocausal and acausal hypotheses, *E(EζΘi)(πpa)(f)=Eζ(EΘn(πpa)(f))

**Proposition 8:** *For pseudocausal and acausal hypotheses,*

prπhipa,πlopa∗((EζΘi)(πpa))=Eζ(prπhipa,πlopa∗(Θi(π′pa)))

Ok, so mixing works sensibly. What about updates? Updates are somewhat complicated, because we have to slice away part of the policy tree, and incorporate a partial policy off-history into our update (again, for dynamic consistency reasons, we want to keep track of what's happening off-history, and we need to know our off-history policy in order to do that).

Also, there's renormalization to consider, and Nirvana is a bit of an issue. If Nirvana happens off-h, and we update, then that would fold a slice of infinite value into the b term, which can't happen. Our first order of business is taking care of the bookkeeping for getting rid of the h prefix.

**Definition 23: Partial Policies After **h

Π+|h* is the set of partial policies starting at *h*. For a policy in this set, you can translate to *Π+* by adding a prefix of *h* onto all the o-histories and specifying that the partial policy behaves appropriately to make the finite history *h* occur. In the reverse direction, you can take a partial policy capable of producing *h*, remove all o-histories without *h* as a prefix, and remove the *h* prefix from whatever is left.*

**Definition 24: History-Clip**

c* is the partial function that takes an o-history and clips the *h* prefix off, and is otherwise undefined. There's an implicit dependence on the history *h* that we're suppressing in the notation.*

**Definition 25: Off-History Policy**

π¬h* is some partial policy in *Π+* that's defined everywhere except on o-histories with *h* as a prefix, and is also consistent with the history *h*. It specifies everything except what happens after *h*, and can produce *h*.*

**Definition 26: Policy Gluing**

π¬h∙πpa* is the partial policy made by extending *πpa∈Π+|h* to be a partial policy in *Π+* and unioning that with *π¬h*. More simply, it's the policy made by "specify all what you do off-h, and glue that on to what you do on-h"*

Also, now that we know how normalization works, we can define an analogue of our PgH(L) scale term that showed up in updating an infradistribution.

**Definition 27: Probability**

PgΘ,π¬h(h)* is the probability of history *h* relative to the belief function *Θ*, off-h policy *π¬h*, and function *g*.*

\pi_{\neg h}}\mathbb{E}_{\BF(\pi)}(1\bigstar^{h}g)-\mathbb{E}_{\BF(\pi_{\neg h})}(0\bigstar^{h}g)">PgΘ,π¬h(h):=maxπ>π¬hEΘ(π)(1★hg)−EΘ(π¬h)(0★hg)

An important thing to note about these is that they're not additive, and it's probably better to think about them as the scale term necessary to renormalize after updating than as a true probability. However, various results end up using this where the analogous result in the standard Bayesian setting would have standard probability, so it plays a very similar role. And it recovers the usual notion of probability in the special case of one environment.

The rationale for this definition is that we want the worst possible case to have a value of 0, and the best possible case to have a value of 1. This requires rescaling by the gap between the worst-case and best-case outcomes. Further, we know some information about our policy (it behaves as π¬h does off-h, so the policy we selected must be above that), and our utility function off-history (it's g). Thus, the best-case is \pi_{\neg h}}\mathbb{E}_{\BF(\pi)}(1\bigstar^{h}g)">maxπ>π¬hEΘ(π)(1★hg), and the worst-case is \pi_{\neg h}}\mathbb{E}_{\BF(\pi)}(0\bigstar^{h}g)">minπ>π¬hEΘ(π)(0★hg). Further, we can prove \pi_{\neg h}}\mathbb{E}_{\BF(\pi)}(0\bigstar^{h}g)=\mathbb{E}_{\BF(\pi_{\neg h})}(0\bigstar^{h}g)">minπ>π¬hEΘ(π)(0★hg)=EΘ(π¬h)(0★hg).

So, that's where this definition comes from. The gap between best-case and worst-case payoffs plays the analogous role to probability, and knowing π¬h and g gives you some info on what your best-case and worst-case outcomes are.

One more note. We'll use m|h to refer to the measure m⋅1h, chopping down the measure to only the histories with h as a prefix. No renormalization back up afterwards.

*Finally*, we can define an update.

**Definition 28: Belief Function Update**

(Θ|π¬h,g,h)(πpa)* is the set made by intersecting *Θ(π¬h∙πpa)* with*

{(m,b)|m(nirvana occurs without h as a prefix)=0}

*pushing it through the following function, and taking the closure.*

(m,b)↦1PgΘ,π¬h(h)(c(m|h),b+m(0★hg)−EΘ(π¬h)(0★hg))

This process is: In order to find out what happens for a partial policy πpa post-h, we glue it to a complete specification of what happens off-h π¬h, clip away a bit of Θ(π¬h∙πpa) so that infinite reward off-h doesn't contaminate the b term, do some bookkeeping to prune the h prefixes off the post-h conditional measure we get, update, and renormalize. Then close if needed.

With all this extra detail, it's highly prudent to check the basic properties of an update. Fortunately, we have analogues of everything.

**Proposition 9:** *For causal, surcausal, pseudocausal and acausal hypotheses, updating them produces a causal, surcausal, pseudocausal or acausal hypothesis as long as renormalization doesn't fail.*

**Proposition 10:** *For causal, pseudocausal, acausal, and surcausal hypotheses,*

EΘ(π¬h∙πpa)(f★hg)=EΘ(π¬h)(0★hg)+PgΘ,π¬h(h)⋅E(Θ|g,π¬h,h)(πpa)(fh)

If you're wondering what fh is, it's the restriction of f to outcomes with h as a prefix, and then the h prefix gets clipped off. This is another bookkeeping thing.

**Proposition 11:** *If *hh′* is a valid o-history, then for causal, pseudocausal, acausal, and surcausal hypotheses, *(Θ|g,π¬h,h)|gh,π¬h′,h′=Θ|g,(π¬h∙π¬h′),hh′

Ah good, we have analogues of all the usual update properties. What about our Bayes Theorem analogue?

**Theorem 4: Belief Function Bayes:** *For pseudocausal and acausal hypotheses, if there's some *i* s.t. *Θi|g,π¬h,h* is well-defined and nontrivial, then*

(Eζ(Θi))R|g,π¬h,h=⎛⎜⎝Eζ(PgΘi,π¬h(h)⋅(Θi|g,π¬h,h))Eζ(PgΘi,π¬h(h))⎞⎟⎠R

So, mixing hypotheses to make a prior, normalizing it, and updating, is the same as mixing the updated hypotheses in accordance with the probability they put on the observation, and rescaling that. Compare to Bayesian updating going "if you mix hypotheses to make a prior, and update, that's the same as mixing the updated hypotheses in accordance with the probability they put on the observation". Also, this means we don't generally have to worry about the agent giving up and crying when something in its prior says "nothing you do matters", because it'll just drop out of the updated prior completely and we can ignore it. So, as long as something in the prior says that what the agent does matters, we're good.

**Decision Theory**

Ok, the first, most basic thing we want is dynamic consistency. Does dynamic consistency work? Yes it does!

**Theorem 5: Dynamic Consistency:** *Given a hypothesis *Θ* (causal, pseudocausal, acausal, surcausal), and an arbitrary policy *π* and utility function *U*, then, with *πh* being the continuation of *π* post-update, *π¬h* being the off-h behavior of *π*, and *π∗* being such that *

E(Θ|U,π¬h,h)(πh)(Uh)⪋E(Θ|U,π¬h,h)(π∗)(Uh)

*then *EΘ(π)(U)⪋EΘ(π¬h∙π∗)(U)

Ok, ⪋ may be confusing. This is shorthand for "if we have a < in the first part, then we have a < for the second part, and same for = and ">>".

So, viewed from the start, you perfectly agree with future-you-that-updated-on-h's judgement. If they think a policy does better than the default candidate you stuck them with, you should defer to them. If they think a policy does worse than the default you stuck them with, you also agree with them on that too. So, there's no situations where you disapprove of future-you's decisions when they update on a history.

Ok, that's cool, but how does this thing stack up against UDT? Well, let's take an arbitrary set S of policy-selection environments. A policy selection environment is a function Π×(A×O)<ω×A→ΔO that is continuous in Π, so very late policy changes don't affect the probability distribution now very much. This doesn't necessarily assume you're predicted accurately! If the environment ignores your policy, it's just an ordinary environment, and if there's a noisy predictor, well, that's why we have ΔO. This covers any reasonable way in which the environment could depend on your policy.

S does need to have a uniform modulus of continuity, though. Ie, for all ϵ, there needs to be *some* ridiculously large time n where two policies that are identical up till time n mean that every policy selection environment in S only has its behavior *now* varying by ϵ, for all ϵ. If we're just encoding finitely many policy-selection environments, we automatically get this property.

**Theorem 6: Maximin UDT:** *Translating a set *S* of policy selection environments with a uniform modulus of continuity to an acausal hypothesis *Θ* (via making a *Θ?ω* where *Θ?ω(π):={(m,b)|b=0,∃e∈S:π⋅e=m}*, and turning it into an acausal hypothesis by the procedure in Proposition 2) always works. Also, for all utility functions* U,

argmaxπinfe∈SEπ⋅e(U)=argmaxπEΘ(π)(U)

So, this thing acts as maximin UDT on sets of policy-dependent environments, and in the case where there's a single policy-dependent environment, it replicates UDT perfectly. Nice!

Lets ponder the bestiary of decision-theory problems. Note that the following analysis is cheating, in a sense. It just assumes that we have the proper hypothesis corresponding to the problem encoded in our head, it doesn't cover learnability of the problem if we don't start out with the correct model. Much more interesting is whether we can learn a given decision theory problem, given some suitably broad prior, if we're repeatedly faced with it. Also, the "you should know your entire policy off-history" thing is just for defining the theoretical gold standard, real agents don't fulfill this property.

5-and-10 falls immediately.

Troll bridge is a bit interesting because we always have deterministic policies. We could stipulate that we always have an accurate internal sense of which action we meant to take, but it goes through a RNG first and we may end up exploring with that. We cross (the internal event of crossing means you probably cross, with a small chance of staying on the shore, the internal event of not-crossing means you probably don't cross, with a small probability of getting blown up on the bridge).

Doing what UDT would means we also get XOR blackmail, Counterfactual Mugging, Newcomb, Transparent Newcomb, and the version of Death In Damascus where you have to pay for a random coin to escape Death, along with many others.

We haven't yet crammed Absent-Minded Driver into this setting, because of the assumption that you know your past history. Maybe more on that later?

More interesting is which of these decision theory problems are pseudocausal vs acausal. Counterfactual Mugging, Newcomb, Death In Damascus, and XOR Blackmail are pseudocausal. Transparent Newcomb is also pseudocausal if there's some small probability of Omega making a mistake. Really, the issue is with problems where you have 0 probability of getting into a situation where you can prove a false prediction wrong, like perfect Transparent Newcomb. Perfect predictors aren't a problem, as long as you're dealing with a situation where you have to make a bad situation worse locally in exchange for making the situation impossible in the first place. They're only a problem (ie, make an acausal instead of pseudocausal hypothesis) for decision theory problems where you're like "I'll make this good situation worse locally in exchange for making the situation possible in the first place", because false predictions lock you out in that case.

Also, there's material on learnability and infra-POMDP's and Game Theory and "we have a theorem prover, what do we do with that" deferred for a later post. We're still working on this stuff, and the field feels unusually broad and tractable.

Let's finish up by presenting some material on learnability, and a proof sketch for the analogue of the Complete Class Theorem.

**Definition 29: **Uγ

*Use *h:n* to refer to the prefix string of the first *n* actions and observations for an infinite history *h*. Given a reward function *r:(A×O)<ω→[0,1]*, and a time-discount parameter *γ∈[0,1)*, *Uγ* is the utility function generated from *r* and *γ* via: *Uγ(h):=(1−γ)∑∞n=0γnr(h:n)

If γ is near 1, this is low time-discount. An important notation note is that U is used to denote some generic utility function, but U is more like a utility function template, with the reward function specified, but still missing the time-discount factor γ necessary to make it an actual utility function.

Let's define regret in the standard Bayesian case. For some environment e and utility function U, the regret of policy π is R(π,e,U):=maxπ∗(Eπ∗⋅e(U)−Eπ⋅e(U))

Ie, regret is the gap between the best possible value you could get on the environment, and the value you actually got.

In the Bayesian case, learnability of a set of environments ei w.r.t a family of utility functions U is: There exists some γ-indexed family of policies, πγ, where: ∀i:limγ→1R(πγ,ei,Uγ)=0

So, this means that there's some policy which takes in the time discount and decides what to do, and regardless of environment, you can find a really low time discount where the policy nearly matches the performance of a policy which starts out knowing what the true environment is.

Learnability can be thought of as a sort of weak no-traps condition. A trap is something that makes you lose a nonneglible amount of value in the low-time-discount limit. If you can learn a class of environments, that means that there's some policy that (in the low-time-discount limit), doesn't run into traps and figures out how to be near-optimal for the environment, no matter which environment in the class it finds itself in.

Also, there's a notion of a Bayes-optimal family of policies. Given some nondogmatic (nothing with probability 0) prior ζ over environments ei, which can be viewed as one big environment eζ, a Bayes-optimal family of policies π∗γ is one where, for all γ<1, π∗,γ∈argmaxπEπ⋅eζ(Uγ)

One of the most basic results of learnability in the standard Bayesian setting is that if a collection of environments is learnable *at all*, then if we make any nondogmatic prior out of them, the Bayes-optimal family of policies for that prior will learn that collection of environments as well.

Now let's transplant these to our setting.

**Definition 30: Regret**

*The regret of the policy *π* on the hypothesis *Θ* with utility function *U* is*

R(π,Θ,U):=maxπ∗(EΘ(π∗)(U)−EΘ(π)(U))

Regret, as usual, is "what's the gap between the value Murphy forced for you, and the value Murphy forced for the optimal policy?" And then learnability of a family of hypotheses Θi turns into:

**Definition 31: Learnability**

*For a family of utility functions *U*, a family of hypotheses *Θi* is learnable if there is a *γ*-indexed family of policies *πγ*, where *∀i:limγ→1R(πγ,Θi,Uγ)=0

So, on every hypothesis in the family, you converge to having similar worst-case value as the optimal policy which already knows which component hypothesis it's in.

Fixing some prior (EζΘi)R, and abbreviating it as Θζ, there's a notion of a Bayes-optimal family of policies for the prior.

**Definition 32: InfraBayes-Optimal Family**

An InfraBayes-optimal family of policies π∗γ w.r.t. a prior Θζ is one where, for all γ<1, π∗,γ∈argmaxπEΘζ(π)(Uγ)

So, we've got nice parallels of everything. But do we still have the same result that an InfraBayes-optimal family for a prior will learn a collection of hypotheses if it's learnable? Yes.

**Proposition 12:** *If a collection of hypotheses *Θi* is learnable, then any InfraBayes-optimal family for a prior on them also learns the collection of hypotheses as well.*

Now, what about the Complete Class Theorem? Let's say there are finitely many states S, observations O, and actions A, a function ob:S→ΔO that tells you the probabilities of observations given the state, and a payoff function P:S×A→[0,1]. Then we can view a policy π:O→A as a function fπ:S→[0,1], given by s↦Eob(s)(P(s,π(o))).

The standard formulation of the Complete Class Theorem permits mixtures between policies, and anything on the Pareto-frontier has some probability distribution over states where it's optimal. In our case, we (probably) have something much stronger. We stick with our deterministic policies.

**Definition 33: Convexity Property**

*A total order *">>* on equivalence classes of policies *π* (*π* and *π′* are in the same equivalence class iff *fπ=fπ′*,) fulfills the convexity property if, viewing the *fπ* as points in *[0,1]S*,*

\pi\})+[0,\infty)^{S}\to\pi'>\pi">fπ′∈c.h({fπ′′|π′′>π})+[0,∞)S→π′>π

To sum up what we've done so far, we associated each policy with its corresponding function from states to expected payoffs. We consider policies equivalent if they have the same expected payoff for all states, and then stick a total order on them. The convexity property says that if π′ Pareto-dominates, or equals, a state-to-payoff function made by mixing policies that outperform π (according to the total order), then it must outperform π as well. For any Pareto-optimal policy, we can make an ordering like this that puts it as the highest-ranked policy.

**Complete Class Conjecture: ***If we have a total order *">>* over equivalence classes of policies that fulfills the convexity property, then there exists an infradistribution over states, *H*, where*

\pi\to\mathbb{E}_{H}(f_{\pi'})>\mathbb{E}_{H}(f_{\pi})">π′>π→EH(fπ′)>EH(fπ)

So, we can (probably, still a conjecture) match a wide variety of rankings of policies. Such an infradistribution can be thought of as a set of a-environments (with the appropriate mix of states comprising the environment), and then thought of as a causal hypothesis without Nirvana.

Well, what's the proof sketch of this? Why do we think it's true? By Theorem 5 of the previous post, there's a duality between infradistributions and concave monotone Lipschitz normalized functionals on [0,1]S (the space of continuous functions on states). So, viewing our policies as points in [0,1]S, we just need to craft a suitable function. We can do something like this.

The points are the score of the policies in the states, and the number over them is their ranking in the partial order, from worst to best. What we do is basically make a topographic map that's compatible with the partial order, like we did here. Except in more dimensions. Assigning each contour line from worst to best to an elevation/value, and filling in the rest of the values by linear interpolation, we can get a quasiconcave monotone Lipschitz function (quasiconcave is just that all superlevel sets are convex, which we obviously have) where 0 in all coordinates is assigned the lowest value, and 1 is assigned the highest value, and it respects the total ordering we have.

Then, we just have to turn this into a concave function and renormalize to get a function of the proper form, which we can turn into an infradistribution. Fortunately, there's a process called concavification, which lets you adjust the values assigned to the topographic contour lines to make a concave function. It's detailed in this paper. The one issue is that the conditions you have to verify are pretty nightmarish. Two of the three look fine, the nightmarish one is the third condition in Theorem 3 about having uniformly bounded variation of the logarithm of the upper derivative along all strictly increasing geodesic paths when we warp our function in a certain way. That's why this is still a conjecture.

There is, however, a weaker version of the Complete Class Theorem we were able to prove, the one where we can pick any Pareto-optimal policy and come up with an infradistribution over states where it is indeed optimal.

**Complete Class Theorem Weak Version: ***Given any Pareto-optimal policy *π*, then there is an infradistribution *H* over states, where *\mathbb{E}_{H}(f_{\pi'})">∀π′:fπ′≠fπ:EH(fπ)>EH(fπ′)

A proof sketch for why this is true is: Because we are able to translate from concave Lipschitz monotone normalized functionals over [0,1]S to infradistributions over states, we just have to get a concave Lipschitz monotone functional where our policy π is optimal, and then it can be normalized back up to 1, and then it can be turned into an infradistribution over states by LF-duality. Said concave Lipschitz monotone functional is:

h(f):=mins∈S(f(s)−Eo∼ob(s)P(s,π(o)))

If f is some fπ′ (the payoff function induced by a policy π′), then this can be rewritten as:

h(fπ′):=mins∈S(fπ′(s)−fπ(s))

And so, now it's clearer what this means, the infradistribution picks the state where π has the largest advantage over π′. π is Pareto-optimal, so there's always a state where π does strictly better than π′.

And that's about it for this post. On to further results and glory! (in progress)

Discuss

### Proofs Section 1.1 (Initial results to LF-duality)

Fair upfront warning: This is not a particularly readable proof section (though much better than Section 2 about belief functions). There's dense notation, logical leaps due to illusion of transparency since I've spent a month getting fluent with these concepts, and a relative lack of editing since it's long. If you really want to read this, I'd suggest PM-ing me to get a link to MIRIxDiscord, where I'd be able to guide you through it and answer questions.

**Proposition 1:** *If *f∈C(X,[0,1])* then *f+:(m,b)↦m(f)+b* is a positive functional on *Msa(X)*.*

Proof Sketch: We just check three conditions. Linearity, being nonnegative on Msa(X), and continuity.

Linearity proof. Using a,a′ for constants,

f+(a(m,b)+a′(m′,b′))=f+(am+a′m′,ab+ab′)=(am+a′m′)(f)+ab+a′b′

=a(m(f)+b)+a′(m′(f)+b′)=af+(m,b)+a′f+(m′,b′)

So we have verified that f+(aM+a′M′)=af+(M)+a′f+(M′) and we have linearity.

Positivity proof: An sa-measure M, writeable as (m,b) has m uniquely writeable as a pair of finite measures m+ (all the positive regions) and a m− (all the negative regions) by the Jordan Decomposition Theorem, and b+m−(1)≥0. So,

f+(M)=m(f)+b=m+(f)+m−(f)+b≥0+m−(1)+b≥0

The first ≥ by 1≥f≥0, so the expectation of f is positive and m− is negative so taking the expectation of 1 is more negative. The second ≥ is by the condition on how m− relates to b.

Continuity proof: Fix a sequence (mn,bn) converging to (m,b). Obviously the b part converges, so now we just need to show that mn(f) converges to m(f). The metric we have on the space of finite signed measures is the KR-metric, which implies the thing we want. This only works for continuous f, not general f.

**Theorem 1:** *Every positive functional on *Msa(X)* can be written as *(m,b)↦c(m(f)+b)*, where *c≥0*, and *f∈C(X,[0,1])

Proof Sketch: The first part is showing that it's impossible to have a positive functional where the b term doesn't matter, without the positive functional being the one that maps everything to 0. The second part of the proof is recovering our f by applying the positive functional to Dirac-delta measures δx, to see what the function must be on point x.

Part 1: Let's say f+ isn't 0, ie there's some nonzero (m,b) pair where 0">f+(m,b)>0, and yet f+(0,1)=0 (which, by linearity, means that f+(0,b)=0 for all b). We'll show that this situation is impossible.

Then, 0<f+(m,b)=f+(m+,0)+f+(m−,b) by our starting assumption, and Jordan decomposition of m, along with linearity of positive functionals. Now, f+(m−,b)+f+(−2(m−),0)=f+(−(m−),b) because positive functionals are linear, and everything in that above equation is an sa-measure (flipping a negative measure makes a positive measure, which doesn't impose restrictions on the b term except that it be ≥0). And so, by nonnegativity of positive functionals on sa-measures, f+(m−,b)≤f+(−(m−),b). Using this, we get

f+(m+,0)+f+(m−,b)≤f+(m+,0)+f+(−(m−),b)

=f+(m+,0)+f+(−(m−),0)+f+(0,b)=f+(m+,0)+f+(−(m−),0)

Another use of linearity was invoked for the first = in the second line, and then the second = made use of our assumption that f+(0,b)=0 for all b.

At this point, we have derived that 0<f+(m+,0)+f+(−(m−),0). Both of these are positive measures. So, there exists some positive measure m′ where 0">f+(m′,0)>0.

Now, observe that, for all b, 0=f+(0,b)=f+(m′,0)+f+(−(m′),b)

Let b be sufficiently huge to make (−(m′),b) into an sa-measure. Also, since 0">f+(m′,0)>0, f+(−(m′),b)<0, which is impossible because positive functionals are nonnegative on all sa-measures. Contradiction. Due to the contradiction, if there's a nonzero positive functional, it must assign 0">f+(0,1)>0, so let f+(0,1) be our c term.

Proof part 2: Let's try to extract our f. Let f(x):=f+(δx,0)f+(0,1) This is just recovering the value of the hypothesized f on x by feeding our positive functional the measure δx that assigns 1 value to x and nothing else, and scaling. Now, we just have to verify that this f is continuous and in [0,1].

For continuity, let xn limit to x. By the KR-metric we're using, (δxn,0) limits to (δx,0). By continuity of f+, f+(δxn,0) limits to f+(δx,0). Therefore, f(xn) limits to f(x) and we have continuity.

For a lower bound, f≥0, because f(x) is a ratio of two nonnegative numbers, and the denominator isn't 0.

Now we just have to show that f≤1. For contradiction, assume there's an x where 1">f(x)>1. Then 1">f+(δx,0)f+(0,1)>1, so \pf(0,1)">f+(δx,0)>f+(0,1), and in particular, f+(0,1)−f+(δx,0)<0.

But then, f+(−(δx),1)+f+(δx,0)=f+(0,1), so f+(−(δx),1)=f+(0,1)−f+(δx,0)<0

However, (−(δx),1) is an sa-measure, because δx(1)+1=0, and must have nonnegative value, so we get a contradiction. Therefore, f∈C(X,[0,1]).

To wrap up, we can go:

f+(m,b)=f+(m,0)+f+(0,b)=f+(0,1)f+(0,1)(∫X(f+(δx,0))dm+f+(0,b))

=f+(0,1)(∫Xf+(δx,0)f+(0,1)dm+f+(0,b)f+(0,1))=c(∫Xf(x)dm+b)=c(m(f)+b)

And c≥0, and f∈C(X,[0,1]), so we're done.

**Lemma 1: Compactness Lemma:** *Fixing some nonnegative constants *λ◯* and *b◯*, the set of sa-measures where *m+(1)∈[0,λ◯]*, *b∈[0,b◯]*, is compact. Further, if a set lacks an upper bound on *m+(1)* or on *b*, it's not compact.*

Proof Sketch: We fix an arbitrary sequence of sa-measures, and then use the fact that closed intervals are compact-complete and the space ΔX is compact-complete to isolate a suitable convergent subsequence. Since all sequences have a limit point, the set is compact. Then, we go in the other direction, and get a sequence with no limit points assuming either a lack of upper bounds on m+(1), or a lack of upper bounds on b.

Proof: Fix some arbitrary sequence Mn wandering about within this space, which breaks down into (m+n,0)+(m−n,bn), and then, since all measures are just a probability distribution scaled by the constant m(1), it further breaks down into (m+n(1)⋅μn,0)+(m−n(1)⋅μ′n,bn). Since bn+m−n(1)≥0, m−n(1) must be bounded in [−b◯,0].

Now, what we can do is extract a subseqence where bn ,m+n(1), m−n(1), μn, and μ′n all converge, by Tychonoff's Theorem (finite product, no axiom of choice required) Our three number sequences are all confined to a bounded interval, and our two probability sequences are wandering around within ΔX which is a compact complete metric space if X is. The limit of this subsequence is a limit point of the original sequence, since all its components are arbitrarily close to the components that make up Mn for large enough n in our subsequence.

The limiting value of m+(1) and b both obey their respective bounds, and the cone of sa-measures is closed, so the limit point is an sa-measure and respects the bounds too. Therefore the set is compact, because all sequences of points in it have a limit point.

In the other direction, assume a set B has unbounded b values. Then we can fix a sequence (mn,bn)∈B where bn increases without bound, so the a-measures can't converge. The same applies to all subsequences, so there's no limit point, so B isn't compact.

Now, assume a set B has bounded b values, call the least upper bound b⊙, but the value of m+(1) is unbounded. Fix a sequence (mn,bn)∈B where m+n(1) is unbounded above. Assume a convergent subsequence exists. Since bn+m−n(1)≥0, m−n(1) must be bounded in [−b⊙,0]. Then because mn(1)=m+n(1)+m−n(1)≥m+n(1)−b⊙, and the latter quantity is finite, mn(1) must be unbounded above. However, in order for the mn to limit to some m, limn→∞mn(1)=m(1), which results in a contradiction. Therefore, said convergent subsequence doesn't exist, and B is not compact.

Put together, we have a necessary-and-sufficient condition for a closed subset of Msa(X) to be compact. There must be an upper bound on b and m+(1), respectively.

**Lemma 2:** *The upper completion of a closed set of sa-measures is closed.*

Proof sketch: We'll take a convergent sequence (mn,bn) in the upper completion of B that limits to (m,b), and show that, in order for it to converge, the same sorts of bounds as the Compactness Lemma uses must apply. Then, breaking down (mn,bn) into (mBn,bBn)+(m∗n,b∗n), where (mBn,bBn)∈B, and (m∗n,b∗n) is an sa-measure, we'll transfer these Compactness-Lemma-enabling bounds to the sequences (mBn,bBn) and (m∗n,b∗n), to get that they're both wandering around in a compact set. Then, we just take a convergent subsequence of both, add the two limit points together, and get our limit point (m,b), witnessing that it's in the upper completion of B.

Proof: Let (mn,bn)∈B+Msa(X) limit to some (m,b). A convergent sequence (plus its one limit point) is a compact set of points, so, by the Compactness Lemma, there must be a b◯ and λ◯ that are upper bounds on the bn and m+n(1) values, respectively.

Now, for all n, break down (mn,bn) as (mBn,bBn)+(m∗n,b∗n), where (mBn,bBn)∈B, and (m∗n,b∗n) is an sa-measure.

Because bBn+b∗n=bn≤b◯, we can bound the bBn and b∗n quantities by b◯. This transfers into a −b◯ lower bound on mB−n(1) and m∗−n(1), respectively.

Now, we can go:

mB+n(1)+mB−n(1)+m∗+n(1)+m∗−n(1)=mBn(1)+m∗n(1)=mn(1)

=m+n(1)+m−n(1)≤m+n(1)≤λ◯

Using worst-case values for mB−n(1) and m∗−n(1), we get:

mB+n(1)+m∗+n(1)−2b◯≤λ◯

mB+n(1)+m∗+n(1)≤λ◯+2b◯

So, we have upper bounds on mB+n(1) and m∗+n(1) of λ◯+2b◯, respectively.

Due to the sequences (mBn,bBn) and (m∗n,b∗n) respecting bounds on b and m+(1) (b◯ and λ◯+2b◯ respectively), and wandering around within the closed sets B and Msa(X) respectively, we can use the Compactness Lemma and Tychonoff's theorem (finite product, no axiom of choice needed) to go "hey, there's a subsequence where both (mBn,bBn) and (m∗n,b∗n) converge, call the limit points (mB,bB) and (m∗,b∗). Since B and Msa(X) are closed, (mB,bB)∈B, and (m∗,b∗)∈Msa(X)."

Now, does (mB,bB)+(m∗,b∗)=(m,b)? Well, for any ϵ, there's some really large n where d((mBn,bBn),(mB,bB))<ϵ, d((m∗n,b∗n),(m∗,b∗))<ϵ, and d((mn,bn),(m,b))<ϵ. Then, we can go:

d((m,b),(mB,bB)+(m∗,b∗))≤d((m,b),(mn,bn))+d((mn,bn),(mB,bB)+(m∗,b∗))

=d((m,b),(mn,bn))+d((mBn,bBn)+(m∗n,b∗n),(mB,bB)+(m∗,b∗))

=d((m,b),(mn,bn))+||((mBn,bBn)+(m∗n,b∗n))−((mB,bB)+(m∗,b∗))||

=d((m,b),(mn,bn))+||((mBn,bBn)−(mB,bB))+((m∗n,b∗n)−(m∗,b∗))||

≤d((m,b),(mn,bn))+||(mBn,bBn)−(mB,bB)||+||(m∗n,b∗n)−(m∗,b∗)||

=d((m,b),(mn,bn))+d((mBn,bBn),(mB,bB))+d((m∗n,b∗n),(m∗,b∗))<3ϵ

So, regardless of ϵ, d((m,b),(mB,bB)+(m∗,b∗))<3ϵ, so (mB,bB)+(m∗,b∗)=(m,b). So, we've written (m,b) as a sum of an sa-measure in B and an sa-measure, certifying that (m,b)∈B+Msa(X), so B+Msa(X) is closed.

**Proposition 2:** *For closed convex nonempty* B,B+Msa(X)={M|∀f+∃M′∈B:f+(M)≥f+(M′)}

Proof sketch: Show both subset inclusion directions. One is very easy, then we assume the second direction is false, and invoke the Hahn-Banach theorem to separate a point in the latter set from the former set. Then we show that the separating functional is a positive functional, so we have a positive functional where the additional point underperforms everything in B+Msa(X), which is impossible by the definition of the latter set.

Easy direction: We will show that B+Msa(X)⊆{M|∀f+∃M′∈B:f+(M)≥f+(M′)}

This is because a M∈(B+Msa(X)), can be written as M=MB+M∗. Let MB be our M′ of interest. Then, it is indeed true that for all f+, f+(M)=f+(MB)+f+(M∗)≥f+(MB)

Hard direction: Assume by contradiction that

B+Msa(X)⊂{M|∀f+∃M′∈B:f+(M)≥f+(M′)}

Then there's some M where ∀f+∃M′∈B:f+(M)≥f+(M′) and M∉B+Msa(X). B+Msa(X) is the upper completion of a closed set, so by the Compactness Lemma, it's closed, and since it's the Minkowski sum of convex sets, it's convex.

Now, we can use the variant of the Hahn-Banach theorem from the Wikipedia article on "Hahn-Banach theorem", in the "separation of a closed and compact set" section. Our single point M is compact, convex, nonempty, and disjoint from the closed convex set B+Msa(X). Banach spaces are locally convex, so we can invoke Hahn-Banach separation.

Therefore, there's some continuous linear functional ϕ s.t. ϕ(M)<infM′∈(B+Msa(X))ϕ(M′)

We will show that this linear functional is actually a positive functional!

Assume there's some sa-measure M∗ where ϕ(M∗)<0. Then we can pick a random MB∈B, and consider ϕ(MB+cM∗), where c is extremely large. MB+cM∗ lies in B+Msa(X), but it would also produce an extremely negative value for \phi which undershoots ϕ(M) which is impossible. So ϕ is a positive functional.

However, ϕ(M)<infM′∈(B+Msa(X))ϕ(M′), so ϕ(M)<infM′∈Bϕ(M′). But also, M fulfills the condition ∀f+∃M′∈B:f+(M)≥f+(M′), because of the set it came from. So, there must exist some M′∈B where ϕ(M)≥ϕ(M′). But, we have a contradiction, because ϕ(M)<infM′∈Bϕ(M′).

So, there cannot be any point in {M|∀f+∃M′∈B:f+(M)≥f+(M′)} that isn't in B+Msa(X). This establishes equality.

**Lemma 3:** *For any closed set *B⊆Msa(X)* and point *M∈B*, the set *({M}−Msa(X))∩B* is nonempty and compact.*

Proof: It's easy to verify nonemptiness, because M is in the set. Also, it's closed because it's the intersection of two closed sets. B was assumed closed, and the other part is the Minkowski sum of {M} and −Msa(X), which is closed if −Msa(X) is, because it's just a shift of −Msa(X) (via a single point). −Msa(X) is closed because it's -1 times a closed set.

We will establish a bound on the m+(1) and b values of anything in the set, which lets us invoke the Compactness Lemma to show compactness, because it's a closed subset of a compact set.

Note that if M′∈({M}−Msa(X))∩B, then M′=M−M∗, so M′+M∗=M. Rewrite this as (m′,b′)+(m∗,b∗)=(m,b)

Because b′+b∗=b, we can bound b′ and b∗ by b. This transfers into a −b lower bound on m′−(1) and m∗−(1). Now, we can go:

m′+(1)+m′−(1)+m∗+(1)+m∗−(1)=m′(1)+m∗(1)=m(1)

=m+(1)+m−(1)≤m+(1)

Using worst-case values for m′−(1) and m∗−(1), we get:

m′+(1)+m′+(1)−2b≤m+(1)

m′+(1)≤m′+(1)+m∗+(1)≤m+(1)+2b

So, we have an upper bound of m+(1)+2b on m′+(1), and an upper bound of b on b′. Further, (m′,b′) was arbitrary in ({M}−Msa(X))∩B, so we have our bounds. This lets us invoke the Compactness Lemma, and conclude that said closed set is compact.

**Lemma 4:** *If *≥* is a partial order on *B* where *M′≥M* iff there's some sa-measure *M∗* where *M=M′+M∗*, then*

\sm\leftrightarrow\left(\sm\in B\wedge\exists\sm'\neq\sm:\sm'\in\{\sm\}-\vsupm(X))\cap B\right)\leftrightarrow\sm\text{ is not minimal in }B">∃M′>M↔(M∈B∧∃M′≠M:M′∈{M}−Msa(X))∩B)↔M is not minimal in B

Proof: \sm\leftrightarrow\exists\sm'\neq\sm:\sm'\ge\sm">∃M′>M↔∃M′≠M:M′≥M

Also, M′≥M↔(M′,M∈B∧∃M∗:M=M′+M∗)

Also, ∃M∗:M=M′+M∗↔∃M∗:M−M∗=M′↔M′∈({M}−Msa(X))

Putting all this together, we get

\sm)\leftrightarrow\left(\sm\in B\wedge\exists\sm'\neq\sm:\sm'\in(\{\sm\}-\vsupm(X))\cap B\right)">(∃M′>M)↔(M∈B∧∃M′≠M:M′∈({M}−Msa(X))∩B)

And we're halfway there. Now for the second half.

M is not minimal in B↔M∈B∧(∃M′∈B:M′≠M∧(∃M∗:M=M′+M∗))

Also, ∃M∗:M=M′+M∗↔∃M∗:M−M∗=M′↔M′∈({M}−Msa(X))

Putting this together, we get

M is not minimal in B↔(M∈B∧∃M′≠M:M′∈({M}−Msa(X))∩B)

And the result has been proved.

**Theorem 2:** *Given a nonempty closed set *B*, the set of minimal points *Bmin* is nonempty and all points in *B* are above a minimal point.*

Proof sketch: First, we establish an partial order that's closely tied to the ordering on B, but flipped around, so minimal points in B are maximal elements. We show that it is indeed a partial order, letting us leverage Lemma 4 to translate between the partial order and the set B. Then, we show that every chain in the partial order has an upper bound via Lemma 3 and compactness arguments, letting us invoke Zorn's lemma to show that that everything in the partial order is below a maximal element. Then, we just do one last translation to show that minimal points in B perfectly correspond to maximal elements in our partial order.

Proof: first, impose a partial order on B, where M′≥M iff there's some sa-measure M∗ where M=M′+M∗. Notice that this flips the order. If an sa-measure is "below" another sa-measure in the sa-measure addition sense, it's above that sa-measure in this ordering. So a minimal point in B would be maximal in the partial order. We will show that it's indeed a partial order.

Reflexivity is immediate. M=M+(0,0), so M≥M.

For transitivity, assume M′′≥M′≥M. Then there's some M∗ and M′∗ s.t. M=M′+M∗, and M′=M′′+M′∗. Putting these together, we get M=M′′+(M∗+M′∗), and adding sa-measures gets you an sa-measure, so M′′≥M.

For antisymmetry, assume M′≥M and M≥M′. Then M=M′+M∗, and M′=M+M′∗. By substitution, M=M+(M∗+M′∗), so M′∗=−M∗. For all positive functionals, f+(M′∗)=f+(−M∗)=−f+(M∗), and since positive functionals are always nonnegative on sa-measures, the only way this can happen is if M∗ and M′∗ are 0, showing that M=M′.

Anyways, since we've shown that it's a partial order, all we now have to do is show that every chain has an upper bound in order to invoke Zorn's lemma to show that every point in B lies below some maximal element.

Fix some ordinal-indexed chain Mγ, and associate each of them with the set Sγ=({Mγ}+(−Msa(X)))∩B, which is compact by Lemma 3 and always contains Mγ.

The collection of Sγ also has the finite intersection property, because, fixing finitely many of them, we can consider a maximal γ∗, and Mγ∗ is in every associated set by:

Case 1: Some other Mγ equals Mγ∗, so Sγ=Sγ∗ and Mγ∗∈Sγ∗=Sγ.

Case 2: \sm_{\gam}">Mγ∗>Mγ, and by Lemma 4, Mγ∗∈({Mγ}−Msa(X))∩B.

Anyways, since all the Sγ are compact, and have the finite intersection property, we can intersect them all and get a nonempty set containing some point M∞. M∞ lies in B, because all the sets we intersected were subsets of B. Also, because M∞∈(Mγ−Msa(X))∩B for all γ in our chain, then if M∞≠Mγ, Lemma 4 lets us get \sm_{\gam}">M∞>Mγ, and if M∞=Mγ, then M∞≥Mγ. Thus, M∞ is an upper bound for our chain.

By Zorn's Lemma, because every chain has an upper bound, there are maximal elements in B, and every point in B has a maximal element above it.

To finish up, use Lemma 4 to get: \sm\leftrightarrow\sm\text{ is minimal in }B">M is maximal↔¬∃M′>M↔M is minimal in B

**Proposition 3:** Given a f∈C(X,[0,1]), and a B that is nonempty closed, inf(m,b)∈B(m(f)+b)=inf(m,b)∈Bmin(m(f)+b)

Direction 1: since Bmin is a subset of B, we get one direction easily, that

inf(m,b)∈B(m(f)+b)≤inf(m,b)∈Bmin(m(f)+b)

Direction 2: Take a M∈B. By Theorem 2, there is a Mmin∈Bmin s.t. M=Mmin+M∗. Applying our positive functional m(f)+b (by Proposition 1), we get that m(f)+b≥mmin(f)+bmin. Because every point in B has a point in Bmin which scores as low or lower according to the positive functional,

inf(m,b)∈B(m(f)+b)≥inf(m,b)∈Bmin(m(f)+b)

And this gives us our desired equality.

**Proposition 4:** *Given a nonempty closed convex *B*, *Bmin=(Buc)min* and *(Bmin)uc=Buc

Proof: First, we'll show Bmin=(Buc)min. We'll use the characterization in terms of the partial order ≤ we used for the Zorn's Lemma proof of Theorem 2. If a point M is in Buc, then it can be written as M=MB+M∗, so M≤MB. Since all points added in Buc lie below a preexisting point in B (according to the partial order from Theorem 2) the set of maximals (ie, set of minimal points) is completely unchanged when we add all the new points to the partial order via upper completion, so Bmin=(Buc)min.

For the second part, one direction is immediate. Bmin⊆B, so (Bmin)uc⊆Buc. For the reverse direction, take a point M∈Buc. It can be decomposed as MB+M∗, and then by Theorem 2, MB can be decomposed as Mmin+M′∗, so M=Mmin+(M∗+M′∗), so it lies in (Bmin)uc, and we're done.

**Theorem 3:** *If the nonempty closed convex sets *A* and *B* have *Amin≠Bmin*, then there is some *f∈C(X,[0,1])* where *EA(f)≠EB(f)

Proof sketch: We show that upper completion is idempotent, and then use that to show that the upper completions of A and B are different. Then, we can use Hahn-Banach to separate a point of A from Buc (or vice-versa), and show that the separating functional is a positive functional. Finally, we use Theorem 1 to translate from a separating positive functional to different expectation values of some f∈C(X,[0,1])

Proof: Phase 1 is showing that upper completion is idempotent. (Buc)uc=Buc. One direction of this is easy, Buc⊆(Buc)uc. In the other direction, let M∈(Buc)uc. Then we can decompose M into M′+M∗, where M′∈Buc, and decompose that into MB+M′∗ where MB∈B, so M=MB+(M∗+M′∗) and M∈Buc.

Now for phase 2, we'll show that the minimal points of one set aren't in the upper completion of the other set. Assume, for contradiction, that this is false, so Amin⊆Buc and Bmin⊆Auc. Then, by idempotence, Proposition 4, and our subset assumption,

Auc=(Amin)uc⊆(Buc)uc=Buc

Swapping the A and B, the same argument holds, so Auc=Buc, so (Buc)min=(Auc)min.

Now, using this and Proposition 4, Bmin=(Buc)min=(Auc)min=Amin.

But wait, we have a contradiction, we said that the minimal points of B and A weren't the same! Therefore, either Bmin⊈Auc, or vice-versa. Without loss of generality, assume that Bmin⊈Auc.

Now for phase 3, Hahn-Banach separation to get a positive functional with different inf values. Take a point MB in Bmin that lies outside Auc. Now, use the Hahn-Banach separation of {MB} and Auc used in the proof of Proposition 2, to get a linear functional ϕ (which can be demonstrated to be a positive functional by the same argument as the proof of Proposition 2) where: ϕ(MB)<infM∈Aucϕ(M). Thus, infM∈Bϕ(M)<infM∈Aϕ(M), so infM∈Bϕ(M)≠infM∈Aϕ(M)

Said positive functional can't be 0, otherwise both sides would be 0. Thus, by Theorem 1, ϕ((m,b))=a(m(f)+b) where 0">a>0, and f∈C(X,[0,1]). Swapping this out, we get:

inf(m,b)∈Ba(m(f)+b)≠inf(m′,b′)∈Aa(m′(f)+b′)

inf(m,b)∈B(m(f)+b)≠inf(m′,b′)∈A(m′(f)+b′)

and then this is EB(f)≠EA(f) So, we have crafted our f∈C(X,[0,1]) which distinguishes the two sets and we're done.

**Corollary 1:** *If two nonempty closed convex upper-complete sets *A* and *B* are different, then there is some *f∈C(X,[0,1])* where* EA(f)≠EB(f)

Proof: Either Amin≠Bmin, in which case we can apply Theorem 3 to separate them, or their sets of minimal points are the same. In that case, by Proposition 4 and upper completion, A=Auc=(Amin)uc=(Bmin)uc=Buc=B and we have a contradiction because the two set are different.

**Theorem 4:** *If *H* is an infradistribution/bounded infradistribution, then *h:f↦EH(f)* is concave in *f*, monotone, uniformly continuous/Lipschitz, *h(0)=0,h(1)=1*, and if *range(f)⊈[0,1]*, *h(f)=−∞

Proof sketch: h(0)=0,h(1)=1 is trivial, as is uniform continuity from the weak bounded-minimal condition. For concavity and monotonicity, it's just some inequality shuffling, and for h(f)=∞ if f∈C(X),f∉C(X,[0,1]), we use upper completion to have its worst-case value be arbitrarily negative. Lipschitzness is much more difficult, and comprises the bulk of the proof. We get a duality between minimal points and hyperplanes in C(X)⊕R, show that all the hyperplanes we got from minimal points have the same Lipschitz constant upper bound, and then show that the chunk of space below the graph of h itself is the same as the chunk of space below all the hyperplanes we got from minimal points. Thus, h has the same (or lesser) Lipschitz constant as all the hyperplanes chopping out stuff above the graph of h.

Proof: For normalization, h(1)=EH(1)=1 and h(0)=EH(0)=0 by normalization for H. Getting the uniform continuity condition from the weak-bounded-minimal condition on an infradistribution H is also trivial, because the condition just says f↦EH(f) is uniformly continuous, and that's just h itself.

Let's show that h is concave over C(X,[0,1]), first. We're shooting for h(pf+(1−p)f′)≥ph(f)+(1−p)h(f′). To show this,

h(pf+(1−p)f′)=EH(pf+(1−p)f′)=inf(m,b)∈H(m(pf+(1−p)f′)+b)

=inf(m,b)∈H(p(m(f)+b)+(1−p)(m(f′)+b))

≥pinf(m,b)∈H(m(f)+b′)+(1−p)inf(m′,b′)∈H(m′(f′)+b′)

=pEH(f)+(1−p)EH(f′)=ph(f)+(1−p)h(f′)

And concavity has been proved.

Now for monotonicity. By Proposition 3 and Proposition 1,

∀f:inf(m,b)∈H(m(f)+b)=inf(m,b)∈Hmin(m(f)+b)

Now, let's say f′≥f. Then:

EH(f)=inf(m,b)∈H(m(f)+b)=inf(m,b)∈Hmin(m(f)+b)≤inf(m,b)∈Hmin(m(f′)+b)

=inf(m,b)∈H(m(f′)+b)=EH(f′)

And we're done. The critical inequality in the middle came from all minimal points in an infradistribution having no negative component by positive-minimals, so swapping out a function for a greater function produces an increase in value.

Time for range(f)⊈[0,1]→h(f)=−∞. Let's say there exists an x s.t. 1">f(x)>1. We can take an arbitrary sa-measure (m,b)∈H, and consider (m,b)+c(−δx,1), where δx is the point measure that's 1 on x, and c is extremely huge. The latter part is an sa-measure. But then,(m−cδx)(f)+(b+c)=m(f)+b+c(1−δx(f))=m(f)+b+c(1−f(x)). Since 1">f(x)>1, and c is extremely huge, this is extremely negative. So, since there's sa-measures that make the function as negative as we wish in H by upper-completeness, inf(m,b)∈H(m(f)+b)=−∞ A very similar argument can be done if there's an x where f(x)<0, we just add in (cδx,0) to force arbitrarily negative values.

Now for Lipschitzness, which is by far the worst of all. A minimal point (m,b) induces an affine function hm,b (kinda like a hyperplane) of the form hm,b(f)=m(f)+b. Regardless of (m,b), as long as it came from a minimal point in H, hm,b≥h for functions with range in [0,1], because

hm,b(f)=m(f)+b≥inf(m,b)∈H(m(f)+b)=EH(f)=h(f)

Ok, so if a point is on-or-below the graph of h over C(X,[0,1]), then it's on-or-below the graph of hm,b for all (m,b)∈Hmin.

What about the other direction? Is it possible for a point (f,b′) to be strictly above the graph of h and yet ≤ all the graphs of hm,b? Well, no. Invoking Proposition 3,

h(f)=\mathbb{E}_{H}(f)=\inf_{(\m,b)\in H}(\m(f)+b)=\inf_{(\m,b)\in H^{\min}}(\m(f)+b)=\inf_{(\m,b)\in H^{\min}}(h_{\m,b}(f))">b′>h(f)=EH(f)=inf(m,b)∈H(m(f)+b)=inf(m,b)∈Hmin(m(f)+b)=inf(m,b)∈Hmin(hm,b(f))

So, there exists a minimal point (m,b)∈Hmin where h_{\m,b}(f)">b′>hm,b(f), so (f,b′) lies above the graph of hm,b.

Putting these two parts together, h's hypograph over C(X,[0,1]) is the same as the intersection of the hypographs of all these hm,b. If we can then show all the hm,b have a Lipschitz constant bounded above by some constant, then we get that h itself is Lipschitz with the same constant.

First, a minimal (m,b) must have m having no negative parts, so it can be written as λμ, and by bounded-minimals (since we have a bounded infradistribution), λ≤λ⊙. Now,

|hm,b(f)−hm,b(f′)|=|m(f)+b−m(f′)−b|=|m(f−f′)|≤m(|f−f′|)

=(λμ)(|f−f′|)=λ(μ)(|f−f′|)≤λsupx∈X|f(x)−f′(x)|≤λ⊙supx∈X|f(x)−f′(x)|

So, we get that: |hm,b(f)−hm,b(f′)|supx∈X|f(x)−f′(x)|≤λ⊙supx∈X|f(x)−f′(x)|supx∈X|f(x)−f′(x)|=λ⊙

Note that supx∈X|f(x)−f′(x)| is our distance metric between functions in C(X). This establishes that regardless of which minimal point we picked, hm,b is Lipschitz with Lipschitz constant ≤λ⊙, and since h=inf(m,b)∈Hminhm,b, then h itself has the same bound on its Lipschitz constant.

**Lemma 5:** ∀m:inff∈C(X,[0,1])(m(f))=m−(1)

Proof sketch: We'll work in the Banach space L1(|m|) of L1 measurable functions w.r.t the absolute value of the signed measure m. Then, we consider the discontinuous (but L1) function that's 1 everywhere where m is negative. Continuous functions are dense in L1 measurable functions, so we can fix a sequence of continuous functions limiting to said indicator function. Then we just have to check that f↦m(f) is a bounded linear functional, and we get that there's a sequence of continuous functions f′n where m(f′n) limits to the measure of the indicator function that's 1 where everything is negative. Which is the same as the measure of the "always 1" function, but only on the negative parts, and we're done.

Consider the Banach space L1(|m|) of measurable functions w.r.t. the absolute value of the signed measure m, ie, |m|=m+−m−, which is a measure. It has a norm given by ||f||=∫X|f|d|m|. To begin with, we can consider the L1 indicator function 1m− that's 1 where the measure is negative. Note that

m(1m−)=∫X1m−dm=∫X1m−dm++∫X1m−dm−

=∫X0dm++∫X1dm−=∫X1dm−=m−(1)

Because continuous functions are dense in L1, we can fix a sequence of continuous functions fn limiting to 1m−. Then, just clip those continuous functions to [0,1], making a continuous function f′n. They'll get *closer* to 1m− that way, so the sequence f′n of continuous functions X→[0,1] limits to 1m− too.

We'll take a detour and show that m is a bounded linear functional L1(|m|)→R, with a Lipschitz constant of 1 or less.

First, m(af+a′f′)=am(f)+a′m(f′), trivially, establishing linearity.

As for the boundedness, if ||f||≤1, then ∫X|f|d|m|≤1, so:

1≥∫X|f|d|m|=∫Xsup(f,0)d|m|−∫Xinf(f,0)d|m|

=∫Xsup(f,0)dm++∫Xsup(f,0)d|m−|−∫Xinf(f,0)dm+−∫Xinf(f,0)d|m−|

=∣∣∫Xsup(f,0)dm+∣∣+∣∣−∫Xsup(f,0)d|m−|∣∣+∣∣∫Xinf(f,0)dm+∣∣+∣∣−∫Xinf(f,0)d|m−|∣∣

≥∣∣∫Xsup(f,0)dm+−∫Xsup(f,0)d|m−|+∫Xinf(f,0)dm+−∫Xinf(f,0)d|m−|∣∣

=∣∣∫Xsup(f,0)dm+∫Xinf(f,0)dm∣∣=∣∣∫Xfdm∣∣=|m(f)|

So, m(f)∈[−1,1]. An f having a norm of 1 or less gets mapped to a number with a norm of 1 or less, so the Lipschitz constant of f↦m(f) is 1 or less. This implies continuity.

Now that we have all requisite components, fix some ϵ. There's some n where, for all greater n, d(1m−,f′n)<ϵ. Mapping them through f↦m(f), due to having a Lipschitz constant of 1 or less, then means that |\m(f'_n)-\m(\textbf{1}_{\m^{-}})|=\m(f'_n)-\m(\textbf{1}_{\m^{-}})=\m(f'_n)-\m^{-}(1)">ϵ>|m(f′n)−m(1m−)|=m(f′n)−m(1m−)=m(f′n)−m−(1) because the value of 1-but-only-on-negative-parts is as-or-more negative than f′n on the measure, due to f′n being bounded in [0,1]. Summarizing, \m(f'_n)-\m^{-}(1)">ϵ>m(f′n)−m−(1) for all n beyond a certain point, so, for all n beyond a certain point, m(f′n)<ϵ+m−(1)

So we have a sequence of functions in C(X,[0,1]) where m(f′n) limits to m−(1), and our signed measure was arbitrary. Therefore, we have our result that ∀m:inff∈C(X,[0,1])m(f)=m−(1).

**Theorem 5:** If h is a function C(X)→R that is concave, monotone, uniformly-continuous/Lipschitz, h(0)=0, h(1)=1, and range(f)∉[0,1]→h(f)=−∞, then it specifies a infradistribution/bounded infradistribution by: {(m,b)|b≥(h′)∗(m)}, where h′ is the function given by h′(−f)=−h(f), and (h′)∗ is the convex conjugate of h′. Also, going from a infradistribution to an h and back recovers exactly the infradistribution, and going from an h to a infradistribution and back recovers exactly h.

Proof sketch: This is an extremely long one. Phase 1 and 2 is showing isomorphism. One direction is reshuffling the definition of H until we get the definition of the set built from h′ via convex conjugate, showing that going H to h and back recovers your original set. In the other direction, we show that expectations w.r.t the set we built from H match up with h exactly.

Phase 3 is cleanup of the easy conditions. Nonemptiness is pretty easy to show, the induced set being a set of sa-measures is harder to show and requires moderately fancier arguments, and closure and convexity require looking at basic properties of functions and the convex conjugate. Upper completeness takes some equation shuffling to show but isn't too bad. The weak-minimal bound property is immediate, and normalization is fairly easy.

That just leaves the positive-minimal property and the bounded-minimal properties, respectively, which are nightmares. A lesser nightmare and a greater nightmare. For phase 4 to lay the groundwork for these, we establish an isomorphism between points in H and hyperplanes which lie above the graph of h, as well as a way of certifying that a point in H isn't minimal by what its hyperplane does.

Phase 5 is, for showing positive-minimals, we can tell whether a hyperplane corresponds to an a-measure, and given any hyperplane above the graph of h, construct a lower one that corresponds to a lower point in H that *does* correspond to an a-measure

Phase 6 is, for bounded-minimals, we take a hyperplane that *may* correspond to a minimal point, but which is too steep in certain directions. Then, we make an open set that fulfills the two roles of: if you enter it, you're too steep, or you overshoot the hyperplane of interest that you're trying to undershoot. Some fancy equation crunching and one application of Hahn-Banach later, we get a hyperplane that lies above h and doesn't enter our open set we crafted. So, in particular, it undershoots our hyperplane of interest, and isn't too steep. This certifies that our original "too steep" hyperplane didn't actually correspond to a minimal point, so all minimal points must have a bound on their λ values by the duality between hyperplanes above h and points in H.

Fix the convention that supf or inff is assumed to mean f∈C(X), we'll explicitly specify when f has bounds.

Phase 1: Let's show isomorphism. Our first direction is showing H to h and back is H exactly. By upper completion, and Proposition 2, we can also characterize H as

{M|∀f+∃M′∈H:f+(M)≥f+(M′)}

Using Theorem 1 to express all positive functionals as arising from an f∈C(X,[0,1]), and observing that the a constant in front doesn't change which stuff scores lower than which other stuff, so we might as well characterize everything in terms of f, H can also be expressed as

{(m,b)|∀f∈C(X,[0,1]):m(f)+b≥inf(m′,b′)∈H(m′(f)+b′)}

We can swap out C(X,[0,1]) for C(X), because, from the −∞ argument in Theorem 4, f going outside [0,1] means that inf(m′,b′)∈H(m′(f)+b′)=−∞. And then, our H can further be reexpressed as

{(m,b)|∀f:m(f)+b≥EH(f)}={(m,b)|∀f:b≥EH(f)−m(f)}

={(m,b)|b≥supf(EH(f)−m(f))}

Also, EH(f)=h(f)=−h′(−f), so we can rewrite this as:

{(m,b)|b≥sup−f(m(−f)−h′(−f))}={(m,b)|b≥supf(m(f)−h′(f))}

and, by the definition of the convex conjugate(sup characterization) and the space of finite signed measures being the dual space of C(X), and m(f) being a functional applied to an element, this is {(m,b)|b≥(h′)∗(m)} So, our original set H is identical to the convex-conjugate set, when we go from H to h back to a set of sa-measures.

Proof Phase 2: In the reverse direction for isomorphism, assume that h fulfills the conditions. We want to show that E{(m,b)|b≥(h′)∗(m)}(f)=h(f), so let's begin.

E{(m,b)|b≥(h′)∗(m)}(f)=inf(m,b):b≥(h′)∗(m)(m(f)+b)

Given an m, we have a natural candidate for minimizing the b, just set it equal to (h′)∗(m). So then we get infm(m(f)+(h′)∗(m))=infm((h′)∗(m)−m(−f))

And this is just... −(h′)∗∗(−f) (proof by Wikipedia article, check the inf characterization), and, because h is continuous over C(X,[0,1]), and concave, and −∞ everywhere outside the legit functions then h′ is continuous over C(X,[−1,0]), and convex, and ∞ everywhere outside the legit functions, so in particular, h′ is convex and lower-semicontinuous and proper, so h′=(h′)∗∗ by the Fenchel-Moreau Theorem. From that, we get

E{(m,b)|b≥(h′)∗(m)}(f)=−(h′)∗∗(−f)=−h′(−f)=h(f)

and we're done with isomorphism. Now that isomorphism has been established, let's show the relevant conditions hold. Namely, nonemptiness, closure, convexity, upper completion, normality, weak-bounded-minimals (phase 3) and positive-minimals (phase 5) and bounded-minimals (assuming h is Lipschitz) (phase 6) to finish off. The last two will be extremely hard.

Begin phase 3. Weak-bounded-minimals is easy by isomorphism. For our H′ we constructed, if f→EH′(f) wasn't uniformly continuous, then because EH′(f) equals h(f), we'd get a failure of uniform continuity for h, which was assumed.

By the way, the convex conjugate, (h′)∗(m), can be expressed as (by Wikipedia, sup charaacterization) supf(m(f)−h′(f))=supf(m(−f)−h′(−f))=supf(h(f)−m(f)) We can further restrict f to functions with range in [0,1], because if it was anything else, we'd get −∞. We'll be using (h′)∗(m)=supf∈C(X,[0,1])(h(f)−m(f)) (or the supf variant) repeatedly.

For nonemptiness, observe that (0,1) is present in H′ because, fixing an arbitrary f,

(h′)∗(0)=supf∈C(X,[0,1])(h(f)−0(f))=supf∈C(X,[0,1])h(f)=1

This is from our format of the convex conjugate, and h being normalized and monotone, so the highest it can be is 1 and it attains that value. Therefore, 1≥(h′)∗(0), so (0,1) is in the H′ we constructed.

For showing that our constructed set H′ lies in Msa(X), we have that, for a random (m++m−,b)∈H′, it has (by our characterization of (h′)∗(m))

b+m−(1)≥supf∈C(X,[0,1])(h(f)−(m++m−)(f))+m−(1)

≥supf∈C(X,[0,1])(−(m++m−)(f))+m−(1)

=m−(1)−inff∈C(X,[0,1])((m++m−)(f))=m−(1)−m−(1)=0

This is by the lower bound on b being (h′)∗(m++m−) and unpacking the convex conjugate, h(f)≥h(0)=0 by monotonicity and normalization, a reexpression of sup, and Lemma 5, respectively. b+m−(1)≥0 so it's an sa-measure.

For closure and convexity, by monotonicity of h, we have 0=−h(0)≥−h(f)≥−h(1)=−1 and h is continuous on C(X,[0,1]), concave, and −∞ everywhere else by assumption, so h′ is proper, continuous on C(X,[−1,0]), convex, and lower-semicontinuous in general because of the ∞ everywhere else, so, by the Wikipedia page on "Closed Convex Function", h′ is a closed convex function, and then by the Wikipedia page on "Convex Conjugate" in the Properties section, (h′)∗ is convex and closed. From the Wikipedia page on "Closed Convex Function", this means that the epigraph of (h′)∗ is closed, and also the epigraph of a convex function is convex. This takes care of closure and convexity for our H′

Time for upper-completeness. Assume that (m,b) lies in the epigraph. Our task now is to show that (m,b)+(m∗,b∗) lies in the epigraph. This is equivalent to showing that b+b∗≥(h′)∗(m+m∗). Note that b∗≥−m∗−(1), because (m∗,b∗) is an sa-measure. Let's begin.

(h′)∗(m+m∗)=supf∈C(X,[0,1])(h(f)−(m+m∗)(f))

=supf∈C(X,[0,1])(h(f)−m(f)−m∗+(f)−m∗−(f))≤supf∈C(X,[0,1])(h(f)−m(f)+b∗)

=b∗+supf∈C(X,[0,1])(h(f)−m(f))=b∗+(h′)∗(m)≤b∗+b

This was done by unpacking the convex conjugate, splitting up m∗ into m∗+ and m∗−, locking two of the components in the sup to be an upper bound (which also gives the sup more flexibility on maximizing the other two components, so this is greater), packing up the convex conjugate, and using that b≥(h′)∗(m) because (m,b)∈H′

Normalization of the resulting set is easy. Going from h to a (maybe)-inframeasure H′ back to h is identity as established earlier, so all we have to do is show that a failure of normalization in a (maybe)-inframeasure makes the resulting h not normalized. Thus, if our h is normalized, and it makes an H′ that isn't normalized, then going back makes a non-normalized h, which contradicts isomorphism. So, assume there's a failure of normalization in H′. Then EH′(0)≠0, or EH′(1)≠1, so either h(0)≠0 or h(1)≠1 and we get a failure of normalization for h which is impossible. So H′ must be normalized.

Begin phase 4. First, continuous affine functionals ϕ that lie above the graph of h perfectly correspond to sa-measures in H′. This is because the continuous dual space of C(X) is the space of finite signed measures, so we can interpret ϕ−ϕ(0) as a finite signed measure, and ϕ(0) as the b term. In one direction, given an (m,b)∈H′,

ϕ(f)=m(f)+b≥inf(m,b)∈H′(m(f)+b)=EH′(f)=h(f)

so every point in H′ induces a continuous affine functional C(X)→R whose graph is above h.

In the other direction, from earlier, we can describe H′ as: {(m,b)|b≥supf(h(f)−m(f))}

and then, for (ϕ−ϕ(0),ϕ(0)),

supf(h(f)−(ϕ−ϕ(0))(f))=supf(h(f)−ϕ(f)+ϕ(0))≤ϕ(0)

because ϕ(f)≥h(f). So continuous affine functionals whose graph lies above the graph of h correspond to points in H′.

So, we have a link between affine functionals that lie above the graph of h, and points in H′. What would a minimal point correspond to? Well, a non-minimal point corresponds to (m,b)+(m∗,b∗), where the latter component is nonzero. There's *some* f+ where \pf(\m,b)">f+((m,b)+(m∗,b∗))>f+(m,b) due to the latter component being nonzero, and for all f+, f+((m,b)+(m∗,b∗))≥f+(m,b). Using Theorem 1 to translate positive functionals to f, this means that the ϕ induced by (m,b) lies *below* the affine functional induced by (m,b)+(m∗,b∗) over the f∈C(X,[0,1]). So, if there's a different affine functional ψ s.t. ∀f∈C(X,[0,1]):h(f)≤ψ(f)≤ϕ(f), then ϕ must correspond to a nonminimal point.

Further, we can characterize whether ϕ corresponds to an a-measure or not. For a measure, if you increase your function you're feeding in, you increase the value you get back out, f′≥f→ϕ(f′)≥ϕ(f). For a signed measure with some negative component, Lemma 5 says we can find some f′∈C(X,[0,1]) that attain negative value, so you can add one of those f′ to your f and get ϕ(f+f′)<ϕ(f). So, a ϕ corresponds to an a-measure exactly when it's monotone.

Phase 5: Proving positive-minimals. With these links in place, this means we just have to take any old point that's an sa-measure in H′, get a ϕ from it, it'll fulfill certain properties, and use those properties to find a ψ that lies below ϕ and above h on C(X,[0,1]) and is monotone, certifying that ψ corresponds to a point *below* our minimal-point of interest that's still in H′ but is an a-measure, so we have a contradiction.

To that end, fix a ϕ that corresponds to some point in H′ that's not an a-measure (in particular, it has a negative component), it lies above the graph of h.

Now, translate ϕ to a (mϕ,bϕ), where bϕ=ϕ(0), and mϕ(f)=ϕ(f)−ϕ(0). ϕ is minimized at some f. Since our ϕ corresponds to something that's not an a-measure, (mϕ)−(1)<0

Let our affine continuous functional ψ be defined as ψ(f)=(mϕ)+(f)+ϕ(0)+(mϕ)−(1).

In order to show that ψ corresponds to an a-measure below (mϕ,bϕ) in H′, we need three things. One is that ψ is monotone (is an a-measure), two is that it lies below ϕ over C(X,[0,1]) and three is that it lies above h. Take note of the fact that ϕ(0)+(mϕ)−(1)≥0, because ϕ(0)=bϕ.

For monotonicity of ψ, it's pretty easy. If f′≥f, then

ψ(f′)=ψ(f+(f′−f))=(mϕ)+(f+(f′−f))+ϕ(0)+(mϕ)−(1)

≥(mϕ)+(f)+ϕ(0)+(mϕ)−(1)=ψ(f)

and we're done with that part.

For being less than or equal to ϕ over C(X,[0,1]) (we know it's not the same as ϕ because ϕ isn't monotone and ψ is),

ψ(f)=(mϕ)+(f)+ϕ(0)+(mϕ)−(1)≤(mϕ)+(f)+ϕ(0)+(mϕ)−(f)

=mϕ(f)+ϕ(0)=ϕ(f)−ϕ(0)+ϕ(0)=ϕ(f)

For being ≥h over C(X,[0,1]) it takes a somewhat more sophisticated argument. By Lemma 5, regardless of ϵ, there exists a f′ where mϕ(f′)<(mϕ)−(1)+ϵ. Then, we can go:

\psi(f)+\m^{\phi}(f')-(\m^{\phi})^{-}(1)">ψ(f)+ϵ>ψ(f)+mϕ(f′)−(mϕ)−(1)

=(mϕ)+(f)+ϕ(0)+(mϕ)−(1)+mϕ(f′)−(mϕ)−(1)=(mϕ)+(f)+ϕ(0)+mϕ(f′)

=(mϕ)+(f+f′)+ϕ(0)+(mϕ)−(f′)≥(mϕ)+(sup(f,f′))+ϕ(0)+(mϕ)−(sup(f,f′))

=mϕ(sup(f,f′))+ϕ(0)=ϕ(sup(f,f′))≥h(sup(f,f′))≥h(f)

The last steps were done via the definition of ϕ, ϕ≥h, and h being monotonic.

So, h(f)">ψ(f)+ϵ>h(f) for all ϵ and all f∈C(X,[0,1]) getting ψ(f)≥h(f) for f∈C(X) (because h is −∞ everywhere else)

Thus, ψ specifies an a-measure (ψ being monotone) that is below the sa-measure encoded by ϕ (by ϕ≥ψ over C(X,[0,1])), yet ψ≥h, so said point is in H′. This witnesses that there can be no minimal points in H′ that aren't a-measures. That just leaves getting the slope bound from Lipschitzness, the worst part of this whole proof.

Phase 6: Let λ⊙ be the Lipschitz constant for h. Fix a ϕ that corresponds to a minimal point with \la^{\odot}">λ>λ⊙. This violates the Lipschitz bound when traveling from 0 to 1, so the Lipschitz bound is violated in some direction. Further, the graph of ϕ touches the graph of h at some point f∗∈C(X,[0,1]), because if it didn't, you could shift ϕ down further until it *did* touch, witnessing that the point ϕ came from wasn't minimal (you could sap more from the b term).

Now, if this point is minimal, it should be *impossible* to craft a ψ which is ≤ϕ over C(X,[0,1]), ≥h, and different from ϕ. We shall craft such a ψ, witnessing that said point isn't actually minimal. Further, said ψ won't violate the Lipschitz bound in any direction. Thus, all affine functionals corresponding to minimal points must obey the Lipschitz bound and be monotone, so they're a-measures with λ≤λ⊙.

In order to do this, we shall craft three sets in C(X)⊕R. A, B1, and B2.

Set A is {(f,b)|f∈C(X,[0,1]),b≤h(f)}. Pretty much, this set is the hypograph of h. It's obviously convex because h is concave, and the hypograph of a concave function is convex. It's closed because h is continuous.

Set B1 is \phi(f)\}">{(f,b)|f∈C(X,(0,1)),b>ϕ(f)}. This could be thought of as the the interior of the epigraph of ϕ restricted to C(X,[0,1]). Undershooting this means you never exceed ϕ over C(X,[0,1]). First, it's open. This is because, due to f being continuous over a compact set X, the maximum and minimum are attained, so any f∈C(X,(0,1)) is bounded below 1 and above 0, so we've got a little bit of room to freely wiggle f in any direction. Further, since ϕ−ϕ(0) is a continuous linear functional on C(X) which is a Banach space, it's a bounded linear functional and has *some* Lipschitz constant (though it may exceed λ⊙), so we have a little bit of room to freely wiggle b as well. So B1 is open.

Also, B1 is convex, because a mixture of f and f′ that are bounded away from 0 and 1 is also bounded away from 0 and 1, and p\phi(f)+(1-p)\phi(f')=\phi(pf+(1-p)f')">pb+(1−p)b′>pϕ(f)+(1−p)ϕ(f′)=ϕ(pf+(1−p)f′).

Set B2 is \la^{\odot}d(f,f^*)+\phi(f^*)\}">{(f,b)|b>λ⊙d(f,f∗)+ϕ(f∗)}. This could be thought of as an open cone with a point (it's missing that exact point, though) at (f∗,ϕ(f∗)), that opens straight up, and certifies a failure of the λ⊙ bound on slope. If an affine function includes the point (f∗,ϕ(f∗)) in its graph, then if it increases faster than λ⊙ in any direction, it'll land in this set. It's open because, given a point in it, we can freely wiggle the f and b values around a little bit in any direction, and stay in the set. Now we'll show it's convex. Given an (f,b) and (f′,b′) in it, due to C(X) being a Banach space (so it has a norm), we want to check whether \la^{\odot}d(pf+(1-p)f',f^*)+\phi(f^*)">pb+(1−p)b′>λ⊙d(pf+(1−p)f′,f∗)+ϕ(f∗).

Observe that (using the defining axioms for a norm)

p(\la^{\odot}d(f,f^*)+\phi(f^*))+(1-p)(\la^{\odot}d(f',f^*)+\phi(f^*))">pb+(1−p)b′>p(λ⊙d(f,f∗)+ϕ(f∗))+(1−p)(λ⊙d(f′,f∗)+ϕ(f∗))

=λ⊙(pd(f,f∗)+(1−p)d(f′,f∗))+ϕ(f∗)=λ⊙(p||f−f∗||+(1−p)||f′−f∗||)+ϕ(f∗)

=λ⊙(||pf−pf∗||+||(1−p)f′−(1−p)f∗||)+ϕ(f∗)

≥λ⊙(||pf−pf∗+(1−p)f′−(1−p)f∗||)+ϕ(f∗)

=λ⊙(||pf+(1−p)f′−f∗||)+ϕ(f∗)=λ⊙d(pf+(1−p)f′,f∗)+ϕ(f∗)

So, B2 is convex.

Ok, so we've got a convex closed set and two convex opens. Now, consider B:=c.h(B1∪B2). The convex hull of an open set is open. We will show that A∩B=∅.

Assume this is false, and that they overlap. The point where they overlap, can then be written as a convex mixture of points from B1∪B2. However, B1 and B2 are both convex, so we can reduce it to a case where we're mixing one point (f,b) from B1 and one point (f′,b′) in B2. And (pf+(1−p)f′,pb+(1−p)b′)∈A.

If p=0, then we've just got a single point in B2. Also, ϕ(f∗)=h(f∗).

\la^{\odot}d(f',f^*)+\phi(f^*)=\la^{\odot}d(f',f^*)+h(f^*)\ge h(f')">b′>λ⊙d(f′,f∗)+ϕ(f∗)=λ⊙d(f′,f∗)+h(f∗)≥h(f′)

This is because ϕ(f∗)=h(f∗) and h has a Lipschitz constant of λ⊙, so it can't increase as fast as we're demanding as we move from f∗ to f′, which stays in C(X,[0,1]). So (f′,b′)∉A.

If p=1, then we've just got a single point in B1. Then \phi(f)\ge h(f)">b>ϕ(f)≥h(f), so again, (f,b)∉A.

For the case where p isn't 0 or 1, we need a much more sophisticated argument. Remembering that (f,b)∈B1, and (f′,b′)∈B2, we will show that (pf+(1−p)f∗,pb+(1−p)ϕ(f∗)) lies *strictly* above the graph of h. Both f and f∗ lie in C(X,[0,1]), so their mix lies in the same set, so we don't have to worry about h being undefined there. Also, remember that ϕ≥h over C(X,[0,1]). Now,

p\phi(f)+(1-p)\phi(f^*)=\phi(pf+(1-p)f^*)\ge h(pf+(1-p)f^*)">pb+(1−p)ϕ(f∗)>pϕ(f)+(1−p)ϕ(f∗)=ϕ(pf+(1−p)f∗)≥h(pf+(1−p)f∗)

The critical > is by the definition of B1, and (f,b)∈B1. So, the b term is strictly too high for this point (different than the one we care about) to land on the graph of h.

With the aid of this, we will consider "what slope do we have as we travel from (pf+(1−p)f∗,pb+(1−p)ϕ(f∗)) to (pf+(1−p)f′,pb+(1−p)b′)"? Said slope is

(pb+(1−p)b′)−(pb+(1−p)ϕ(f∗))d(pf+(1−p)f′,pf+(1−p)f∗)=(1−p)(b′−ϕ(f∗))||(pf+(1−p)f′)−(pf+(1−p)f∗)||

\frac{\la^{\odot}d(f',f^*)+\phi(f^*)-\phi(f^*)}{d(f',f^*)}=\la^{\odot}">=(1−p)(b′−ϕ(f∗))(1−p)||f′−f∗||=b′−ϕ(f∗)d(f′,f∗)>λ⊙d(f′,f∗)+ϕ(f∗)−ϕ(f∗)d(f′,f∗)=λ⊙

That critical > is by (f′,b′)∈B2 and the definition of B2.

So, if we start at (pf+(1−p)f∗,pb+(1−p)ϕ(f∗)) (and pf+(1−p)f∗ lies in C(X,[0,1])), we're above the graph of h. Then, we travel to (pf+(1−p)f′,pb+(1−p)b′), where pf+(1−p)f′∈C(X,[0,1]) by assumption that this point is in A, but while doing this, we ascend *faster* than λ⊙, the Lipschitz constant for h. So, our point of interest (pf+(1−p)f′,pb+(1−p)b′) lies above the graph of h and can't lie in A, and we have a contradiction.

Putting all this together, A∩B=∅. Since B is open, and they're both convex and nonempty, we can invoke Hahn-Banach (first version of the theorem in the "Separation of Sets" section)and conclude they're separated by some continuous linear functional ψL. Said linear functional must increase as b does, because (0,0)∈A, and (0,b) (for some sufficiently large b) lies in B2, thus in B. This means that given any f and a∈R to specify a level, we can find a unique b where ψL(f,b)=a.

So, any level set of this continuous linear functional we crafted can also be interpreted as an affine functional. There's a critical value of the level set that achieves the separation, ψL(f∗,ϕ(f∗)). This is because (f∗,ϕ(f∗))=(f∗,h(f∗))∈A, but (f∗,ϕ(f∗)+ϵ) is in B2, thus in B, for all ϵ. So we've uniquely pinned down which affine function ψ we're going for. Since the graph of ψ is a hyperplane separating A and B (It may touch the set A, just not cut into it, but it doesn't touch B), from looking at the definitions of A and B1 and B2, we can conclude:

From the definition of A, ψ(f)≥h(f), so ψ≥h over C(X,[0,1]).

From the definition of B1, ψ(f)≤ϕ(f) over C(X,(0,1)), and they're both continuous, so we can extend ψ(f)≤ϕ(f) to C(X,[0,1]) by continuity, so ψ≤ϕ over C(X,[0,1]).

Also, h(f∗)≤ψ(f∗)≤ϕ(f∗)=h(f∗), so ψ(f∗)=ϕ(f∗), and this, paired with the ability of B2 to detect whether an affine function exceeds the λ⊙ slope bound (as long as the graph of said function goes through (f∗,ϕ(f∗))), means that the graph of ψ not entering B2 certifies that its Lipschitz constant is λ⊙ or less. Since \phi *does* enter B2 due to violating the Lipschitz constant bound, this also certifies that ϕ≠ψ.

Putting it all together, given a ϕ which corresponds to a minimal point and violates the Lipschitz bound, we can find a ψ below it that's also above h, so said minimal point isn't actually minimal.

Therefore, if you were to translate a minimal point in the induced H into an affine function above h, it'd have to A: not violate the Lipschitz bound (otherwise we could undershoot it) and B: be monotone (otherwise we could undershoot it). Being monotone certifies that it's an a-measure, and having a Lipschitz constant of λ⊙ or less certifies that the λ of the a-measure is λ⊙ or less. We're finally done!

The next proofs are here.

Discuss

### Proofs Section 1.2 (Mixtures, Updates, Pushforwards)

The previous proofs are here.

**Proposition 5:** *If *Bmin⊆Ma(X)*, then the condition "there is a *λ⊙* where, *∀(λμ,b)∈Bmin:λ≤λ⊙*" is equivalent to "there is a compact *C* s.t. *Bmin⊆C*"*

Proof sketch: One direction is immediate from the Compactness Lemma. For showing that just a bound on the λ values suffices to be contained in a compact set, instead of a bound on the λ and b values to invoke the Compactness Lemma, we use a proof by contradiction where we can get a bound on the b values of the minimal points from just a bound on the λ values.

Proof: In one direction, assume there's a compact C s.t. Bmin⊆C, and yet there's no upper-bounding λ⊙ on the λ values. This is impossible by the Compactness Lemma, since (λμ)+(1)=λμ+(1)=λμ(1)=λ.

In the other direction, assume there's a λ⊙ bound on λ for the minimal points. Fix some arbitrary (λμ,b)∈Bmin for the rest of the proof. Now, we will show that all minimal points (λ′μ′,b′)∈Bmin have λ′≤λ⊙, and b′≤λ⊙+b, letting us invoke the Compactness Lemma to get that everything is in a suitable compact set C. The first bound is obvious. Since λ′ came from a minimal point, it must have λ⊙ as an upper bound.

For the other one, by contradiction, let's assume that there's a minimal point (λ′μ′,b′) where \la^{\odot}+b">b′>λ⊙+b. Then, we can write (λ′μ′,b′) as: (λμ,b)+(−λμ,λ⊙)+(λ′μ′,b′−λ⊙−b)

The first component, (λμ,b) is our fixed minimal point of interest. The second component is an sa-measure, because λ⊙−λ≥0, due to the λ⊙ upper bound on the λ value of minimal points. The third component is also a nonzero sa-measure, because λ′ is nonnegative (it came from a minimal point), and by assumption, \la^{\odot}+b">b′>λ⊙+b. Hang on, we wrote a minimal point (λ′μ′,b′) as another minimal point (λμ,b), plus two sa-measures (one of which is nonzero), so (λ′μ′,b′) can't be minimal, and we have a contradiction.

Therefore, all (λ′μ′,b′)∈Bmin have b′≤λ⊙+b. Now that we have bounds on λ and b for minimal points, we can invoke the Compactness Lemma to conclude that everything is in a compact set.

**Proposition 6:** EB(0)=EB(1)* only occurs when there's only one minimal point of the form *(0,b)*.*

Proof: Unpacking the expectations, and in light of Proposition 3,

EB(1)=inf(λμ,b)∈Bmin(λμ(1)+b)=inf(λμ,b)∈Bmin(λ+b) and EB(0)=inf(λμ,b)∈Bmin(λμ(0)+b)=inf(λμ,b)∈Bminb

So, take a minimal a-measure (λμ,b) that minimizes λ+b. One must exist because we have λ and b bounds, so by the Compactness Lemma, we can restrict our attention to an actual compact set, and continuous functions from a compact set to R have a minimum, so there's an actual minimizing minimal point.

λ must be 0, because otherwise b\ge\mathbb{E}_{B}(0)">EB(1)=λ+b>b≥EB(0) which contradicts EB(1)=EB(0). Further, since b=λ+b=EB(1)=EB(0), said b must be the lowest b possible amongst minimal points.

So, we have a minimal point of the form (0,b) where b is the lowest possible b amongst the minimal points. Any other distinct minimal point must be of the form (λ′μ′,b′), where b′≥b. This other minimal point can be written as (0,b)+(λ′μ′,b′−b), where the latter component is an sa-measure, so it's not minimal. Thus, there's only one minimal a-measure and it's of the form (0,b).

**Proposition 7:** *Renormalizing a bounded inframeasure produces a bounded infradistribution, if renormalization doesn't fail.*

Proof sketch: Our first order of business is showing that our renormalization process doesn't map anything outside the cone of sa-measures. A variant of this argument establishes that the preimage of a minimal point in BR must be a minimal point in B, which quickly establishes positive-minimals and bounded-minimals for BR. Then, we verify the other conditions of a bounded infradistribution. Nonemptiness, closure, and convexity are very easy, upper-closure is shown by adding appropriately-scaled sa-measures such that, after renormalization, they hit whatever sa-measure you want. Then, finally, we just have to verify that our renormalization procedure is the right one to use, that it makes EBR(1)=1 and EBR(0)=0.

Proof: First up, we need to show that after renormalization, nothing gets mapped outside the cone of sa-measures. Observe that the renormalization process is injective. If two points are distinct, after a scale-and-shift, they'll still be distinct.

Let B be our original set and BR be our renormalized set. Take a point in BR, given by (m,b). Undoing the renormalization, we get (EB(1)−EB(0))(m,b)+(0,EB(0))∈B.

By decomposition into a minimal point and something else via Theorem 2, we get that

(EB(1)−EB(0))(m,b)+(0,EB(0))=(mmin,bmin)+(m∗,b∗)

where (mmin,bmin)∈Bmin. Renormalizing back, we get that

(m,b)=1EB(1)−EB(0)((mmin,bmin−EB(0))+(m∗,b∗))

b′≥EB(0), obviously, because EB(0) is the minimal b value amongst the minimal points. So, the first component is an a-measure, the second component is an sa-measure, so adding them is an sa-measure, and then we scale by a nonnegative constant, so (m,b) is an sa-measure as well.

This general line of argument also establishes positive-minimals and bounded-minimals, as we'll now show. If the (m∗,b∗) isn't 0, then we just wrote (m,b) as

1EB(1)−EB(0)(mmin,bmin−EB(0))+1EB(1)−EB(0)(m∗,b∗)

And the first component lies in BR, but the latter component is nonzero, witnessing that (m,b) isn't minimal. So, if (m,b) is minimal in BR, then (m∗,b∗)=0, so it must be the image of a single minimal point (mmin,bmin)∈Bmin by injectivity. Ie, the preimage of a minimal point in BR is a minimal point in B.

Scale-and-shift maps a-measures to a-measures, showing positive-minimals, and the positive scale constant of (EB(1)−EB(0))−1 just scales up the λ⊙ upper bound on the λ values of the minimal points in B, showing bounded-minimals.

For the remaining conditions, nonemptiness, closure, and convexity are trivial. We're taking a nonempty closed convex set and doing a scale-and-shift so it's nonempty closed convex.

Time for upper-completeness. Letting B be our original set and BR be our renormalized set, take a point MR+M∗ in (BR)uc. By injectivity, MR has a single preimage point M∈B. Undoing the renormalization by multiplying by EB(1)−EB(0) (our addition of EB(0) is paired with BR to undo the renormalization on that one), consider M+(EB(1)−EB(0))M∗ This lies in B by upper-completeness, and renormalizing it back produces MR+M∗, which is in BR, so BR is upper-complete.

That just leaves showing that after renormalizing, we're normalized.

EBR(1)=inf(λμ,b)∈BR(λ+b)=inf(λ′μ′,b′)∈B1EB(1)−EB(0)(λ′+b′−EB(0))

=1EB(1)−EB(0)(inf(λ′μ′,b′)∈B(λ′+b′)−EB(0))=EB(1)−EB(0)EB(1)−EB(0)=1

For the other part,

EBR(0)=inf(λμ,b)∈BRb=inf(λ′μ′,b′)∈B1EB(1)−EB(0)(b′−EB(0))

=1EB(1)−EB(0)(inf(λ′μ′,b′)∈Bb′−EB(0))=EB(0)−EB(0)EB(1)−EB(0)=0

And we're done.

**Lemma 6:** g∗ *is a continuous linear operator.*

Proof sketch: First show linearity, then continuity, for the operator that just maps a signed measure through g, using some equation-crunching and characterizations of continuity. Then, since g∗ is just the pair of that and the identity function, it's trivial to show that it's linear and continuous.

We'll use g′∗ to refer to the function M±(X)→M±(Y) defined by (g′∗(m))(Z)=m(g−1(Z)), where Z is a measurable subset of Y and g∈C(X,Y). Ie, this specifies what the measure g′∗(m) is in terms of telling you what value it assigns to all measurable subsets of Y.

We'll use g∗ to refer to the function M±(X)⊕R→M±(X)⊕R given by g∗(m,b)=(g′∗(m),b).

Our first order of business is establishing the linearity of g′∗. Observe that, for all measurable Z⊆Y, and a,a′ being real numbers, and m,m′ being signed measures over X,

(g′∗(am+a′m′))(Z)=(am+a′m′)(g−1(Z))=am(g−1(Z))+a′m′(g−1(Z))

=ag′∗(m)(Z)+a′g′∗(m′)(Z)=(ag′∗(m)+a′g′∗(m′))(Z)

So, g′∗(am+a′m′)=ag′∗(m))+a′g′∗(m′) and we have linearity of g′∗.

Now for continuity of g′∗. Let mn limit to m. The sequence g′∗(mn) converging to g′∗(m) in our metric on M±(Y) is equivalent to: ∀f∈C(Y):limn→∞g′∗(mn)(f)=g′∗(m)(f)

So, if g′∗(mn) fails to converge to g′∗(m), then there is some continuous function f∈C(Y) that witnesses the failure of convergence. But, because g is a continuous function X→Y, then f∘g∈C(X), and also mn(f∘g)=g′∗(mn)(f), so:

limn→∞g′∗(mn)(f)=limn→∞mn(f∘g)=m(f∘g)=g′∗(m)(f)

The key step in the middle is that mn limits to m, so mn(f∘g) limits to m(f∘g), by our characterization of continuity. Thus, we get a contradiction, our f that witnesses the failure of convergence actually *does* converge. Therefore, g′∗(mn) limits to g′∗(m) if mn limits to m, so g′∗ is continuous.

To finish up, continuity for g∗ comes from the product of two continuous functions being continuous (g′∗ which we showed already, and idR because duh), and linearity comes from:

g∗(a(m,b)+a′(m′,b′))=g∗(am+a′m′,ab+a′b′)=(g′∗(am+a′m′),ab+ab′)

=(ag′∗(m)+a′g′∗(m),ab+ab′)=a(g′∗(m)+b)+a′(g′∗(m′)+b′)=ag∗(m,b)+a′g∗(m′,b′)

**Proposition 8:** *If *f∈C(X,[0,1])* and *g* is a continuous function *X→Y*, then *Eg∗(H)(f)=EH(f∘g)

Eg∗(H)(f)=inf(m,b)∈(g∗(H))(m(f)+b)=inf(m,b)∈H(g′∗(m)(f)+b)

=inf(m,b)∈H(m(f∘g)+b)=EH(f∘g)

**Proposition 9:** g∗(H)* is a (bounded) inframeasure if *H* is, and it doesn't require upper completion if *g* is surjective.*

Proof sketch: Nonemptiness is obvious, and showing that it maps sa-measures to sa-measures is also pretty easy. Closure takes a rather long argument that the image of any closed subset of sa-measures over X, through g∗, is closed, which is fairly tedious. We may or may not invoke upper completion afterwards, but if we do, we can just appeal to the lemma that the upper completion of a closed set is closed. Convexity is immediate from linearity of g∗.

For upper completion, we can just go "we took the upper completion" if g isn't surjective, but we also need to show that we don't need to take the upper completion if g is surjective, which requires crafting a measurable inverse function to g via the Kuratowski-Ryll-Nardzewski selection theorem, in order to craft suitable preimage points.

Then we can use LF-Duality to characterize the induced h function, along with Proposition 8, which lets us get positive-minimals, bounded-minimals, and normalization fairly easily, wrapping up the proof.

Proof: Nonemptiness is obvious. For showing that it takes sa-measures to sa-measures, take an (m,b)∈H, and map it through to get (g′∗(m),b)∈g∗(H). (m,b) is an sa-measure, so b+m−(1)≥0. Now, we can use Lemma 5 to get:

b+(g′∗(m))−(1)=b+inff∈C(Y,[0,1])g′∗(m)(f)=b+inff∈C(Y,[0,1])m(f∘g)

≥b+inff′∈C(X,[0,1])m(f)=b+m−(1)≥0

So the b term is indeed big enough that the image of (m,b) is an sa-measure.

For closure, fix a sequence of (mn,bn)∈g∗(H) limiting to some (m,b), with preimage points (m′n,b′n)∈H. Due to convergence of (mn,bn) there must be some b◯ bound on the bn. g∗ preserves those values, so b◯ is an upper bound on the b′n. Since the (m′n,b′n) are sa-measures, −b◯ is a lower bound on the m′−n(1) values. Since mn converges to m, mn(1) converges to m(1), so there's a λ◯ upper bound on the mn(1) values. Further,

λ◯≥mn(1)=g′∗(m′n)(1)=m′n(1∘g)=m′n(1)=m′+n(1)+m′−n(1)≥m′+n(1)−b◯

So, for all n, m′+n(1)≤λ◯+b◯, so we have an upper bound on the b′n and m′+n(1) values. Now we can invoke the Compactness Lemma to conclude that there's a convergent subsequence of the (m′n,b′n), with a limit point (m′,b′), which must be in H since H is closed. By continuity of g∗(H) from Lemma 6, g∗(m′,b′) must equal (m,b), witnessing that (m,b)∈g∗(H). So, g∗(H) is closed. Now, if we take upper completion afterwards, we can just invoke Lemma 2 to conclude that the upper completion of a closed set of sa-measures is closed.

Also, g∗ is linear from Lemma 6, so it maps convex sets to convex sets getting convexity.

Now for upper completion. Upper completion is immediate if g isn't surjective, because we had to take the upper completion there. Showing we don't need upper completion if g is surjective is trickier. We must show that g∗ is a surjection from Msa(X) to Msa(Y).

First, we'll show that g∗(U) where U is an open subset of X is a measurable subset of Y. In metrizable spaces (of which X is one), every open set is a Fσ set, ie, it can be written as a countable union of closed sets. Because our space is compact, all those closed sets are compact. And the continuous image of a compact set is a compact set, ie closed. Therefore, g∗(U) is a countable union of closed sets, ie, measurable.

X is a Polish space (all compact metric spaces are Polish), it has the Borel σ-algebra, and we'll use the function g−1. Note that g−1(y) is closed and nonempty for all y∈Y due to g being a continuous surjection. Further, the set {y:g−1(y)∩U≠∅} equals g(U) for all open sets U. In one direction, if the point y is in the first set, then there's some point x∈U where g(x)=y. In the other direction, if a point y is in g(U), then there's some point x∈U where g(x)=y so g−1(y)∩U is nonempty.

Thus, g−1 is weakly measurable, because for all open sets U of X, {y:g−1(y)∩U≠∅}=g(U) and g(U) is measurable. Now, by the Kuratowski-Ryll-Nardzewski Measurable Selection Theorem, we get a measurable function g◊ from Y to X where g◊(y)∈g−1(y) so g(g◊(y))=y, and g◊ is an injection.

So, we can push any sa-measure of interest (m∗,b∗) through g◊∗ (which preserves the amount of negative measure due to being an injection), to get an sa-measure that, when pushed through g∗ recovers (m∗,b∗) exactly. Thus, if g∗(m,b)∈g∗(H), and you want to show g∗(m,b)+(m∗,b∗)∈g∗(H), just consider

g∗((m,b)+g◊∗(m∗,b∗))=g∗(m,b)+g∗(g◊∗(m∗,b∗))=g∗(m,b)+(m∗,b∗)

So, since (m,b)+g◊∗(m∗,b∗)∈H due to upper-completeness, then g∗((m,b)+g◊∗(m∗,b∗))=g∗(m,b)+(m∗,b∗)∈g∗(H) And we have shown upper-completeness of g∗(H) if g is a surjection.

We should specify something about using LF-Duality here. If you look back through the proof of Theorem 5 carefully, the only conditions you really *need* for isomorphism are (on the set side) g∗(H) being closed, convex, and upper complete (in order to use Proposition 2 to rewrite g∗(H) appropriately for the subsequent arguments, we have these properties), and (on the functional side), f↦Eg∗(H)(f) being concave (free), −∞ if range(f)⊈[0,1] (by proof of Theorem 4, comes from upper completeness), and continuous over f∈C(Y,[0,1]) (showable by Proposition 8 that Eg∗(H)(f)=EH(f∘g), and the latter being continuous since H is an infradistribution)

It's a bit of a pain to run through this argument over and over again, so we just need to remember that if you can show closure, convexity, upper completeness, and the expectations to be continuous, that's enough to invoke LF-Duality and clean up the minimal point conditions. We did that, so we can invoke LF-Duality now.

Time for normalization. From Proposition 8, the g∗(h) function we get from f↦Eg∗(H)(f) is uniquely characterized as: g∗(h)(f)=h(f∘g). So,

Eg∗(H)(1)=g∗(h)(1)=h(1∘g)=h(1)=EH(1)=1

Eg∗(H)(0)=g∗(h)(0)=h(0∘g)=h(0)=EH(0)=0

and normalization is taken care of.

For bounded-minimals/weak-bounded-minimals, since g∗(H) is the LF-dual of g∗(h), we can appeal to Theorem 5 and just check whether g∗(h) is Lipschitz/uniformly continuous. if d(f,f′)<δ, then d(f∘g,f′∘g)<δ according to the sup metric on C(Y,[0,1]) and C(X,[0,1]), respectively, which (depending on whether we're dealing with Lipschitzness or uniform continuity), implies that |h(f∘g)−h(f′∘g)|<λ⊙δ, or ϵ for uniform continuity. So, we get: |g∗(h)(f)−g∗(h)(f′)|=|h(f∘g)−h(f′∘g)|<λ⊙δ (or ϵ for uniform continuity), thus establishing that f and f′ being sufficiently close means that g∗(h) doesn't change much, which, by Theorem 5, implies bounded-minimals/weak-bounded-minimals in g∗(H).

For positive-minimals it's another Theorem 5 argument. If f′≥f, then f′∘g≥f∘g, so: g∗(h)(f′)−g∗(h)(f)=h(f′∘g)−h(f∘g)≥0 And we have monotonicity for g∗(h), which, by Theorem 5, translates into positive-minimals on g∗(H).

**Lemma 7:** *If *M∈(EζHi)min*, then for all decompositions of *M* into *Mi*, *Mi∈(Hi)min

This is easy. Decompose M into EζMn. To derive a contradiction, assume there exists a nonminimal Mi that decomposes into Mmini+M∗i where M∗i≠0. Then,

M=EζMi=Eζ(Mmini+M∗i)=Eζ(Mmini)+Eζ(M∗i)

Thus, we have decomposed our minimal point into another point which is also present in EζHi, and a nonzero sa-measure because there's a nonzero M∗i so our original "minimal point" is nonminimal. Therefore, all decompositions of a minimal point in the mixture set must have every component part being minimal as well.

**Proposition 10:** EEζHi(f)=Eζ(EHi(f))

EEζHn(f)=inf(m,b)∈EζHi(m(f)+b)=inf(mi,bi)∈ΠiHi((Eζmi)(f)+Eζbi)

=inf(mi,bi)∈ΠiHi(Eζ(mi(f))+Eζ(bi))=inf(mi,bi)∈ΠiHiEζ(mi(f)+bi)

=Eζ(inf(mi,bi)∈Hi(mi(f)+bi))=Eζ(EHi(f))

Done.

**Proposition 11:** *A mixture of infradistributions is an infradistribution. If it's a mixture of bounded infradistributions with Lipschitz constants on their associated *h* functions of *λ⊙i*, and *∑iζiλ⊙i<∞*, then the mixture is a bounded infradistribution.*

Proof sketch: Nonemptiness, convexity, upper completion, and normalization are pretty easy to show. Closure is a nightmare.

The proof sketch of Closure is: Take a sequence (mn,bn) limiting to (m,b). Since each approximating point is a mixture of points from the Hi, we can shatter each of these (mn,bn)∈EζHi into countably many (mi,n,bi,n)∈Hi. This defines a sequence in *each* Hi (not necessarily convergent). Then, we take some bounds on the (mn,bn) and manage to translate them into (rather weak) i-dependent bounds on the (mi,n,bi,n) sequence. This lets us invoke the Compactness Lemma and view everything as wandering around in a compact set, regardless of Hi. Then, we take the product of these compact sets to view everything as a *single* sequence in the product of compact sets, which is compact by Tychonoff's theorem. This is only a countable product of compact metric spaces, so we don't need full axiom of choice. Anyways, we isolate a convergent subsequence in there, which makes a convergent subsequence in each of the Hi. And then, we can ask "what happens when we mix the limit points in the Hi according to ζ?" Well, what we can do is just take a partial sum of the mixture of limit points, like the i from 0 to 1 zillion. We can establish that (m,b) gets arbitrarily close to the upper completion of a partial sum of the mixture of limit points, so (m,b) lies above all the partial sums of our limit points. We show that the partial sums don't have multiple limits, then, we just do one more invocation of Lemma 3 to conclude that the mixture of limit points lies below (m,b). Finally, we appeal to upper completion to conclude that (m,b) is in our mixed set of interest. Whew!

Once those first 4 are out of the way, we can then invoke Theorem 5 to translate to the h view, and mop up the remaining minimal-point conditions.

First, nonemptiness. By Theorem 5, we can go "hm, the hi are monotone on C(X,[0,1]), and −∞ everywhere else, and hi(1)=1, so the affine functional ϕ:ϕ(f)=1 lies above the graph of hi". This translates to the point (0,1) being present in all the Hi. Then, we can just go: Eζ(0,1)=(0,1), so we have a point in our EζHi set.

For normalization, appeal to Proposition 10 and normalization for all the Hi. EEζHi(1)=Eζ(EHi(1))=Eζ(1)=1 and EEζHi(0)=Eζ(EHi(0))=Eζ(0)=0.

Convexity is another easy one. Take a M,M′∈EζHi. They shatter into Mi,M′i∈Hi. Then, we can just go:

pM+(1−p)(m′,b′)=pEζ(mi,bi))+(1−p)Eζ(m′i,b′i))=Eζ(p(mi,bi)+(1−p)(m′i,b′i))

and then, by convexity of the Hi, p(mi,bi)+(1−p)(m′i,b′i)∈Hi, so we wrote p(m,b)+(1−p)(m′,b′) as a mixture of points in Hi.

Upper completion is another easy one, because, if (m,b)∈EζHi, then you can go

(m,b)+(m∗,b∗)=Eζ(mi,bi)+Eζ(m∗,b∗)=Eζ((mi,bi)+(m∗,b∗))

And ((mi,bi)+(m∗,b∗))∈Hi by upper completion.

That leaves the nightmare of closure. Fix a sequence Mn∈Eζ(Hi) limiting to M. You can think of the Mn as (mn,bn). We can shatter the Mn into Mi,n∈Hi, where Mi,n can be thought of as (mi,n,bi,n).

Now, since Mn converge to something, there must be an upper bound on the bn and mn(1) terms of the sequence, call those b◯ and λ◯. Now, for all n and all i′, b◯≥bn=∑iζibi,n≥ζi′bi′,n so, for all n and i, bi,n≤b◯ζi.

Also, for all n and i′, λ◯+b◯≥mn(1)+bn=∑i(ζi(mi,n(1)+bi,n))≥ζi′(mi′,n(1)+bi′,n) and reshuffling, we get λ◯+b◯ζi′≥mi′,n(1)+bi′,n which then makes λ◯+b◯ζi′≥m+i′,n(1)+(m−i′,n(1)+bi′,n). Further, due to (mi′,n,bi′,n) being a sa-measure, bi′,n+m−i,n(1)≥0, so for all n and i, m+i,n(1)≤λ◯+b◯ζi.

Ok, so taking stock of what we've shown so far, it's that for all i, the sequence Mi,n is roaming about within Hi∩{(m,b)|b≤b◯ζi,m+(1)≤λ◯+b◯ζi} And, by the Compactness Lemma, this set is compact, since it's got bounds (weak bounds, but bounds nonetheless). Defining

¯¯¯¯¯¯Mn∈∏i(Hi∩{(m,b)|b≤b◯ζi,m+(1)≤λ◯+b◯ζi})

where ¯¯¯¯¯¯Mn(i):=Mi,n, we can view everything as one single sequence ¯¯¯¯¯¯Mn wandering around in the product of compact sets. By Tychonoff's theorem (we've only got a countable product of compact metric spaces, so we don't need full axiom of choice, dependent choice suffices), we can fix a convergent subsequence of this, and the projections of this subsequence to every Hi converge.

Ok, so we've got a subsequence of n where, regardless of i, Mi,n converge to some Mi∈Hi (by closure of Hi). How does that help us? We don't even know if mixing these limit points converges to something or runs off to infinity. Well... fix any j you like, we'll just look at the partial sum of the first j components. Also fix any ϵ you please. On our subsequence of interest, the Mn converge to M, and in all i, the Mi,n converge to Mi. So, let n be large enough (and in our subsequence) that d(Mn,M)<ϵ, and ∀i≤j:d(Mi,n,Mi)<ϵ, we can always find such an n.

Now, j}\zeta_i\sm_{i,n}">∑i≤jζiMi+∑i>jζiMi,n is a well-defined point (because it's a finite sum of points plus a convergent sequence as witnessed by the well-definedness of Mn which breaks down as ∑iζiMi,n) It also lies in the upper completion of the single point ∑i≤jζiMi. We'll show that this point is close to M. Since we're working in a space with a norm,

d(M+M∗,M′+M∗)=||(M+M∗)−(M′+M∗)||=||M−M′||=d(M,M′)

This will come in handy in the later equations.

j}\zeta_i\sm_{i,n},\sm\right)\le d\left(\sum_{i\le j}\zeta_{i}\sm_{i}+\sum_{i>j}\zeta_i\sm_{i,n},\sm_n\right)+d\left(\sm_n,\sm\right)">d(∑i≤jζiMi+∑i>jζiMi,n,M)≤d(∑i≤jζiMi+∑i>jζiMi,n,Mn)+d(Mn,M)

j}\zeta_i\sm_{i,n},\sum_{i}\zeta_{i}\sm_{i,n}\right)+\epsilon=d\left(\sum_{i\le j}\zeta_{i}\sm_{i},\sum_{i\le j}\zeta_{i}\sm_{i,n}\right)+\epsilon"><d(∑i≤jζiMi+∑i>jζiMi,n,∑iζiMi,n)+ϵ=d(∑i≤jζiMi,∑i≤jζiMi,n)+ϵ

≤∑i≤jd(ζiMi,ζiMi,n)+ϵ=∑i≤j||ζiMi−ζiMi,n||+ϵ=∑i≤jζi||Mi−Mi,n||+ϵ

=∑i≤jζid(Mi,Mi,n)+ϵ<∑i≤jζiϵ+ϵ≤ϵ+ϵ=2ϵ

So, M is less than 2ϵ away from the upper completion of the point ∑i≤jζiMi, which is a closed set (Minkowski sum of a closed and compact set is closed). ϵ can be shrank to 0 with increasing n, so M has distance 0 from the upper completion of said partial sum, and thus lies above the partial sum!

Abbreviating ∑i≤jζiMi as Mj, we get that all the Mj lie in {M}−Msa(X), and are all sa-measures. Thus, *if* the sequence Mj converges to a unique point, then said limit point is ∑iζiMi, and all the Mi∈Hi, so ∑iζiMi would lie in EζHi. Further, by Lemma 3, ∑iζiMi∈{M}−Msa(X), since that set is compact, so M lies above ∑iζiMi, and would lie in EζHi by upper-completeness.

So, all that's left to wrap up our closure argument is showing that the sequence Mj has a *single* limit point. Since it's wandering around in ({M}−Msa(X))∩Msa(X) which is compact by Lemma 3, there are convergent subsequences. All we have to show now is that all convergent subsequences must have the same limit point.

Assume this is false, and there's two distinct limit points of the sequence Mj, call them M∞ and M′∞. Because it's impossible for two points to *both* be above another (in the minimal-point/adding-points sense), without both points being identical, either M∞∉{M′∞}−Msa(X), or vice-versa. Without loss of generality, assume M∞∉{M′∞}−Msa(X). Since the latter is a closed set, M∞ must be ϵ away for some 0">ϵ>0. Fix some j from the subsequence that M∞ is a limit point of, where d(Mj,M∞)<ϵ2. There must be some *strictly greater* j′ from the subsequence that M′∞ is a limit point of.

Mj′=∑i≤j′ζiMi=∑i≤jζiMi+∑j<i≤j′ζiMi=Mj+∑j<i≤j′ζiMi

Further, the ζi are nonzero. Also, no Mi can be the 0 point, because Mi∈Hi, and if Mi=(0,0), then EHi(1)=0, which is impossible by normalization. So, Mj lies strictly below Mj′. Also, Mj′ lies below M′∞, because for all the j'">j∗>j′,

Mj∗=∑i≤j∗ζiMi=∑i≤j′ζiMi+∑j′<i≤j∗ζiMi=Mj′+∑j′<i≤j∗ζiMi

so Mj∗∈{Mj′}+Msa(X) for all j'">j∗>j′. The sequence that limits to M′∞ is roaming around in this set, which is closed because the sum of a compact set (a single point) and a closed set is closed. So, M′∞ lies above Mj′ which lies above Mj. Thus, Mj∈{M′∞}−Msa(X). However, Mj is ϵ2 or less distance from M∞, which must be ϵ distance from {M′∞}−Msa(X), and we have a contradiction.

Ok, so the sequence of partial sums Mj has a *single* limit point, which is ∑iζiMi, and all the Mi∈Hi, so ∑iζiMi∈EζHi, and by Lemma 3, ∑iζiMi∈{M}−Msa(X), since that set is compact, so M lies above ∑iζiMi, and lies in EζHi by upper-completeness. We're done!

For minimals, by our argument about what it takes to invoke LF-Duality in Proposition 9, we only need convexity, closure, and upper completion (which we have), and that the h induced by EζHi is continuous. By Proposition 10, EEζHi(f)=Eζ(EHi(f))=Eζ(hi(f))=(Eζhi)(f). We might as well go for uniform continuity since all the Hi are infradistributions, and so fulfill weak-bounded-minimals, so their hi are uniformly continuous. Then, this continuity lets you invoke LF-Duality, and transfer uniform continuity for the h induced by EζHi to weak-bounded-minimals for EζHi

For uniform continuity/weak-bounded-minimals, given an arbitrary ϵ, we can pick a finite j where j}\zeta_i<\frac{\epsilon}{2}">∑i>jζi<ϵ2, and a finite δ where, for all hi with i≤j, d(f,f′)<δ implies |hi(f)−hi(f′)|<ϵ2. Monotonicity and normalization for the hi ensures that, no matter what, hi(f)∈[0,1], so regardless of the f,f′, |hi(f)−hi(f′)|≤1. Then, we can go: Ok, if |f−f′|<δ, then

|Eζ(hi(f))−Eζ(hi(f′))|≤Eζ|hi(f)−hi(f′)|

j}\zeta_i|h_i(f)-h_i(f')|">=∑i≤jζi|hi(f)−hi(f′)|+∑i>jζi|hi(f)−hi(f′)|

j}\zeta_i<\sum_{i}\zeta_i\frac{\epsilon}{2}+\frac{\epsilon}{2}=\frac{\epsilon}{2}+\frac{\epsilon}{2}=\epsilon"><∑i≤jζiϵ2+∑i>jζi<∑iζiϵ2+ϵ2=ϵ2+ϵ2=ϵ

And by our earlier argument, we invoke LF-Duality and pick up weak-bounded-minimals.

For positive-minimals, we can just observe that, if f′≥f, then

(Eζhi)(f′)=Eζ(hi(f′))≥Eζ(hi(f))=(Eζhi)(f)

By monotonicity for the hi because Hi had positive-minimals. Going back to EζHi, since its associated h is monotone, it must have positive-minimals as well.

For bounded minimals assuming the Lipschitz constants aren't too big, fix some ϵ. We know that ∑iζiλ⊙i<∞, where λ⊙i is the Lipschitz constant of hi. So, if d(f,f′)<ϵ, then:

|Eζ(hi(f))−Eζ(hi(f′))|≤Eζ|hi(f)−hi(f′)|=∑iζi|hi(f)−hi(f′)|<∑iζiλ⊙iϵ

So, ∑iζiλ⊙i is a finite constant, and is an upper bound on the Lipschitz constant of the mixture of the hi, so the h corresponding to EζHi has a Lipschitz constant, which, by Theorem 5, translates to bounded-minimals. And we're done.

**Proposition 12:** g∗(Eζ(Hi))=Eζ(g∗(Hi))

Let's use Theorem 5 to translate this into the concave functional setting. We want to show that g∗(Eζhi)=Eζ(g∗(hi)) Now, given any function f∈C(Y,[0,1]),

(g∗(Eζhi))(f)=(Eζhi)(f∘g)=Eζ(hi(f∘g))=Eζ(g∗(hi)(f))=(Eζ(g∗(hi)))(f)

and we're done! The two concave functionals corresponding to those two sets are the same, so the sets themselves are the same.

**Lemma 8:** *The "raw update" *ugL:Msa(X)→Msa(L)* defined by *(m,b)↦(m⋅L,b+m(0★Lg))* is a continuous linear operator.*

For linearity,

ugL(a(m,b)+a′(m′,b′))=ugL(am+a′m′,ab+a′b)

=((am+a′m′)⋅L,ab+a′b′+(am+a′m′)(0★Lg))

=(a(m⋅L)+a′(m′⋅L),ab+a′b′+am(0★Lg)+a′m′(0★Lg))

=a(m⋅L,b+m(0★Lg))+a′(m′⋅L,b′+m′(0★Lg))=augL(m,b)+a′ugL(m,b)

Now for continuity. mn⋅L limits to m⋅L if, for all f∈C(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯supp(L)), (mn⋅L)(f) limits to (m⋅L)(f). Observe that (m⋅L)(f)=m(f★L0), and f★L0 is continuous.

Now, for any f we can go

limn→∞((mn⋅L)(f))=limn→∞(mn(f★L0))=m(f★L0)=(m⋅L)(f)

establishing continuity in the first vector component, by mn limiting to m. For the second vector component,

m(f★Lg)+b=limn→∞(mn(f★Lg))+limn→∞bn=limn→∞(mn(f★Lg)+bn)

So we have continuity in the second vector component as well, and we're done.

**Lemma 9:** (ugL(H))min⊆ugL(Hmin)

As a recap, the raw update function ugL is: (m,b)↦(m⋅L,b+m(0★Lg))

Take a point (m,b)∈(ugL(H))min. Now there must be a preimage point (m′,b′)∈H that, when we apply ugL, produces (m,b). Because (m′,b′) is in an infradistribution, we can decompose it into a minimal point and something else, (m′,b′)=(mmin,bmin)+(m∗,b∗). Then,

(m,b)=ugL((m′,b′))=ugL((mmin,bmin)+(m∗,b∗))=ugL(mmin,bmin)+ugL(m∗,b∗)

This was done by using linearity of ugL via Lemma 8.

Note that, since we have written (m,b) as a sum of a different point also in ugL(H) and an sa-measure, but (m,b) is minimal in ugL(H), the sa-measure must be 0, so (m,b)=ugL(mmin,bmin)∈ugL(Hmin), and we're done.

**Proposition 13:** *When updating a bounded infradistribution over *Msa(X)*, if the renormalization doesn't fail, you get a bounded infradistribution over the set *Msa(L)*. (for infradistributions in general, you may have to take the closure)*

Proof sketch: It doesn't matter whether you take upper-completion before or after renormalization, so we can appeal to Proposition 7: Renormalizing a bounded inframeasure produces a bounded infradistribution (if the renormalization doesn't fail).

So, we just have to show nonemptiness, convexity, upper-completion (trivial), positive-minimals/bounded minimals (by Lemma 9, the preimage of a minimal point contains a minimal point, so we can transfer over the properties from the minimal point in the preimage), and closure. The set of minimal points in H is contained in a compact set, so we can take a sequence in (ugL(H))uc, split into a component in ugL(H) and something else, take preimage points, get minimals below all of them, isolate a convergent subsequence, map the limit point back through, and show that the limit point lands under your point of interest. That establishes all conditions for a bounded inframeasure, so then we just have to check that our renormalization is the right one to do.

Proof: Nonemptiness is trivial, ugL isn't a partial function. Upper-completion is also trivial, because we explicitly took the upper completion. For convexity, observe that ugL is a linear operator by Lemma 7, so it maps convex sets to convex sets, and the Minkowski sum of two convex sets is convex. ugL maps sa-measures to sa-measures, because

b+m(0★Lg)+(m⋅L)−(1)=b+m(0★Lg)+(m−⋅L)(1)

=b+m(0★Lg)+m−(1★L0)=b+m+(0★Lg)+m−(0★Lg)+m−(1★L0)

≥b+m−(1★Lg)≥b+m−(1)≥0

For positive-minimals and bounded-minimals, we invoke Lemma 9, (ugL(H))min⊆ugL(Hmin). All minimal points in ugL(H) must have a preimage minimal in H, which is an a-measure. Chopping down a measure by L keeps it a measure, so we still have no negative components post-update, and all minimal points in ugL(H) are a-measures. Similarly, chopping down a measure by L reduces the λ value, and we had an upper bound of λ⊙ originally, so the upper bound still works post-update. This gets bounded-minimals.

This just leaves closure. Fix a sequence Mn in ugL(H)uc limiting to M. The Mn break down into ufg(M′n)+M∗n, where M′n∈H. M′n further breaks down into Mminn+M∗∗n, where Mminn∈Hmin. By Proposition 5, the Mminn sequence is wandering around in a compact set since we have bounded-minimals on H, so there's a convergent subsequence which has a limit point Mmin. Map that convergent subsequence and limit point through ugL which is continuous by Lemma 8 to get a sequence of points ugL(Mminn) limiting to ugL(Mmin)∈ugL(H). Fix some really big n where d(M,Mn)<ϵ and d(ugL(Mminn),ugL(Mmin))<ϵ.

Now, ugL(Mmin)+ugL(M∗∗n)+M∗n lies in the upper completion of the point ugL(Mmin). We'll show that this sum of 3 terms is close to M. Since we're working in a Banach space, d(x+y,z+y)=d(x,z), by norm arguments.

d(ugL(Mmin)+ugL(M∗∗n)+M∗n,M)≤d(ugL(Mmin)+ugL(M∗∗n)+M∗n,Mn)+d(Mn,M)

<d(ugL(Mmin)+ugL(M∗∗n)+M∗n,ufg(M′n)+M∗n)+ϵ

=d(ugL(Mmin)+ugL(M∗∗n),ugL(M′n))+ϵ=d(ugL(Mmin)+ugL(M∗∗n),ugL(Mminn+M∗∗n))+ϵ

=d(ugL(Mmin)+ugL(M∗∗n),ugL(Mminn)+ugL(M∗∗n))+ϵ=d(ugL(Mmin),ugL(Mminn))+ϵ<2ϵ

So,M is within 2ϵ of the upper completion of {ugL(Mmin)} for all ϵ, and it's a closed set, so M lies above ugL(Mmin)∈ugL(H), so M∈(ugL(H))uc, and we have closure.

Now that all prerequisite conditions have been established, we just need to show that 1PgH(L) and EH(0★Lg) are the proper renormalization constants to use.

The proper renormalization to use is: 1E(ugL(H))uc(1)−E(ugL(H))uc(0) for the scale, and E(ugL(H))uc(0) for the shift. So let's unpack these quantities.

E(ugL(H))uc(0)=EugL(H)(0)=inf(m,b)∈ugL(H)b=inf(m,b)∈H(b+m(0★Lg))=EH(0★Lg)

So, our shift constant checks out, it's the proper shift constant to use. In the other direction,

E(ugL(H))uc(1)=EugL(H)(1)=inf(m,b)∈ugL(H)(m(1)+b)

=inf(m,b)∈H((m′⋅L)(1)+b+m(0★Lg))=inf(m,b)∈H(m(1★L0)+b+m(0★Lg))

=inf(m,b)∈H(m(1★Lg)+b)=EH(1★Lg)

For the scale constant, observe that 1E(ugL(H))uc(1)−E(ugL(H))uc(0)=1EH(1★Lg)−EH(0★Lg)=1PgH(L)

So our scale constant is also the right scale constant to use. Now, we can invoke Proposition 7: Renormalizing a bounded inframeasure produces a bounded infradistribution if the renormalization doesn't fail.

**Proposition 14:** EH(f★Lg)=EH(0★Lg)+PgH(L)EH|gL(f)

Proof: if PgH(L)≠0, then

EH(0★Lg)+PgH(L)EH|gL(f)=EH(0★Lg)+PgH(L)(inf(m,b)∈H|gL(m(f)+b))

=EH(0★Lg)+PgH(L)(inf(m,b)∈H((1PgH(L)m⋅L)(f)+1PgH(L)(b+m(0★Lg)−EH(0★Lg))))

=EH(0★Lg)+inf(m,b)∈H((m⋅L)(f)+b+m(0★Lg)−EH(0★Lg))

=inf(m,b)∈H((m⋅L)(f)+b+m(0★Lg))

=inf(m,b)∈H(m(f★L0)+b+m(0★Lg))=inf(m,b)∈H(m(f★Lg)+b)=EH(f★Lg)

Now, if PgH(L)=0, then EH(1★Lg)=EH(0★Lg) so, for any f∈C(X,[0,1]), (1★Lg)≥(f★Lg)≥(0★Lg) by monotonicity for the h induced by H, and h(1★Lg)=h(0★Lg), so h(f★Lg)=h(0★Lg). Therefore,

EH(0★Lg)+PgH(L)EH|gL(f)=EH(0★Lg)+0=EH(f★Lg)

and we get our same result.

**Proposition 15:** (H|gL)|g′L′=H|⎛⎝g★1−L1−LL′g′⎞⎠LL′

Proof sketch: First, we do some shuffling around of the stars to get a lemma that will help. Then, we can use the link between updated sets and their associated concave functionals h, getting the identity purely on the concave functional level, where it's much easier to approach.

Proof: First, the star shuffling. For any f,g,g′,L,L′∈C(X,[0,1]), we'll show that

f★LL′(g★1−L1−LL′g′)=(f★L′g′)★Lg.

Let's begin. First, let's deal with points x where L(x)=L′(x)=1, because that gets you a divide-by-zero error.

(f★LL′(g★1−L1−LL′g′))(x)=L(x)L′(x)f(x)+(1−L(x)L′(x))(g★1−L1−LL′g′)(x)

=L(x)L′(x)f(x)+0+0=L(x)L′(x)f(x)+L(x)⋅0⋅g′(x)+0⋅g(x)

=L(x)L′(x)f(x)+L(x)(1−L′(x))g′(x)+(1−L(x))g(x)

=L(x)(L′(x)f(x)+(1−L′(x))g′(x))+(1−L(x))g(x)

=((L′f+(1−L′)g′)★Lg)(x)=((f★L′g′)★Lg)(x)

and we're done with the divide-by-zero case. In the other case, we can safely assume there's no divide-by-zero errors.

f★LL′(g★1−L1−LL′g′)=LL′f+(1−LL′)(g★1−L1−LL′g′)

=LL′f+(1−LL′)(1−L1−LL′g+(1−1−L1−LL′)g′)

=LL′f+(1−LL′)(1−L1−LL′g+(1−LL′−1+L1−LL′)g′)

=LL′f+(1−L)g+(1−LL′−1+L)g′=LL′f+(1−L)g+L(1−L′)g′

=L(L′f+(1−L′)g′)+(1−L)g=(L′f+(1−L′)g′)★Lg=(f★L′g)★Lg

Ok, so we've established our crucial f★LL′(g★1−L1−LL′g′)=(f★L′g′)★Lg identity. Let's proceed. Updates for concave functionals are: (h|gL)(f)=h(f★Lg)−h(0★Lg)h(1★Lg)−h(1★Lg)

Importing Proposition 14, EH(f★Lg)=EH(0★Lg)+PgH(L)EH|gL(f) and rearranging it (and unpacking the definition of PgH(L)), we get EH|gL(f)=EH(f★Lg)−EH(0★Lg)EH(1★Lg)−EH(0★Lg)

So, updating fulfills the positive functional definition of update, because this transfers into (h|gL)(f)=h(f★Lg)−h(0★Lg)h(1★Lg)−h(0★Lg) which is exactly our concave functional definition of updating. So, in order to verify that the two updates equal the one big update, we could just show that their concave functional definitions are equivalent. (H|gL)|g′L′ would, on the concave functional level, turn into:

((h|gL)|g′L)(f)=(h|gL)(f★L′g′)−(h|gL)(0★L′g′)(h|gL)(1★L′g′)−(h|gL)(0★L′g′)

=h((f★L′g′)★Lg)−h(0★Lg)h(1★Lg)−h(0★Lg)−h((0★L′g′)★Lg)−h(0★Lg)h(1★Lg)−h(0★Lg)h((1★L′g′)★Lg)−h(0★Lg)h(1★Lg)−h(0★Lg)−h((0★L′g′)★Lg)−h(0★Lg)h(1★Lg)−h(0★Lg)

=h((f★L′g′)★Lg)−h(0★Lg)−h((0★L′g′)★Lg)+h(0★Lg)h((1★L′g′)★Lg)−h(0★Lg)−h((0★L′g′)★Lg)+h(0★Lg)

=h((f★L′g′)★Lg)−h((0★L′g′)★Lg)h((1★L′g′)★Lg)−h((0★L′g′)★Lg)

and now we can use our earlier star identity to rewrite as:

=h⎛⎝f★LL′⎛⎝g★1−L1−LL′g′⎞⎠⎞⎠−h⎛⎝0★LL′⎛⎝g★1−L1−LL′g′⎞⎠⎞⎠h⎛⎝1★LL′⎛⎝g★1−L1−LL′g′⎞⎠⎞⎠−h⎛⎝0★LL′⎛⎝g★1−L1−LL′g′⎞⎠⎞⎠

=⎛⎜ ⎜⎝h|⎛⎝g★1−L1−LL′g′⎞⎠LL′⎞⎟ ⎟⎠(f)

establishing our identity of updating twice, vs one big update of a different form.

**Corollary 2:** *Regardless of L and *L′* and *g*, then* (H|gL)|gL′=H|g(LL′)

Just use Proposition 15, and notice that: g★1−L1−LL′g=1−L1−LL′g+(1−1−L1−LL′)g=g getting us our result.

**Corollary 3:** *If *Y* and *Z* are clopen sets, then, abusing notation by glossing over the difference between indicator functions and sets,* (H|gY)|gZ=H|g(Y∩Z)

Invoke Corollary 2, and observe that 1Y⋅1Z=1Y∩Z.

**Lemma 10:** PgEζHi(L)=Eζ(PgHi(L))

Proof: Invoke Proposition 10 to go:

PgEζHi(L)=EEζHn(1★Lg)−EEζHi(0★Lg)=Eζ(EHi(1★Lg))−Eζ(EHi(0★Lg))

=Eζ(EHi(1★Lg)−EHi(0★Lg))=Eζ(PgHn(L))

**Theorem 6:** (EζHi)|gL=Eζ(PgHi(L)⋅(Hi|gL))Eζ(PgHi(L)) *If the update doesn't fail.*

Proof: Let ζ′ be defined as ζ′i:=ζiPgHi(L)∑jζjPgHj(L) It is a probability distribution, because if all PgHi(L)=0, then EζPgHi(L)=0, and so by Lemma 10, PgEζHi(L)=0, which would cause the update to fail.

The left-hand-side corresponds to (Eζhi)|gL on the concave functional level, and the right-hand-side corresponds to Eζ′(hi|gL) on the concave functional level. Let's begin unpacking. Lemma 10 will be used throughout, as well as the definition of PgHi(L).

(Eζ′(hi|gL))(f)=Eζ′((hi|gL)(f))=∑i(ζiPgHi(L)∑jζjPgHj(L)hi(f★Lg)−hi(0★Lg)hi(1★Lg)−hi(0★Lg))

=∑i(ζiPgHi(L)∑jζjPgHj(L)hi(f★Lg)−hi(0★Lg)PgHi(L))=∑i(ζi(hi(f★Lg)−hi(0★Lg))∑jζjPgHj(L))

=∑iζi(hi(f★Lg)−hi(0★Lg))EζPgHj(L)=Eζ(hi(f★Lg)−hi(0★Lg))PgEζHi(L)

=Eζ(hi(f★Lg))−Eζ(hi(0★Lg))Eζ(hi(0★Lg))−Eζ(hi(0★Lg))=(Eζhi)(f★Lg)−(Eζhi)(0★Lg)(Eζhi)(0★Lg)−(Eζhi)(0★Lg)=((Eζhi)|gL)(f)

So, (Eζhi)|gL=Eζ′(hi|gL) as desired, which shows our result.

Discuss

### Proofs Section 2.1 (Theorem 1, Lemmas)

Fair upfront warning: This is not a particularly readable proof section. There's a bunch of dense notation, logical leaps due to illusion of transparency since I've spent months getting fluent with these concepts, and a relative lack of editing since it's long. If you really want to read this, I'd suggest PM-ing me to get a link to MIRIxDiscord, where I'd be able to guide you through it and answer questions. This post will be recapping the notions and building up an arsenal of lemmas, the next one will show the isomorphism theorem, translation theorems, and behavior of mixing, and the last one is about updates and the decision-theory results. It's advised to have them open in different tabs and go between them as needed. .mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

With that said, let's establish some notation and mental intuition. I'll err on the side of including more stuff because illusion of transparency. First, visualize the tree of alternating actions and observations in an environment. A full policy π can be viewed as that tree with some branches pruned off, specifying every history that's *possible* with your policy of interest. All our policies are deterministic. A policy stub πst is a policy tree that's been mostly pruned down (doesn't extend further than some finite time n). A partial policy πpa is just any policy tree in any state of specification or lack thereof, from tiny stubs to full policies to trees that are infinite down some branches but not others.

π∅ denotes the empty policy (a stub) which specifies nothing about what a policy does, and π¬h is some partial policy which specifies everything (acts like a full policy) everywhere except on history h and afterwards.

There's a distance metric on histories, as well as a distance metric on partial policies. Both of them are of the form γt where γ<1, and t is the "time of first difference". For histories, it's "what's the first time these histories differ", for policies, it's "what's the shortest time by which one partial policy is defined and the other is undefined, or where the policies differ on what to do". So, thinking of the distance as getting smaller as the time of first difference gets bigger is a reliable guide.

The outcome set F(πpa) is... take the tree corresponding to πpa, and it's the set of all the observation leaf nodes and infinite paths. No matter what, if you're interacting with an environment and acting according to πpa, the history you get is *guaranteed* to have, as a prefix, something in F(πpa). FNF(πpa) is that same set but minus all the Nirvana observations. Nirvana is a special observation which can occur at any time, counts as infinite reward, and ends the history. This is our compact metric space of interest that we're using to define a-measures and sa-measures. We assume that there's only finitely many discrete actions/observations available at any given point in time.

In this setting, sa-measures and a-measures over FNF(πpa) are defined as usual (a pair of a signed measure m and a number b where b+m−(1)≥0 for sa-measures, and a measure m with no negative parts and b≥0, respectively), because there's no infinite reward shenanigans. Sa-measures over F(πpa) require a technicality, though, which is that no nirvana event can have negative measure. λ will denote the total amount of measure you have. So, for a probability distribution, λ will be 1. We'll just use this for a-measures, and talk freely about the λ and b values of an a-measure. We use the KR-metric for measuring the distance between sa-measures (or a-measures), which is like "if two measures are really similar for a long time and then start diverging at late times, they're pretty similar." It's also equivalent to the earthmover distance, which is "how much effort does it take to rearrange the pile-of-dirt-that-is-this-measure into the pile-of-dirt-that-is-that-measure."

One important note. While m(f) is "what's the expectation of the continuous function f over histories, according to the measure we have", we frequently abuse notation and use m(h) to refer to what technically should be "what's the expectation of the indicator function for "this history has h as a prefix" w.r.t the measure". The reason we can do this is because the indicator function for the finite history h is a continuous function! So we can just view it as "what's the measure assigned to history h". Similarly, f★hg is the continuous function that's f on histories with h as a prefix and g on histories without h as a prefix.

For a given Ma(F(πpa)) or the nirvana-free variant, NF is just the subset of that where the measure components of the a-measures assign 0 measure to Nirvana occurring. They're safe from infinite reward. We suppress the dependency on πpa. Similarly,

EB(f)=inf(m,b)∈B∩NF(m(f)+b)

because if a Nirvana-containing measure was selected by Murphy, you'd get infinite expected value, so Murphy won't pick anything with Nirvana in it. Keep that in mind.

There's a fiddly thing to take into account about upper completion. We're usually working in the space of a-measures Ma(F(πpa)) or the nirvana-free equivalent. But the variant of upper completion we impose on our sets is: take the nirvana-free part of your set of interest, take the upper completion w.r.t the cone of nirvana-free sa-measures, then intersect with a-measures again. So, instead of the earlier setting where we could have any old sa-measure in our set and we could add any old sa-measure to them, now, since we're working purely in the space of a-measures and only demanding upper closure of the nirvana-free part, our notion of upper completion is something more like "start with a nirvana-free a-measure, you can add a nirvana-free sa-measure to it, and adding them has to make a nirvana-free a-measure"

Even worse, this is the notion of upper completion we impose, but for checking whether a point counts as minimal, we use the cone of sa-measures (with nirvana). So, for certifying that a point is non-minimal, we have to go "hey, there's another a-measure where we can add an sa-measure and make our point of interest". A different notion of upper completion here.

And, to make matters even worse, sometimes we do arguments involving the cone of sa-measures or nirvana-free sa-measures and don't impose the a-measure restriction. I'll try to clarify which case we're dealing with, but I can't guarantee it'll all be clear or sufficiently edited.

There's a partial ordering on partial policies, which is πpa≥π′pa if the two policies never disagree on which action to take, and πpa is defined on more histories than π′pa is (is a bigger tree). So, instead of viewing a partial policy as a tree, we can view the set of partial policies as a big poset. The full policies π are at the top, the empty policy π∅ is at the bottom. Along with this, we've got two important notions. One is the fundamental sequence of a partial policy. Envisioning it at the tree level, πnpa is "the tree that is πpa, just cut off at level n". Envisioning it at the poset level, the sequence πnpa is a chain of points in our poset ascending up to the point πpa.

Also synergizing with the partial-order view, we've got functions heading down the partial policy poset. prπhipa,πlopa∗ is the function that takes an a-measure or sa-measure over F(πhipa), and is like "ok, everything in F(πhipa) has a unique prefix in F(πlopa), push your measure component down, keep the b term the same". A good way of intuiting this is that this sort of projection describes what happens when you crunch down a measure over 10-bit-bitstrings to a measure over 8-bit-bitstrings. So view your poset of partial policies as being linked together by a bunch of projection arrows heading down.

There's a function Θ mapping each partial policy πpa to a set of a-measures over F(πpa) (or the nirvana-free variant), fulfilling some special properties. Maybe Θ is only defined over policy stubs or full policies, in which case we use Θst or Θω, respectively. So, the best mental visualization/sketching aid for a lot of the proofs is visualizing your partial policies of interest with an ordering on them where bigger=up and smaller=down, and have a set/point for each one, that organizes things fairly well and is how many of these proofs were created.

Every Θ (or the stub/full policy analogue) is associated with a λ⊙ and b⊙ value, which is "smallest upper bound on the λ of the minimal points of the Θ(πpa) sets" and "smallest upper bound on the b of the minimal points of the Θ(πpa) sets". Accordingly, the set {≤⊙} is defined as {(λμ,b)|λ+b≤λ⊙+b⊙}, and is a way of slicing out a bounded region of a set that contains all minimal points, if we need to do compactness arguments.

Finally, we'll reiterate two ultra-important results from basic inframeasure theory that get used a *lot* in here and will be tossed around casually for arguments without citing where they came from. There's the Compactness Lemma, which says that if you have a bound on the λ values and the b values of a closed set of a-measures, the set is compact. There's also Theorem 2, which says that you can break down any sa-measure into (minimal point + sa-measure), we use that decomposition a whole lot.

Other results we'll casually use are that projections commute (projecting down and then down again is the same as doing one big projection down), projections are linear (it doesn't matter whether you mix before or after projecting), projections don't expand distance (if two a-measures are ϵ apart before being projected down, they'll be ϵ or less apart after being projected down), if two a-measures are distinct inMa(F(πpa)), then the measure components differ at *some* finite time (or the b terms differ), so we can project down to some finite πnpa (same thing, just end history at time n) and they'll still be different, and projections preserve the λ and b value of an a-measure.

One last note, Ma(∞) is the space of a-measures on nirvana-free histories. This is *all* histories, not just the ones compatible with a specific policy. And a surmeasure SM is like a measure, but it can assign 0+ value to a nirvana event, marking it as "possible" even though it has 0 (arbitrarily low) measure.

Now, we can begin. Our first order of business is showing how the surtopology/surmetric/surmeasures are made and link together, but the bulk of this is the Isomorphism theorem. Which takes about 20 lemmas to set up first, in order to compress all the tools we need for it, and then the proof itself is extremely long. After that, things go a bit faster.

**Lemma 1:** d(x,x′):=max(d1(x,x′),d2(x,x′)) *is a metric if *d1* is a metric and *d2* is a pseudometric.*

For identity of indiscernibles, d(x,x′)=0→d1(x,x′)=0→x=x′ because d1 is a metric, and in the reverse direction, if x=x′, then d2(x,x′)=0 (pseudometrics have 0 distance from a point to itself) and d1(x,x′)=0, so d(x,x′)=0.

For symmetry, both metrics and pseudometrics have symmetry, so

d(x,x′)=max(d1(x,x′),d2(x,x′))=max(d1(x′,x),d2(x′,x))=d(x′,x)

For triangle inequality, both metrics and pseudometrics fulfill the triangle inequality, so

d(x,z)=max(d1(x,z)+d2(x,z))≤max(d1(x,y)+d1(y,z),d2(x,y)+d2(y,z))

≤max(d1(x,y),d2(x,y))+max(d1(y,z),d2(y,z))=d(x,y)+d(y,z)

And we're done.

**Lemma 2:** *The surmetric is a metric.*

To recap, the surmetric over sa-measures is

ds((m,b),(m′,b′)):=max(d(m,m′)+|b−b′|,γt(m,m′))

where γ<1, and t(m,m′) is the minimum length of a Nirvana-containing history that has positive measure according to m and 0 measure according to m′ (or vice-versa) We'll show that γt(m,m′) acts as a pseudometric, and then invoke Lemma 1.

The first three conditions of nonnegativity, γt(m,m)=γ∞=0, and symmetry are immediate. That just leaves checking the triangle inequality. Let t1:=t(m1,m2), t2:=t(m2,m3), and t3:=t(m1,m3).

Assume t3<min(t1,t2). Then, going from m1 to m2, all changes in the possibility of a Nirvana-history take place strictly *after* t3, and going from m2 to m3, all changes in the possibility of a Nirvana-history also take place strictly after t3, so m1 and m3 behave identically (w.r.t. Nirvana-possibilities) up to and including time t3, which is impossible, because of t3 being "what's the shortest time where m1 and m3 disagree on the possiblility of a Nirvana-history".

Therefore, t3≥min(t1,t2). and this case is ruled out.

In one case, either t1 or t2 are t_3">>t3. Without loss of generality, assume it's t1. Then, γt3<γt1≤γt1+γt2 and the triangle inequality is shown.

In the other case, t1=t2=t3. In that case, γt3=γt1≤γt1+γt2 And the triangle inequality is shown.

Therefore, γt(m,m′) is a pseudometric. Now, we can invoke Lemma 1 to show that ds is a metric.

**Theorem 1:** *The surmetric on the space of sa-measures *Ma(F(πpa))* induces the surtopology. The Cauchy completion of *Ma(F(πpa))* w.r.t the surmetric is exactly the space of sa-surmeasures.*

Proof sketch: First, use the metric to get an entourage (check the Wikipedia page on "Uniform Space"), and use the entourage to get a topology. Then, we go in both directions, and check that entourage-open sets are open according to the surtopology and the surtopology subbasis sets are entourage-open, to conclude that the topology induced by the surmetric is exactly the surtopology. Then, for the Cauchy completion, we'll show a bijection between equivalence classes of Cauchy sequences w.r.t. the surmetric and sa-surmeasures.

The surmetric is ds((m,b),(m′,b′)):=max(d(m,m′)+|b−b′|,γt(m,m′)) where γ<1, and t is the minimum length of a Nirvana-containing history that has positive measure according to m and 0 measure according to m′ (or vice-versa)

From the Wikipedia page on "Uniform Space", a fundamental system of entourages for Msa(F(πpa)) is given by

{(M,M′)∈Msa(F(πpa))×Msa(F(πpa)):ds(M,M′)≤ϵ}

A set O is open w.r.t. the uniformity iff for all M∈O, there exists an entourage V where V[M] lies entirely within O (wikipedia page). Because V is a subset of Ma(F(πpa))×Ma(F(πpa)), V[M] is the set of all the second components paired with a given sa-measure.

So, let's say O is open w.r.t. the uniformity. Then, for all M∈O, there's an entourage V where V[M] lies entirely within O. A fundamental system of entourages has the property that every entourage is a superset of some set from the fundamental system. Thus, from our earlier definition of the fundamental system, there exists some ϵM where

{M′∈Msa(F(πpa)):ds(M,M′)≤ϵM}⊆O

We'll construct an open set from the surtopology that is a subset of this set and contains M, as follows. First, observe that if ds(M,M′)≤ϵM, then d(M,M′)≤ϵM, and γt(m,m′)≤ϵM. For the latter, there are finitely many nirvana-containing histories with a length less than logγ(ϵM)+1, and if a M′ matches M w.r.t. which nirvana-containing histories of that finite set are possible or impossible, then γt(m,m′)<ϵM (because M and M′ only differ on which Nirvana-histories are possible at very late times)

Accordingly, intersect the following sets:

1: the open ball centered at M with a size of ϵM

2: For all the nirvana-histories hN where |hN|≤logγ(ϵM)+1 and 0">m(hN)>0, intersect all the sets of a-measures where that history has positive measure. These are open because they're the complements of "this finite history has zero measure", which is a closed set of sa-measures.

3: For all the nirvana-histories hN where |hN|≤logγ(ϵM)+1 and m(hN)=0, intersect all the sets of a-measures where that nirvana-history has 0 measure. These are open because they are subbasis sets for the surtopology.

We intersected finitely many open sets, so the result is open. Due to 2 and 3 and our earlier discussion, any M′ in the intersection must have γt(m,m′)<ϵM. Due to 1, d(M,M′)<ϵM.

This finite intersection of open sets (in the surtopology) produces an open set that contains M (obviously) and is a subset of {M′∈Msa(F(πpa)):ds(M,M′)≤ϵM}, which is a subset of V[M] which is a subset of O.

Because this argument can be applied to *every* point M∈O to get an open set (in the surtopology) that contains M and is a subset of O, we can make O itself by just unioning all our open sets together, which shows that O is open in the surtopology.

In the reverse direction, let's show that all sets in the subbasis of the surtopology are open w.r.t. the uniform structure.

First, we'll address the open balls around a point M. Every point M′ in such an open ball has some ϵM′-sized open ball which fits entirely within the original open ball. Then, we can just consider our entourage V being

{M,M′′∈Msa(F(πpa))×Msa(F(πpa)):ds(M,M′′)≤ϵM′2}

And then V[M′] is all points that are ϵM′2 or less away from M′ according to the surmetric, and ds(M′,M′′)≥d(M′,M′′) so this is a subset of the ϵM′-sized ball around M′, which is a subset of the ball around M. The extra measure we added in total on step n is (note that no nirvana-history can have a length of 0, so we start at 1, and t denotes timesteps in the history)

∑t∑hN:|hN|=t2−(n+t)#(t)=∑t#(t)2−(n+t)#(t)≤∑t2−(n+t)=2−n∑t2−t≤2−n

So, as n increases, the deviation of this sequence of sa-measures from the limit sa-surmeasure approaches 0 w.r.t. the usual metric, and every component in this sequence agrees with the others and the limit sa-surmeasure on which nirvana events are possible or impossible, so it's a Cauchy sequence limiting to the sa-surmeasure of interest.

Thus, all parts have been shown. The surmetric induces the surtopology, and the Cauchy completion of the sa-measures w.r.t. the surmetric is the set of sa-surmeasures. The same proof works if you want a-surmeasures (get it from the a-measures), or surmeasures (get it from the measures).

Alright, now we can begin the lemma setup for the Isomorphism Theorem, which is the Big One. See you again at Lemma 21.

**Lemma 3:** *If *B⊆Ma(FNF(πst))* and *B* is nonempty, closed, convex, nirvana-free upper-complete, and has bounded-minimals, then *c.h(Bmin)=c.h(Bxmin)

So, first, Bxmin refers to the set of extreme minimal points of B. An extreme point of B is one that cannot be written as a mixture of other points in B.

Proof Sketch: One subset direction, c.h(Bmin)⊇c.h(Bxmin) is immediate. For the other direction, we need a way to write a minimal point as a finite mixture of extreme minimal points. What we do is first show that all extreme points in B must lie below the λ⊙+b⊙ bound by crafting a way to write them as a mix of different points with upper completion if they violate the bound. Then, we slice off the top of B to get a compact convex set with all the original minimal (and extreme) points in it. Since πst is a policy stub, FNF(πst) has finitely many possible outcomes, so we're working in a finite-dimensional vector space. In finite dimensions, a convex compact set is the convex hull of its extreme points, which are all either (extreme points in B originally), or (points on the high hyperplane we sliced at). Further, a minimal point can only be made by mixing together other minimal points. Putting this together, our minimal point of interest can be made by mixing together extreme minimal points, and the other subset direction is immediate from there.

Proof: As stated in the proof sketch, one subset direction is immediate, so we'll work on the other one. To begin with, fix a Mex that is extreme in B. It's an a-measure. If Mex has \bound">λ+b>λ⊙+b⊙, then it's not minimal (B has bounded-minimals) so we can decompose it into a minimal point Mmin respecting the bound and some nonzero sa-measure M∗. Mex=Mmin+M∗. Now, consider the point Mmin+(m∗−,−m∗−(1)) instead. We're adding on the negative part of m∗, and *just* enough of a b term to compensate, so it's an sa-measure. The sum of these two points is an a-measure, because we already know from Mex being an a-measure that the negative part of m∗ isn't enough to make any negative parts when we add it to mmin.

Anyways, summing the two parts like that saps a bit from the λ value of Mmin, but adds an equal amount on the b value, so it lies below the λ⊙+b⊙ "barrier", and by nirvana-free upper-completeness, it also lies in B. Then, we can express Mex as

Mex=Mmin+M∗=Mmin+(m∗,b∗)=Mmin+(m∗−,−m∗−(1))+(m∗+,b∗+m∗−(1))

=0.5(Mmin+(m∗−,−m∗−(1)))+0.5(Mmin+(m∗−,−m∗−(1))+2(m∗+,b∗+m∗−(1)))

Now, we already know that Mmin+(m∗−,−m∗−(1)) is an a-measure, and (m∗+,b∗+m∗−(1)) is an a-measure (no negative parts, end term is ≥0). So, we just wrote our extreme point as a mix of two distinct a-measures, so it's not extreme. Contradiction. Therefore, all extreme points have λ+b≤λ⊙+b⊙.

Let's resume. From bounded-minimals, we know that B has a suitable bound on λ+b, so the minimal points respect the λ⊙+b⊙ bound. Take B and chop it off at some high hyperplane, like λ+b≤2(λ⊙+b⊙). (the constant 2 isn't that important, it just has to be above 1 so we net all the extreme points and minimal points). Call the resulting set C. Due to the bound, and B being closed, C is compact by the Compactness Lemma. It's also convex.

Now, invoke the Krein-Milman theorem (and also, we're in a finite-dimensional space since we're working with a stub, finitely many observation leaf nodes, so we don't need to close afterwards, check the Wikipedia page for Krein-Milman Theorem at the bottom) to go C=c.h(extreme(C)). The only extreme points in C are either points that were originally extreme in B, or points on the high hyperplane that we chopped at.

Fix someMmin∈Bmin. Bmin⊆C, so Mmin∈C. Thus, Mmin can be written as a finite mixture of points from extreme(C). However, because Mmin is minimal, it can only be a mixture of minimal points, as we will show now.

Decompose Mmin into EζMi, and then decompose the Mi into Mmini+M∗i . To derive a contradiction, assume there exists some i′ where Mi′ isn't minimal, so that M∗i′ isn't 0. Then,

Mmin=EζMi=Eζ(Mmini+M∗i)=Eζ(Mmini)+Eζ(M∗i)

Thus, we have decomposed our minimal point into another point which is also present in B, and a nonzero sa-measure because M∗i′ is nonzero, so our original minimal point is actually nonminimal. and we have a contradiction. Therefore, all decompositions of a minimal point into a mix of points must have every component point being minimal as well.

So, when we decomposed Mmin into a mix of points in extreme(C), all the extreme points we decomposed it into are minimal, so there's no component on the high hyperplane. Mmin was arbitrary in Bmin establishing that Bmin⊆c.h(Bxmin). Therefore, c.h(Bmin)⊆c.h(Bxmin)

So we have our desired result.

**Lemma 4:** *If *πpa≥πhist≥πlost*, and *A⊆Ma(F(πhist))* and *B⊆Ma(F(πlost))* (also works with the nirvana-free variants) and *prπhist,πlost∗(A)⊆B* then *(prπpa,πhist∗)−1(A)⊆(prπpa,πlost∗)−1(B) *This works for surmeasures too.*

A point M in the preimage of A has prπpa,πhist∗(M)∈A, and by projections commuting and projecting down further landing you in B, we get prπpa,πlost∗(M)∈B, so M is in the preimage of B too.

**Lemma 5:** *Given a partial policy *πpa* and stub *πst*, if *πpa≥πst*, then *∃n:πnpa≥πst

πst is a stub that specifies less about what the policy does than πpa, and because it's a stub it has a minimum time beyond which it's guaranteed to be undefined, so just let that be your n. πnpa then specifies everything that πst does, and maybe more, because it has all the data of πpa up till time n.

**Lemma 6:** *If *πpa* is a partial policy, and* ∀πlost,πhist≥πlost:prπhist,πlost∗(Θst(πhist))⊆Θst(πlost) *holds, then, for all *m*, *⋂πst≤πpa(prπpa,πst∗)−1(Θst(πst))=⋂n≥m(prπpa,πnpa∗)−1(Θst(πnpa)) *This works for surmeasures too.*

First, all the πnpa≤πpa are stubs, so we get one subset direction immediately.

⋂πst≤πpa(prπpa,πst∗)−1(Θst(πst))⊆⋂n≥m(prπpa,πnpa∗)−1(Θst(πnpa))

In the other direction, use Lemma 5 to find a πnpa≥πst, with n≥m, and then pair

∀πlost,πhist≥πhist:prπhist,πlost∗(Θst(πhist))⊆Θst(πlost)

with Lemma 4 to deduce that

(prπpa,πnpa∗)−1(Θst(πnpa))⊆(prπpa,πst∗)−1(Θst(πst))

Due to being able to take any stub preimage and find a smaller preimage amongst the fundamental sequence for πpa (with an initial segment clipped off) we don't need anything other than the preimages of the fundamental sequence (with an initial segment clipped off), which establishes the other direction and thus our result.

**Lemma 7:** *If *M* is an a-measure and *prπpa,πst∗(M)=M′* and *M′=M′lo+M′∗* and *M′lo* is an a-measure, then there exists a *Mlo∈Ma(F(πpa))* (or the nirvana-free variant) s.t. *prπpa,πst∗(Mlo)=M′lo* and there's an sa-measure *M∗* s.t. *Mlo+M∗=M*. This works for a-surmeasures and sa-surmeasures too.*

What this essentially says is "let's say we start with a M and project it down to M′, and then find a point M′lo below M′. Can we "go back up" and view M′lo as the projection of some point below M? Yes". It's advised to sketch out the setup of this one, if not the proof itself.

Proof sketch: To build our Mlo and M∗, the b components are preserved, but crafting the measure component for them is tricky. They've gotta project down to M′lo and M′∗ so those two give us our base case to start working from with the measures (and automatically get the "must project down appropriately" requirement) and then we can recursively build up by extending both of them with the conditional probabilities that M gives us. However, we must be wary of division-by-zero errors and accidentally assigning negative measure on Nirvana, which complicates things considerably. Once we've shown how to recursively build up the measure components of our Mlo and M∗, we then need to check four things. That they're both well formed (sum of measure on 1-step extensions of a history=measure on the history, no semimeasures here), that they sum up to make M, the measure component of Mlo can't go negative anywhere (to certify that it's an a-measure), and that the b term attached to M∗ is big enough to cancel out the negative regions (to certify that it's an sa-measure).

Proof: Let Mlo=(mlo,blo) where blo is the b term of M′lo. Let M∗=(m∗,b∗) where b∗ is the b term of M′∗. Recursively define mlo and m∗ on h that are prefixes of something in F(πpa) (or the nirvana-free variant) as follows:

If h is a prefix of something in F(πst) (or the nirvana-free variant), mlo(h)=m′lo(h) and m∗(h)=m′∗(h). That defines our base case. Now for how to inductively build up by mutual recursion. Let's use haN for a nirvana-history and hao for a non-nirvana history.

If m∗(h)<0, then

mlo(haN)=m(haN),mlo(hao)=m(hao)−m∗(h)#o

m∗(haN)=0,m∗(hao)=m∗(h)#o

#o is the number of non-nirvana observations that can come after ha.

If m∗(h)≥0 and 0">m(h)>0, then

mlo(hao)=m(hao)m(h)mlo(h),m∗(hao)=m(hao)m(h)m∗(h)

and the same holds for defining mlo(haN) and m∗(haN).

If m∗(h)≥0 and m(h)=0, then mlo(hao)=m∗(hao)=0

We need to verify that these sum up to m, that they're both well-formed signed measures, that mlo has no negative parts, and that the b value for M∗ is big enough. mlo having no negative parts is immediate by the way we defined it, because it's nonnegative on all the base cases since m′lo came from an a-measure, and m came from an a-measure as well which lets you use induction to transfer that property all the way up the histories.

To verify that they sum up to m, observe that for base-case histories in F(πst),

mlo(h)+m∗(h)=m′lo(h)+m′∗(h)=m′(h)=m(h)

For non-base-case histories hao we can use induction (assume it's true for h) and go:

Negative case, m∗(h)<0.

mlo(haN)+m∗(haN)=m(haN)+0=m(haN)

mlo(hao)+m∗(hao)=m(hao)−m∗(h)#o+m∗(h)#o=m(hao)

Nonnegative case, no division by zero.

mlo(hao)+m∗(hao)=m(hao)m(h)(mlo(h)+m∗(h))=m(hao)m(h)m(h)=m(hao)

Zero case: m(hao)=0 because m(h)=0 and m came from an a-measure and has no negative parts. mlo(hao)+m∗(hao)=0+0=0=m(hao)

Ok, so we've shown that mlo+m∗=m.

What about checking that they're well-formed signed measures? To do this, it suffices to check that summing up their measure-mass over haoi gets the measure-mass over h. This works over the base case, so we just have to check the induction steps.

In the negative case, for m∗,

m∗(haN)+∑im∗(haoi)=∑im∗(h)#o=m∗(h)

and for mlo

mlo(haN)+∑imlo(haoi)=m(haN)+∑i(m(haoi)−m∗(h)#o)=m(h)−m∗(h)=mlo(h)

In the nonnegative case, no division by zero, then

mlo(haN)+∑imlo(haoi)=m(haN)m(h)mlo(h)+∑im(haoi)m(h)mlo(h)

=mlo(h)m(h)(m(haN)+∑im(haoi))=mlo(h)m(h)m(h)=mlo(h)

And similar for m∗.

In the zero case where m(h)=0, we need to show that mlo(h) and m∗(h) will *also* be zero. Winding h back, there's some longest prefix h′ where 0">m(h′)>0. Now, knowing that m(h′)=mlo(h′)+m∗(h′), we have two possible options here.

In the first case, m∗(h′)≥0, so mlo(h′ao) (advancing one step) is:

mlo(h′ao)=m(h′ao)m(h′)mlo(h′)=0m(h′)mlo(h′)=0

And similar for m∗(h′ao), so they're both 0, along with m′, on h′ao, and then the zero case transfers the "they're both zero" property all the way up to h.

In the second case, m∗(h′)<0 and 0">mlo(h′)>0. Then, proceeding forward, m∗(h′ao)<0, and this keeps holding all the way up to h, so we're actually in the negative case, not the zero case.

So, if m(h)=0, then mlo(h)=m∗(h)=0 as long as we're in the case where m∗(h)≥0 and m(h)=0. Then, it's easy. mlo(haN)+∑imlo(haoi)=0=mlo(h) and the same for m∗.

Also, m∗, by the way we defined it, never puts negative measure on a nirvana event, so we're good there, they're both well-formed signed measures. For the b∗ value being sufficient to compensate for the negative-measure of m∗, observe that the way we did the extension, the negative-measure for m∗ is the same as the negative measure for m′∗, and b∗=b′∗, and the latter is sufficient to cancel out the negative measure for m′∗, so we're good there.

We're done now, and this can be extended to a-surmeasures by taking the 0+ nirvana-events in m and saying that all those nirvana-events have 0+ measure in mlo.

**Lemma 8:** *Having a *λ+b* bound on a set of a-surmeasures is sufficient to ensure that it's contained in a compact set w.r.t the surtopology.*

This is the analogue of the Compactness Lemma for the sur-case. We'll keep it in the background instead of explicitly invoking it each time we go "there's a bound, therefore compactness". It's important.

Proof sketch: Given a sequence, the bound gets convergence of the measure part by the Compactness Lemma, and then we use Tychonoff to show that we can get a subsequence where the a-surmeasures start agreeing on which nirvana events are possible or impossible, for all nirvana events, so their first time of disagreement gets pushed arbitrarily far out, forcing convergence w.r.t. the surmetric.

Proof: given a sequence of a-surmeasures SMn, and rounding them off to their "standard" part (slicing off the 0+ probability), we can get a convergent subsequence, where the measure part and b part converges, by the Compactness Lemma since we have a λ+b bound, which translates into bounds on λ and b.

Now, all we need is a subsequence of that subsequence that ensures that, for each nirvana-event, the sequence of a-surmeasures starts agreeing on whether it's possible or impossible. There are countably many finite histories, and each nirvana-history is a finite history, so we index our nirvana events by natural numbers, and we can view our sequence as wandering around within {0,1}ω, where the t'th coordinate keeps track of whether the t'th nirvana event is possible or impossible. {0,1}ω is compact by Tychonoff's Theorem, so we can find a convergent subsequence, which corresponds to a sequence of a-surmeasures that, for any nirvana event, eventually start agreeing on whether it's possible or impossible. There's finitely many nirvana events before a certain finite time, so if we go far enough out in the n, the a-surmeasures agree on what nirvana events are possible or impossible for a very long time, and so the surdistance shrinks to 0 and they converge, establishing that all sequences have a convergent subsequence, so the set is compact.

**Lemma 9:** *Given a *πpa* and a sequence of nonempty compact sets *Bn∈Ma(F(πnpa))* (or the nirvana-free variant) where *∀n:prπn+1pa,πnpa∗(Bn+1)⊆Bn* then there is a point *M∈Ma(F(πpa))* (or the nirvana-free variant) where *∀n:prπpa,πnpa∗(M)∈Bn*. This also works with a-surmeasures.*

Sketch this one out. It's essentially "if sets get smaller and smaller, but not empty, as you ascend up the chain πnpa towards πpa, and are nested in each other, then there's something at the πpa level that projects down into all the πnpa"

Proof sketch: Projection preserves λ and b, the Compactness Lemma says that compactness means you have a λ and b bound, so the preimage of a compact set is compact. Then, we just have to verify the finite intersection property to show that the intersection of the preimages is nonempty, which is pretty easy since all our preimages are nested in each other like an infinite onion.

Proof: Consider the intersection ⋂n(prπpa,πnpa∗)−1(Bn). Because Bn are all compact, they have a λ and b bound. Projection preserves the λ and b values, so the preimage of Bn has a λ and b bound, therefore lies in a compact set (By Lemma 8 for the sur-case). The preimage of a closed set is also closed set, so all these preimages are compact. This is then an intersection of a family of compact sets, so we just need to check the nonempty intersection property. Fixing finitely many m, we can find an n above them all, and pick an arbitrary point in the preimage of Bn, and invoke Lemma 4 on Bn to conclude that said point lies in all lower preimages, thus demonstrating finite intersection. Therefore, the intersection is nonempty.

**Lemma 10:** Given a sequence of nonempty closed sets Bn where prπn+1pa,πnpa∗(Bn+1)⊆Bn, and a sequence of points Mn∈(prπpa,πnpa∗)−1(Bn), all limit points of the sequence Mn (if they exist) lie in ⋂n(prπpa,πnpa∗)−1(Bn) (works in the a-surmeasure case)

Proof: Assume a limit point exists, isolate a subsequence limiting to it. By Lemma 4, the preimages are nested in each other. Also, the preimage of a closed set is closed. Thus, for our subsequence, past n, the points are in the preimage of Bn and don't ever leave, so the limit point is in the preimage of Bn. This argument works for all n, so the limit point is in the intersection of the preimages.

The next three Lemmas are typically used in close succession to establish nirvana-free upper-completeness for projecting down a bunch of nirvana-free upper complete sets, and taking the closed convex hull of them, which is an operation we use a lot. The first one says that projecting down a nirvana-free upper-complete set is upper-complete, the second one says that convex hull preserves the property, and the third one says that closure preserves the property. The first one requires building up a suitable measure via recursion on conditional probabilities, the second one requires building up a whole bunch of sa-measures via recursion on conditional probabilities and taking limits of them to get suitable stuff to mix together, and the third one also requires building up a whole bunch of sa-measures via recursion on conditional probabilities and then some fanciness with defining a limiting sequence.

**Lemma 11:** *In the nirvana-free setting, a projection of an upper-complete set is upper-complete.*

Proof sketch: To be precise about exactly what this says, since we're working with a-measures, it says "if you take the fragment of the upper completion composed of a-measures, and project it down, then the thing you get is: the fragment of the upper completion of (projected set) composed of a-measures". Basically, since we're not working in the full space of sa-measures, and just looking at the a-measure part of the upper completion, that's what makes this one tricky and not immediate.

The proof path is: Take an arbitrary point Mlo in the projection of B which has been crafted by projecting down Mhi. Given an arbitrary M′lo:=Mlo+M∗lo (assuming it's an a-measure) which lies in the upper completion of the projection of B, we need to certify that it's in the projection of B to show that B is upper-complete. In order to do this, we craft a M∗hi and M′hi (an a-measure) s.t: Mhi+M∗hi=M′hi (certifying that M′hi is in B since B is upper complete), and M′hi projects down to hit our M′lo point of interest.

These a-measures are crafted by starting with the base case of M∗lo and M′lo, and recursively building up the conditional probabilities in accordance with the conditional probabilities of Mhi. Then we just verify the basic conditions like the measures being well-formed, M′hi being an a-measure, M∗hi having a big enough b term, and Mhi+M∗hi=M′hi, to get our result. Working in the Nirvana-free case is nice since we don't need to worry about assigning negative measure to Nirvana.

Proof: Let B⊆Ma(FNF(πhipa)) be our upper-complete set. We want to show that prπhipa,πlopa∗(B) is upper-complete. To that end, fix a Mlo in the projection of B that's the projection of a Mhi∈B. Let M′lo:=Mlo+M∗lo, where M′lo is an a-measure. Can we find an a-measure M′hi in B that projects down to M′lo? Let's define M∗hi and M′hi as follows:

Let M′hi=(m′hi,b′hi) where b′hi is b′lo. Let M∗hi=(m∗hi,b∗hi) where b∗hi is b∗lo. Recursively define m′hi and m∗hi on h that are prefixes of something in FNF(πhipa) as follows:

If h is a prefix of something in F(πlopa), m′hi(h)=m′lo(h) and m∗hi(h)=m∗lo(h).

Otherwise, recursively define the measure components m′hi and m∗hi as:

If 0">mhi(h)>0, then

m′hi(hao)=mhi(hao)mhi(h)m′hi(h),m∗hi(hao)=mhi(hao)mhi(h)m∗hi(h)

If mhi(h)=0, then m′hi(hao)=m∗hi(hao)=0.

We need to verify that m′hi has no negative parts so it's fitting for an a-measure, that mhi+m∗hi=m′hi, that the b value for sm∗hi works, and that they're both well-formed signed measures. The first part is easy to establish, m′hi(h)=m′lo(h)≥0 in the base cases since M′lo is an a-measure and a quick induction as well as mhi coming from the a-measure Mhi (so no negatives anywhere) establish m′hi as not having any negative parts.

To verify that they sum up to m′hi, observe that for base-case histories in FNF(πlopa), mhi(h)+m∗hi(h)=mlo(h)+m∗lo(h)=m′lo(h)=m′hi(h). Then, in the general case, we can use induction (assume it's true for h) and go:

If 0">mhi(h)>0, then

mhi(hao)+m∗hi(hao)=mhi(hao)+mhi(hao)mhi(h)m∗hi(h)=mhi(hao)mhi(h)(mhi(h)+m∗hi(h))

=mhi(hao)mhi(h)m′hi(h)=m′hi(h)

If mhi(h)=0, then mhi(hao)=0, so

mhi(hao)+m∗hi(hao)=0+0=0=m′hi(hao)

What about checking that they're well-formed measures? To do this, it suffices to check that summing up their measure-mass over haoi gets the measure-mass over h. If 0">mhi(h)>0, then:

∑im′hi(haoi)=∑imhi(haoi)mhi(h)m′hi(h)=mhi(h)mhi(h)m′hi(h)=m′hi(h)

And similar for m∗.

If mhi(h)=0, then, when we trace back to some longest prefix h′ where 0">mhi(h′)>0, then mhi(h′ao)=0, so:

m′hi(hao)=mhi(hao)mhi(h)m′hi(h)=0mhi(h′)m′hi(h′)=0

And the same for m∗hi(hao). This extends forward up to h, so mhi(h)=0 implies m′hi(h)=m∗hi(h)=0. And we get:

∑im′hi(haoi)=0=m′hi(h)

and the same for m∗hi(h).

For the b value being sufficient, observe that the way we did the extension, the negative-measure for m∗hi is the same as the negative measure for m∗lo, and b∗hi=b∗lo, and the latter is sufficient to cancel out the negative measure for m∗lo, so we're good there. And for projecting down appropriately, observe that m∗hi and m′hi copy m∗lo and m′lo wherever they get a chance to do so.

Thus, M′hi=Mhi+M∗hi so M′hi lies in the upper completion of B, so it's in B, and projects down onto M′lo, certifying that M′lo lies in the projection of B, so the projection of an upper-complete set of a-measures is upper-complete.

**Lemma 12:** *In the nirvana-free setting, the convex hull of an upper-closed set is upper-closed.*

Proof sketch: Again, this is tricky since we're working with a-measures. We have a point M in the convex hull, which shatters into EζMi where the Mi are in the (non-convex) set B itself. If M+M∗ is an a-measure, we want to find M∗i s.t. Mi+M∗i is an a-measure, and mixing these makes M+M∗. However, it's really hard to define these M∗i directly, so we craft approximations indexed by n where Mi+M∗i,n is an a-measure, and mixing the m∗i,n matches m∗ perfectly up till time n. We get weak bounds on the amount of positive measure and the b term to invoke the compactness lemma, and then use Tychonoff to get a suitable convergent subsequence for all the i to define our final M∗i that, when combined with Mi, makes an a-measure. Mixing these together replicates the measure component of M∗, but not the b term. However, that's easily fixable by adding a bit of excess to one of the b terms, and we're done. We took an arbitrary M+M∗ where M∈c.h(B), and crafted M∗i where, for all i, Mi+M∗i is an a-measure (and in B by its upper-completeness) and it mixes together to make M+M∗, certifying that the convex hull is upper-closed.

Our way of constructing the m∗i,n is basically: start at length n, use mi(h)m(h) as your scale factor for filling in the measure of histories, extend down, and then extend up with the conditional probabilities of Mi.

Proof: Take a point M∈c.h(B). It decomposes into EζMi for finitely many i, where Mi∈B. Fix some arbitrary M+M∗ (as long as it's an a-measure). We'll craft M∗i where Mi+M∗i∈B by upper completeness of B,and the M∗i mix together to make M∗ itself, certifying that M+M∗ lies in c.h(B).

Let m∗i,n be defined by: If h is of length n, or lies in FNF(πpa) and is shorter than length n,

m∗i,n(h)=mi(h)m(h)m∗(h)

(m∗i,n(h) defaults to m∗(h) if m(h)=0)

And then it's defined for shorter h via:

m∗i,n(h)=∑jm∗i,n(haoj)

The b value is (we're summing over the base-case histories where h is of length n or lies in FNF(πpa) and is shorter than length n) −∑hmin(0,m∗i,n(h))

And if 0">mi(h)>0, extend to bigger histories via

m∗i,n(hao)=mi(hao)mi(h)m∗i,n(h)

And if mi(h)=0, it's m∗i,n(hao)=m∗i,n(h)#o

We've got a few things to show. first is showing that m∗i,n(h) is a well-defined signed measure. ∑jm∗i,n(haoj)=m∗i,n(h) trivially if h is shorter than length n.

Otherwise, (assuming 0">mi(h)>0)

∑jm∗i,n(haoj)=∑jmi(haoj)mi(h)m∗i,n(h)=mi(h)mi(h)m∗i,n(h)=m∗i,n(h)

If mi(h)=0, then ∑jm∗i,n(haoj)=∑jm∗i,n(h)#o=m∗i,n(h)

Ok, so it's well-defined. Also, past n, it doesn't add any more negative measure. If it's negative on a length n history, it'll stay that negative forevermore and never go positive, so the b value we stuck on it is a critical b value (exactly sufficient to cancel out the areas of negative measure)

We do need to show that m∗i,n+mi is an a-measure. For the h of length n or in FNF(πpa) and shorter, we can split into two cases. For the case where 0">m(h)>0,

m∗i,n(h)mi(h)=mi(h)m(h)m∗(h)+mi(h)=mi(h)(m∗(h)m(h)+1)

And because m∗(h)+m(h)≥0 (they sum to make an a-measure), m∗(h)≥−m(h), so m∗(h)m(h)≥−1, so we get a nonnegative number times a nonnegative number, so m∗i,n(h)+mi(h)≥0.

Now for the case where m(h)=0. In that case mi(h)=0 because m came from an a-measure and is the mix of the mi. Also, because m(h)+m∗(h)≥0 due to M+M∗ being an a-measure, m∗(h)≥0. Then, m∗i,n(h)+mi(h)=m∗(h)≥0

Now for the short histories. By induction down,

m∗i,n(h)+mi(h)=∑jm∗i,n(haoj)+∑jmi(haoj)=∑j(m∗n,i(haoj)+mi(haoj))≥0

Now for the long histories. If 0">mi(h)>0,

m∗i,n(hao)+mi(hao)=mi(hao)+mi(hao)mi(h)m∗i,n(h)=mi(hao)(1+m∗i,n(h)mi(h))

And then, by induction up we can assume m∗i,n(h)+mi(h)≥0, so m∗i,n(h)≥−mi(h), so m∗i,n(h)mi(h)≥−1, so it's a multiplication of a nonnegative number and a nonnegative number, so we're good there on showing nonnegativity.

If mi(h)=0, then mi(hao)=0. Since m∗i,n(h)+mi(h)≥0, then m∗i,n(h′)≥0. Then, m∗i,n(hao)+mi(hao)=m∗i,n(h)#o≥0

Ok, so, for all n, Mi+Mi,n is an a-measure.

One last thing we'll want to show is that, for histories h of length n or shorter, (Eζm∗i,n)(h)=m∗(h)

First, for the histories of length n or in FNF(πpa) (assuming 0">m(h)>0)

(Eζm∗i,n)(h)=Eζ(m∗i,n(h))=Eζ(mi(h)m(h)m∗(h))=m∗(h)m(h)Eζ(mi(h))=m∗(h)m(h)(Eζmi)(h)

=m∗(h)m(h)m(h)=m∗(h)

Assuming m(h)=0, then m∗i,n(h)=m∗(h) immediately giving you your result. So, since the mixture of the m∗i,n mimics m∗ on everything of length n or shorter in FNF(πpa), the "sum up the stuff ahead of you" thing makes it mimic m∗ on all histories of length n or shorter.

Further, m∗ is nonpositive on a history of length n or a shorter history in FNF(πpa) iff for all i, m∗i,n is nonpositive on it. So, since the b values for m∗i,n are the negative component, Eζb∗i,n is the negative-measure of m∗ up till time n, which is less negative than the negative-measure of m∗, so the mixture of the b∗i,n terms undershoots b∗.

Consider the sequence M∗i,n (the sequence is in n, i is fixed). It's a sequence of sa-measures. To show that there's a limit point, we need a bound on the positive value part, and the negative value part (our b is critical, it can't go any smaller, so bounding the negative value part bounds the b) Fixing an n, the "boundary" of stuff of length n or shorter and in FNF(πpa) suffices to establish what the mass of the negative part and positive part are. We either mimic m∗ if m(h)=0, or mi(h)=0 while 0">m(h)>0 so mi,n(h)=0, or both quantities are positive so we have a scale term of mi(h)m(h)=mi(h)Eζmi(h)≤mi(h)ζimi(h)=1ζi

So, our amount of positive and negative measure on m∗i,n on the "n boundary" is at most 1ζi times the positive and negative measure on m∗ at the "n boundary", which is less than or equal to the amount of positive and negative measure that m∗ has overall. So, that gets us our bounds on the positive and negative part of m∗i,n of m∗+(1)ζi and m∗−(1)ζi, respectively (which bounds our b∗i,n terms)

Now, we can consider our sequence (m∗i,n,b∗i,n) as a sequence ¯¯¯¯¯¯Mn in

∏i(Msa(FNF(πpa))∩{(m,b)|m+(1)≤m∗+(1)ζi,b≤−m∗−(1)ζi})

where ¯¯¯¯¯¯Mn(i)=(m∗i,n,b∗i,n)

By the Compactness Lemma, these sets are compact, and by Tychonoff, the product is compact. So, there's a convergent subsequence, and the limit point projects down to the coordinates to make a M∗i for each i.The set of sa-measures that, when added to an a-measure, make another a-measure, is closed, and regardless of n, Mi+M∗i,n is an a-measure, so Mi+M∗i is an a-measure.

Eζm∗i,n mimics m∗ up till timestep n, so Eζm∗i=m∗ And because, at each step, the mixture of the b values undershoots b∗, Eζb∗i≤b∗.

So, our final batch of sa-measures is M∗i for 0">i>0, and for i=0, it's M∗0+(0,b∗−Eζb∗iζ0) Now, all these M∗i are sa-measures that, when added to Mi, make a-measures, and one of them has some extra b term on it, which doesn't impede it from being an a-measure. By upper-completeness of B, they're all in B, and mixing them makes M+M∗ exactly, because

0}\zeta_i(\sm_i+\sm^*_i)=\mathbb{E}_{\zeta}(\sm_i)+\mathbb{E}_{\zeta}(\sm^*_i)+(0,b^*-\mathbb{E}_{\zeta}b^*_i)">ζ0(M0+M∗0+(0,b∗−Eζb∗iζ0))+∑i>0ζi(Mi+M∗i)=Eζ(Mi)+Eζ(M∗i)+(0,b∗−Eζb∗i)

=M+Eζ(m∗i,b∗i)+(0,b∗−Eζb∗i)=M+(Eζm∗i,Eζb∗i)+(0,b∗−Eζb∗i)

=M+(Eζm∗i,b∗)=M+(M∗,b∗)=M+M∗

M+M∗ was an arbitrary a-measure above a M∈c.h(B), and we showed it's a mix of a-measures in B since B is upper complete so c.h(B) is upper-complete.

**Lemma 13:** *In the nirvana-free setting, the closure of an upper-closed set of a-measures is upper-closed.*

Proof sketch: We have an M∈¯¯¯¯B, and a sequence Mn∈B limiting to M. Let M′:=M+M∗ be an arbitrary a-measure above M. We must craft a sequence limiting to M′. What we do, is make a bunch of M∗n,j with the special property that mn+m∗n,j perfectly mimics m′ up till time j, by basically going "copy m′−mn for time j or before, and complete with the conditional probabilities of mn so mn+m∗n,j doesn't go negative". And then the b term is set to mimic the b term of M∗, or set to cancel out the amount of negative measure, whichever is greater. The reason we only copy up till time j instead of skipping to the chase and just going "copy m′−mn, stick on whichever b term you need" is it affords us finer control and understanding over what our b terms are doing.

Then, we let j increase as ρ(n) to get a sequence of one variable, Mn+M∗n,ρ(n) where ρ(n) is selected to diverge to infinity at a suitable rate to get convergence to M′ itself. Again, no matter what ρ(n) is, as long as it diverges to infinity as n does, we get convergence of the measure term to the measure term m′, the hard part is selecting ρ to appropriately control what the b term is doing. Once ρ(n) is suitably defined, then we can get upper and lower bounds on how the b term of the sum compares to the b term of M′, and show convergence.

Proof: Ok, so B is upper-closed, and we want to show upper-closure of ¯¯¯¯B. Thus, we have an M∈¯¯¯¯B, a sequence of points Mn∈B that limit to M, and if M′=M+M∗ is an a-measure, we want to show that M′ is in ¯¯¯¯B. This is going to require a rather intricate setup to get our limit of interest. In this case, we'll be using both n and j as limit parameters.

Let m∗n,j be defined up till time j by: If h is of length j or shorter, m∗n,j(h)=m′(h)−mn(h)

Extend to longer histories via (if 0">mn(h)>0)

m∗n,j(hao)=mn(hao)mn(h)m∗n,j(h)

And if mn(h)=0, go with m∗n,j(hao)=m∗n,j(h)#o

The b value is defined as (we're summing over histories of length j or in FNF(πpa) and shorter) max(b∗,−∑hmin(0,m∗n,j(h)))

We've got a few things to show. first is showing that m∗n,j(h) is a well-defined signed measure. If h is length j or shorter,

∑im∗n,j(haoi)=∑i(m′(haoi)−mn(haoi))=m′(h)−mn(h)=m∗n,j(h)

Otherwise, (assuming 0">mn(h)>0)

∑im∗n,j(haoi)=∑imn(haoi)mn(h)m∗n,j(h)=mn(h)mn(h)m∗n,j(h)=m∗n,j(h)

Assuming mn(h)=0, ∑im∗n,j(haoi)=∑im∗n,j(h)#o=m∗n,j(h)

Ok, so it's well-defined. Also, past j, it doesn't add any more negative measure. If it's negative on a length j history, it'll stay that negative forevermore and never go positive, so the b value we stuck on it is either a critical b value, or greater than that. In particular, this implies that our definition of b∗n,j can be reexpressed as: b∗n,j=max(b∗,−m∗−n,j(1))

We do need to show that m∗n,j+mn is an a-measure. For the h of length j or in FNF(πpa) and shorter, we can go: m∗n,j(h)+mn(h)=m′(h)−mn(h)+mn(h)=m′(h)≥0 This is because M′ is an a-measure. This also means that mn+m∗n,j perfectly mimics m′ up till time j.

Now for the long histories. Assume 0">mn(h)>0

m∗n,j(hao)+mn(hao)=mn(hao)+mn(hao)mn(h)m∗n,j(h)=mn(hao)(1+m∗n,j(h)mn(h))

And then, by induction up we can assume m∗n,j(h)+mn(h)≥0, so m∗n,j(h)≥−mn(h), so m∗n,j(h)mn(h)≥−1, so it's a multiplication of a nonnegative number and a nonnegative number, so we're good there.

Assume mn(h)=0. Then mn(hao)=0 and m∗n,j(hao)+mn(hao)=m∗n,j(h)#o

And then, since m∗n,j(h)+mn(h)≥0 (induction up), and mn(h)=0, m∗n,j(h)≥0, and we get our result, showing that Mn+M∗n,j is an a-measure, and by upwards closure of B, all the Mn+M∗n,j lie in B.

There's one more fiddly thing to take care of. What we'll be doing is letting j increase as ρ(n), to get a function of 1 variable, and showing that Mn+M∗n,ρ(n) limits to M′. So we should think carefully about what we want out of ρ.

First, let m∗j− be the measure gotten by restricting m∗ to only histories which are in FNF(πpa) with a length of <j with negative measure, and histories where their length j prefix has negative measure. This is kinda like a bounded way of slicing out areas with negative measure from m∗, falling short of the optimal decomposition m∗−

Also, (m∗n,j)−(1), as n increases and j remains fixed, limits to m∗j−(1). The reason for this is that for histories of length j or shorter,

m∗n,j(h)−m∗(h)=(m′(h)−mn(h))−(m′(h)−m(h))=m(h)−mn(h)

and the end term limits to 0 because mn limits to m. So, past a sufficently large n, m∗n,j comes *extremely* close to mimicking m∗ for the first j steps. So, dialing up n far enough, the negative-measure of m∗n,j comes really close to the negative measure mass in m∗ as evaluated up till time j, due to the aforementioned mimicry.

Further, m∗j−(1)≥m∗−(1), because only evaluating up till length j isn't as good at slicing out areas of negative measure as the optimal decomposition of m∗ into positive and negative components.

With all this, our rule for ρ(n) will be:

ρ(n):=sup{j≤n|∀n′≥n:|m∗−n′,j(1)−m∗j−(1)|≤2−j}

ρ(n) never decreases. What this is basically doing is going "ok, I'll step up j but *only* when there's a guarantee that I'll mimic m∗ up till timestep j (re: amount of negative measure) sufficiently closely forever afterward" ρ(n) eventually diverges to infinity, though it might do so very slowly, because for all the j, m∗−n,j(1) limits to m∗j−(1) so eventually we get to a large enough n that the defining condition is fulfilled and we can step up j.

Now we can *finally* define our sequence as: M∗n,ρ(n)+Mn. We'll show that this limits to M′. First, it's always an a-measure, and always in B because it's in the upper completion of B which is upper-complete. For a fixed n, m∗n,ρ(n)+mn *always* flawlessly matches m′ up till time ρ(n). Since ρ(n) diverges to infinity, eventually we get a flawless match up till any finite time we name, so the measure components do converge to m′ as n limits to infinity.

But what about the b component? Well, the b component of the sum is: max(b∗,−m∗−n,ρ(n)(1))+bn. Let's bound it. Obviously, a lower bound is b∗+bn.

An upper bound is a bit more interesting.

max(b∗,−m∗−n,ρ(n)(1))+bn≤max(b∗+2−ρ(n),−m∗−n,ρ(n)(1))+bn

By the way we defined ρ(n), we can go to −m∗ρ(n)−(1) with a small constant overhead, and then swap out −m∗ρ(n)−(1) (amount of negative measure up till time ρ(n)) for −m∗−(1) (total amount of negative measure) which is greater.

max(b∗+2−ρ(n),−m∗−n,ρ(n)(1))+bn≤max(b∗+2−ρ(n),−m∗ρ(n)−(1)+2−ρ(n))+bn

≤max(b∗+2−f(n),−m∗−(1)+2−ρ(n))+bn≤max(b∗+2−ρ(n),b∗+2−ρ(n))+bn

=b∗+2−ρ(n)+bn

(because b∗ must be equal to or exceed the amount of negative measure in m∗ for M∗ to be a legit sa-measure)

So, our bounds on the b term for M∗n,ρ(n)+Mn are b∗+bn on the low end, and b∗+2−ρ(n)+bn on the high end. As n limits to infinity, so does ρ(n), so that term vanishes, and bn limits to b, so our limiting b value is b∗+b=b′, and we're done. We built a sequence of a-measures in B limiting to M′, certifying that it's in ¯¯¯¯B, and M′ was arbitrary above some M∈¯¯¯¯B. Thus, the closure of an upper-complete set is upper-complete.

The next one is a story proof, because I couldn't figure out how to make it formal. It essentially says that given two points near each other, their nirvana-free upper-completions (the set of a-measures, if it was a set of sa-measures, it'd be immediate to show) are close to each other.

**Lemma 14:** *For stubs, in the nirvana-free setting, if *M* and *M′* are *ϵ* apart, then the Hausdorff-distance between *{M}uc∩Ma(FNF(πst))* and *{M′}uc∩Ma(FNF(πst))* is *ϵ* or less.*

Ok, I don't really know how to make this formal, so all I have is a story-proof. The KR-metric (what's the maximum difference between two measures w.r.t 1-Lipschitz bounded functions) is the same as (or at least within a constant of) the earthmover distance. The earthmover distance is "interpret your measure as piles (or pits) of dirt on various spots. It takes ϵ effort to move 1 unit of dirt ϵ distance. Also, ϵ effort lets you create or destroy ϵ units of dirt. What's the minimum amount of effort it takes to rearrange one pile of dirt into the other pile of dirt?". So our proof will be a story about moving dirt.

Let's just examine the measure components of M and M′. Since the earthmover distance is ϵ (it might be less because of different b), it takes ϵ effort to rearrange the dirt pile of m into the dirt pile of m′ in an optimal way. Let's say M+M∗ is an a-measure (no negative parts ie no dirt pits). We need some M′∗ to add to M′ to make an a-measure within ϵ of M+M∗.

The procedure to construct M′∗ is as follows: b′∗=b∗. For the measure component, start with the m∗ pile (there may be dirt pits ie areas of negative measure). Now, keep a close tab on the process of rearranging m into m′. One crumb of dirt at a time is moved, or dirt is created/destroyed. The rule is:

Let's say a crumb of dirt is moved from h to h′, created at h or destroyed at h. If the pile-being-rearranged into m′ has its measure on h being greater than the size of the pit (negative measure) for h for the pile-being-rearranged into m∗′, sit around and do nothing. If moving that crumb/destroying it would make h have negative measure (the dirt pile on h for the pile-being-turned-into m′ would become smaller than the size of the hole for h for the pile-being-turned-into m′∗), then take the latter pile and move a crumb from h′ into the pit for h (at the same expenditure of effort), or create a crumb of dirt there instead. Once you're done, that's your m′∗. Keep the b value the same.

Now, this process does several things. First, very little effort was expended (ϵ or less), to reshuffle m∗ into m′∗ because you're either sitting around, or mimicking the same low-effort dirt moving process in reverse. Second, b∗ stays as a viable bound throughout, because whenever you move a crumb, you're dropping it into a pit, so an increase in negative measure at one spot is balanced out by a decrease in negative measure for the spot you moved the crumb to. Also, you never destroy crumbs, only create them. Also, in the whole process by which we rearrange m+m∗ into m′+m′∗, we always preserve the invariant that (pile on h + pit on h ≥0), so m′+m′∗ is a measure, not a signed measure.

For the final bit, we can imagine reshuffling m+m∗ into m′+m′∗ as a whole. Then, either a crumb is moved from point A to point B, or you move a crumb from point A to point B, and a crumb back from point B to point A so you can skip that step. Or, a crumb is created at point A, or a crumb is both destroyed and created at A, so you can skip that step. So, the dirt-moving procedure to turn m+m∗ into m′+m′∗ spends *as much or less* effort than the procedure to turn m into m′, which takes ϵ effort.

Putting it all together, we took an arbitrary point in the upper-completion of M, and it only takes ϵ or less effort to shift the b a little bit and reshuffle the measures to get a point in the upper-completion of M′.

The argument works in reverse, just switch the labels, to establish that the two upper-completions are ϵ apart or less.

For the next one, we have two ways of expressing uniform Hausdorff-continuity for a belief function. As a recap, Ma(∞) is the set of a-measures over all outcomes (regardless of whether or not they could have come from a single policy or not), and all belief functions have a critical λ⊙+b⊙ parameter that controls the λ and b values of the set of minimal points regardless of πpa. {≤⊙} is the set {(λμ,b)|λ+b≤λ⊙+b⊙}. They are:

1: For all nonzero ϵ, there exists a nonzero δ where d(πpa,π′pa)<δ implies (pr∞,πpa∗)−1(Θ(πpa)∩NF∩{≤⊙}) has a Hausdorff-distance of ϵ from the corresponding set for π′pa.

2: For all nonzero ϵ, there exists a nonzero δ where d(πpa,π′pa)<δ implies: If (λμ,b)∈(pr∞,πpa∗)−1(Θ(πpa)∩NF), then (λμ,b) has a distance of ϵ(λ+1) or less from the set (pr∞,π′pa∗)−1(Θ(π′pa)∩NF) (and symmetrically for the other set)

**Lemma 15:** *The two ways of expressing the Hausdorff-continuity requirement are equivalent for a belief function *Θ* or *Θω* obeying nirvana-free nonemptiness, closure, nirvana-free upper-completion, and bounded-minimals.*

Proof sketch: We start with the second λ-dependent distance condition and derive the first. Roughly, that {≤⊙} restriction means the tail where the λ values are high gets clipped so the two sets are within a constant of ϵ away from each other. In the other direction... Well, we start with a point M in one preimage and do a bunch of projecting points down and finding minimals and taking preimages and using earlier lemmas and our first distance condition, and eventually end up with a fancy diagram, and finish up with an argument that two points are close to each other, so M and one other point are "similarly close". This isn't good exposition, but I've got diagrams to keep a mental picture of the dozen different points and how they relate to each other in working memory.

Folk Result (from Vanessa): if two measures m and m′ are ϵ-distance apart in the KR metric, then if you extend m in some way, and extend m′ with the same conditional probabilities, then the two resulting measures remain ϵ-apart. We'll be using this in both directions.

Proof direction 1: Ok, we'll show the second way implies the first way, first. Fix some ϵ, and let the δ (distance between two partial policies) be low enough to guarantee that the distance parameter between the two preimages (according to definition 2, which has the λ-dependent distance guarantee) is ϵ2(1+λ⊙)(1+λ). We can always do this. λ⊙ is fixed by our belief function.

Keep this image in mind while reading the following arguments. The upper left set is the preimage of Θ(πpa)∩NF, the upper right set is the preimage of Θ(πpa)∩NF, and the bottom right set is Θ(π′pa)∩NF itself.

Now, any M in the preimage of Θ(πpa)∩NF∩{≤⊙} has a λ⊙ upper bound on its λ value because projection preserves λ. By the λ-dependent distance condition, it's within ϵ2(1+λ⊙)(1+λ⊙) from the preimage of Θ(πpa)∩NF, so we can hop over that far and get a point M′.

Admittedly, moving over to the nearby point M′ may involve violating the λ⊙ bound by ϵ2(1+λ⊙)(1+λ⊙) or less, but if that happens, we can project down our M′ point to Θ(π′pa)∩NF making M′lo, find a point M′min (nirvana-free) in Θ(π′pa)∩NF∩{≤⊙} (bounded-minimals) below Mlo where M′min+M∗=M′lo by upper-completion for Θ(π′pa), and then consider the reexpression of Mlo as M′min+(m∗−,−m∗−(1))+(m∗+,b∗+m∗−(1))

The sum of the first two terms is a nirvana-free a-measure (because Mlo is an a-measure, adding on the negative component does nothing) that lies below Mlo and respects the λ⊙+b⊙ bound (exactly as much as it adds to b, it takes away from the measure). and then you can add in *most* of the third term, going *just* up to the bound), to get a point M′′lo only ϵ2(1+λ⊙)(1+λ⊙) (at most) away from M′lo, which respects the bound (so it lies in Θ(π′pa)∩NF∩{≤⊙})

Now, you can complete M′′lo with the conditional probabilities of the measure part of M′, to make a point M′′ in the preimage of Θ(π′pa)∩NF∩{≤⊙} that's ϵ2(1+λ⊙)(1+λ⊙) or less distance away from M′

Going from M to M′, and M′ to M′′ is ϵ2(1+λ⊙)(1+λ⊙) distance both times, so we found a δ distance between any two partial policies πpa and π′pa that ensures the preimages of Θ(πpa)∩NF∩{≤⊙} and Θ(π′pa)∩NF∩{≤⊙} are only ϵ apart.

Proof Direction 2: Keep tabs on the following diagram to see how the 12 different points in 5 different sets relate to each other.

This δ gives us an n via: n=logγ(δ) which is the first time the partial policies start disagreeing on what to do. The upper left and upper right sets are the preimages of Θ(πpa)∩NF and Θ(π′pa)∩NF respectively, the middle-left and middle-right sets are the sets themselves, and the bottom set is: Take the inf of πpa and π′pa, it's another partial policy that's fully defined before time n because those policies agree up till that time, and chop it off at time n, so it's a stub, call this stub πlopa. The bottom set is Ma(FNF(πlopa)).

Now, follow the diagram in conjunction with the following proof. We start with an arbitrary M in the preimage of Θ(πpa)∩NF. We project it down to Θ(πpa)∩NF to make Mmid. Due to bounded-minimals, we can find a minimal point below it, M∗mid which obeys the λ⊙+b⊙ bound. Now, we go in two directions. One is projecting Mmid and M∗mid down to make Mlo and M∗lo, the latter of which lies below Mlo. Let's just keep those two in mind. In the other direction, since M∗mid obeys the λ⊙+b⊙ bound and lies in Θ(πpa)∩NF∩{≤⊙}, we can find a point M∗ in the preimage that obeys the λ⊙+b⊙ bound, and so, there's another point

M′∗ in (pr∞,π′pa∗)−1(Θ(π′pa)∩NF∩{≤⊙})

that's only ϵ or less away, by our version of the Hausdorff condition that only works on the clipped version of the preimages. M′∗ projects down to Θ(π′pa)∩NF to make M′∗mid, and projects down further to make M′∗lo.

Now, projections preserve or contract distances, and M∗lo is the projection of M∗, and M′∗lo is the projection of M′∗, and M∗ and M′∗ are only ϵ2 apart, so M∗lo and M′∗lo are only ϵ2 apart, and Mlo lies above M∗lo. Now, we can invoke Lemma 14 to craft a M′lo that's above M′∗lo and within ϵ2 of Mlo. Then, we can observe that Θ(π′pa)∩NF is nirvana-free and nirvana-free upper-complete. So, by Lemma 11, its projection down is nirvana-free and nirvana-free upper complete. M′∗lo is the projection down of M′∗mid, and M′lo is above M′∗lo, so M′lo is in the projection of Θ(π′pa)∩NF and we can craft a point M′mid∈Θ(π′pa)∩NF that projects down accordingly. And then go a level up to the preimage of Θ(π′pa)∩NF, and make a preimage point M′ by extending m′mid with the conditional probabilities of m up till time n whenever you get a chance, and then doing whatever, that'll be our M′ point of interest. The diagram sure came in handy, huh?

We still need to somehow argue that M and M′ are close to each other in a λ (the λ of M) dependent way. And the only tool we have is that Mlo and M′lo are within ϵ2 of each other, and M and M′ project down onto them. So how do we do that? Well, notice that before time n, m′ and m are either: in a part of the action-observation tree where πlopa has opinions on, and they're ϵ2-apart there, or m′ is copying the conditional probabilities of m. So, if we were to chop m and m′ off at timestep n, the two measures would be within ϵ2 of each other.

However, after timestep n, things go to hell, they both end up diverging and doing their own thing.

Now, we can give the following dirt-reshuffling procedure to turn m into m′. You've got piles of dirt on each history, corresponding to the measure component of M. You can "coarse-grain" and imagine all your distinct and meticulous, but close-together, piles of dirt on histories with a prefix of h, where |h|=n, as just one big pile on h. So, you follow the optimal dirt-reshuffling procedure for turning m (clipped off at length n) into m′ (clipped off at length n), which takes ϵ2 effort or less. Then, we un-coarse-grain and go "oh damn, we've gotta sort out all our little close-together-piles now to make m′ exactly! We're not done yet!"

But we've got something special. When we're sorting out all our little close-together-piles... said piles are the extensions of a finite history with length n. All those extensions will agree for the first n timesteps. And the distance between histories is γn where n is the first timestep they disagree, right? And further, n was logγ(δ), so whenever we move a bit of dirt somewhere else to rearrange all our close-together-piles, we're only moving it δ distance! So, in the worst case of doing a complete rearrangement, we've gotta move our *whole* quantity of dirt δ distance, at a cost of δλ′ effort (total amount of measure for m′)

Let's try to bound this, shall we? Our first phase of dirt rearrangement (and adjusting the b values) took ϵ2 effort or less, our second phase took δλ′ effort or less. Now, we can observe two crucial facts. The first is, at the outset, we insisted that δ was <ϵ2. Our second crucial fact is that λ′ and λ can't be more than ϵ2 apart, because projection preserves λ values, and M and M′ project down to Mlo and M′lo respectively, which are ϵ2 or less apart. So, the total amount of measure they have can't differ by more than ϵ2. This lets us get:

d(M,M′)≤ϵ2+δλ′≤ϵ2+δ(ϵ2+λ)<ϵ2+ϵ2(ϵ2+λ)<ϵ2+ϵ2(1+2λ)

=ϵ+ϵλ=ϵ(1+λ)

And so, given any ϵ, there's a δ where if d(πpa,π′pa)<δ, then for any point M in the preimage of Θ(πpa)∩NF, there's a point M′ in the preimage of Θ(π′pa)∩NF s.t. d(M,M′)<ϵ(1+λ), deriving our second formulation of Hausdorff-continuity from our first one. And we're done! Fortunately, the next one is easier.

**Lemma 16:** *If *Mn* limits to *M*, and *Mlon* are all below their corresponding *Mn* and obey a *λ⊙+b⊙* bound, then all limit points of *Mlon* lie below *M*. This works for a-surmeasures too.*

Proof sketch: We've got a λ⊙+b⊙ bound, so we can use the Compactness Lemma or Lemma 8 to get a convergent subsequence. Now, this is a special proof because we don't have to be as strict as we usually are about working only with a-measures and sa-measures only showing up as intermediate steps. What we do is take a limit point of the low sequence, and add some sa-measure to it that makes the resulting sa-measure close to M, so M is close to the upper completion of our limit point. We can make it arbitrarily close, and the upper completion of a single point is closed, so M actually does lie above our limit point and we're done. To do our distance fiddling argument in the full generality that works for sur-stuff, we do need to take a detour and show that for surmeasures, ds(x+y,z+y)≤ds(x,z).

Proof: The Mlon obey the λ⊙+b⊙ bound, so convergent subsequences exist by the compactness lemma or Lemma 8. Pick out a convergent subsequence to work in, giving you a limit point Mlo. All the Mn can be written as Mlon+M∗n.

We'll take a brief detour, and observe that if we're just dealing with sa-measures, then, since we're in a Banach space, d(x+y,z+y)=d(x,z). But what about the surmetric? Well, the surmetric is the max of the usual metric and \gam raised to the power of "first time the measure components start disagreeing on what nirvana events are possible or impossible". Since sa-measures and sa-surmeasures can't assign negative probability to Nirvana, adding an sa-surmeasure adds *more* nirvana spots into both surmeasure components! In particular, they won't disagree more, and may disagree less, since adding that sa-surmeasure in may stick nirvana on a spot that they disagree on, so now they both agree that Nirvana happens there. So, since the standard distance component stays the same and the nirvana-sensitive component says they stayed the same or got closer, ds(x+y,z+y)≤ds(x,z). We'll be using this.

Let n be large enough that d(Mn,M)<ϵ and d(Mlon,Mlo)<ϵ (same for surmetric) Now, consider the point Mlo+M∗n. It is an sa-measure or sa-surmeasure that lies above Mn and we'll show that it's close to M. Whether we're working with the sa-measures or sa-surmeasures,

d(Mlo+M∗n,M)≤d(Mlo+M∗n,Mn)+d(Mn,M)<d(Mlo+M∗n,Mlon+M∗n)+ϵ

≤d(Mlo,Mlon)+ϵ<2ϵ

So, M is <2ϵ distance from the upper completion of Mlo in the space of sa-measures/sa-surmeasures, for all ϵ. Said upper completion is the sum of a closed set (cone of sa-measures/sa-surmeasures) and a compact set (a single point) so it's closed, so M (an a-measure/a-surmeasure) lies above Mlo (an a-measure/a-surmeasure that was an arbitrary limit point of the Mlon) and we're done.

The next three, Lemmas 17, 18, and 19, are used to set up the critical Lemma 20 which we use a lot.

**Lemma 17:** *The function *Π→Ma(∞)* of the form *π↦Θω(π)∩NF∩{≤⊙}* has closed graph assuming Hausdorff-continuity for *Θω* on policies, and that *Θω(π)* is closed for all *π*. Also works for a *Θ* that fulfills the stated properties.*

Let πn limit to π, and let Mn∈Θω(πn)∩NF∩{≤⊙} limit to M. We'll show that M∈Θω(π)∩NF∩{≤⊙} (the definition of closed graph) Take some really big n that guarantees that d(πn,π)<δ and d(Mn,M)<ϵ. Then we go:

The distance from M to Mn is ϵ or less, and since Mn∈Θω(πn), we can invoke uniform Hausdorff continuity and conclude Mn is only ϵ or less away from a point in Θω(π)∩NF∩{≤⊙}. So, the distance from M to Θω(π)∩NF∩{≤⊙} is ≤2ϵ. This argument works for all ϵ, so it's at distance 0 from Θω(π)∩NF∩{≤⊙}, and that set is closed because it's an intersection of closed sets, so M∈Θω(π)∩NF∩{≤⊙}, and we have closed-graph.

**Lemma 18: **⋃π≥πstprπ,πst∗(Θω(π)∩NF∩{≤⊙}) *is compact as long as *Θω(π)* is closed for all *π* and *Θω* fulfills the Hausdorff-continuity property on policies. Also works for a *Θ* that fulfills the stated properties.*

The set of π≥πst is closed in the topology on Π, because a limit of policies above πst will still be above πst. More specifically, because it's a closed subset of a compact space, it's compact. Also, remembering that projection preserves λ and b, we can consider the set {≤⊙} (for Ma(∞)) which is compact.

Take the product of those two compact sets to get a compact set in Π×Ma(∞), intersect it with the graph of our function mapping π to Θω(π)∩NF∩{≤⊙} (which is closed by Lemma 17), we get a compact set, project it down to the Ma(∞) coordinate (still compact, projection to a coordinate is continuous), and everything in that will be safe to project down to Ma(FNF(πst)), getting you exactly the set ⋃π≥πstprπ,πst∗(Θω(π)∩NF∩{≤⊙}) which is compact because it's the image of a compact set through a continuous function.

**Lemma 19:**

¯¯¯¯¯¯¯¯c.h(⋃π≥πstprπ,πst∗(Θω(π)∩NF))=(c.h(⋃π≥πstprπ,πst∗(Θω(π)∩NF∩{≤⊙})))uc

*Where the upper completion is with respect to the cone of nirvana-free sa-measures and then we intersect with the set of nirvana-free a-measures, and *Θω(π)* is closed and nirvana-free upper-complete for all *π* and *Θω* fulfills the Hausdorff-continuity property on policies and the bounded-minimals property. Also works for a *Θ* that fulfills the stated properties.*

One direction of this,

¯¯¯¯¯¯¯¯c.h(⋃π≥πstprπ,πst∗(Θω(π)∩NF))⊇(c.h(⋃π≥πstprπ,πst∗(Θω(π)∩NF∩{≤⊙})))uc

is pretty easy. Everything in the convex hull of the clipped projections lies in the closed convex hull of the full projections, and then, from lemmas 11, 12, and 13, the closed convex hull of these projections is nirvana-free upper complete since Θω(π) is for all π, so that gets the points added by upper completion as well, establishing one subset direction.

Now for the other direction,

¯¯¯¯¯¯¯¯c.h(⋃π≥πstprπ,πst∗(Θω(π)∩NF))⊆(c.h(⋃π≥πstprπ,πst∗(Θω(π)∩NF∩{≤⊙})))uc

Let M lie in the closed convex hull. There's a sequence Mn that limits to it, where all the Mn are made by taking M∞i,n from above, projecting down and mixing. By bounded minimals, we can find some M∞,mini,n∈Θω(π)∩NF below the M∞i,n, and they're minimal points so they all obey the λ⊙+b⊙ bound. Now, project the M∞,mini,n down, and mix in the same way, to get an a-measure Mlon below Mn, which lies in the convex hull of clipped projections.

From Lemma 16, we can take a limit point of Mlon to get a Mlo below M. Now, we just have to show that Mlo lies in the convex hull set in order to get M by upper completion. Mlo is a limit of points from the convex hull set, so we just have to show that said convex hull set is closed. The thing we're taking the convex hull of is compact (Lemma 18), and in finite dimensions (because we're working in a stub), the convex hull of a compact set is compact. Thus, Mlo lies in the convex hull, and is below M, so M lies in the upper completion of the convex hull and we're done.

**Lemma 20: ** c.h(⋃π≥πstprπ,πst∗(Θω(π)∩NF)) *is closed, if *Θω(π)* is closed and nirvana-free upper-complete for all *π* and *Θω* fulfills the Hausdorff-continuity property on policies and the bounded-minimals property. Also works for a *Θ* that fulfills the stated properties.*

By Lemmas 11 and 12, said convex hull is nirvana-free upper-complete. Any point in the closure of the convex hull, by Lemma 19, lies above some finite mixture of nirvana-free stuff from above that respects the λ⊙+b⊙ bound, projected down. However, since the convex hull is upper-complete, our arbitrary point in the closure of the convex hull is captured by the convex hull alone.

**Lemma 21:** *If *Θ* is consistent and nirvana-free upper-complete for *Θ(π)*, and obeys the extreme point condition, and obeys the Hausdorff-continuity condition on policies, then*Θ(πpa)∩NF=¯¯¯¯¯¯¯¯c.h(⋃π≥πpaprπ,πpa∗(Θ(π)∩NF))* and*\st}pr_{*}^{\pi,\st}(\BF(\pi)\cap\nfree)\right)">Θ(πst)∩NF=c.h(⋃π>πstprπ,πst∗(Θ(π)∩NF))* This works in the sur-case too.*

Proof sketch: One subset direction is pretty dang easy from Consistency. The other subset direction for stubs (that any nirvana-free point in Θ(πst) lies in the convex hull of projections from above) is done by taking your point M of interest, finding a minimal point below it, using Lemma 3 to split your minimal points into finitely many minimal extreme points, and using the extreme point condition to view them as coming from policies above, so the minimal point has been captured by the convex hull, and then Lemmas 11 and 12 say that the convex hull of those projections is nirvana-free upper-complete, so our M is captured by the convex hull.

Getting it for partial policies is significantly more complex. We take our M and project it down into Θ(πnpa) for some very large n. Then, using our result for stubs, we can view our projected point Mn as a mix of nirvana-free stuff from policies above πnpa. If n is large enough, πnpa is very close to πpa itself, so we can perturb our points at the infinite level a little bit to be associated with policies above πpa with Hausdorff-Continuity, and then we can project down and mix, and show that this point (in the convex hull of projections of nirvana-free stuff from above) is close to M itself, getting a sequence that limits to M, witnessing that it's in the closed convex hull of projections of nirvana-free stuff from above. It's advised to diagram the partial policy argument, it's rather complicated.

Ok, so one direction is easy, Θ(πpa)∩NF⊇¯¯¯¯¯¯¯¯c.h(⋃π≥πpaprπ,πpa∗(Θ(π)∩NF)) (and likewise for stubs). Consistency implies that Θ(πpa) (or Θ(πst)) is the closed convex hull of projections down from above, so the closed (or vanilla) convex hulls of the projections of nirvana-free stuff from above are a subset of the nirvana-free part of Θ(πpa) (or Θ(πst)).

For the other direction... we'll show the stub form, that's easier, and build on that. We're shooting for \st}pr_{*}^{\pi,\st}(\BF(\pi)\cap\nfree)\right)">Θ(πst)∩NF⊆c.h(⋃π>πstprπ,πst∗(Θ(π)∩NF))

Fix some M∈Θ(πpa)∩NF. Find a minimal point Mmin below it, which must be nirvana-free, because you can't make Nirvana vanish by adding sa-measures. Invoke Lemma 3 to write Mmin as a finite mixture of minimal extreme points in the nirvana-free part of Θ(πst). These must be minimal and extreme and nirvana-free in Θ(πst), because you can't mix nirvana-containing points and get a nirvana-free point, nor can you add something to a nirvana-containing point without getting a nirvana-containing point. By the extreme point condition, there are nirvana-free points from above that project down to those extreme points. Mixing them back together witnesses that Mmin lies in the convex hull of projections of nirvana-free stuff from above. M is nirvana-free and lies above Mmin, so it's captured by the convex hull (with Lemmas 11 and 12)

Now for the other direction with partial policies, that

Θ(πpa)∩NF⊆¯¯¯¯¯¯¯¯c.h(⋃π≥πpaprπ,πpa∗(Θ(π)∩NF))

Fix some M∈Θ(πpa)∩NF. We can project M down to all the πnpa to get Mn which are nirvana-free and in Θ(πnpa) by Consistency.

Our task is to express M as a limit of some sequence of points that are finite mixtures of nirvana-free projected from policies above πpa. Also, remember the link between "time of first difference n" and the δ distance between two partial policies. δn=γn where γ<1. Each δn induces an ϵn number for the purposes of Hausdorff-continuity.

First, πnpa is a stub, which, as we have already established has its nirvana-free part equal the convex hull of projections of nirvana-free stuff down from above. So, Mn is made by taking finitely many M∞i,n∈Θ(πi)∩NF where πi≥πnpa, projecting down, and mixing. By linearity of projection, we can mix the M∞i,n before projecting down and hit the same point, call this mix M∞n.

Since the distance between πnpa and πpais δn or less, each policy πi has another policy within δn that's above πpa, and by uniform Hausdorff-continuity (the variant from Lemma 15) we only have to perturb the M∞i,n by ϵn(1+λi,n) to get M′∞i,n in Θ(π′i)∩NF where π′i≥πpa for all i.

Mixing these in the same proportion makes a M′∞n within ϵn(1+λ) of M∞n, which projects down to Θ(πpa)∩NF (because mix-then-project is the same as projecting the M′∞i,n then mixing, and the convex hull of projections of stuff from above is a subset of Θ(πpa) by consistency) The projection of M′∞n we'll call M′n. It lies in the convex hull of the projections of nirvana-free stuff from above.

Now, to finish up, we just need to show that M′n limit to M, witnessing that M is in the closed convex hull of projections of nirvana-free stuff from above. Since M′n is the projection of M′∞n, which is ϵn(1+λ) away from M∞n, and projection doesn't increase distance, and the projection of M∞n is Mn, we can go

d(prπpa,πnpa∗(M′n),Mn)=d(pr∞,πnpa∗(M′∞n),pr∞,πnpa∗(M∞n))≤d(M′∞n,M∞n)<ϵn(1+λ)

So, we can conclude that, restricting to before time n, the measure components of M′n and M are fairly similar (ϵn(1+λ) distance), and so are the b components. Then some stuff happens after time n. Because our distance evaluation is done with Lipschitz functions, they really don't care that much what happens at late times. So, in the n→∞ limit, the difference between the b terms vanishes, and the measure components agree increasingly well (limiting to perfect agreement for times before n, and then some other stuff happens, and since the other stuff happens at increasingly late time (n is diverging), the measure components converge.

So, we just built M as a limit of points from the convex hull of the projections of nirvana-free parts down from above, and we're done.

Alright, we're back. We've finally accumulated a large enough lemma toolkit to attack our major theorem, the Isomorphism theorem. Time for the next post!

Discuss

### 19th-century progress studies

I wrote earlier about an 1857 plan for the transcontinental railroad (as yet unbuilt), and how it advocated funding the project through the sale of stock to a broad base of public investors. However, another aspect of the plan was even more striking: it explicitly advocating studying the history of progress, in order to judge what was possible and desirable for the railroad. It was like a 19th-century form of progress studies.

The author of the plan, Theodore Judah, first addresses the question of how fast the trains will travel from St. Louis to San Francisco: three days, he estimates. But then he points out that this estimate is only based on the speeds attainable by “our present class of engines.” The next paragraphs exemplify the 19th-century spirit of progress:

But—

Is there to be no improvement in our present class of engines? have we reached a point in the stage of progress where we must stop, beyond which we cannot go? Are we willing to admit that fifty miles per hour is the limit to speed? Are we contented, and do we desire to go no faster?

No—

However well we may be satisfied with the present rate of speed in traveling, we dare not admit the principle—we wish to go as fast as we can. Improvements are progressive and the future is before us. No, we have not arrived at the limit, at a final stopping place; we are only at a station, a way station—we have paused, but not to remain. We do not travel fast enough, nor will we, until a speed of one hundred miles per hour is attained with as much ease, and as little risk, as at present.

In defense of this idea, he explicitly appeals to history:

Does the idea seem preposterous? Is it foolish, visionary? Is it absurd?

Let us inquire into the matter a little: **let us extract a few notes from the history of progress.**

What follows is a long passage about the history of roads and horse travel in England starting in the 16th century. He points out that horseback riding was slow, uncomfortable, and dangerous. He describes the establishment of the English post office in 1660, which greatly sped up mail delivery; and the development of stage coach service, which was an improvement on riding horseback.

He also points out that advances in travel were often unappreciated, even opposed. He quotes several paragraphs from a 1673 pamphlet decrying the stage coaches and pining for the old days of horseback. The pamphlet claimed that horseback was a healthier mode of travel, and advocated restrictions on coach service. He tells the story of an early railroad promoter, Thomas Gray, who was called a fool in the 1820s and who failed to raise funding for his proposed system of railways. He quotes authors as late as 1829, the year that locomotives were demonstrated to be practical in the famous Rainhill trials, writing of “the ridiculous expectations… that we shall see engines traveling at the rate of twelve, sixteen, eighteen or twenty miles per hour” and calling grand plans for railroad networks “visionary schemes unworthy of notice.” He concludes:

Such opinions thus expressed by authorities of such eminence, in opposition to what is now an every day reality, may well induce the most intelligent and far sighted to hesitate in making dogmatical assertions as to what may or may not be the revelations of the future.

Judah was writing less than thirty years into the the great age of railroads, but already he spoke of what had been accomplished so far in glowing terms:

… those short twenty-six years are a living monument in the progress of time more grand, lofty and noble than the proudest pyramid which the world has yet gazed upon.

Twenty-five thousand miles of railroad have been built in this country or an average of one thousand miles per year.

Where is the man who can sum up the grand, mighty benefits which have, in consequence, accrued to mankind. If the man could be found with a mind vast enough to comprehend and with talent sufficient to compass them, he could write a tale in comparison with which the mightiest achievements of the collective world would sink into utter insignificance. No one appreciates the innumerable blessings which have flowed in consequence, for the story has never been told; it is not understood.

(Already, in 1857, he is saying that progress is not understood or appreciated!)

But it is one of the first steps of the newly awakened young giant, Progress, and shall we measure his glorious march by a few strides? No: he may pause to rest, but it is to recruit his powers for new conquests, and among them some of will yet see the realization of our preposterous, absurd idea, viz, “traveling by railroad at the rate of 100 miles per hour, with the same safety as present,” is not near so startling or absurd a proposition, in this age, was that of 20 miles per hour only thirty years ago.

Incidentally, many trains in Europe and Asia today travel at well over 300 miles per hour.

It is amazing to think of Judah’s tremendous confidence in progress, given how much of it was yet to come. In 1857, electric light and power were still decades away. The Bessemer process for making cheap steel had not yet been introduced on a large scale. Synthetic fertilizer had not yet been invented. The germ theory had not yet been discovered, and mortality from infectious disease was still staggering by today’s standards. Telephone was still a dream; radio and television a fantasy. And of course, the internal combustion engine, and its twin children the automobile and airplane, were well in the future.

And yet, here is Judah, already confident in the march of capital-P Progress, and pointing to history to justify his vision and to dismiss skeptics.

Discuss

### What is the interpretation of the do() operator?

To reductively explain causality, it has to be explained in non-causal terms, most likely in terms of total propability distributions. Pearl explains causality in terms of causal graphs which are created by conditionalizing the propability distribution on not .mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0}
.MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0}
.mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table}
.mjx-full-width {text-align: center; display: table-cell!important; width: 10000em}
.mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0}
.mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left}
.mjx-numerator {display: block; text-align: center}
.mjx-denominator {display: block; text-align: center}
.MJXc-stacked {height: 0; position: relative}
.MJXc-stacked > * {position: absolute}
.MJXc-bevelled > * {display: inline-block}
.mjx-stack {display: inline-block}
.mjx-op {display: block}
.mjx-under {display: table-cell}
.mjx-over {display: block}
.mjx-over > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-under > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-stack > .mjx-sup {display: block}
.mjx-stack > .mjx-sub {display: block}
.mjx-prestack > .mjx-presup {display: block}
.mjx-prestack > .mjx-presub {display: block}
.mjx-delim-h > .mjx-char {display: inline-block}
.mjx-surd {vertical-align: top}
.mjx-mphantom * {visibility: hidden}
.mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%}
.mjx-annotation-xml {line-height: normal}
.mjx-menclose > svg {fill: none; stroke: currentColor}
.mjx-mtr {display: table-row}
.mjx-mlabeledtr {display: table-row}
.mjx-mtd {display: table-cell; text-align: center}
.mjx-label {display: table-row}
.mjx-box {display: inline-block}
.mjx-block {display: block}
.mjx-span {display: inline}
.mjx-char {display: block; white-space: pre}
.mjx-itable {display: inline-table; width: auto}
.mjx-row {display: table-row}
.mjx-cell {display: table-cell}
.mjx-table {display: table; width: 100%}
.mjx-line {display: block; height: 0}
.mjx-strut {width: 0; padding-top: 1em}
.mjx-vsize {width: 0}
.MJXc-space1 {margin-left: .167em}
.MJXc-space2 {margin-left: .222em}
.MJXc-space3 {margin-left: .278em}
.mjx-test.mjx-test-display {display: table!important}
.mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px}
.mjx-test.mjx-test-default {display: block!important; clear: both}
.mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex}
.mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left}
.mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right}
.mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0}
.MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal}
.MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal}
.MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold}
.MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold}
.MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw}
.MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw}
.MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw}
.MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw}
.MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw}
.MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw}
.MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw}
.MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw}
.MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw}
.MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw}
.MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw}
.MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw}
.MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw}
.MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw}
.MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw}
.MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw}
.MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw}
.MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw}
.MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw}
.MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw}
.MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw}
@font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')}
@font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')}
@font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold}
@font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')}
@font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')}
@font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold}
@font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')}
@font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic}
@font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')}
@font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')}
@font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold}
@font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')}
@font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic}
@font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')}
@font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')}
@font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')}
@font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')}
@font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold}
@font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')}
@font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic}
@font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')}
@font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')}
@font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic}
@font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')}
@font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')}
@font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')}
@font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')}
@font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')}
@font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')}
@font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold}
@font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}
event, but do(event). What does this mean? It's easy enough to explain *in causal terms*: You make it so event occurs without changing any of its causal antecedents. But of course that fails to explain causality. How could it be explained without that?

Discuss

### WebBundles and URL Randomization

*Disclosure: I work on ads at Google; this is a personal post.*

Peter Snyder of Brave recently wrote a post on the proposed WebBundle specification arguing that it makes URL randomization easier and so is harmful to content blocking. The core of the post is incorrect, however, and based on a misunderstanding of what can be done today and what WebBundles make easier. Snyder and I discussed this some in on HN and I wanted to take a step back and try to write up the issues more clearly.

A WebBundle (see explainer) allows you to serve a single file that the browser can treat as multiple files. This solves many long-standing problems, some of which I've been working around for a long time. A classic one is that many sites will have a large number of small files. Everyone who visits the site will need essentially all of these files, and you lose a lot of performance in individually requesting each one. When I worked on open source web page optimization software (mod_pagespeed) it could combine CSS, JS, and images so you could request a single 'bundle' for each, but it was not ideal:

Instead of one bundle containing the resources the site needs, you have at least three: CSS, JS, images.

There are differences in error handling and recovery that mean you can't just concatenate CSS or JS files. Even with code to work around these issues there were still differences that meant automatic bundling was a bit risky.

Image combining (spriting) Is especially difficult, and our tool was less of a fully automatic image combiner than a tool to take the toil out of doing it manually.

Instead of using a dynamic tool like mod_pagespeed, you can also do this kind of concatenation at build time (webpack, browserfy, parcel, rollup, etc) but this doesn't fix any of these problems.

A WebBundle is a much more straight-forward approach, where you tell the browser explicitly about your bundling instead of trying to glue a bunch of things together and pass them off to the browser as one.

I'm no longer working on mod_pagespeed, but these issues still come up; in my work now I'm interested in using WebBundles to allow a single ad response to provide multiple ads.

My understanding of Snyder's view, from their post, HN comments, and spec comments, is that bundles make it much easier to randomize URLs, and bypass URL-based content blockers. Specifically, that if you deliver a single bundle with both HTML and resources, it becomes very easy for you to randomize URLs.

This claim is based, however, on a misunderstanding of what bundling makes easier. Sites that deliver both the content and the ads, such as Facebook, often already use this kind of randomization to make things harder for ad blockers. Since they're delivering the HTML, they can randomize all their URLs on every deploy, and it's hard for blockers to keep up. If blockers did catch up, they could use cookies to randomize their URLs on a per-user basis, or even encrypt them before sending to the client. All of these approaches work much better for them than sending a bundle with the HTML and JS in a single file, because that would mean the JS could not be cached on the client. (If they were willing to give up on caching they could already use HTTP/2's server push or plain inlining to send the JS and HTML at once.)

On the other hand, sites like this are few. It is worth it for Facebook or Google Search to run their own ads, but most sites instead use an ad network. These networks typically integrate client-side, with JS. For example, to put AdSense ads on your page you include something like:

<script async src="https://pagead2.googlesyndication.com/ pagead/js/adsbygoogle.js"></script> <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-NNNNNNNN" data-ad-slot="NNNNNNNN" data-ad-format="auto" data-full-width-responsive="true"> </ins>

This loads adsbygoogle.js (the ad JS), which reads configuration from the <ins> and handles requesting and rendering the ads. Ad blockers recognize adsbygoogle.js by its URL and prevent it from loading and running.

Neither the publisher site nor the ad network can randomize the URL of the ad JS on their own: it's a cross-party integration point. If they worked closely together, perhaps by moving integration from client-side to server-side, then they could randomize URLs easily even with today's technology. A server-side integration is much more technically difficult, however, which is why I think we rarely see them. Bundles don't change the nature of this situation: a client-side integration still needs a consistent (and, hence, blockable) URL to load the ad JS, while a server-side integration doesn't become any easier.

Snyder is concerned about a world in which ad blockers aren't able to operate because they can't recognize ad resources by URL. While I think this is a reasonable concern, the WebBundle proposal is orthogonal to this problem, and does not bring us any closer to that sort of world.

*Comment via: facebook*

Discuss

### [AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents

[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents View this email in your browser

**Newsletter #114**

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet). SECTIONS HIGHLIGHTS

TECHNICAL AI ALIGNMENT

ITERATED AMPLIFICATION

MESA OPTIMIZATION

AGENT FOUNDATIONS

FORECASTING

MISCELLANEOUS (ALIGNMENT)

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

NEWS HIGHLIGHTS

The Alignment Problem for Bayesian History-Based Reinforcement Learners *(Tom Everitt et al)* (summarized by Rohin): After forgetting its existence for quite a while, I've finally read through this technical report (which won first place in round 2 of the AI alignment prize (AN #3)). It analyzes the alignment problem from an AIXI-like perspective, that is, by theoretical analysis of powerful Bayesian RL agents in an online POMDP setting.

In this setup, we have a POMDP environment, in which the environment has some underlying state, but the agent only gets observations of the state and must take actions in order to maximize rewards. The authors consider three main setups: 1) rewards are computed by a preprogrammed reward function, 2) rewards are provided by a human in the loop, and 3) rewards are provided by a *reward predictor* which is trained interactively from human-generated data.

For each setup, they consider the various objects present in the formalism, and ask how these objects could be corrupted, misspecified, or misleading. This methodology allows them to identify several potential issues, which I won't get into as I expect most readers are familiar with them. (Examples include wireheading and threatening to harm the human unless they provide maximal reward.)

They also propose several tools that can be used to help solve misalignment. In order to prevent reward function corruption, we can have the agent *simulate* the future trajectory, and *evaluate* this future trajectory with the current reward, removing the incentive to corrupt the reward function. (This was later developed into current-RF optimization (AN #71).)

Self-corruption awareness refers to whether or not the agent is aware that its policy can be modified. A self-corruption *unaware* agent is one that behaves as though it's current policy function will never be changed, effectively ignoring the possibility of corruption. It is not clear which is more desirable: while a self-corruption unaware agent will be more corrigible (in the MIRI sense), it also will not preserve its utility function, as it believes that even if the utility function changes the policy will not change.

Action-observation grounding ensures that the agent only optimizes over policies that work on histories of observations and actions, preventing agents from constructing entirely new observation channels ("delusion boxes") which mislead the reward function into thinking everything is perfect.

The interactive setting in which a reward predictor is trained based on human feedback offers a new challenge: that the human data can be corrupted or manipulated. One technique to address this is to get *decoupled* data: if your corruption is determined by the current state s, but you get feedback about some different state s', as long as s and s' aren't too correlated it is possible to mitigate potential corruptions.

Another leverage point is how we decide to use the reward predictor. We could consider the *stationary* reward function, which evaluates simulated trajectories with the *current* reward predictor, i.e. assuming that the reward predictor will never be updated again. If we combine this with self-corruption unawareness (so that the policy also never expects the policy to change), then the incentive to corrupt the reward predictor's data is removed. However, the resulting agent is *time-inconsistent*: it acts as though its reward never changes even though it in practice does, and so it can make a plan and start executing it, only to switch over to a new plan once the reward changes, over and over again.

The *dynamic* reward function avoids this pitfall by evaluating the kth timestep of a simulated trajectory by also taking an expectation over future data that the reward predictor will get. This agent is no longer time-inconsistent, but it now incentivizes the agent to manipulate the data. This can be fixed by building a single integrated Bayesian agent, which maintains a single environment model that predicts both the reward function and the environment model. The resulting agent is time-consistent, utility-preserving, and has no direct incentive to manipulate the data. (This is akin to the setup in assistance games / CIRL (AN #69).)

One final approach is to use a *counterfactual* reward function, in which the data is simulated in a counterfactual world where the agent executed some known safe default policy. This no longer depends on the current time, and is not subject to data corruption since the data comes from a hypothetical that is independent of the agent's actual policy. However, it requires a good default policy that does the necessary information-gathering actions, and requires the agent to have the ability to simulate human feedback in a counterfactual world.

**Read more:** Tom Everitt's PhD thesis

**Rohin's opinion:** This paper is a great organization and explanation of several older papers (that haven't been summarized in this newsletter because they were published before 2018 and I read them before starting this newsletter), and I wish I had read it sooner. It seems to me that the integrated Bayesian agent is the clear winner -- the only downside is the computational cost, which would be a bottleneck for any of the models considered here.

One worry I have with this sort of analysis is that the guarantees you get out of it depends quite a lot on how you model the situation. For example, let's suppose that after I sleep I wake up refreshed and more capable of intellectual work. Should I model this as "policy corruption", or as a fixed policy that takes as an input some information about how rested I am?

TECHNICAL AI ALIGNMENTITERATED AMPLIFICATION

Universality Unwrapped *(Adam Shimi)* (summarized by Rohin): This post explains the ideas behind universality and ascription universality, in a more accessible way than the original posts and with more detail than my summary.

Mesa-Search vs Mesa-Control *(Abram Demski)* (summarized by Rohin): This post discusses several topics related to mesa optimization, and the ideas in it led the author to update towards thinking inner alignment problems are quite likely to occur in practice. I’m not summarizing it in detail here because it’s written from a perspective on mesa optimization that I find difficult to inhabit. However, it seems to me that this perspective is common so it seems fairly likely that the typical reader would find the post useful.

Radical Probabilism *(Abram Demski)* (summarized by Rohin): The traditional Bayesian treatment of rational agents assumes that the only way an agent can get new information is by getting some new observation that is known with probability 1. However, we would like a theory of rationality that can allow for agents that also get more information by thinking longer. In such a situation, some of the constraints imposed by traditional Bayesian reasoning no longer apply. This detailed post explores what constraints remain, and what types of updating are allowable under this more permissive definition of rationality.

**Read more:** The Bayesian Tyrant

**Rohin's opinion:** I particularly enjoyed this post; it felt like the best explanation in relatively simple terms of a theory of rationality that is more suited to bounded agents that cannot perfectly reason about an environment larger than they are. (Note “simple” really is relative; the post still assumes a lot of technical knowledge about traditional Bayesianism.)

My AI Timelines Have Sped Up *(Alex Irpan)* (summarized by Nicholas): Alex Irpan updates his predictions of AGI sooner to:

10% chance by 2035 (previously 2045)

50% chance by 2045 (previously 2050)

90% chance by 2070

The main reasons why are:

- Alex is now more uncertain because research pace over the past five years have been more surprising than expected, faster in some domains, but slower than others.

- Accounting for improvements in tooling. New libraries like TensorFlow and PyTorch have accelerated progress. Even CNNs can be used as a “tool” that provides features for downstream tasks like robotic control.

- He previously thought that labeled data might be a bottleneck, based on scaling laws showing that data needs might increase faster than compute; however, semi- and unsupervised learning have improved significantly, GPT-3 being the latest example of this.

- Alex now believes that compute will play a larger role and that compute can scale faster than algorithms because there is large worldwide consumer demand.

The post ends with a hypothetical description of how AGI may happen soon that I will leave out of the summary but recommend reading.

**Nicholas's opinion:** My personal opinion on timelines is that I think it is much more informative to draw out the full CDF/PDF of when we will get to AGI instead of percentages by different years. It isn’t included in the post, but you can find Alex’s here. I end up placing higher likelihood on AGI happening sooner than Alex does, but I largely agree with his reasoning.

More uncertainty than the original prediction seems warranted to me; the original prediction had a very high likelihood of AGI between 2045-2050 that I didn’t understand. Of the rest of the arguments, I agree most strongly with the section on tooling providing a speedup. I’d even push the point farther to say that there are many inputs into current ML systems, and all of them seem to be improving at a rapid clip. Hardware, software tools, data, and the number of ML researchers all seem to be on track to improve significantly over the next decade.

MISCELLANEOUS (ALIGNMENT)The Problem with Metrics is a Fundamental Problem for AI *(Rachel Thomas et al)* (summarized by Flo): The blog post lists five problems of current AI that are exacerbated by the cheap cost and easy scaling of AI systems combined with the common belief that algorithms are objective and error-free:

1. It is often hard for affected people to address problems in algorithmic decisions

2. The complexity of AI problems can easily lead to a diffusion of responsibility

3. AI can encode biases and sometimes magnify them via feedback loops

4. Big tech companies lack accountability

5. Current AI systems usually focus exclusively on optimizing metrics.

The paper then dives deeper into the last point. They review a series of case studies and form four conclusions. First, measured metrics are usually only a proxy for what we really care about: Youtube's terminal goal is certainly not to maximize viewing time and society does not inherently care about student test scores. Secondly, metrics can and will be gamed: Soviet workers would often achieve their production targets at the cost of some unmeasured aspects of performance, reported waiting times in the English healthcare system were distorted once targets were set for them and evaluating teachers by test scores has led to cheating scandals in the US. Third, metrics tend to overemphasise short-term concerns as they are often easier to measure. This can be seen in businesses like Facebook and Wells Fargo that have faced political backlash, worse access to talent pools, or lawsuits because of an excessive focus on click-through rates and quarterly earnings. Fourth, tech firms often focus on metrics that are associated with addictive environments. For example, "engagement" metrics are used as proxies for user preferences but rarely reflect them accurately in contexts that were optimized for these metrics. The authors then propose three remedies: Using multiple metrics to get a more holistic picture and make gaming harder, combining metrics with qualitative accounts, and involving domain experts and stakeholders that would be personally affected by the deployed system.

**Read more:** I’m an AI researcher, and here’s what scares me about AI

**Flo's opinion:** I found this interesting to read, as it does not really seem to be written from the perspective of AI Safety but still lists some problems that are related to AI safety and governance. Just think of an AI system tasked to help with realizing human preferences magnifying "biases" in its preference elicitation via unwanted feedback loops, or about the lack of firms accountability for socioeconomic disturbances their AI systems could create that the windfall clause (AN #88) was envisioned to mitigate.

REINFORCEMENT LEARNING

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey *(Sanmit Narvekar et al)* (summarized by Zach): For a variety of learning problems, the training process is organized so that new concepts and tasks leverage previously learned information. This can serve as a broad definition of curriculum learning. This paper gives an overview of curriculum learning and a framework to organize various approaches to the curriculum learning problem. One central difficulty is that there is a broad class of methods that can be considered curricula. At one extreme, we have curricula where new tasks are created to speed up learning. At another extreme, some curricula simply reorder experience samples. For example, the prioritized replay buffer is one such reordering method. Thus, to cover as much as the literature as possible the authors outline a framework for curriculum learning and then use that structure to classify various approaches. In general, the definition, learning, construction, and the evaluation of curricula are all covered in this work. This is done by breaking the curriculum learning problem into three steps: task generation, sequencing, and transfer learning. Using this problem decomposition the authors give an overview of work addressing each component.

**Zach's opinion:** Before I read this, I thought of curricula as 'hacks' used to improve training. However, the authors' presentation of connections with transfer learning and experience replay has significantly changed my opinion. In particular, the phrasing of curriculum learning as a kind of 'meta-MDP seems particularly interesting to me. Moreover, there seem to be interesting challenges in this field. One such challenge is that there does not seem to be a great amount of theory about *why* curricula work which could indicate a point of departure for people interested in safety research. Knowing more about theory could help answer safety questions. For example, how do we design curricula so that we can guarantee/check the agent is behaving correctly at each step?

Looking for adversarial collaborators to test our Debate protocol *(Beth Barnes)* (summarized by Rohin): OpenAI is looking for people to help test their debate (AN #86) protocol, to find weaknesses that allow a dishonest strategy to win such debates.

**replying to this email**. PODCAST An audio podcast version of the

**Alignment Newsletter**is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

**Subscribe here:**

*Copyright © 2020 Alignment Newsletter, All rights reserved.*

**Want to change how you receive these emails?**

You can update your preferences or unsubscribe from this list.

Discuss

### Excusing a Failure to Adjust

Here is a handle for a mistake I make all the time that works really well for me… that is the handle works really well for me. I make this mistake fairly often, but of course I notice it in other people more often.

Imagine that your credence in a belief is represented by a physical slider. When you learn a new piece of evidence or argument or whatever, there are two kinds of mental motions you can react with. You can adjust the slider up or down, or you can explain why you don’t have to adjust the slider at all.

Sometimes the right thing to do really is to not adjust the slider. Sometimes people will think that you should adjust the slider when really you shouldn’t for reasons that are difficult to explain. But this is far far rarer than cases where you should adjust the slider, but instead you focus on coming up with reasons to not adjust the slider. On some days, it seems like 90% of human cognition is coming up with reasons to not adjust the slider.

And of course, for beliefs that you like, you will come up with excuses for refusing to adjust the slider down way more often than you will come up with excuses for refusing to adjust the slider up. And vice versa for beliefs that you dislike.

Probably, you would be better off banning that mental motion altogether. Nearly every new piece of evidence should cause some adjustment, even if it is very small. Most conversations should be about how much to adjust, not about whether to adjust. But banning that mental motion is difficult, and before you can, you must learn to notice the motion, and learn how it feels different from other motions.

This is the best tool I have found for noticing so far. I ask myself: “Am I trying to explain why I don’t need to adjust the slider?” Something different might work better for you, but I highly recommend trying at all to notice the difference.

Discuss

### nostalgebraist: Recursive Goodhart's Law

Would kind of like to excerpt the whole post, but that feels impolite, so I'll just quote the first four paragraphs and then suggest reading the whole thing:

There’s this funny thing about Goodhart’s Law, where it’s easy to say “being affected by Goodhart’s Law is bad” and “it’s better to behave in ways that aren’t as subject to Goodhart’s Law,” but it can be very*hard*to explain why these things are true to someone who doesn’t already agree.Why? Because any such explanation is going to involve some step where you say, “see, if you do that,

*the results are worse*.” But this requires some standard by which we can judge results … and any such standard, when examined closely enough, has Goodhart problems of its own.There are times when you can’t convince someone without a formal example or something that amounts to one, something where you can say “see, Alice’s cautious heuristic strategy wins her $X while Bob’s strategy of computing the global optimum under his world model only wins him the smaller $Y, which is objectively worse!”But if you’ve gotten to this point, you’ve conceded that there’s

*some*function whose global optimum is the one true target. It’s hard to talk about Goodhart at all without something like this in the background – how can “the metric fails to capture the true target” be the problem unless there is some true target?

Discuss

### Predictions for Neuralink Working Device

Elon Musk just announced a "live webcast of a working Neuralink device" Friday at 3PM PT. In the same vein of last year's Neuralink prediction post, feel free to specify what is likely/unlikely to be revealed, giving your credence. We will likely do a "scoring of the predictions" after the event (like in last year's post).

Discuss

### On Suddenly Not Being Able to Work

Cross-posted, as always, from Putanumonit.

Since I started Putanumonit five years ago I’ve rarely ever gone two weeks without posting anything. I’ve now gone a full month.

It’s not that I have nothing to write about. I have several half-finished drafts, Twitter threads that can be expanded into posts, marked-up books for review, and more. This post is none of the above, it’s about the reason why I have not been writing. And chiefly, it’s a way to just get back to writing by any means possible, to make this a habit again.

If this sounds like boring post that you don’t want to read, you are probably correct. You can instead watch my video interview with Chase Harris about COVID, Rationality, social reality, and more.

The reason I haven’t blogged in a month is that I haven’t been able to do my job. My work laptop is plugged in, my colleagues are online, my deadlines are approaching fast, and yet I have consistently failed to put in the required (and not very large) number of hours it would take me to catch up on my work. I am not sure why this has been happening, and I have observed this happening as if from the outside with some bemusement.

My brain keeps “work” and “blogging” very near each other on the mental shelf. Both are obligations of a sort. Both are done on the same screen with the same keyboard sitting on the same chair. And so, since my rising panic at falling behind on work is ever-present, whenever I sit down to write I think I should do work instead — it pays my rent and has specific people that depend on it, as opposed to an abstract readership. So whenever I try to blog I switch to trying to work. Then I fail and end up doing neither.

This experience of not being able to do work is very strange. On the face of it, nothing is different in August from March and April when I was working from home quite productively on similar projects. I am seeing more friends now and spending more time outdoors but these usually happen in the evening, when I have already spent the day mysteriously not working.

I don’t particularly dread my work or feel stressed about it. It’s as if the job is shrouded in fog — it’s hard to get started and when I do I’m easily distracted by other things which capture my attention for unusually long periods of time. I have also noticed in the last two weeks that my stomach acts up whenever I log in to the work laptop, even several times a day, making me delay the start of work while I run to the bathroom. I thought “gut feeling” was just a metaphor.

I talked to a friend who had similar psychosomatic anxiety responses to working on her thesis, to the point of suffering loss of vision and other strange symptoms. Surprisingly, the solution she chose was just to power through. Even more surprisingly, it worked — she now holds a PhD. This is basically my mainline strategy for now.

Other friends suggested other approaches, although all of them would require not working for a while (on purpose). Unfortunately, I feel like I can’t afford to do that with the amount of work I have piled up and the upcoming deadlines. If I somehow get a lot done and meet the deadlines, I actually feel that it would be much easier to go back to working normally. The more I have to do the bigger the anxiety around my work builds, and thus the harder it is to do it. I realized this a short while ago, but the trap had already shut above me.

I’m really taking a big Rationality “L” on this one. It feels like a failure of planning, introspection, and organization that __I would have expected to avoid__. I guess anything other than asking myself constantly *“am I failing at life in some dumb and obvious way”* is unforgivable hubris.

One thing that’s important to know about my job is that it is utterly lacking in intrinsic motivation. The work is not very fun, challenging, impactful, or meaningful. __No one is cheering on me as I do it__ or admiring me for it. It is something that the world needs in the abstract sense of someone willing to pay for it in a fair market, but I would certainly not care whether it was done or not if it wasn’t my responsibility.

I’m quite grateful to my company for paying us a fair salary instead of trying to sell us on some vision of how our work is world-changing or pride-worthy. We’re paid in real cash rather than made-up meaning — it’s an honest deal and I appreciate it. At least, part of me does.

The problem is that my reasons for doing the work are very distant from my subconscious systems of motivation. Even the main reason, money, is not strongly stimulating to me these days. I’m not shopping or eating at restaurants or planning vacations, my cost of living has gone way down because COVID. If I was fired tomorrow my lifestyle wouldn’t change. My job also fit into some broader plans I had for my life this year and next, plans that have been cancelled or thrown up in the air because COVID as well. This uncertainty about the future is probably also sucking away my motivation.

It is strange to think how much cognitive effort it takes to remember why I want to do the work. Perhaps more effort than it takes to actually do it!

When I just joined the company I committed to working for two years, then for another two (for work visa purposes). Having a hard contract that I intended to honor prevented me from spending mental energy on thinking of alternatives and second guessing myself. I have a choice now, and __choices are bad__.

My friend recently tweeted that he’s * addicted to indecision*, and I think I know what he’s talking about. Having choices feels bad, but giving up choices also feels bad. The longer you spend with a choice, the more it feels like whatever decision you make has to be absolutely awesome. If the decision space is vague and no choice is clearly superior you can get stuck in a way that feels a lot like addiction.

Several of my young friends are dealing with the same thing. They’re feeling anxious about having a life full of choices ahead of them, choices about careers, relationships, places to live. For many, all their choices seem worse now than they did in 2019, and not any more concrete or dependable. Others are waiting for a partner / friend / employer to make a choice and are resenting the other person’s addiction to indecision.

Another friend said:

Each decision is like a little death. But putting off the choice isn’t actually letting me live forever.

Discuss

### I Started a Sports and Gambling Substack

The substack is here.

The latest post is here and contains predictions for tomorrow’s baseball games. I also posted predictions for yesterday’s games here, and here you can find my offensive team ratings. Pitcher ratings will be posted soon.

I wanted to try out the substack system and software as a potential landing point, and as per my discussion regarding the forced switch to the new WordPress editor. I also wanted to avoid conflating the sports modeling and gambling talk with my rationality blog, as I expect them to be relevant to the interests of different groups of people.

If you have meaningful feedback for what the model has to say, or are finding these posts interesting or useful, let me know and also hopefully subscribe to the substack. If not, I’ll likely periodically mention it is there, but will do my best to keep the two efforts distinct.

Discuss

### Sunday, 20/8/30, 12pm PDT – Tagging Celebration: Habryka/Crawford + Party

**FB Event**

Woop! Woop! This past Saturday, gallant taggers brought our immense first tagging campaign to a close. The archives have been swept and every post with over 25 karma has been given at least one tag. To date, 15,440 tags have been applied to 8,022 posts.

See the full celebration post here.

While this is only the beginning for tagging, what a beginning it is! We think this deserves some celebration. We're preparing a two-part event:

**Part 1: Oliver Habryka and Jason Crawford discuss Intellectual Progress**

**https://us02web.zoom.us/j/82547847213**

Back in 2017, when the future of LessWrong was in doubt, Habryka led the revival efforts, pulling together a team and creating LessWrong 2.0. Central to Habryka's vision of LessWrong was that it would accelerate intellectual progress on the important problems facing humanity. *Intellectual progress* here simply meaning progress on building knowledge and understanding.

For several years, Jason has been seeking to answer the question of which factors have led to human progress in general: how did we go from living at the mercy of nature with merely stone tools and fire to buildings, electricity, medicine, legal systems, etc? Jason's research, and info about the broader Progress Studies movement, can be found at his blog, Roots of Progress (with many pieces crossposted to LessWrong).

Of course, the history of human progress, in general, is tightly woven with humanity's intellectual progress. When our understanding of the world increased, so did our ability to shape it (for better or worse).

To celebrate the new tagging system (itself designed with the goal of intellectual progress) we've decided to have Habryka and Jason chat about questions such as:

- What historical factors were important for intellectual progress?
- What conditions are most important to create now in order to get intellectual progress?

**Starting at 12:00 PDT, Habryka and Jason will chat on Zoom for ~1 hour, including some Q&A**

**Part 2: Party in Rational Woods (our Topia location)**

**https://topia.io/rational-woods** (feel free to check it out now)

Following the talk, we will migrate to a social environment that more easily allows people to strike up small group conversations. Raemon has created the marvelous Rational Woods on the Topia platform. You can check it out now for social hangouts. There are some neat features, so make sure to click on things.

This section will begin following the main talk and will last for several hours, so long as people are having a good time. Conceivably, we could play Breakfast is Served or Person Do Thing with LessWrong tags if there's enough enthusiasm. Not sure if that'll work well or not.

Also, I hope everyone can treat our top taggers as VIPs at the party. They deserve it. Three cheers for them!

The Details**When: Sunday, 30th August, 12:00PM PDT / 3:00PM EDT / 7:00PM UTC**

**Where: **

**FB Event: ****https://www.facebook.com/events/2827544884191033/**

Discuss

## Страницы

- « первая
- ‹ предыдущая
- …
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- следующая ›
- последняя »