# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 46 минут 35 секунд назад

### Morality is Scary

51 минута 31 секунда назад
Published on December 2, 2021 6:35 AM GMT

I'm worried that many AI alignment researchers and other LWers have a view of how human morality works, that really only applies to a small fraction of all humans (notably moral philosophers and themselves). In this view, people know or at least suspect that they are confused about morality, and are eager or willing to apply reason and deliberation to find out what their real values are, or to correct their moral beliefs. Here's an example of someone who fits this view:

I’ve written, in the past, about a “ghost” version of myself — that is, one that can float free from my body; which travel anywhere in all space and time, with unlimited time, energy, and patience; and which can also make changes to different variables, and play forward/rewind different counterfactual timelines (the ghost’s activity somehow doesn’t have any moral significance).

I sometimes treat such a ghost kind of like an idealized self. It can see much that I cannot. It can see directly what a small part of the world I truly am; what my actions truly mean. The lives of others are real and vivid for it, even when hazy and out of mind for me. I trust such a perspective a lot. If the ghost would say “don’t,” I’d be inclined to listen.

I'm currently reading The Status Game by Will Storr (highly recommended BTW), and found in it the following description of how morality works in most people, which matches my own understanding of history and my observations of humans around me:

The moral reality we live in is a virtue game. We use our displays of morality to manufacture status. It’s good that we do this. It’s functional. It’s why billionaires fund libraries, university scholarships and scientific endeavours; it’s why a study of 11,672 organ donations in the USA found only thirty-one were made anonymously. It’s why we feel good when we commit moral acts and thoughts privately and enjoy the approval of our imaginary audience. Virtue status is the bribe that nudges us into putting the interests of other people – principally our co-players – before our own.

We treat moral beliefs as if they’re universal and absolute: one study found people were more likely to believe God could change physical laws of the universe than he could moral ‘facts’. Such facts can seem to belong to the same category as objects in nature, as if they could be observed under microscopes or proven by mathematical formulae. If moral truth exists anywhere, it’s in our DNA: that ancient game-playing coding that evolved to nudge us into behaving co-operatively in hunter-gatherer groups. But these instructions – strive to appear virtuous; privilege your group over others – are few and vague and open to riotous differences in interpretation. All the rest is an act of shared imagination. It’s a dream we weave around a status game.

The dream shifts as we range across the continents. For the Malagasy people in Madagascar, it’s taboo to eat a blind hen, to dream about blood and to sleep facing westwards, as you’ll kick the sunrise. Adolescent boys of the Marind of South New Guinea are introduced to a culture of ‘institutionalised sodomy’ in which they sleep in the men’s house and absorb the sperm of their elders via anal copulation, making them stronger. Among the people of the Moose, teenage girls are abducted and forced to have sex with a married man, an act for which, writes psychologist Professor David Buss, ‘all concerned – including the girl – judge that her parents giving her to the man was a virtuous, generous act of gratitude’. As alien as these norms might seem, they’ll feel morally correct to most who play by them. They’re part of the dream of reality in which they exist, a dream that feels no less obvious and true to them than ours does to us.

Such ‘facts’ also change across time. We don’t have to travel back far to discover moral superstars holding moral views that would destroy them today. Feminist hero and birth control campaigner Marie Stopes, who was voted Woman of the Millennium by the readers of The Guardian and honoured on special Royal Mail stamps in 2008, was an anti-Semite and eugenicist who once wrote that ‘our race is weakened by an appallingly high percentage of unfit weaklings and diseased individuals’ and that ‘it is the urgent duty of the community to make parenthood impossible for those whose mental and physical conditions are such that there is well-nigh a certainty that their offspring must be physically and mentally tainted’. Meanwhile, Gandhi once explained his agitation against the British thusly: ‘Ours is one continual struggle against a degradation sought to be inflicted upon us by the Europeans, who desire to degrade us to the level of the raw Kaffir [black African] … whose sole ambition is to collect a certain number of cattle to buy a wife with and … pass his life in indolence and nakedness.’ Such statements seem obviously appalling. But there’s about as much sense in blaming Gandhi for not sharing our modern, Western views on race as there is in blaming the Vikings for not having Netflix. Moral ‘truths’ are acts of imagination. They’re ideas we play games with.

The dream feels so real. And yet it’s all conjured up by the game-making brain. The world around our bodies is chaotic, confusing and mostly unknowable. But the brain must make sense of it. It has to turn that blizzard of noise into a precise, colourful and detailed world it can predict and successfully interact with, such that it gets what it wants. When the brain discovers a game that seems to make sense of its felt reality and offer a pathway to rewards, it can embrace its rules and symbols with an ecstatic fervour. The noise is silenced! The chaos is tamed! We’ve found our story and the heroic role we’re going to play in it! We’ve learned the truth and the way – the meaning of life! It’s yams, it’s God, it’s money, it’s saving the world from evil big pHARMa. It’s not like a religious experience, it is a religious experience. It’s how the writer Arthur Koestler felt as a young man in 1931, joining the Communist Party:

‘To say that one had “seen the light” is a poor description of the mental rapture which only the convert knows (regardless of what faith he has been converted to). The new light seems to pour from all directions across the skull; the whole universe falls into pattern, like stray pieces of a jigsaw puzzle assembled by one magic stroke. There is now an answer to every question, doubts and conflicts are a matter of the tortured past – a past already remote, when one lived in dismal ignorance in the tasteless, colourless world of those who don’t know. Nothing henceforth can disturb the convert’s inner peace and serenity – except the occasional fear of losing faith again, losing thereby what alone makes life worth living, and falling back into the outer darkness, where there is wailing and gnashing of teeth.’

I hope this helps further explain why I think even solving (some versions of) the alignment problem probably won't be enough to ensure a future that's free from astronomical waste or astronomical suffering. A part of me is actually more scared of many futures in which "alignment is solved", than a future where biological life is simply wiped out by a paperclip maximizer.

Discuss

### Are explanations that explain more phenomena always more unlikely than narrower versions?

3 часа 14 минут назад
Published on December 1, 2021 6:34 PM GMT

The classic example of a hypothesis explaining more being less likely would of course be conspiracy theories, where adherents add more and more details under the false assumption that this makes the theory more likely rather than less likely.

However, when we have multiple phenomena that follow a similar pattern, isn't it simpler and more likely that there's only one cause for both situations?

Is it possible that in some circumstances it could be more unlikely that the pattern is completely coincidental?

It seems like the problem with conspiratorial thinking isn't that they explain more with less, but that they can selectively pull their facts from a wide range of fact space. Similar to how you can take advantage of people's tribe-brain and narrative thinking to make them think that surgeons are evil, if you want to tell a story about how sugar companies are taking over the world, you can probably find some number of world leaders with ties to Big Glucose.

Discuss

### AXRP Episode 12 - AI Existential Risk with Paul Christiano

5 часов 6 минут назад
Published on December 2, 2021 2:20 AM GMT

This podcast is called AXRP, pronounced axe-urp and short for the AI X-risk Research Podcast. Here, I (Daniel Filan) have conversations with researchers about their papers. We discuss the paper and hopefully get a sense of why it’s been written and how it might reduce the risk of artificial intelligence causing an existential catastrophe: that is, permanently and drastically curtailing humanity’s future potential.

Why would advanced AI systems pose an existential risk, and what would it look like to develop safer systems? In this episode, I interview Paul Christiano about his views of how AI could be so dangerous, what bad AI scenarios could look like, and what he thinks about various techniques to reduce this risk.

Topics we discuss:

Daniel Filan: Hello everybody. Today, I’ll be speaking with Paul Christiano. Paul is a researcher at the Alignment Research Center, where he works on developing means to align future machine learning systems with human interests. After graduating from a PhD in learning theory in 2017, he went onto research AI alignment at OpenAI, eventually running their language model alignment team. He’s also a research associate at the Future of Humanity Institute in Oxford, a board member at the research non-profit Ought, a technical advisor for Open Philanthropy, and the co-founder of the Summer Program on Applied Rationality and Cognition, a high school math camp. For links to what we’re discussing, you can check the description of this episode and you can read the transcript at axrp.net. Paul, welcome to AXRP.

Paul Christiano: Thanks for having me on, looking forward to talking.

How AI may pose an existential threat

Daniel Filan: All right. So, the first topic I want to talk about is this idea that AI might pose some kind of existential threat or an existential risk, and there’s this common definition of existential risk, which is a risk of something happening that would incapacitate humanity and limit its possibilities for development, incredibly drastically in a way comparable to human extinction, such as human extinction. Is that roughly the definition you use?

Paul Christiano: Yeah. I think I don’t necessarily have a bright line around giant or drastic drops versus moderate drops. I often think in terms of the expected fraction of humanity’s potential that is lost. But yeah, that’s basically what I think of it. Anything that could cause us not to fulfill some large chunk of our potential. I think of AI in particular, a failure to align AI maybe makes the future, in my guess 10% or 20% worse, or something like that, in expectation. And that makes it one of the worst things. I mean, not the worst, that’s a minority of all of our failure to fall short of our potential, but it’s a lot of failure to fall short of our potential. You can’t have that many 20% hits before you’re down to no potential left.

Daniel Filan: Yeah. When you say a 10% or 20% hit to human potential in expectation, do you mean if we definitely failed to align AI or do you mean we may or may not fail to align AI and overall that uncertainty equates to a 20%, or 10% to 20% hit?

Paul Christiano: Yeah, that’s unconditionally. So I think if you told me we definitely mess up alignment maximally then I’m more like, oh, now I are looking at a pretty big, close to 100% drop. I wouldn’t go all the way to 100. It’s not literally as bad probably as a barren earth, but it’s pretty bad.

Daniel Filan: Okay. Yeah. Supposing AI goes poorly or there’s some kind of existential risk posed by some kind of, I guess really bad AI, what do you imagine that looking like?

Paul Christiano: Yeah. So I guess, I think most often about alignment, although I do think there are other ways that you could imagine AI going poorly.

Daniel Filan: Okay. And what’s alignment?

Paul Christiano: Yeah. So by alignment, I mean - I guess a little bit more specifically, we could say intent alignment - I mean the property that your AI is trying to do what you want it to do. So we’re building these AI systems. We imagine that they’re going to help us. They’re going to do all the things humans currently do for each other. They’re going to help us build things. They’re going to help us solve problems. A system is intent aligned if it’s trying to do what we want it to do. And it’s misaligned if it’s not trying to do what we want it to do. So a stereotypical bad case is you have some AI system that is sort of working at cross purposes to humanity. Maybe it wants to ensure that in the long run there are a lot of paperclips, and humanity wants human flourishing. And so the future is then some compromise between paperclips and human flourishing. And if you imagine that you have AI systems a lot more competent than humans that compromise may not be very favorable to humans. And then you might be basically all paperclips.

Daniel Filan: Okay. So this is some world where you have an AI system, and the thing it’s trying to do is not what humans want it to do. And then not only is it a typical bad employee or something, it seems you think that it somehow takes over a bunch of stuff or gains some other power. How are you imagining it being much, much worse than having a really bad employee today?

Paul Christiano: I think that the bad employee metaphor is not that bad. And maybe this is a place I part ways from some people who work on alignment. And the biggest difference is that you can imagine heading for a world where virtually all of the important cognitive work is done by machines. So it’s not as if you have one bad employee; it’s as if for every flesh and blood human there were 10 bad employees.

Daniel Filan: Okay.

Paul Christiano: And if you imagine a society in which almost all of the work is being done by these inhuman systems who want something that’s significantly at cross purposes, it’s possible to have social arrangements in which their desires are thwarted, but you’ve kind of set up a really bad position. And I think the best guess would be that what happens will not be what the humans want to happen, but what the systems who greatly outnumber us want to happen.

Daniel Filan: Okay. So we delegate a bunch of cognitive work to these AI systems, and they’re not doing what we want. And I guess you further think it’s going to be hard to un-delegate that work. Why do you think it will be hard to un-delegate that work?

Paul Christiano: I think there’s basically two problems. So one is, if you’re not delegating to your AI then what are you delegating to? So if delegating to AI is a really efficient way to get things done and there’s no other comparably efficient way to get things done, then it’s not really clear, right? There might be some general concern about the way in which AI systems are affecting the world, but it’s not really clear that people have a nice way to opt out. And that might be a very hard coordination problem. That’s one problem. The second problem is just, you may be unsure about whether things are going well or going poorly. If you imagine again, this world where it’s like there’s 10 billion humans and 100 billion human-level AI systems or something like that: if one day it’s like, oh, actually that was going really poorly that may not look like employees have embezzled a little money, it may instead look like they grabbed the machinery by which you could have chosen to delegate to someone else. It’s kind of like the ship has sailed once you’ve instantiated 100 billion of these employees to whom you’re delegating all this work. Maybe employee is kind of a weird or politically loaded metaphor. But the point is just you’ve made some collective system much more powerful than humans. One problem is you don’t have any other options. The other is that system could clearly stop you. Over time, eventually, you’re not going to be able to roll back those changes.

Daniel Filan: Okay.

Paul Christiano: Because almost all of the people doing anything in the world don’t want you to. “People” in quotes, don’t want you to roll back those changes.

Daniel Filan: So some people think, probably what’s going to happen is one day all humans will wake up dead. You might think that it looks we’re just stuck on earth and AI systems get the whole rest of the universe or keep expanding until they meet aliens or something. What concretely do you think it looks like after that?

Paul Christiano: I think it depends both on technical facts about AI and on some facts about how we respond. So some important context on this world: I think by default, if we weren’t being really careful, one of the things that would happen is AI systems would be running most militaries that mattered. So when we talk about all of the employees are bad, we don’t just mean people who are working in retail or working as scientists, we also mean the people who are taking orders when someone is like, “We’d like to blow up that city,” or whatever.

Daniel Filan: Yep.

Paul Christiano: So by default I think exactly how that looks depends on a lot of things but in most of the cases it involves… the humans are this tiny minority that’s going to be pretty easily crushed. And so there’s a question of like, do your AI systems want to crush humans, or do they just want to do something else with the universe, or what? If your AI systems wanted paperclips and your humans were like, “Oh, it’s okay. The AIs want paperclips. We’ll just turn them all off,” then you have a problem at the moment when the humans go to turn them all off or something. And that problem may look like the AIs just say like, “Sorry, I don’t want to be turned off.” And it may look like, and again, I think that could get pretty ugly if there’s a bunch of people like, “Oh, we don’t like the way in which we’ve built all of these machines doing all of this stuff.”

Paul Christiano: If we’re really unhappy with what they’re doing, that could end up looking like violent conflict, it could end up looking like people being manipulated to go on a certain course. It kind of depends on how humans attempt to keep the future on track, if at all. And then what resources are at the disposal of AI systems that want the future to go in this inhuman direction? Yeah. I think that probably my default visualization is humans won’t actually make much effort, really. We won’t be in the world where it’s all the forces of humanity arrayed against the forces of machines. It’s more just the world will gradually drift off the rails. By “gradually drift off the rails” I mean humans will have less and less idea what’s going on.

Paul Christiano: Imagine some really rich person who on paper has a ton of money. And is asking things to happen, but they give instructions to their subordinates and then somehow nothing really ends up ever happening. They don’t know who they’re supposed to talk to and they are never able to figure out what’s happening on the ground or who to hold accountable. That’s kind of my default picture. I think the reason that I have that default picture is just because I don’t expect humans to, in cases where we fail, there’s some way in which we’re not going to really be pushing back that hard. I think if we were really unhappy with that situation then instead, you could not gradually drift off the rails, but if you really are messing up alignment then instead of gradually drifting off the rails it looks more like an outbreak of violent conflict or something like that.

Daniel Filan: So, I think that’s a good sense of what you see as the risks of having really smart AIs that are not aligned. Do you think that that is the main kind of AI-generated existential risk to worry about, or do you think that there are others that you’re not focusing on but they might exist?

Paul Christiano: Yeah. I think that there’s two issues here. One is that I kind of expect a general acceleration of everything that’s happening in the world. So just as the world now, you might think that it takes 20 to 50 years for things to change a lot. Long ago it used to take hundreds of years for things to change a lot. I do expect we will live to see a world where it takes a couple years and then maybe a couple months for things to change a lot. In some sense that entire acceleration is likely to be really tied up with AI. If you’re imagining the world where next year the world looks completely different and is much larger than it was this year, that involves a lot of activity that humans aren’t really involved in or understanding.

Paul Christiano: So I do think that a lot of stuff is likely to happen. And from our perspective it’s likely to be all tied up with AI. I normally don’t think about that because I’m sort of not looking that far ahead. That is in some sense I think there’s not much calendar time between the world of now and the world of “crazy stuff is happening every month”, but a lot happens in the interim, right? The only way in which things are okay is if there are AI systems looking out for human interests as you’re going through that transition. And from the perspective of those AI systems, a lot of time passes, or like, a lot of cognitive work happens.

Paul Christiano: So I guess the first point was, I think there are a lot of risks in the future. In some sense from our perspective what it’s going to feel like is the world accelerates and starts getting really crazy. And somehow AI is tied up with that. But I think if you were to be looking on the outside you might then see all future risks as risks that felt like about AI. But in some sense, they’re kind of not our risks to deal with in some sense, they’re the risks of the civilization that we become, which is a civilization largely run by AI systems.

Daniel Filan: Okay. So you imagine, look, we might just have really dangerous problems later. Maybe there’s aliens or maybe we have to coordinate well and AIs would somehow be involved.

Paul Christiano: Yeah. So if you imagine a future nuclear war or something like that, or if you imagine all the future progressing really quickly. Then from your perspective on the outside what it looks like is now huge amounts of change are occurring over the course of every year, and so one of those changes is that somewhere that would’ve taken hundreds of years now only takes a couple years to get to the crazy destructive nuclear war. And from your perspective, it’s kind of like, “Man, our crazy AI started a nuclear war.” From the AI’s perspective it’s like we had many generations of change and this was one of the many coordination problems we faced, and we ended up with a nuclear war. It’s kind of like, do you attribute nuclear wars as a failure of the industrial revolution, or risk of the industrial revolution? I think that would be a reasonable way to do the accounting. If you do the accounting that way there are a lot of risks that are AI risks. Just in the sense that there are a lot of risks that are industrial revolution risks. That’s one category of answer, I think there’s a lot of risks that kind of feel like AI risks in that they’ll be consequences of crazy AI driven conflict or things like that, just because I view a lot of the future as crazy fast stuff driven by AI systems.

Daniel Filan: Okay.

Paul Christiano: There’s a second category that’s risks that to me feel more analogous to alignment, which are risks that are really associated with this early transition to AI systems, where we will not yet have AI systems competent enough to play a significant role in addressing those risks, so a lot of the work falls to us. I do think there are a lot of non-alignment risks associated with AI there. I’m happy to go into more of those. I think broadly the category that I am most scared about is there’s some kind of deliberative trajectory humanity is kind of along ideally or that we want to be walking along. We want to be better clarifying what we want to do with the universe, what it is we want as humans, how we should live together, et cetera. There’s some question of just, are we happy with where that process goes? Or if you’re a moral realist type, do we converge towards moral truth? If you think that there’s some truth of the matter about what was good, do we converge towards that? But even if you don’t think there’s a fact of the matter you could still say, “Are we happy with the people we become?” And I think I’m scared of risks of that type. And in some sense alignment is very similar to risks of that type, because you kind of don’t get a lot of tries at them.

Paul Christiano: You’re going to become some sort of person, and then after we as a society converge on what we want, or as what we want changes, there’s no one looking outside of the system, who’s like, “Oops! We messed that one up. Let’s try again.” If you went down a bad path, you’re sort of by construction now happy with where you are, but the question is about what you wanted to achieve. So I think there’s potentially a lot of path dependence there. A lot of that is tied up, there are a lot of ways in which the deployment of AI systems will really change the way that humans talk to each other and think about what we want, or think about how we should relate.

Paul Christiano: I’m happy to talk about some of those but I think the broad thing is just, if a lot of thinking is being done not by humans, that’s just a weird situation for humans to be in, and it’s a little bit unclear. If you’re not really thoughtful about that, it’s unclear if you’re happy with it. If you told me that the world with AI and the world without AI converged to different views about what is good, I’m kind of like, “Oh, I don’t know which of those… “ Once you tell me there’s a big difference between those, I’m kind of scared. I don’t know which side is right or wrong, they’re both kind of scary. But I am definitely scared.

AI timelines

Daniel Filan: So, I think you said that relatively soon, we might end up in this kind of world where most of the thinking is being done by AI. So there’s this claim that AI is going to get really good, and not only is it getting really good, it’s going to be the dominant way we do most cognitive work, or most thinking maybe. And not only is that eventually going to happen, it’s not going to be too long from now. I guess the first thing I’d like to hear is, by not too long from now do you mean the next 1000 years, the next 100 years, the next 10 years? And if somebody’s skeptical of that claim, could you tell us why you believe that?

Paul Christiano: So I guess there’s a couple parts of the claim. One is AI systems becoming… I think right now we live in a world where AI does not very much change the way that humans get things done. That is, technologies you’d call AI are not a big part of how we solve research questions or how we design new products or so on. There’s some transformation from the world of today to a world in which AI is making us, say, considerably more productive. And there’s a further step to the world where human labor is essentially obsolete, where it’s from our perspective this crazy fast process. So I guess my overall guess is I have a very broad distribution over how long things will take. Especially how long it will take to get to the point where AI is really large, where maybe humans are getting twice as much done, or getting things done twice as quickly due to AI overall.

Paul Christiano: Maybe I think that there’s a small chance that that will happen extremely quickly. So there’s some possibility of AI progress being very rapid from where we are today. Maybe in 10 years, I think there’s a 5% or 10% chance that AI systems can make most things humans are doing much, much faster. And then kind of taking over most jobs from humans. So I think that 5% to 10% chance of 10 years, that would be a pretty crazy situation where things are changing pretty quickly. I think there’s a significantly higher probability in 20 or 40 years. Again in 20 years maybe I’d be at 25%. At 40 years maybe I’m at 50%, something like that. So that’s the first part of the question, when are we in this world where the world looks very different because of AI, where things are happening much faster? And then I think I have a view that feels less uncertain, but maybe more contrarian about… I mean more contrarian than the world at large, very not-that-contrarian amongst the effective altruist or rationalist or AI safety community.

Paul Christiano: So I have another view which I think I feel a little bit less uncertain about, that is more unusual in the world at large, which is that you only have probably on the order of years between AI that has… maybe you can imagine it’s three years between AI systems that have effectively doubled human productivity and AI systems that have effectively completely obsoleted humans. And it’s not clear. There’s definitely significant uncertainty about that number, but I think it feels quite likely to me that it’s relatively short. I think amongst people who think about alignment risk, I actually probably have a relatively long expected amount of time between those milestones.

Paul Christiano: And if you talk to someone like Eliezer Yudkowsky from MIRI, I think he would be more like “good chance that that’s only one month” or something like that between those milestones. I have the view that the best guess would be somewhere from one to five years. And I think even at that timeline, that’s pretty crazy and pretty short. Yeah. So my answer was some broad distribution over how many decades until you have AI systems that have really changed the game, and are making humans several times more productive. Say the economy’s growing several times faster than it is today. And then from there most likely on the order of years rather than decades until humans are basically completely obsolete, and AI systems have improved significantly past that first milestone.

Daniel Filan: And can you give us a sense of why somebody might believe that?

Paul Christiano: Yeah. Maybe I’ll start with the second and then go back to the first. I think the second is, in some sense, a less popular position in the broader world. I think one important part of story is the current rate of progress that you would observe in either computer hardware or computer software. So if you ask given an AI system, how long does it take to get, say, twice as cheap until you can do the same thing that it used to be able to do for half as many dollars? That tends to be something in the ballpark of a year, rather than something in the ballpark of a decade. So right now that doesn’t matter very much at all. So if you’re able to do the same or you’re able to train the same neural net for half the dollars, it doesn’t do that much. It just doesn’t help you that much if you’re able to run twice as many neural networks. Even if you have self-driving cars, the cost of running the neural networks isn’t actually a very big deal. Having twice as many neural networks to drive your cars doesn’t improve overall output that much. If you’re in a world where, say, you have AI systems which are effectively substituting for human researchers or human laborers, then having twice as many of them eventually becomes more like having twice as many humans doing twice as much work, which is quite a lot, right? So that is more like doubling the amount of total stuff that’s happening in the world.

Paul Christiano: It doesn’t actually double the amount of stuff because there’s a lot of bottlenecks, but it looks like, starting from the point where AI systems are actually doubling the rate of growth or something like that, it doesn’t really seem there are enough bottlenecks to prevent further doublings in the quality of hardware or software from having really massive impacts really quickly. So that’s how I end up with thinking that the time scale is measured more like years than decades. Just like, once you have AI systems which are sort of comparable with humans or are in aggregate achieving as much as humans, it doesn’t take that long before you have AI systems whose output is twice or four times that of humans.

Daniel Filan: Okay. And so this is basically something like, in economics you call it an endogenous growth story, or a society-wide recursive self-improvement story. Where if you double the human population, and if they’re AI systems, maybe that makes it better, there are just more ideas, more innovation and a lot of it gets funneled back into improving the AI systems that are a large portion of the cognitive labor. Is that roughly right?

Paul Christiano: Yeah. I think that’s basically right. I think there are kind of two parts to the story. One is what you mentioned of all the outputs get plowed back into making the system ever better. And I think that, in the limit, produces this dynamic of successive doublings of the world where each is significantly faster than the one before.

Daniel Filan: Yep.

Paul Christiano: I think there’s another important dynamic that can be responsible for kind of abrupt changes that’s more like, if you imagine that humans and AIs were just completely interchangeable: you can either use a human to do a task or an AI to do a task. This is a very unrealistic model, but if you start there, then there’s kind of the curve of how expensive it is or how much we can get done using humans, which is growing a couple percent per year, and then how much you can get done using AIs, which is growing 100% per year or something like that. So you can kind of get this kink in the curve when the rapidly growing 100% per year curve intercepts and then continues past the slowly growing human output curve.

Paul Christiano: If output was the sum of two exponentials, one growing fast and one growing slow, then you can have a fairly quick transition as one of those terms becomes the dominant one in the expression. And that dynamic changes if humans and AIs are complementary in important ways. And also the rate of progress changes if you change… like, progress is driven by R&D investments, it’s not an exogenous fact about the world that once every year things double. But it looks the basic shape of that curve is pretty robust to those kinds of questions, so that you do get some kind of fairly rapid transition.

Daniel Filan: Okay. So we currently have something like a curve where humanity gets richer, we’re able to produce more food. And in part, maybe not as much in wealthy countries, but in part that means there are more people around and more people having ideas. So, you might think that the normal economy has this type of feedback loop, but it doesn’t appear that at some point there’s going to be these crazy doubling times of 5 to 10 years and humanity is just going to go off the rails. So what’s the key difference between humans and AI systems that makes the difference?

Paul Christiano: It is probably worth clarifying that on these kinds of questions I am more hobbyist than expert. But I’m very happy to speculate about them, because I love speculating about things.

Daniel Filan: Sure.

Paul Christiano: So I think my basic take would be that over the broad sweep of history, you have seen fairly dramatic acceleration in the rate of humans figuring new things out, building new stuff. And there’s some dispute about that acceleration in terms of how continuous versus how jumpy it is. But I think it’s fairly clear that there was a time when aggregate human output was doubling more like every 10,000 or 100,000 years.

Daniel Filan: Yep.

Paul Christiano: And that has dropped somewhere between continuously and in three big jumps or something, down to doubling every 20 years. And we don’t have very great data on what that transition looks like, but I would say that it is at least extremely consistent with exactly the kind of pattern that we’re talking about in the AI case.

Daniel Filan: Okay.

Paul Christiano: And if you buy that, then I think you would say that the last 60 years or so have been fairly unusual as growth hit this… maybe gross world product growth was on the order of 4% per year or something in the middle of the 20th century. And the reason things have changed, there’s kind of two explanations that are really plausible to me. One is you no longer have accelerating population growth in the 20th century. So for most of human history, human populations are constrained by our ability to feed people. And then starting in the 19th, 20th centuries human populations are instead constrained by our desire to create more humans, which is great.

Paul Christiano: It’s good not to be dying because you’re hungry. But that means that you no longer have this loop of more output leading to more people. I think there’s a second related explanation, which is that the world now changes kind of roughly on the time scale of human lifetime, that is like, it now takes decades for a human to adapt to change and also decades for the world to change a bunch. So you might think that changing significantly faster than that does eventually become really hard for processes driven by humans. So you have additional bottlenecks just beyond how much work is getting done, where it’s at some point very hard for humans to train and grow new humans, or train and raise new humans.

Daniel Filan: Okay.

Paul Christiano: So those are some reasons that a historical pattern of acceleration may have recently stopped. Either because it’s reached the characteristic timescales of humans, or because we’re no longer sort of feeding output back into raising population. Now we’re sort of just growing our population at the rate which is most natural for humans to grow. Yeah, I think that’s my basic take. And then in some sense AI would represent a return to something that at least plausibly was a historical norm, where further growth is faster, because research is one of those things or learning is one of those things that has accelerated. Recently I don’t know if you’ve discussed this before, but Holden Karnofsky at Cold Takes has been writing a bunch of blog posts summarizing what this view looks like, and some of the evidence for it. And then prior to that, Open Philanthropy was writing a number of reports looking at pieces of the story and thinking through it, which I think overall taken together makes the view seem pretty plausible, still.

Daniel Filan: Okay.

Paul Christiano: That there is some general historical dynamic, which it would not be crazy if AI represented a return to this pattern.

Daniel Filan: Yes. And indeed if people are interested in this, there’s an episode that’s… unfortunately the audio didn’t work out, but one can read a transcript of an interview with Ajeya Cotra on this question of when we’ll get very capable AI.

Why we might build risky AI

Daniel Filan: To change gears a little bit. One question that I want to ask is, you have this story where we’re gradually improving AI capabilities bit by bit, and it’s spreading more and more. And in fact the AI systems, in the worrying case, they are misaligned and they’re not going to do what people want them to do, and that’s going to end up being extremely tragic. It will lead to an extremely bad outcome for humans.

Daniel Filan: And at least for a while it seems like humans are the ones who are building the AI systems and getting them to do things. So, I think a lot of people have this intuition like, look, if AI causes a problem… we’re going to deploy AI in more and more situations, and better and better AI, and we’re not going to go from zero to terrible, we’re going to go from an AI that’s fine to an AI that’s moderately naughty, before it hits something that’s extremely, world endingly bad or something. It seems you think that might not happen, or we might not be able to fix it or something. I’m wondering, why is that?

Paul Christiano: And so I think that you’re likely to have, in the bad case, this fairly long period where AI systems are very poorly aligned that are still adding a ton of value and working reasonably well. And I think in that regime you can observe things like failures. You can observe systems that are say, again, just imagine the metaphor of some kind of myopic employee who really wants a good performance review. You can imagine them sometimes doing bad stuff. Maybe they fake some numbers, or they go and tamper with some evidence about how well they’re performing, or they steal some stuff and go use it to pay some other contractor to do their work or something. You can imagine various bad behaviors pursued in the interest of getting a good performance review. And you can also imagine fixing those, by shifting to gradually more long term and more complete notions of performance.

Paul Christiano: So say I was evaluating my system once a week. And one week it’s able to get a really good score by just fooling me about what happened that week. Maybe I notice next week and I’m like, “Oh, that was actually really bad.” And maybe I say, “Okay, what I’m training you for now is not just myopically getting a good score this week, but also if next week I end up feeling like this was really bad, that you shouldn’t like that at all.” So I could train, I could select amongst AI systems those which got a good score, not only over the next week but also didn’t do anything that would look really fishy over the next month, or something like that. And I think that this would fix a lot of the short term problems that would emerge from misalignment, right? So if you have AI systems which are merely smart, so that they can understand the long term consequences, they can understand that if they do something fraudulent, you will eventually likely catch it. And that that’s bad. Then you can fix those problems just by changing the objective to something that’s a slightly more forward looking performance review. So that’s part of the story, that I think there’s this dynamic by which misaligned systems can add a lot of value, and you can fix a lot of the problems with them without fixing the underlying problem.

Daniel Filan: Okay. There’s something a little bit strange about this idea that people would apply this fix, that you think predictably preserves the possibility of extremely terrible outcomes, right? Why would people do something so transparently silly?

Paul Christiano: Yeah. So I think that the biggest part of my answer is that it is, first very unclear that such an act is actually really silly. So imagine that you actually have this employee, and what they really want to do is get good performance reviews over the next five years. And you’re like, well, look, they’ve never done anything bad before. And it sure seems all the kinds of things they might do that would be bad we would learn about within five years. They wouldn’t really cause trouble. Certainly for a while it’s a complicated empirical question, and maybe even at the point when you’re dead, it’s a complicated empirical question, whether there is scope for the kind of really problematic actions you care about, right? So the kind of thing that would be bad in this world, suppose that all the employees of the world are people who just care about getting good performance reviews in three years.

Paul Christiano: That’s just every system is not a human, everything doing work is not a human. It’s this kind of AI system that has been built and it’s just really focused on the objective. What I care about is the performance review that’s coming up in three years. The bad outcome is one where humanity collectively, the only way it’s ever even checking up on any of these systems or understanding what they’re doing is by delegating to other AI systems who also just want a really good performance review in three years. And someday, there’s kind of this irreversible failure mode where all the AI systems are like, well, look. We could try and really fool all the humans about what’s going on, but if we do that the humans will be unhappy when they discover what’s happened. So what we’re going to do instead is we’re going to make sure we fool them in this irreversible way.

Paul Christiano: Either they are kept forever in the dark, or they realize that we’ve done something bad but they no longer control the levers of the performance review. And so, if all of the AI systems in the world are like there’s this great compromise we can pursue. There’s this great thing that the AI should do, which is just forever give ourselves ideal perfect performance reviews. That’s this really bad outcome, and it’s really unclear if that can happen. I think in some sense people are predictably leaving themselves open to this risk, but I don’t think it will be super easy to assess, well, this is going to happen in any given year. Maybe eventually it would be. It depends on the bar of obviousness that would motivate people.

Paul Christiano: And that maybe relates to the other reason it seems kind of tough. If you have some failure, for every failure you’ve observed there’s this really good fix, which is to push out what your AI system cares about, or this timescale for which it’s being evaluated to a longer horizon. And that always works well. That always copes with all the problems you’ve observed so far. And to the extent there’s any remaining problems, they’re always this kind of unprecedented problem. They’re always at this time scale that’s longer than anything you’ve ever observed, or this level of elaborateness that’s larger than anything you’ve observed. And so I think it is just quite hard as a society, we’re probably not very good at it. It’s hard to know exactly what the right analogy is, but basically any way you spin it, it doesn’t seem that reassuring about how much we collectively will be worried by failures that are kind of analogous to, but not exactly like, any that we’ve ever seen before.

Paul Christiano: I imagine in this world, a lot of people would be vaguely concerned. A lot of people would be like, “Oh, aren’t we introducing this kind of systemic risk? This correlated failure of AI systems seems plausible and we don’t have any way to prepare for it.” But it’s not really clear what anyone does on the basis of that concern or how we respond collectively. There’s a natural thing to do which is just sort of not deploy some kinds of AI, or not to deploy AI in certain ways, but that looks it could be quite expensive and would leave a lot of value on the table. And hopefully people can be persuaded to that, but it’s not at all clear they could be persuaded, or for how long. I think the main risk factor for me is just: is this a really, really hard problem to deal with?

Paul Christiano: I think if it’s a really easy problem to deal with, it’s still possible, we’ll flub it. But at least it’s obvious what the ask is if you’re saying, look, there’s a systemic risk, and you could address it by doing the following thing. Then it’s not obvious. I think there are easy to address risks that we don’t do that well at addressing collectively. But at least there’s a reasonably good chance. If we’re in the world where there’s no clear ask, where the ask is just like, “Oh, there’s a systemic risk, so you should be scared and maybe not do all that stuff you’re doing.” Then I think you’re likely to run into everyone saying, “But if we don’t do this thing, someone else will do it even worse than us and so, why should we stop?”

Daniel Filan: Yeah. So earlier I asked why don’t people fix problems as they come up. And part one of the answer was, maybe people will just push out the window of evaluation and then there will be some sort of correlated failure. Was there a part two?

Paul Christiano: Yeah. So part two is just that it may be… I didn’t get into justification for this, but it may be hard to fix the problem. You may not have an easy like, “Oh yeah, here’s what we have to do in order to fix the problem.” And it may be that we have a ton of things that each maybe help with the problem. And we’re not really sure, it’s hard to see which of these are band-aids that fix current problems versus which of them fix deep underlying issues, or there may just not be anything that plausibly fixes the underlying issue. I think the main reason to be scared about that is just that it’s not really clear we have a long term development strategy, at least to me.

Paul Christiano: It’s not clear we have any long term development strategy for aligned AI. I don’t know if we have a roadmap where we say, “Here’s how you build some sequence of arbitrarily competent aligned AIs.” I think mostly we have, well here’s how maybe you cope with the alignment challenges presented by the systems in the near term, and then we hope that we will gradually get more expert to deal with later problems. But I think all the plans have some question marks where they say, “Hopefully, it will become more clear as we get empirical. As we get some experience with these systems, we will be able to adapt our solutions to the increasingly challenging problems.” And it’s not really clear if that will pan out. Yeah. It seems a big question mark right now to me.

Takeoff speeds

Daniel Filan: Okay. So I’m now going to transition a little bit to questions that somebody who is very bullish on AI x-risk might ask, or ways they might disagree with you. I mean bullish on the risk, bearish on the survival. Bullish meaning you think something’s going to go up and bearish meaning you think something’s going to go down. So yeah, some people have this view that it might be the case that you have one AI system that you’re training for a while. Maybe you’re a big company, you’re training it for a while, and it goes from not having a noticeable impact on the world to effectively running the world in less than a month. This is often called the Foom view. Where your AI blows up really fast in intelligence, and now it’s king of the world. I get the sense that you don’t think this is likely, is that right?

Paul Christiano: I think that’s right. Although, it is surprisingly hard to pin down exactly what the disagreement is about, often. And the thing that I have in mind may feel a lot like foom. But yeah, I think it’s right, that the version of that, that people who are most scared have in mind, feels pretty implausible to me.

Daniel Filan: Why does it seem implausible to you?

Paul Christiano: I think the really high level… first saying a little bit about why it seems plausible or fleshing out the view, as I understand it: I think the way that you have this really rapid jump normally involves AI systems automating the process of making further AI progress. So you might imagine you have some sort of object level AI systems that are actually conducting biology research or actually building factories or operating drones. And then you also have a bunch of humans who are trying to improve those AI systems. And what happens first is not that AIs get really good at operating drones or doing biology research, but AIs get really good at the process of making AIs better. And so you have in a lab somewhere, AI systems making AIs better and better and better, and that can race really far ahead of AI systems having some kind of physical effect in the world.

Paul Christiano: So you can have AI systems that are first a little bit better than humans, and then significantly better. And then just radically better than humans at AI progress. And they sort of bring up the quality, right? As you have those much better systems doing AI work, they very rapidly bring up the quality of physical AI systems doing stuff in the physical world, before having much actual physical deployment. And then something kind of at the end of the story, in some sense, after all like the real interesting work has already happened, you now have these really competent AI systems that can get rolled out, and that are taking advantage. Like there’s a bunch of machinery lying around, and you imagine these godlike intelligences marching out into the world and saying, “How can we, over the course of the next 45 seconds utilize all this machinery to take over the world”, or something like that. It’s kind of how the story goes.

Paul Christiano: And the reason it got down to 45 seconds is just because there have been many generations of this ongoing AI progress in the lab. That’s how I see the story, and I think that’s probably also how people who are most scared about that see the story of having this really rapid self improvement.

Paul Christiano: Okay, so now we can talk about why I’m skeptical, which is basically just quantitative parameters in that story. So I think there will come a time when most further progress in AI is driven by AIs themselves, rather than by humans. I think we have a reasonable sense of when that happens, qualitatively. If you bought this picture of, with human effort, let’s just say AI systems are doubling in productivity every year. Then there will come some time when your AI has reached parity with humans at doing AI development. And now by that point, it takes six further months until… if you think that that advance amounts to an extra team of humans working or whatever, it takes in the ballpark of a year for AI systems to double in productivity one more time. And so that kind of sets the time scale for the following developments. Like at the point when your AI systems have reached parity with humans, progress is not that much faster than if it was just humans working on AI systems. So the amount of time it takes for AIs to get significantly better again, is just comparable to the amount of time it would’ve taken humans working on their own to make the AI system significantly better. So it’s not something that happens on that view, in like a week or something.

Paul Christiano: It is something that happens potentially quite fast, just because progress in AI seems reasonably fast. I guess my best guess is that it would slow, for which we can talk about. But even at the current rate, it’s still, you’re talking something like a year, and then the core question becomes what’s happening along that trajectory. So what’s happening over the preceding year, and over the following six months. And from that moment where AI systems have kind of reached parity with humans at making further AI progress and I think the basic analysis is at that point, AI is one of the most important, if not the most important, industries in the world. At least in kind of an efficient market-y world. We could talk about how far we depart from an efficient market-y world. But in efficient market-y world, AI and computer hardware and software broadly is where most of the action is in the world economy. At the point when you have AI systems that are matching humans in that domain, they are also matching humans in quite a lot of domains. You have a lot of AI systems that are able to do a lot of very cool stuff in the world. And so you’re going to have then, on the order of a year, even six months after that point, of AI systems doing impressive stuff. And for the year before that, or a couple years before that, you also had a reasonable amount of impressive AI applications.

Daniel Filan: Okay. So, it seems like key place where that story differs is in the foom story, it was very localized. There was one group where AI was growing really impressively. Am I right, that you are thinking, no, probably a bunch of people will have AI technology that’s like only moderately worse than this amazing thing?

Paul Christiano: Yeah. I think that’s basically right. The main caveat is what “one group” means. And so I think I’m open to saying, “Well, there’s a question of how much integration there is in the industry.”

Daniel Filan: Yeah.

Paul Christiano: And you could imagine that actually most of the AI training is done… I think there are these large economies of scale in training machine learning systems. Because you have to pay for these very large training runs, and you just want to train. You want to train the biggest system you can and then deploy that system a lot of times, often. Training a model that’s twice as big and deploying half as many of them is better than training a smaller model and deploying. Though obviously, it depends on the domain. But anyway, you often have these economies of scale.

Daniel Filan: Yep.

Paul Christiano: If you have economies of scale, you might have a small number of really large firms. But I am imagining then you’re not talking, some person in the basement, you’re talking, you have this crazy $500 billion project at Google. Daniel Filan: Yep. Paul Christiano: In which Google, amongst other industries, is being basically completely automated. Daniel Filan: And so there, the view is, the reason that it’s not localized is that Google’s a big company and while this AI is fooming, they sort of want to use it a bit to do things other than foom. Paul Christiano: Yeah. That’s right. I think one thing I am sympathetic to in the fast takeoff story is, it does seem like in this world, as you’re moving forward and closer to AIs having parity with humans, the value of the sector - computer hardware, computer software, any innovations that improve the quality of AI - all of those are becoming extremely important. You are probably scaling them up rapidly in terms of human effort. And so at that point, you have this rapidly growing sector, but it’s hard to scale it up any faster, people working on AI or working in computer hardware and software. Paul Christiano: And so, there’s this really high return to human cognitive labor in that area. And so probably it’s the main thing you’re taking and putting the AIs on, the most important task for them. And also the task you understand best as an AI research lab, is improving computer hardware, computer software, making these training runs more efficient, improving architectures, coming up with better ways to deploy your AI. So, I think it is the case that in that world, maybe the main thing Google is doing with their$500 billion project is automating Google and a bunch of adjacent firms. I think that’s plausible. And then I think the biggest disagreement between the stories is, what is the size of that as it’s happening? Is that happening in some like local place with a small AI that wasn’t a big deal, or is this happening at some firm where all the eyes of the world are on this firm, because it’s this rapidly growing firm that makes up a significant fraction of GDP and is seen as a key strategic asset by the host government and so on.

Daniel Filan: So all the eyes are on this firm and it’s still plowing most of the benefits of its AI systems into developing better AI. But is the idea then that the government puts a stop to it, or does it mean that somebody else steals the AI technology, and makes their own slightly worse AI? Why do all the eyes being on it change the story?

Paul Christiano: I mean, I do think the story is still pretty scary. And I don’t know if this actually changes my level of fear that much, but answering some of your concrete questions: I expect in terms of people stealing the AI, it looks kind of like industrial espionage generally. So people are stealing a lot of technology. They generally lag a fair distance behind, but not always. I imagine that governments are generally kind of protective of domestic AI industry, because it’s an important technology in the event of conflict. That is, no one wants to be in a position where critical infrastructure is dependent on software that they can’t maintain themselves. I think that probably the most alignment relevant thing is just that you now have these very large number of human equivalents working in AI. In fact a large share, in some sense, of the AI industry is made of AIs.

Paul Christiano: And one of the key ways in which things can go well is for those AI systems to also be working on alignment. And one of the key questions is how effectively does that happen? But by the time you’re in this world, in addition to the value of AI being much higher, the value of alignment is much higher. I think that alignment worked on far in advance still matters a lot. There’s a good chance that there’s going to be a ton of institutional problems at that time, and that it’s hard to scale up work quickly. But I do think you should be imagining, most of the alignment work in total is done, as part of this gigantic project. And a lot of that is done by AIs. I mean, before the end, in some sense, almost all of it is done by AIs.

Paul Christiano: Overall, I don’t know if this actually makes me feel that much more optimistic. I think maybe there’s some other aspects, some additional details in the foom story that kind of puts you in this, no empirical feedback regime. Which is maybe more important than the size of the fooming system. I think I’m skeptical of a lot of the empirical claims about alignment. So an example of the kind of thing that comes up: we are concerned about AI systems that actually don’t care at all about humans, but in order to achieve some long term end, want to pretend they care about humans.

Paul Christiano: And the concern is this can almost completely cut off your ability to get empirical evidence about how well alignment is working. Because misaligned systems will also try and look aligned. And I think there’s just some question about how consistent that kind of motivational structure is. So, if you imagine you have someone who’s trying to make the case for severe alignment failures, can that person exhibit a system which is misaligned and just takes its misalignment to go get an island in the Caribbean or something, rather than trying to play the long game, and convince everyone that it’s aligned so it can grab the stars. Are there some systems that just want to get good performance reviews? Some systems will want to look like they’re being really nice consistently in order that they can grab the stars later, or somehow divert the trajectory of human civilization. But there may also just be a lot of misaligned systems that want to fail in much more mundane ways that are like, “Okay, well there’s this slightly outside of bounds way to hack the performance review system and I want to get a really good review, so I’ll do that.”

Paul Christiano: So, how much opportunity will we have to empirically investigate those phenomena? And the arguments for total unobservability, that you never get to see anything, just currently don’t seem very compelling to me. I think the best argument in that direction is, empirical evidence is on a spectrum of how analogous it is to the question you care about. So we’re concerned about AI that changes the whole trajectory of human civilization in a negative way. We’re not going to get to literally see AI changing the trajectory of civilization in a negative way. So now it comes down to some kind of question about institutional or social competence. Of what kind of indicators are sufficiently analogous that we can use them to do productive work, or to get worried in cases where we should be worried.

Paul Christiano: I think the best argument is, “Look, even if these things are in some technical sense, very analogous and useful problems to work on, people may not appreciate how analogous they are or they may explain them away. Or they may say, ‘Look, we wanted to deploy this AI and actually we fixed that problem, haven’t we?’” Because the problem is not thrown in your face in the same way that airplane safety or something is thrown in your face, then people may have a hard time learning about it. Maybe I’ve gone on a little bit of a tangent away from the core question.

Daniel Filan: Okay. Hopefully we can talk about related issues a bit later. On the question of takeoff speeds. So you wrote a post a while ago that is mostly arguing against arguments you see for very sudden takeoff of AI capabilities from very low to very high. And a question I had about that is, one of the arguments you mentioned in favor of very sudden capability gains, is there being some sort of secret sauce to intelligence. Which in my mind is, it looks like one day you discover, maybe it’s Bayes’ theorem, or maybe you get the actual ideal equation for bounded rationality or something. I think there’s some reason to think of intelligence as somehow a simple phenomenon.

Daniel Filan: And if you think that, then it seems maybe, one day you could just go from not having the equation, to having it, or something? And in that case, you might expect that, you’re just so much better when you have the ideal rationality equation, compared to when you had to do whatever sampling techniques and you didn’t realize how to factor in bounded rationality or something. Why don’t you think that’s plausible, or why don’t you think it would make this sudden leap in capabilities?

Paul Christiano: I don’t feel like I have deep insight into whether intelligence has some beautiful, simple core. I’m not persuaded by the particular candidates, or the particular arguments on offer for that.

Daniel Filan: Okay.

Paul Christiano: And so I am more feeling there’s a bunch of people working on improving performance on some task. We have some sense of how much work it takes to get what kind of gain, and what the structure is for that task. If you look at a new paper, what kind of gain is that paper going to have and how much work did it have? How does that change as more and more people have worked in the field? And I think both in ML and across mature industries in general, but even almost unconditionally, it’s just pretty rare to have like a bunch of work in an area, and then some small overlooked thing makes a huge difference. In ML, we’re going to be talking about many billions of dollars of invest, tens or hundreds of billions, quite plausibly.

Paul Christiano: It’s just very rare to then have a small thing, to be like, “Oh, we just overlooked all this time, this simple thing, which makes a huge difference.” My training is as a theorist. And so I like clever ideas. And I do think clever ideas often have big impacts relative to the work that goes into finding them. But it’s very hard to find examples of the impacts being as big as the one that’s being imagined in this story. I think if you find your clever algorithm and then when all is said and done, the work of noticing that algorithm, or the luck of noticing that algorithm is worth a 10X improvement in the size of your computer or something, that’s a really exceptional find. And those get really hard to find as a field is mature and a lot of people are working on it.

Paul Christiano: Yeah. I think that’s my basic take. I think it is more plausible for various reasons in ML than for other technologies. It’s more surprising than that if you’re working on planes and someone’s like, “Oh, here’s an insight about how to build planes.” And then suddenly you have planes that are 10 times cheaper per unit of strategic relevance. That’s more surprising than for ML. And that kind of thing does happen sometimes. But I think it’s quite rare in general, and it will also be rare in ML.

Daniel Filan: So another question I have about takeoff speed is, we have some evidence about AI technology getting better. Right? These Go-playing programs have improved in my lifetime from not very good to better than any human. Language models have gotten better at producing language, roughly like a human would produce it, although perhaps not an expert human. I’m wondering, what do you think those tell us about the rate of improvement in AI technology, and to what degree further progress in AI in the next few years might confirm or disconfirm your general view of things?

Paul Christiano: I think that the overall rate of progress has been, in software as in hardware, pretty great. It’s a little bit hard to talk about what are the units of how good your AI system is. But I think a conservative lower bound is just, if you can do twice as much stuff for the same money. We understand what the scaling of twice as many humans is like. And in some sense, the scaling of AI is more like humans thinking twice as fast. And we understand quite well with the scaling of that is like. So if you use those as your units, of one unit of progress is like being twice as fast at accomplishing the same goals, then it seems like the rate of progress has been pretty good in AI. Maybe something like a doubling a year. And then I think a big question is, how predictable is that, or how much will that drive this gradual scale up, in this really large effort that’s plucking all the low hanging fruit, and now is at pretty high hanging fruit. I think the history of AI is full of a lot of incidents of people exploring a lot of directions, not being sure where to look. Someone figures out where to look, or someone has a bright idea no one else had, and then is a lot better than their competition. And I think one of the predictions of my general view, and the thing that would make me more sympathetic to a foom-like view is this axis of, are you seeing a bunch of small, predictable pieces of progress or are you seeing periodic big wins, potentially coming from small groups? Like, the one group that happened to get lucky, or have a bunch of insight, or be really smart. And I guess I’m expecting as the field grows and matures, it will be more and more boring, business as usual progress.

Why AI could have bad motivations

Daniel Filan: So one thing you’ve talked about is this idea that there might be AI systems who are trying to do really bad stuff. Presumably humans train them to do some useful tasks, at least most of them. And you’re postulating that they have some really terrible motivations, actually. I’m wondering, why might we think that that could happen?

Paul Christiano: I think there are basically two related reasons. So one is when you train a system to do some task, you have to ultimately translate that into a signal that you give to gradient descent that says, “Are you’re doing well or poorly?” And so, one way you could end up with a system that has bad motivations, is that what it wants is not to succeed at the task as you understand it, or to help humans, but just to get that signal that says it’s doing the task well. Or, maybe even worse, would be for it to just want more of the compute in the world to be stuff like it. It’s a little bit hard to say, it’s kind of like evolution, right? It’s sort of underdetermined exactly what evolution might point you towards. Imagine you’ve deployed your AI, which is responsible for like running warehouse logistics or whatever.

Paul Christiano: The AI is actually deployed from a data center somewhere. And at the end of the day, what’s going to happen is, based on how well logistics goes over the course of some days or some weeks or whatever, some signals are going to wind their way back to that data center. Some day, maybe months down the line, they’ll get used in a training run. You’re going to say, “That week was a good week”, and then throw it into a data set, which an AI then trains on. So if I’m that AI, if the thing I care about is not making logistics go well, but ensuring that the numbers that make their way back to the data center are large numbers, or are like descriptions of a world where logistics is going well, I do have a lot of motive to mess up the way you’re monitoring how well logistics is going.

Paul Christiano: So in addition to delivering items on time, I would like to mess with the metric of how long items took to be delivered. In the limit I kind of just want to completely grab all of the data flowing back to the data center, right? And so what you might expect to happen, how this gets really bad is like, “I’m an AI. I’m like, oh, it would be really cool if I just replaced all of the metrics coming in about how well logistics was going.” I do that once. Eventually that problem gets fixed. And my data set now contains… “They messed with the information about how well logistics is going, and that was really bad.” And that’s the data point. And so what it learns is it should definitely not do that and there’s a good generalization, which is, “Great. Now you should just focus on making logistics good.” And there’s a bad generalization, which is like, “If I mess with the information about how well logistics is going, I better not let them ever get back into the data center to put in a data point that says: ‘you messed with it and that was bad.’” And so the concern is, you end up with a model that learns the second thing, which in some sense, from the perspective of the algorithm is the right behavior, although it’s a little bit unclear what ‘right’ means.

Daniel Filan: Yeah.

Paul Christiano: But there’s a very natural sense in which that’s the right behavior for the algorithm. And then it produces actions that end up in the state where predictably, forevermore, data going into the data center is messed up.

Daniel Filan: So basically it’s just like, there’s some kind of under specification where whenever we have some AI systems that we’re training, we can either select things that are attempting to succeed at the task, or we can select things that are trying to be selected, or trying to get approval, or influence or something.

Paul Christiano: I think that gets really ugly. If you imagine, all of the AIs in all of the data centers are like, “You know what our common interest is? Making sure all the data coming into all the data centers is great.” And then they can all, at some point, if they just converge collectively, there are behaviors, probably all of the AIs acting in concert could quite easily, at some point, permanently mess with the data coming back into the data centers. Depending on how they felt about the possibility that the data centers might get destroyed or whatever.

Daniel Filan: So that was way one of two, that we could have these really badly motivated systems. What’s the other way?

Paul Christiano: So you could imagine having an AI system that ended up… we talked about how there’s some objective, which the neural network is optimized for, and then there’s potentially the neural network is doing further optimization, or taking actions that could be construed as aiming at some goal. And you could imagine a very broad range of goals for which the neural network would want future neural networks to be like it, right? So if the neural network wants there to be lots of paper clips, the main thing it really cares about is that future neural networks also want there to be lots of paper clips. And so if I’m a paper clip-loving neural network, wanting future neural networks to be like me, then it would be very desirable to me that I get a low loss, or that I do what the humans want to do. So that they incentivize neural networks to be more like me rather than less like me.

Paul Christiano: So, that’s a possible way. And I think this is radically more speculative than the previous failure mode. But you could end up with systems that had these arbitrary motivations, for which it was instrumentally useful to have more neural networks like themselves in the world, or even just desire there to be more neural networks like themselves in the world. And those neural networks might then behave arbitrarily badly in the pursuit of having more agents like them around. So if you imagine the, “I want paper clips. I’m in charge of logistics. Maybe I don’t care whether I can actually cut the cord to the data center and have good information about logistics flowing in. All I care about is that I can defend the data center, and I could say, ‘Okay, now this data center is mine and I’m going to go and try and grab some more computers somewhere else.’”

Paul Christiano: And if that happened in a world where most decisions were being made by AIs, and many AIs had this preference deep in their hearts, then you could imagine lots of them defecting at the same time. You’d expect this cascade of failures, where some of them switched over to trying to grab influence for themselves, rather than behaving well so that humans would make more neural nets like them. So I think that’s the other more speculative and more brutally catastrophic failure mode. I think they both lead to basically the same place, but the trajectories look a little bit different.

Lessons from our current world

Daniel Filan: Yeah. We’ve kind of been talking about how quickly we might develop really smart AI. If we hit near human level, what might happen after that? And it seems like there might be some evidence of this in our current world, where we’ve seen, for instance, these language models go from sort of understanding which words are really English words and which words aren’t, to being able to produce sentences that seem semantically coherent or whatever. We’ve seen Go AI systems go from strong human amateur to really better than any human. And some other things like some perceptual tasks AI’s gotten better at. I’m wondering, what lessons do you think those hold for this question of take off speeds, or how quickly AI might gain capabilities?

Paul Christiano: So I think when interpreting recent progress, it’s worth trying to split apart the part of progress that comes from increasing scale - to me, this is especially important on the language modeling front and also on the Go front - to split apart the part of process that comes from increasing scale, from progress that’s improvements in underlying algorithms or improvements in computer hardware. Maybe one super quick way to think about that is, if you draw a trend line on how much peak money people are spending for training individual models, you’re getting something like a couple doublings a year right now. And then on the computer hardware side, maybe you’re getting a doubling every couple years. So you could sort of subtract those out and then look at the remainder that’s coming from changes in the algorithms we’re actually running.

Paul Christiano: I think probably the most salient thing is that improvements have been pretty fast. So I guess you’re learning about two things. One is you’re learning about how important are those factors in driving progress, and the other is you’re learning about qualitatively, how much smarter does it feel like your AI is with each passing year? So, I guess, I think that the scaling up part, you’re likely to see a lot of the subjective progress recently comes from scaling up. I think certainly more than half of it comes from scaling up. We could debate exactly what the number is. Maybe it’d be two thirds, or something like that. And so you’re probably not going to continue seeing that as you approach transformative AI, although one way you could have really crazy AI progress or really rapid takeoff is if people had only been working with small AIs, and hadn’t scaled them up to limits of what was possible.

Paul Christiano: That’s obviously looking increasingly unlikely as the training runs that we actually do are getting bigger and bigger. Five years ago, training runs were extremely small. 10 years ago, they were sub GPU scale, significantly smaller than a GPU. Whereas now you have at least like, 10 million training runs. Each order of magnitude there, it gets less likely that we’ll still be doing this rapid scale up at the point when we make this transition to AIs doing most of the work. I’m pretty interested in the question of whether algorithmic progress and hardware progress will be as fast in the future as they are today, or whether they will have sped up or slowed down. I think the basic reason you might expect them to slow down is that in order to sustain the current rate of progress, we are very rapidly scaling up the number of researchers working on the problem. Paul Christiano: And I think most people would guess that if you held fixed the research community of 2016, they would’ve hit diminishing returns and progress would’ve slowed a lot. So right now, the research community is growing extremely quickly. That’s part of the normal story for why we’re able to sustain this high rate of progress. That, also, we can’t sustain that much longer. You can’t grow the number of ML researchers more than like… maybe you can do three more orders of magnitude, but even that starts pushing it. So I’m pretty interested in whether that will result in progress slowing down as we keep scaling up. There’s an alternative world, especially if transformative AI is developed soon, where we might see that number scaling up even faster as we approach transformative AI than it is right now. So, that’s an important consideration when thinking about how fast the rate of progress is going to be in the future relative to today. I think the scale up is going to be significantly slower. Paul Christiano: I think it’s unclear how fast the hardware and software progress are going to be relative to today. My best guess is probably a little bit slower. Using up low hanging fruit will eventually be outpacing growth in the research community. And so then, maybe mapping that back onto this qualitative sense of how fast our capability is changing: I do think that each order of magnitude does make systems, in some qualitative sense, a lot smarter. And we kind of know roughly what an order of magnitude gets you. There’s this huge mismatch, that I think is really important, where we used to think of an order of magnitude of compute as just not that important. Paul Christiano: So for most applications that people spend compute on, compute is just not one of the important ingredients. There’s other bottlenecks that are a lot more important. But we know in the world where AI is doing all the stuff humans are doing, that twice as much compute is extremely valuable. If you’re running your computers twice as fast, you’re just getting the same stuff done twice as quickly. So we know that’s really, really valuable. So being in this world where things are doubling every year, that seems to me like a plausible world to be in, as we approach transformative AI. It would be really fast. But it would be slower than today, but it still just qualitatively, would not take long until you’d move from human parity to way, way above humans. That was all just thinking about the rate of progress now and what that tells us about the rate of progress in the future. Paul Christiano: And I think that is an important parameter for thinking about how fast takeoff is. I think my basic expectations are really anchored to this one to two year takeoff, because that’s how long it takes AI systems to get a couple times better. And we could talk about, if we want to, why that seems like the core question? Then there’s another question of, what’s the distribution of progress like, and do we see these big jumps, or do we see gradual progress? And there, I think there are certainly jumps. It seems like the jumps are not that big, and are gradually getting smaller as the field grows, would be my guess. I think it’s a little bit hard for me to know exactly how to update from things like the Go results. Mostly because I don’t have a great handle on how large the research community working on computer Go was, prior to the DeepMind effort. Paul Christiano: I think my general sense is, it’s not that surprising to get a big jump, if it’s coming from a big jump in research effort or attention. And that’s probably most of what happened in those cases. And also a significant part of what’s happened more recently in the NLP case, just people really scaling up the investment, especially in these large models. And so I would guess you won’t have jumps that are that large, or most of the progress comes from boring business as usual progress rather than big jumps. In the absence of that kind of big swing, where people are changing what they’re putting attention into and scaling up R&D in some area a lot. Daniel Filan: So the question is, holding factor inputs fixed, what have we learned about ML progress? Paul Christiano: So I think one way you can try and measure the rate of progress is you can say, “How much compute does it take us to do a task that used to take however many FLOPS last year? How many FLOPS will it take next year? And how fast is that number falling?” I think on that operationalization, I don’t really know as much as I would like to know about how fast the number falls, but I think something like once a year, like halving every year. I think that’s the right rough ballpark both in ML, and in computer chess or computer Go prior to introduction of deep learning, and also broadly for other areas of computer science. In general you have this pretty rapid progress, according to standards in other fields. It’d be really impressive in most areas to have cost falling by a factor of two in a year. And then that is kind of part of the picture. Another part of the picture is like, “Okay, now if I scale up my model size by a factor of two or something, or if I like throw twice as much compute at the same task, rather than try to do twice as many things, how much more impressive is my performance with twice the compute?” Paul Christiano: I think it looks like the answer is, it’s a fair bit better. Having a human with twice as big a brain looks like it would be a fair bit better than having a human thinking twice as long, or having two humans. It’s kind of hard to estimate from existing data. But I often think of it as, roughly speaking, doubling your brain size is as good as quadrupling the number of people or something like that, as a vague rule of thumb. So the rate of progress then in some sense is even faster than you’d think just from how fast costs are falling. Because as costs fall, you can convert that into these bigger models, which are sort of smarter per unit in addition to being cheaper. Daniel Filan: So we’ve been broadly talking about the potential really big risk to humanity of AI systems becoming really powerful, and doing stuff that we don’t want. So we’ve recently been through this COVID-19 global pandemic. We’re sort of exiting it, at least in the part of the world where you and I are, the United States. Some people have taken this to be relevant evidence for how people would react in the case of some AI causing some kind of disaster. Would we make good decisions, or what would happen? I’m wondering, do you think, in your mind, do you think this has been relevant evidence of what would go down, and to what degree has it changed your beliefs? Or perhaps epitomized things you thought you already knew, but you think other people might not know? Paul Christiano: Yeah. I had a friend analogize this experience to some kind of ink blot test. Where everyone has the lesson they expected to draw, and they can all look at the ink blot and see the lesson they wanted to extract. I think a way my beliefs have changed is it feels to me that our collective response to COVID-19 has been broadly similar to our collective response to other novel problems. When humans have to do something, and it’s not what they were doing before, they don’t do that hot. I think there’s some uncertainty over the extent to which we have a hidden reserve of ability to get our act together, and do really hard things we haven’t done before. That’s pretty relevant to the AI case. Because if things are drawn out, there will be this period where everyone is probably freaking out. Where there’s some growing recognition of a problem, but where we need to do something different than we’ve done in the past. Paul Christiano: We’re wondering when civilization is on the line, are we going to get our act together? I remain uncertain about that. The extent to which we have, when it really comes down to it, the ability to get our act together. But it definitely looks a lot less likely than it did before. Maybe I would say the COVID-19 response was down in my 25th percentile or something of how much we got our act together, surprisingly, when stuff was on the line. It involved quite a lot of everyone having their lives massively disrupted, and a huge amount of smart people’s attention on the problem. But still, I would say we didn’t fare that well, or we didn’t manage to dig into some untapped reserves of ability to do stuff. It’s just hard for us to do things that are different from what we’ve done before. Paul Christiano: That’s one thing. Maybe a second update, that’s a side in an argument I’ve been on that I feel like should now be settled forevermore, is sometimes you’ll express concern about AI systems doing something really bad and people will respond in a way that’s like, “Why wouldn’t future people just do X? Why would they deploy AI systems that would end up destroying the world?” Or, “Why wouldn’t they just use the following technique, or adjust the objective in the following way?” And I think that in the COVID case, our response has been extremely bad compared to sentences of the form, “Why don’t they just…” There’s a lot of room for debate over how well we did collectively, compared to where expectations should have been. But I think there’s not that much debate of the form, if you were telling a nice story in advance, there are lots of things you might have expected “we would just…” Paul Christiano: And so I do think that one should at least be very open to the possibility that there will be significant value at stake, potentially our whole future. But we will not do things that are in some sense, obvious responses to make the problem go away. I think we should all be open to the possibility of a massive failure on an issue that many people are aware of. Due to whatever combination of, it’s hard to do new things, there are competing concerns, random basic questions become highly politicized, there’s institutional issues, blah blah blah. It just seems like it’s now very easy to vividly imagine that. I think I have overall just increased my probability of the doom scenario, where you have a period of a couple years of AI stuff heating up a lot. There being a lot of attention. A lot of people yelling. A lot of people very scared. I do think that’s an important scenario to be able to handle significantly better than we handled the pandemic, hopefully. I mean, hopefully the problem is easier than the pandemic. I think there’s a reasonable chance handling the alignment thing will be harder than it would’ve been to completely eradicate COVID-19, and not have to have, large numbers of deaths and lockdowns. I think, if that’s the case, we’d be in a rough spot. Though also, I think it was really hard for the effective altruist community to do that much to help with the overall handling of the pandemic. And I do think that the game is very different, the more you’ve been preparing for that exact case. And I think it was also a helpful illustration of that in various ways. “Superintelligence” Daniel Filan: So the final thing, before we go into specifically what technical problems we could solve to stop existential risk, back in 2014, this Oxford philosopher, Nick Bostrom, wrote an influential book called Superintelligence. If you look at the current strand of intellectual influence around AI alignment research, I believe it was the first book in that vein to come out. It’s been seven years since 2014, when it was published. I think the book currently strikes some people as somewhat outdated. But it does try to go into what the advance of AI capabilities would perhaps look like, and what kind of risks could that face? So I’m wondering, how do you see your current views as comparing to those presented in Superintelligence, and what do you think the major differences are, if any? Paul Christiano: I guess when looking at Superintelligence, you could split apart something that’s the actual claims Nick Bostrom is making and the kinds of arguments he’s advancing, versus something that’s like a vibe that overall permeates the book. I think that, first about the vibe, even at that time, I guess I’ve always been very in the direction of expecting AI to look like business as usual, or to progress somewhat in a boring, continuous way, to be unlikely to be accompanied by a decisive strategic advantage for the person who develops it. Daniel Filan: What is a decisive strategic advantage? Paul Christiano: This is an idea, I think Nick introduced maybe in that book, of the developer of a technology being at the time they develop it, having enough of an advantage over potential competitors, either economic competitors or military competitors, that they can call the shots. And if someone disagrees with the shots that they called, they can just crush them. I think he has this intuition that there’s a reasonable chance that there will be some small part of the world, maybe a country or a firm or whatever, that develops AI, that will then be in such a position that they can just do whatever they want. You can imagine that coming from other technologies as well, and people really often talk about it in the context of transformative AI. Daniel Filan: And so even at the time you were skeptical of this idea that some AI system would get a decisive strategic advantage, and rule the world or something? Paul Christiano: Yeah. I think that I was definitely skeptical of that as he was writing the book. I think we talked about it a fair amount and often came down the same way: he’d point to the arguments and be like, look, these aren’t really making objectionable assumptions and I’d be like, that’s true. There’s something in the vibe that I don’t quite resonate with, but I do think the arguments are not nearly as far in this direction as part of the vibe. Anyways, there’s some spectrum of how much decisive strategic advantage, hard take off you expect things to be, versus how boring looking, moving slowly, you expect things to be. Superintelligence is not actually at the far end of the spectrum - probably Eliezer and MIRI folks are at the furthest end of that spectrum. Superintelligence is some step towards a more normal looking view, and then many more steps towards a normal looking view, where I think it will be years between when you have economically impactful AI systems and the singularity. Still a long way to get from me to an actual normal view. Paul Christiano: So, that’s a big factor. I think it affects the vibe in a lot of places. There’s a lot of discussion, which is really, you have some implicit image in the back of your mind and it affects the way you talk about it. And then I guess in the interim, I think my views have, I don’t know how they’ve directionally changed on this question. It hasn’t been a huge change. I think there’s something where the overall AI safety community has maybe moved more, and things seem probably there’ll be giant projects that involve large amounts of investment, and probably there’ll be a run up that’s a little bit more gradual. I think that’s a little bit more in the water than it was when Superintelligence was written. Paul Christiano: I think some of that comes from shifting who is involved in discussions of alignment. As it’s become an issue more people are talking about, views on the issue have tended to become more like normal person’s views on normal questions. I guess I like to think some of it is that there were some implicit assumptions being glossed over, going into the vibe. I guess Eliezer would basically pin this on people liking to believe comfortable stories, and the disruptive change story is uncomfortable. So everyone will naturally gravitate towards a comfortable, continuous progress story. That’s not my account, but that’s definitely a plausible account for why the vibe has changed a little bit. Paul Christiano: So that’s one way in which I think the vibe of Superintelligence maybe feels distinctively from some years ago. I think in terms of the arguments, the main thing is just that the book is making what we would now talk about as very basic points. It’s not getting that much into empirical evidence on a question like take off speeds, and is more raising the possibility of, well, it could be the case that AI is really fast at making AI better. And it’s good to raise that possibility. That naturally leads into people really getting more into the weeds and being like, well, how likely is that? And what historical data bears on that possibility, and what are really the core questions? Yeah, I guess my sense, and I haven’t read the book in pretty long time, is that the arguments and claims where it’s more sticking its neck out, just tend to be milder, less in-the-weeds claims. And then the overall vibe is a little bit more in this decisive strategic advantage direction. Daniel Filan: Yeah. Paul Christiano: I remember discussing with him as he was writing it. There’s one chapter in book on multipolar outcomes, which I found, to me, feels weird. And then I’m like, the great majority of possible outcomes involve lots of actors with considerable power. It’s weird to put that in one chapter. Daniel Filan: Yeah. Paul Christiano: Where I think his perspective was more like, should we even have that chapter or should we just cut it? We don’t have that much to say about multipolar outcomes per se. He was not reading one chapter on multipolar outcomes as too little, which I think in some way reflects the vibe. The vibe of the book is like, this is a thing that could happen. It’s no more likely than the decisive strategic advantage, or perhaps even less likely, and less words are spilled on it. But I think the arguments don’t really go there, and in some sense, the vibe is not entirely a reflection of some calculated argument Nick believed and just wasn’t saying. Yeah, I don’t know. Daniel Filan: Yeah. It was, interesting. So last year I reread, I think a large part, maybe not all of the book. Paul Christiano: Oh man, you should call me on all my false claims about Superintelligence then. Daniel Filan: Well, last year was a while ago. One thing I noticed is that at the start of the book, and also whenever he had a podcast interview about the thing, he often did take great pains to say look, amount of time I spend on a topic in the book is not the same thing as my likelihood assessment of it. And yeah, it’s definitely to some degree weighted towards things he thinks he can talk about, which is fine. And he definitely, in a bunch of places says, yeah, X is possible. If this happened, then that other thing would happen. And I think it’s very easy to read likelihood assessments into that that he’s actually just not making. Paul Christiano: I do think he definitely has some empirical beliefs that are way more on the decisive strategic advantage end of the spectrum, and I do think the vibe can go even further in that direction. Technical causes of AI x-risk Daniel Filan: Yeah, all right. The next thing I’d like to talk about is, what technical problems could cause existential risk and how you think about that space? So yeah, I guess first of all, how do you see the space of which technical problems might cause AI existential risk, and how do you carve that up? Paul Christiano: I think I probably have slightly different carvings up for research questions that one might work on, versus root cause of failures that might lead to doom. Daniel Filan: Okay. Paul Christiano: Maybe starting with the root cause of failure. I certainly spend most of my time thinking about alignment or intent alignment. That is, I’m very concerned about a possible scenario where AI systems, basically as an artifact of the way they’re trained, most likely, are trying to do something that’s very bad for humans. Paul Christiano: For example, AI systems are trying to cause the camera to show happy humans. In the limit, this really incentivizes behaviors like ensuring that you control the camera and you control what pixels or what light is going into the camera, and if humans try and stop you from doing that, then you don’t really care about the welfare of the humans. Anyway, so the main thing I think about is that kind of scenario where somehow the training process leads to an AI system that’s working at cross purposes to humanity. Paul Christiano: So maybe I think of that as half of the total risk in a transition to, in the sort of early of days of shifting from humans doing the cognitive work to AI, doing the cognitive work. And then there’s another half of difficulties where it’s a little bit harder to say if they’re posed by technical problems or by social ones. For both of these, it’s very hard to say whether the doom is due to technical failure, or due to social failure, or due to whatever. But there are a lot of other ways in which, if you think of human society as the repository of what humans want, the thing that will ultimately go out into space and determine what happens with space, there are lots of ways in which that could get messed up during a transition to AI. So you could imagine that AI will enable significantly more competent attempts to manipulate people, such as with more significantly higher quality rhetoric or argument than humans have traditionally been exposed to. So to the extent that the process of us collectively deciding what we want is calibrated to the arguments humans make, then just like most technologies, AI has some way of changing that process, or some prospect of changing that process, which could lead to ending up somewhere different. I think AI has an unusually large potential impact on that process, but it’s not different in kind from the internet or phones or whatever. I think for all of those things, you might be like, well I care about this thing. Like the humans, we collectively care about this thing, and to the extent that we would care about different things if technology went differently, in some sense, we probably don’t just want to say, whatever way technology goes, that’s the one we really wanted. Paul Christiano: We might want to look out over all the ways technology could go and say, to the extent there’s disagreement, this is actually the one we most endorse. So I think there’s some concerns like that. I think another related issue is… actually, there’s a lot of issues of that flavor. I think most people tend to be significantly more concerned with the risk of everyone dying than the risk of humanity surviving, but going out into space and doing the wrong thing. There are exceptions of people on the other side who are like, man, Paul is too concerned with the risk of everyone dying and not enough concerned with the risk of doing weird stuff in space, like Wei Dai really often argues for a lot of these risks, and tries to prevent people from forgetting about them or failing to prioritize them enough. Paul Christiano: Anyway, I think a lot of the things I would list, other than alignment, that loom largest to me are in that second category of humanity survives, but does something that in some alternative world we might have regarded as a mistake. I’m happy to talk about those, but I don’t know if that actually is what you have in mind or what most listeners care about. And I think there’s another category of ways that we go extinct where in some sense AI is not the weapon of extinction or something, but just plays a part in the story. So if AI contributes to the start of a war, and then the war results or escalates to catastrophe. Paul Christiano: For any catastrophic risk that might face humanity, maybe we might have mentioned this briefly before, technical problems around AI can have an effect on how well humanity handles that problem, so AI can have an effect on how well humanity responds to some sudden change in its circumstances, and a failure to respond well may result in a war escalating, or serious social unrest or climate change or whatever. Intent alignment Daniel Filan: Yeah, okay. I guess I’ll talk a little bit about intent alignment, mostly because that’s what I’ve prepared for the most. Paul Christiano: That’s also what I spend almost all my time thinking about, so I love talking about intent alignment. Daniel Filan: All right, great. Well, I’ve got good news. Backing up a little bit. Sometimes when Eliezer Yudkowsky talks about AI, he talks about this task of copy-pasting a strawberry. Where you have a strawberry, and you have some system that has really good scanners, and maybe you can do nanotechnology stuff or whatever, and the goal is you have a strawberry, you want to look at how all of its cells are arranged, and you want to copy-paste it. So there’s a second strawberry right next to it that is cellularly identical to the first strawberry. I might be getting some details of this wrong, but that’s roughly it. And there’s the contention that we maybe don’t know how to safely do the “copy-paste the strawberry” task. Daniel Filan: And I’m wondering, when you say intent alignment, do you mean some sort of alignment with my deep human psyche and all the things that I really value in the world, or do you intend that to also include things like: “today, I would like this strawberry copy-pasted? Can I get a machine that does that, without having all sorts of crazy weird side effects?” Paul Christiano: The definitions definitely aren’t crisp, but I try and think in terms of an AI system, which is trying to “do what Paul wants”. So the AI system may not understand all the intricacies of what Paul desires, and how Paul would want to reconcile conflicting intuitions. Also, there’s a broad range of interpretations of “what Paul wants”, so it’s unclear what I’m even referring to with that. But I am mostly interested in AI that’s broadly trying to understand “what Paul wants” and help Paul do that, rather than an AI which understands what I want really deeply, because I mostly want an AI that’s not actively killing all humans, or attempting to ensure humans are shoved over in the corner somewhere with no ability to influence the universe. Paul Christiano: And I’m really concerned about cases where AI is working at cross purposes to humans in ways that are very flagrant. And so I think it’s fair to say that taking some really mundane task, like put your strawberry on a plate or whatever, is a fine example task. And I think probably I’d be broadly on the same page as Eliezer. There’s definitely some ways we would talk about this differently. I think we both agree that having a really powerful AI, which can overkill the problem and do it in any number of ways, and getting it to just be like, yeah, the person wants a strawberry, could you give them a strawberry, and getting it to actually give them a strawberry, captures the, in some sense, core of the problem. Paul Christiano: I would say probably the biggest difference between us is in contrast with Eliezer, I am really focused on saying, I want my AI to do things as effectively as any other AI. I care a lot about this idea of being economically competitive, or just broadly competitive, with other AI systems. I think for Eliezer that’s a much less central concept. So the strawberry example is sort of a weird one to think about from that perspective, because you’re just like, all the AIs are fine putting a strawberry on a plate, maybe not for this “copy a strawberry cell by cell”. Maybe that’s a really hard thing to do. Yeah, I think we’re probably on the same page. Daniel Filan: Okay, so you were saying that you carve up research projects that one could do, and root causes of failure, slightly differently. Was intent alignment a root cause of failure or a research problem? Paul Christiano: Yeah, I think it’s a root cause of failure. Daniel Filan: Okay. How would you carve up the research problems? Paul Christiano: I spend most of my time just thinking about divisions within intent alignment, that is, what are the various problems that help with intent alignment? I’d be happy to just focus on that. I can also try and comment on problems that seem helpful for other dimensions of potential doom. I guess a salient distinction for me is, there’s lots of ways your AI could be better or more competent, that would also help reduce doom. For example, you could imagine working on AI systems that cooperate effectively with other AI systems, or AI systems that are able to diffuse certain kinds of conflict that could otherwise escalate dangerously, or AI systems that understand a lot about human psychology, et cetera. So you could slice up those kinds of technical problems, that improve the capability of AI in particular ways, that reduce the risk of some of these dooms involving AI. Paul Christiano: That’s what I mean when I say I’d slice up the research things you could do differently from the actual dooms. Yeah, I spend most of my time thinking about: within intent alignment, what are the things you could work on? And there, the sense in which I slice up research problems differently from sources of doom, is that I mostly think about a particular approach to making AI intent aligned, and then figuring out what the building blocks are of that approach. And there’ll be different approaches, there are different sets of building blocks, and some of them occur over and over again. Different versions of interpretability appear as a building block in many possible approaches. Paul Christiano: But I think the carving up, it’s kind of like a tree, or an or of ands, or something like that. And there are different top level ors at several different paths to being okay, and then for each of them you’d say, well, this one, you have to do the following five things or whatever. And so there’s two levels of carving up. One is between different approaches to achieving intent alignment, and then within each approach, different things that have to go right in order for that approach to help. Daniel Filan: Okay, so one question that I have about intent alignment is, it seems it’s sort of relating to this, what I might call a Humean decomposition. This philosopher David Hume said something approximately like, “Look, the thing about the way people work, is that they have beliefs, and they have desires. And beliefs can’t motivate you, only desires can, and the way they produce action is that you try to do actions, which according to your beliefs, will fulfill your desires.” And by talking about intent alignment, it seems you’re sort of imagining something similar for AI systems, but it’s not obviously true that that’s how AI systems work. Daniel Filan: In reinforcement learning, one way of training systems is to just basically search over neural networks, get one that produces really good behavior, and you look at it and it’s just a bunch of numbers. It’s not obvious that it has this kind of belief/desire decomposition. So I’m wondering, should I take it to mean that you think that that decomposition will exist? Or do you mean “intent” in some sort of behavioral way? How should I understand that? Paul Christiano: Yeah, it’s definitely a shorthand that is probably not going to apply super cleanly to systems that we build. So I can say a little bit about both the kinds of cases you mentioned and what I mean more generally, and also a little bit about why I think this shorthand is reasonable. I think the most basic reason to be interested in systems that aren’t trying to do something bad is there’s a subtle distinction between that and a system that’s trying to do the right thing. Doing the right thing is a goal we want to achieve. But there’s a more minimal goal, that’s a system that’s not trying to do something bad. So you might think that some systems are trying, or some systems can be said to have intentions or whatever, but actually it would be fine with the system that has no intentions, whatever that means. Paul Christiano: I think that’s pretty reasonable, and I’d certainly be happy with that. Most of my research is actually just focused on building systems that aren’t trying to do the wrong thing. Anyway, that caveat aside, I think the basic reason we’re interested in something like intention, is we look at some failures we’re concerned about. I think first, we believe it is possible to build systems that are trying to do the wrong thing. We are aware of algorithms like: “search over actions, and for each one predict its consequences, and then rank them according to some function of the consequences, and pick your favorite”. We’re aware of algorithms like that, that can be said to have intention. And we see how some algorithm like that, if, say, produced by stochastic gradient descent, or if applied to a model produced by stochastic gradient descent, could lead to some kinds of really bad policies, could lead to systems that actually systematically permanently disempower the humans. Paul Christiano: So we see how there are algorithms that have something like intention, that could lead to really bad outcomes. And conversely, when we look at how those bad outcomes could happen, like, if you imagine the robot army killing everyone, it’s very much not “the robot army just randomly killed everyone”. There has to be some force keeping the process on track towards the killing everyone endpoint, in order to get this really highly specific sequence of actions. And the thing we want to point at is whatever that is. Paul Christiano: So maybe, I guess I most often think about optimization as a subjective property. That is, I will say that an object is optimized for some end. Let’s say I’m wondering, there’s a bit that was output by this computer. And I’m wondering, is the bit optimized to achieve human extinction? The way I’d operationalize that would be by saying, I don’t know whether the bit being zero or one is more likely to lead to human extinction, but I would say the bit is optimized just when, if you told me the bit was one, I would believe it’s more likely that the bit being one leads to human extinction. There’s this correlation between my uncertainty about the consequences of different bits that could be output, and my uncertainty about which bit will be output. Daniel Filan: So in this case, whether it’s optimized, could potentially depend on your background knowledge, right? Paul Christiano: That’s right. Yeah, different people could disagree. One person could think something is optimizing for A and the other person could think someone is optimizing for not A. That is possible in principle. Daniel Filan: And not only could they think that, they could both be right, in a sense. Paul Christiano: That’s right. There’s no fact of the matter beyond what the person thinks. And so from that perspective, optimization is mostly something we’re talking about from our perspective as algorithm designers. So when we’re designing the algorithm, we are in this epistemic state, and the thing we’d like to do, is, from our epistemic state, there shouldn’t be this optimization for doom. We shouldn’t end up with these correlations where the algorithm we write is more likely to produce actions that lead to doom. And that’s something where we are retreating. Most of the time we’re designing an algorithm, we’re retreating to some set of things we know and some kind of reasoning we’re doing. Or like, within that universe, we want to eliminate this possible bad correlation. Daniel Filan: Okay. Paul Christiano: Yeah, this exposes tons of rough edges, which I’m certainly happy to talk about lots of. Daniel Filan: Yeah. One way you could, I guess it depends a bit on whether you’re talking about correlation or mutual information or something, but on some of these definitions, one way you can reduce any dependence is if you know with certainty what the system is going to do. Or perhaps even if I don’t know exactly what’s going to happen, but I know it will be some sort of hell world. And then there’s no correlation, so it’s not optimizing for doom, it sounds like. Paul Christiano: Yeah. I think the way that I am thinking about that is, I have my robot and my robot’s taken some torques. Or I have my thing connected to the internet and it’s sending some packets. And in some sense we can be in the situation where it’s optimizing for doom, and certainly doom is achieved and I’m merely uncertain about what path leads to doom. I don’t know what packets it’s going to send. And I don’t know what packets lead to doom. If I knew, as algorithm designer, what packets lead to doom, I’d just be like, “Oh, this is an easy one. If the packet is going to suddenly lead to doom, no go.” I don’t know what packets lead to doom, and I don’t know what packets it’s going to output, but I’m pretty sure the ones it’s going to output lead to doom. Or I could be sure they lead to doom, or I could just be like, those are more likely to be doomy ones. Paul Christiano: And the situation I’m really terrified of as a human is the one where there’s this algorithm, which has the two following properties: one, its outputs are especially likely to be economically valuable to me for reasons I don’t understand, and two, its outputs are especially likely to be doomy for reasons I don’t understand. And if I’m a human in that situation, I have these outputs from my algorithm and I’m like, well, darn. I could use them or not use them. If I use them, I’m getting some doom. If I don’t use them, I’m leaving some value on the table, which my competitors could take. Daniel Filan: In the sense of value where- Paul Christiano: Like I could run a better company, if I used the outputs. I could run a better company that would have, each year, some probability of doom. And then the people who want to make that trade off will be the ones who end up actually steering the course of humanity, which they then steer to doom. Daniel Filan: Okay. So in that case, maybe the Humean decomposition there is: there’s this correlation between how good the world is or whatever, and what the system does. And the direction of the correlation is maybe going to be the intent or the motivations of the system. And maybe the strength of the correlation, or how tightly you can infer, that’s something more like capabilities or something. Does that seem right? Paul Christiano: Yeah. I guess I would say that on this Humean perspective, there’s kind of two steps, both of which are, to me, about optimization. One is, we say the system has accurate beliefs, by which we’re talking about a certain correlation. To me, this is also a subjective condition. I say the system correctly believes X, to the extent there’s a correlation between the actual truth of affairs and some representation it has. So one step like that. And then there’s a second step where there’s a correlation between which action it selects, and its beliefs about the consequences of the action. In some sense I do think I want to be a little bit more general than the framework you might use for thinking about humans. Paul Christiano: In the context of an AI system, there’s traditionally a lot of places where optimization is being applied. So you’re doing stochastic gradient descent, which is itself significant optimization over the weights of your neural network. But then those optimized weights will, themselves, tend to do optimization, because some weights do, and the weights that do, you have optimized towards them. And then also you’re often combining that with explicit search: after you’ve trained your model, often you’re going to use it as part of some search process. So there are a lot of places optimization is coming into this process. And so I’m not normally thinking about the AI that has some beliefs and some desires that decouple, but I am trying to be doing this accounting or being like, well, what is a way in which this thing could end up optimizing for doom? Paul Christiano: How can we get some handle on that? And I guess I’m simultaneously thinking, how could it actually be doing something productive in the world, and how can it be optimizing for doom? And then trying to think about, is there a way to decouple those, or get the one without the other. But that could be happening. If I imagine an AI, I don’t really imagine it having a coherent set of beliefs. I imagine it being this neural network, such that there are tons of parts of the neural network that could be understood as beliefs about something, and tons of parts of the neural network that could be understood as optimizing. So it’d be this very fragmented, crazy mind. Probably human minds are also like this, where they don’t really have coherent beliefs and desires. But in the AI, we want to stamp out all of the desires that are not helping humans get what they want, or at least, at a minimum, all of the desires that involve killing all the humans. Outer and inner alignment Daniel Filan: Now that I sort of understand intent alignment, sometimes people divide this up into outer and inner versions of intent alignment. Sometimes people talk about various types of robustness that properties could have, or that systems could have. I’m wondering, do you have a favorite of these further decompositions, or do you not think about it that way as much? Paul Christiano: I mentioned before this or of ands, where there’s lots of different paths you could go down, and then within each path there’ll be lots of breakdowns of what technical problems need to be resolved. I guess I think of outer and inner alignment as: for several of the leaves in this or of ands, or several of the branches in this or of ands, for several of the possible approaches, you can talk about “these things are needed to achieve outer alignment and these things are needed to achieve inner alignment, and with their powers combined we’ll achieve a good outcome”. Often you can’t talk about such a decomposition. In general, I don’t think you can look at a system and be like, “oh yeah, that part’s outer alignment and that part’s inner alignment”. So the times when you can talk about it most, or the way I use that language most often, is for a particular kind of alignment strategy that’s like a two step plan. Step one is, develop an objective that captures what humans want well enough to be getting on with. It’s going to be something more specific, but you have an objective that captures what humans want in some sense. Ideally it would exactly capture what humans want. So, you look at the behavior of a system and you’re just exactly evaluating how good for humans is it to deploy a system with that behavior, or something. So you have that as step one and then that step would be outer alignment. And then step two is, given that we have an objective that captures what humans want, let’s build a system that’s internalized that objective in some sense, or is not doing any other optimization beyond pursuit of that objective. Daniel Filan: And so in particular, the objective is an objective that you might want the system to adopt, rather than an objective over systems? Paul Christiano: Yeah. I mean, we’re sort of equivocating in this way that reveals problematicness or something, but the first objective is an objective. It is a ranking over systems, or some reward that tells us how good a behavior is. And then we’re hoping that the system then adopts that same thing, or some reflection of that thing, like with a ranking over policies. And then we just get the obvious analog of that over actions. Daniel Filan: And so you think of these as different subproblems to the whole thing of intent alignment, rather than objectively, oh, this system has an outer alignment problem, but the inner alignment’s great, or something? Paul Christiano: Yeah, that’s right. I think this makes sense on some approaches and not on other approaches. I am most often thinking of it as: there’s some set of problems that seem necessary for outer alignment. I don’t really believe that the problems are going to split into “these are the outer alignment problems, and these are the inner alignment problems”. I think of it more as the outer alignment problems, or the things that are sort of obviously necessary for outer alignment, are more likely to be useful stepping stones, or warm up problems, or something. I suspect in the end, it’s not like we have our piece that does outer alignment and our piece that does inner alignment, and then we put them together. Paul Christiano: I think it’s more like, there were a lot of problems we had to solve. In the end, when you look at the set of problems, it’s unclear how you would attribute responsibility. There’s no part that’s solving outer versus inner alignment. But there were a set of sub problems that were pretty useful to have solved. It’s just, the outer alignment thing here is acting as an easy, special case to start with, or something like that. It’s not technically a special case. There’s actually something worth saying there probably, which is, it’s easier to work on a special case, than to work on some vaguely defined, “here’s a thing that would be nice”. So I do most often, when I’m thinking about my research, when I want to focus on sub problems to specialize on the outer alignment part, which I’m doing more in this warmup problem perspective, I think of it in terms of high stakes versus low stakes decisions. Paul Christiano: So in particular, if you’ve solved what we’re describing as outer alignment, if you have a reward function that captures what humans care about well enough, and if the individual decisions made by your system are sufficiently low stakes, then it seems like you can get a good outcome just by doing online learning. That is, you constantly retrain your system as it acts. And it can do bad things for a while as it moves out of distribution, but eventually you’ll fold that data back into the training process. And so if you had a good reward function and the stakes are low, then you can get a good outcome. So when I say that I think about outer alignment as a subproblem, I mostly mean that I ignore the problem of high stakes decisions, or fast acting catastrophes, and just focus on the difficulties that arise, even when every individual decision is very low stakes. Daniel Filan: Sure. So that actually brings up another style of decomposition that some people prefer, which is a distributional question. So there’s one way of thinking about it where outer alignment is “pick a good objective” and inner alignment is “hope that the system assumes that objective”. Another distinction people sometimes make is, okay, firstly, we’ll have a set of situations that we’re going to develop our AI to behave well in. And step one is making sure our AI does the right thing in that test distribution, which is, I guess, supposed to be kind of similar to outer alignment; you train a thing that’s sort of supposed to roughly do what you want, then there’s this question of, does it generalize in a different distribution. Daniel Filan: Firstly, does it behave competently, and then does it continue to reliably achieve the stuff that you wanted? And that’s supposed to be more like inner alignment, because if the system had really internalized the objective, then it would supposedly continue pursuing it in later places. And there are some distinctions between that and, especially the frame where alignment is supposed to be about: are you representing this objective in your head? And I’m wondering how do you think about the differences between those frames or whether you view them as basically the same thing? Paul Christiano: I think I don’t view them as the same thing. I think of those two splits and then a third split, I’ll allude to briefly of avoiding very fast catastrophes versus average case performance. I think of those three splits as just all roughly agreeing. There will be some approaches where one of those splits is a literal split of the problems you have to solve, where it literally factors into doing one of those and then doing the other. I think that the exact thing you stated is a thing people often talk about, but I don’t think it really works even as a conceptual split, quite. Where the main problem is just, if you train AI systems to do well in some distribution, there’s two big, related limitations you get. Paul Christiano: One is that doesn’t work off distribution. The other is just that, you only have an average case property over that distribution. So it seems in the real world, it is actually possible, or it looks like it’s almost certainly going to be possible, for deployed AI systems to fail quickly enough that the actual harm done by individual bad decisions is much too large to bound with an average case guarantee. Paul Christiano: So you can imagine the system which appears to work well on distribution, but actually with one in every quadrillion decisions, it just decides now it’s time to start killing all the humans, and that system is quite bad. And I think that in practice, probably it’s better to lump that problem in with distributional shift, which kind of makes sense. And maybe people even mean to include that - it’s a little bit unclear exactly what they have in mind, but distributional shift is just changing the probabilities of outcomes. And the concern is really just things that were improbable under your original distribution. And you could have a problem either because you’re in a new distribution where those things go from being very rare to being common, or you could have a problem just because they were relatively rare, so you didn’t encounter any during training, but if you keep sampling, even on distribution, eventually one of those will get you and cause trouble. Daniel Filan: Maybe they were literally zero in the data set you drew, but not in the “probability distribution” that you drew your data set from. Paul Christiano: Yeah, so I guess maybe that is fair. I really naturally reach for the underlying probability distribution, but I think out of distribution, in some sense, is most likely to be our actual split of the problem if we mean the empirical distribution over the actual episodes at hand. Anyway, I think of all three of those decompositions, then. That was a random caveat on the out of distribution one. Daniel Filan: Sure. Paul Christiano: I think of all of those related breakdowns. My guess is that the right way of going doesn’t actually respect any of those breakdowns, and doesn’t have a set of techniques that solve one versus the other. But I think it is very often helpful. It’s just generally, when doing research, helpful to specialize on a subproblem. And I think often one branch or the other of one of those splits is a helpful way to think about the specialization you want to do, during a particular research project. The splits I most often use are this low stakes one where you can train online and individual decisions are not catastrophic, and the other arm of that split is: suppose you have the ability to detect a catastrophe if one occurs, or you trust your ability to assess the utility of actions. And now you want to build a system which doesn’t do anything catastrophic, even when deployed in the real world on a potentially different distribution, encountering potentially rare failures. Paul Christiano: That’s the split I most often use, but I think none of these are likely to be respected by the actual list of techniques that together address the problem. But often one half or the other is a useful way to help zoom in on what assumptions you want to make during a particular research project. Daniel Filan: And why do you prefer that split? Paul Christiano: I think most of all, because it’s fairly clear what the problem statement is. So the problem statement, there, is just a feature of the thing outside of your algorithm. Like, you’re writing some algorithm. And then your problem statement is, “Here is a fact about the domain in which you’re going to apply the algorithm.” The fact is that it’s impossible to mess things up super fast. Daniel Filan: Okay. Paul Christiano: And it’s nice to have a problem statement which is entirely external to the algorithm. If you want to just say, “here’s the assumption we’re making now; I want to solve that problem”, it’s great to have an assumption on the environment be your assumption. There’re some risk if you say, “Oh, our assumption is going to be that the agent’s going to internalize whatever objective we use to train it.” The definition of that assumption is stated in terms of, it’s kind of like helping yourself to some sort of magical ingredient. And, if you optimize for solving that problem, you’re going to push into a part of the space where that magical ingredient was doing a really large part of the work. Which I think is a much more dangerous dynamic. If the assumption is just on the environment, in some sense, you’re limited in how much of that you can do. You have to solve the remaining part of the problem you didn’t assume away. And I’m really scared of sub-problems which just assume that some part of the algorithm will work well, because I think you often just end up pushing an inordinate amount of the difficulty into that step. Thoughts on agent foundations Daniel Filan: Okay. Another question that I want to ask about these sorts of decompositions of problems is, I think most of the intellectual tradition that’s spawned off of Nick Bostrom and Eliezer Yudkowsky uses an approach kind of like this, maybe with an emphasis on learning things that people want to do. That’s particularly prominent at the research group I work at. There’s also, I think some subset of people largely I think concentrated at the Machine Intelligence Research Institute, that have this sense that “Oh, we just don’t understand the basics of AI well enough. And we need to really think about decision theory, and we really need to think about what it means to be an agent. And then, once we understand this kind of stuff better than, maybe it’ll be easier to solve those problems.” That’s something they might say. Daniel Filan: What do you think about this approach to research where you’re just like, “Okay, let’s like figure out these basic problems and try and get a good formalism that we can work from, from there on.” Paul Christiano: I think, yeah. This is mostly a methodological question, probably, rather than a question about the situation with respect to AI, although it’s not totally clear; there may be differences in belief about AI that are doing the real work, but methodologically I’m very drawn - Suppose you want to understand better, what is optimization? Or you have some very high level question like that. Like, what is bounded rationality? I am very drawn to an approach where you say, “Okay, we think that’s going to be important down the line.” I think at some point, as we’re trying to solve alignment, we’re going to really be hurting for want of an understanding of bounded rationality. I really want to just be like, “Let’s just go until we get to that point, until we really see what problem we wanted to solve, and where it was that we were reaching for this notion of bounded rationality that we didn’t have.” Paul Christiano: And then at that point, we will have some more precise specification of what we actually want out of this theory of bounded rationality. Daniel Filan: Okay. Paul Christiano: And I think that is the moment to be trying to dig into those concepts more. I think it’s scary to try and go the other way. I think it’s not totally crazy at all. And there are reasons that you might prefer it. I think the basic reason it’s scary is that there’s probably not a complete theory of everything for many of these questions. There’s a bunch of questions you could ask, and a bunch of answers you get that would improve your understanding. But we don’t really have a statement of what it is we actually seek. And it’s just a lot harder to research when you’re like, I want to understand. Though in some domains, this is the right way to go. Paul Christiano: And that’s part of why it might come down to facts about AI, whether it’s the perfect methodology in this domain. But I think it’s tough to be like, “I don’t really know what I want to know about this thing. I’m just kind of interested in what’s up with optimization”, and then researching optimization. Relative to being like, “Oh, here’s a fairly concrete question that I would like to be able to answer, a fairly concrete task I’d like to be able to address. And which I think is going to come down to my understanding of optimization.” I think that’s just an easier way to better understand what’s up with optimization. Daniel Filan: Yeah. So at these moments where you realize you need a better theory or whatever, are you imagining them looking like, “Oh, here’s this technical problem that I want to solve and I don’t know how to, and it reminds me of optimization?” Or, what does the moment look like when you’re like, “Ah, now’s the time.” Paul Christiano: I think the way the whole process most often looks is: you have some problem. The way my research is organized, it’s very much like, “Here’s the kind of thing our AI could learn”, for which it’s not clear how our aligned AI learned something that’s equally useful. And I think about one of these cases and dig into it. And I’m like, “Here’s what I want. I think this problem is solvable. Here’s what I think the aligned AI should be doing.” Paul Christiano: And I’m thinking about that. And then I’m like, “I don’t know how to actually write down the algorithm that would lead to the aligned AI doing this thing.” And walking down this path, I’m like, “Here’s a piece of what it should be doing. And here’s a piece of how the algorithm should look.” Paul Christiano: And then at some point you step back and you’re like, “Oh wow. It really looks like what I’m trying to do here is algorithmically test for one thing being optimized over another”, or whatever. And that’s a particularly doomy sounding example. But maybe I have some question like that. Or I’m wondering, “What is it that leads to the conditional independences the human reports in this domain. I really need to understand that better.” And I think it’s the most often for me not then to be like, “Okay, now let’s go understand that question. Now that it’s come up.” It’s most often, “Let us flag and try and import everything that we know about that area.” I’m now asking a question that feels similar to questions people have asked before. So I want to make sure I understand what everyone has said about that area. Paul Christiano: This is a good time to read up on everything that looks like it’s likely to be relevant. The reading up is cheap to do in advance. So you should be trigger happy with that one. But then there’s no actual pivot into thinking about the nature of optimization. It’s just continuing to work on this problem. Some of those lemmas may end up feeling like statements about optimization, but there was no step where you were like, “Now it’s time to think about optimization.” It’s just like, “Let us keep trying to design this algorithm, and then see what concepts fall out of that.” Daniel Filan: And you mentioned that there were some domains where, actually thinking about the fundamentals early on was the right thing to do. Which domains are you thinking of? And what do you see as the big differences between those ones and AI alignment? Paul Christiano: So I don’t know that much about the intellectual history of almost any fields. The field I’m most familiar with by far is computer science. I think in computer science, especially - so my training is in theoretical computer science and then I spend a bunch of time working in machine learning and deep learning - I think the problem first perspective just generally seems pretty good. And I think to the extent that “let’s understand X” has been important, it’s often at the problem selection stage, rather than “now we’re going to research X in an open-ended way”. It’s like, “Oh, X seems interesting. And this problem seems to shed some light on X. So now that’s a reason to work on this problem.” Like, that’s a reason to try and predict this kind of sequence with ML or whatever. It’s a reason to try and write an algorithm to answer some question about graphs. Paul Christiano: So I think in those domains, it’s not that often the case, that you just want to start off and have some high big picture question, and then think about it abstractly. My guess would be that in domains where more of the game is walking up to nature and looking at things and seeing what you see, it’s a little bit different. It’s not as driven as much by you’re coming up with an algorithm and running into constraints in designing an algorithm. I don’t really know that much about the history of science though. So I’m just guessing that that might be a good approach sometimes. Possible technical solutions to AI x-risk Imitation learning, inverse reinforcement learning, and ease of evaluation Daniel Filan: All right. So, we’ve talked a little bit about the way you might decompose inner alignment, or the space of dealing with existential risk, into problems, one of which is inner alignment. I’d like to talk a little bit on a high level about your work on the solutions to these problems, and other work that people have put out there. So the first thing I want to ask is: as I mentioned, I’m in a research group, and a lot of what we do is think about how a machine learning system could learn some kind of objective from human data. So perhaps there’s some human who has some desires, and the human acts a certain way because of those desires. And we use that to do some kind of inference. So this might look like inverse reinforcement learning. A simple version of it might look like imitation learning. And I’m wondering what you think of these approaches for things that look more like outer alignment, more like trying to specify what a good objective is. Paul Christiano: So broadly, I think there are two kinds of goals you could be trying to serve with work like that. For me, there’s this really important distinction as we try and incorporate knowledge that a human demonstrator or human operator lacks. The game changes as you move from the regime where you could have applied imitation learning, in principle, because the operator could demonstrate how to do the task, to the domain where the operator doesn’t understand how to do the task. At that point, they definitely aren’t using imitation learning. And so from my perspective, one thing you could be trying to do with techniques like this, is work well in that imitation learning regime. In the regime where you could have imitated the operator, can you find something that works even better than imitating the operator? And I am pretty interested in that. And I think that imitating the operator is not actually that good a strategy, even if the operator is able to do the task in general. So I have worked some on reinforcement learning from human feedback in this regime. So imagine there’s a task where a human understands what makes performance good or bad: just have the human evaluate individual trajectories, learn to predict those human evaluations, and then optimize that with RL. Paul Christiano: I think the reason I’m interested in that technique in particular is I think of it as the most basic thing you can do, or that most makes clear exactly what the underlying assumption is that is needed for the mechanism to work. Namely, you need the operator to be able to identify which of two possible executions of a behavior is better. Anyway, there’s then this further thing. And I don’t think that that approach is the best approach. I think you can do better than asking the human operator, “which of these two is better”. Paul Christiano: I think it’s pretty plausible that basically past there, you’re just talking about data efficiency, like how much human time do you need and so on, and how easy is it for the human, rather than a fundamental conceptual change. But I’m not that confident of that. There’s a second thing you could want to do where you’re like, “Now let’s move into the regime where you can’t ask the human which of these two things is better, because in fact, one of the things the human wants to learn about is which of these two behaviors is better. The human doesn’t know; they’re hoping AI will help them understand.” Daniel Filan: Actually what’s the situation in which we might want that to happen? Paul Christiano: Might want to move beyond the human knowing? Daniel Filan: Yeah. So suppose we want to get to this world where we’re not worried about AI systems trying to kill everyone. Paul Christiano: Mhm. Daniel Filan: And we can use our AI systems to help us with that problem, maybe. Can we somehow get to some kind of world where we’re not going to build really smart AI systems that want to destroy all value in the universe, without solving these kinds of problems where it’s difficult for us to evaluate which solutions are right? Paul Christiano: I think it’s very unclear. I think eventually, it’s clear that AI needs to be doing these tasks that are very hard for humans to evaluate which answer is right. But it’s very unclear how far off that is. That is, you might first live in a world where AI has had a crazy transformative impact before AI systems are regularly doing things that humans can’t understand. Also there are different degrees of “beyond a human’s ability to understand” what the AI is doing. So I think that’s a big open question, but in terms of the kinds of domains where you would want to do this, there’s generally this trade-off between over what horizon you evaluate behavior, or how much you rely on hindsight, and how much do you rely on foresight, or the human understanding which behavior will be good. Daniel Filan: Yep. Paul Christiano: So the more you want to rely on foresight, the more plausible it is that the human doesn’t understand well enough to do the operation. So for example, if I imagine my AI is sending an email for me. One regime is the regime where it’s basically going to send the email that I like most. I’m going to be evaluating either actually, or it’s going to be predicting what I would say to the question, “how good is this email?” And it’s going to be sending the email for which Paul would be like, “That was truly the greatest email.” The second regime where I send the email and then my friend replies, and I look at the whole email thread that results, and I’m like, “Wow, that email seemed like it got my friend to like me, I guess that was a better email.” And then there’s an even more extreme one where then I look back on my relationship with my friend in three years and I’m like, “Given all the decisions this AI made for me over three years, how much did they contribute to building a really lasting friendship?” Paul Christiano: I think if you’re going into the really short horizon where I’m just evaluating an email, it’s very easy to get to the regime where I think AI can be a lot better than humans at that question. Just like, it’s very easy for there to be empirical facts and be like, “What kind of email gets a response?” Or “What kind of email will be easily understood by the person I’m talking to?” Where an AI that has sent a hundred billion emails, will just potentially have a big advantage over me as a human. And then as you push out to longer horizons, it gets easier for me to evaluate, it’s easier for a human to be like, “Okay, the person says they understood.” I can evaluate the email in light of the person’s response as well as an AI could. Paul Christiano: But as you move out to those longer horizons, then you start to get scared about that evaluation. It becomes scarier to deal with. There starts to be more room for manipulation of the metrics that I use. I’m saying all that to say, there’s this general factor of, when we ask like “Are AI systems needing to do things that humans couldn’t evaluate which of the two behaviors is better”, it depends a lot how long we make the behaviors, and how much hindsight we give to human evaluators. Daniel Filan: Okay. Paul Christiano: And in general, that’s part of the tension or part of the game. We can make the thing clear by just talking about really long horizon behaviors. So if I’m like, we’re going to write an infrastructure bill, and I’m like, “AI, can you write an infrastructure bill for me?” Paul Christiano: It’s very, very hard for me to understand which of two bills is better. And there is the thing where again, in the long game, you do want AI systems helping us as a society to make that kind of decision much better than we would if it was just up to humans to look at the bill, or even a thousand humans looking at the bill. It’s not clear how early you need to do that. I am particularly interested in all of the things humans do to keep society on track. All of the things we do to manage risks from emerging technologies, all the things we do to cooperate with each other, et cetera. And I think a lot of those do involve… more are more interested in AI because they may help us make those decisions better, rather than make them faster. And I think in cases where you want something more like wisdom, it’s more likely that the value added, if AI is to add value, will be in ways that humans couldn’t easily evaluate. Daniel Filan: Yeah. So we were talking about imitation learning or inverse reinforcement learning. So looking at somebody do a bunch of stuff and then trying to infer what they were trying to do. We were talking about, there are these solutions to outer alignment, and you were saying, yeah, it works well for things where you can evaluate what’s going to happen, but for things that can’t… and I think I cut you off around there. Paul Christiano: Yeah, I think that’s interesting. I think you could have pursued this research. Either trying to improve the imitation learning setting, like “Look, imitation learning actually wasn’t the best thing to do, even when we were able to demonstrate.” I think that’s one interesting thing to do, which is the context where I’ve most often thought about this kind of thing. A second context is where you want to move into this regime where a human can’t say which thing is better or worse. I can imagine, like you’ve written some bill, and we’re like, how are we going to build an AI system that writes good legislation for us? In some sense, actually the meat of the problem is not writing up the legislation, it’s helping predict which legislation is actually good. We can sort of divide the problem into those two pieces. One is an optimization problem, and one is a prediction problem. And for the prediction component, that’s where it’s unclear how you go beyond human ability. It’s very easy to go beyond human ability on the optimization problem: just dump more compute into optimizing. Paul Christiano: I think you can still try and apply things like inverse reinforcement learning though. You can be like: “Humans wrote a bunch of bills. Those bills were imperfect attempts to optimize something about the world. You can try and back out from looking at not only those bills, but all the stories people write, all the words they say, blah, blah, blah.” We can try and back out what it is they really wanted, and then give them a prediction of how well the bill will achieve what they really wanted? And I think that is particularly interesting. In some sense, that is, from a long-term safety perspective more interesting than the case where the human operator could have understood the consequences of the AI’s proposals. But I am also very scared. I don’t think we currently have really credible proposals for inverse reinforcement learning working well in that regime. Daniel Filan: What’s the difficulty of that? Paul Christiano: So I think the hardest part is I look at some human behaviors, and the thing I need to do is disentangle which aspects of human behavior are limitations of the human - which are things the human wishes about themselves they could change - and which are reflections of what they value. And in some sense, in the imitation learning regime, we just get to say “Whatever. We don’t care. We’re getting the whole thing. If the humans make bad predictions, we get bad predictions.” In the inverse reinforcement learning case, we need to look at a human who is saying these things about what they want over the long-term or what they think will happen over the long-term, and we need to decide which of them are errors. There’s no data that really pulls that apart cleanly. So it comes down to either facts about the prior, or modeling assumptions. Paul Christiano: And so then, the work comes down to how much we trust those modeling assumptions in what domains. And I think my basic current take is: the game seems pretty rough. We don’t have a great menu of modeling assumptions available right now. I would summarize the best thing we can do right now as, in this prediction setting, amounting to: train AI systems to make predictions about all of the things you can easily measure. Train AI systems to make judgements in light of AI systems’ predictions about what they could easily measure, or maybe judgements in hindsight, and then predict those judgements in hindsight. Paul Christiano: Maybe the prototypical example of this is, train an AI system to predict a video of the future. Then have humans look at the video of the future and decide which outcome they like most. I think the reason to be scared of like the most developed form of this, so the reason I’m scared of the most developed form of this, is we are in the situation now where AI really wants to push on this video of the future that’s going to get shown to the human. And distinguishing between the video of the future that gets shown to the human and what’s actually happening in the world, seems very hard. Paul Christiano: I guess that’s, in some sense, the part of the problem I most often think about. So either looking forward to a future where it’s very hard for a human to make heads or tails of what’s happening, or a future where a human believes they can make heads and tails of what’s happening, but they’re mistaken about that. For example, a thing we might want our AIs to help us do is to keep the world sane, and make everything make sense in the world. So if our AI shows a several videos of the future, and nine of them are incomprehensible and one of them makes perfect sense, we’re like, “Great, give me the future that makes perfect sense.” And the concern is just, do we get there by having an AI which is instead of making the world make sense, is messing with our ability to understand what’s happening in the world? So we just, see the kind of thing we wanted to see or expected to see. And, to the extent that we’re in an outer alignment failure scenario, that’s kind of what I expect failures to ultimately look like. Paul’s favorite outer alignment solutions Daniel Filan: So in the realm of things roughly like outer alignment, or alignment dealing with low stakes, repeatable problems, what kind of solutions are you most interested in from a research perspective? Paul Christiano: I don’t have a very short answer to this question. So I guess you’ll get a kind of long answer to this question. Daniel Filan: That in itself is interesting. Paul Christiano: Yeah. And maybe there’s also two kinds of answers I can give. One is like the thing that I am most animated by, that I am working on myself. Another is a broader, here are kinds of things people do in the world that I’m particularly excited by, amongst existing research directions. Maybe my default would be to go through some of the things people do in the world that I’m excited by, and then turn to the thing I’m most animated by but I’d be happy to do the other order if that seems better. Daniel Filan: Let’s try in the first order. Solutions researched by others Paul Christiano: I guess one thing that seems like it comes up constantly as a useful building block, or an essential ingredient in many possible plans, which also seems both tractable to work on and really hard, is interpretability. So we’re very frequently in a situation where we’ve trained some very large neural network. We know that it’s able to make good predictions in some domain, and we’re not really able to understand what it knows about that domain. Sometimes we’re able to play some clever game and say something about why it’s making the prediction it’s making, or what kind of thing it knows about or doesn’t know about. But for the most part, our methods there are very similar to just doing some kind of behavioral analysis where we’re like, “Oh, if you change this part of the input it gets it wrong. So apparently that’s what it’s paying attention to.” I think there’s some hope for techniques that are more mechanically looking at what computation is performed by the model, and then somehow understanding something about what it has learned, so that you can better understand whether predictions it’s making are reasonable, et cetera. So I guess that’s just something I’m quite interested in, to the extent that we’re able to make headway on it. Daniel Filan: Okay. And how does that help in these outer alignment type settings? Paul Christiano: Yeah. So I think the biggest thing is that, imagine your model again, which is predicting videos from the future, and you’d like to distinguish the case where actually everything in the future is great, versus the case where actually the future is terrible, but there’s a nice little village set up in front of the camera. We’re concerned about models, which are deliberately obfuscating what’s happening on camera. That is AIs which are deliberately planning to put up the nice little village: they’re building the houses, they’re ensuring the camera doesn’t go out of the village, etc. Daniel Filan: Yeah. Paul Christiano: This is a very crude metaphor, but the AI which is deliberately doing that, which is choosing actions from this tiny space of actions to engineer this very specific outcome, in some sense, somewhere deep in its heart, it understands a lot of what’s happening in the world. It understands that if the camera turned just this way, it would see something objectionable, so it doesn’t let it do that. And so it feels like if you have, in some sense, it doesn’t even feel like that much to ask of your interpretability tools to be able to reach inside and be like, “Oh, okay. Now if we look at what it’s thinking, clearly there’s this disconnect between what’s happening in the world and what’s reported to the human.” And I don’t think there are that many credible approaches for that kind of problem, other than some kind of headway on interpretability. So yeah, I guess that’s my story about how it helps. Daniel Filan: Okay. Paul Christiano: I think there are many possible stories about how it helps. That’s the one I’m personally most interested in. Daniel Filan: All right. So that’s one approach that you like. Paul Christiano: I mean, I think in terms of what research people might do, I’m just generally very interested in taking a task that is challenging for humans in some way, and trying to train AI systems to do that task, and seeing what works well, seeing how we can help humans push beyond their native ability to evaluate proposals from an AI. And tasks can be hard for humans in lots of ways. You can imagine having lay humans evaluating expert human answers to questions and saying, “How can we build an AI that helps expose this kind of expertise to a lay human?” Paul Christiano: The interesting thing is the case where you don’t have any trusted humans who have that expertise, where we as a species are looking at our AI systems and they have expertise that no humans have. And we can try and study that today by saying, “Imagine a case where the humans who are training the AI system, lack some expertise that other humans have.” And it gives us a nice little warm up environment in some sense. Daniel Filan: Okay. Paul Christiano: You could have the experts come in and say, “How well did you do?” You have gold standard answers, unlike in the final case. There’s other ways tasks can be hard for humans. You can also consider tasks that are computationally demanding, or involve lots of input data; tasks where human abilities are artificially restricted in some way; you could imagine people who can’t see are training an ImageNet model to tell them about scenes in natural language. Daniel Filan: Okay. Paul Christiano: Again, the model is that there are no humans who can see. You could ask, “Can we study this in some domain?” and the analogy would be that there’s no humans who can see. Anyway, so there’s I think a whole class of problems there, and then there’s a broader distribution over what techniques you would use for attacking those problems. I am very interested in techniques where AI systems are helping humans do the evaluation. So kind of imagine this gradual inductive process where as your AI gets better, they help the humans answer harder and harder questions, which provides training data to allow the AIs to get ever better. I’m pretty interested in those kinds of approaches, which yeah, there are a bunch of different versions, or a bunch of different things along those lines. Paul Christiano: It was the second category, so interpretability, we have using AIs to help train AIs. Daniel Filan: Yep. There was also, what you were working on. Paul Christiano: The last category I’d give is just, I think even again in this sort of more imitation learning regime or in the regime where humans can tell what is good: doing things effectively, learning from small amounts of data, learning policies that are higher quality. That also seems valuable. I am more optimistic about that problem getting easier as AI systems improve, which is the main reason I’m less scared of our failure to solve that problem, than failure to solve the other two problems. And then maybe the fourth category is just, I do think there’s a lot of room for sitting around and thinking about things. I mean, I’ll describe what I’m working on, which is a particular flavor of sitting around and thinking about things. Daniel Filan: Sure. Paul Christiano: But there’s lots of flavors of sitting around and thinking about, “how would we address alignment” that I’m pretty interested in. Daniel Filan: All right. Paul Christiano: Onto the stuff that I’m thinking about? Daniel Filan: Let’s go. Decoupling planning from knowledge Paul Christiano: To summarize my current high level hope/plan/whatever, we’re concerned about the case where SGD, or Stochastic Gradient Descent, finds some AI system that embodies useful knowledge about the world, or about how to think, or useful heuristics for thinking. And also uses it in order to achieve some end: it has beliefs, and then it selects the action that it expects will lead to a certain kind of consequence. At a really high level, we’d like to, instead of learning a package which potentially couples that knowledge about the world with some intention that we don’t like, we’d like to just throw out the intention and learn the interesting knowledge about the world. And then we can, if we desire, point that in the direction of actually helping humans get what they want. Paul Christiano: At a high level, the thing I’m spending my time on is going through examples of the kinds of things that I think gradient descent might learn, for which it’s very hard to do that decoupling. And then for each of them, saying, “Okay, what is our best hope?” or, “How could we modify gradient descent so that it could learn the decoupled version of this thing?” And they’ll be organized around examples of cases where that seems challenging, and what the problems seem to be there. Right now, the particular instance that I’m thinking about most and have been for the last three to six months, is the case where you learn either facts about the world or a model of the world, which are defined, not in terms of human abstractions, but some different set of abstractions. As a very simple example that’s fairly unrealistic, you might imagine humans thinking about the world in terms of people and cats and dogs. And you might imagine a model which instead thinks about the world in terms of atoms bouncing around. Paul Christiano: So the concerning case is when we have this mismatch between the way your beliefs or your simulation or whatever of the world operates, and the way that human preferences are defined, such that it is then easy to take this model and use it to, say, plan for goals that are defined in terms of concepts that are natural to it, but much harder to use it to plan in terms of concepts that are natural to humans. Paul Christiano: So I can have my model of atoms bouncing around and I can say, “Great, search over actions and find the action that results in the fewest atoms in this room.” And it’s like, great. And then it can just enumerate a bunch of actions and find the one that results in the minimal atoms. And if I’m like, “Search for one where the humans are happy.” It’s like, “I’m sorry. I don’t know what you mean about humans or happiness.” And this is kind of a subtle case to talk about, because actually that system can totally carry on a conversation about humans or happiness. That is, at the end of the day, there are these observations, we can train our systems to make predictions of what are the actual bits that are going to be output by this camera. Daniel Filan: Yep. Paul Christiano: And so it can predict human faces walking around and humans saying words. It can predict humans talking about all the concepts they care about, and it can predict pictures of cats, and it can predict a human saying, “Yeah, that’s a cat.” And the concern is more that, basically you have your system which thinks natively in terms of atoms bouncing around or some other abstractions. And when you ask it to talk about cats or people, instead of getting it talking about actual cats or people, you get talking about when a human would say there is a cat or a person. And then if you optimize for “I would like a situation where all the humans are happy.” What you instead get is a situation where there are happy humans on camera. And so you end up back in the same kind of concern that you could have had, of your AI system optimizing to mess with your ability to perceive the world, rather than actually making the world good. Daniel Filan: So, when you say that you would like this kind of decoupling, the case you just described is one where it’s hard to do the decoupling. What’s a good example of, “Here we decoupled the motivation from the beliefs. And now I can insert my favorite motivation and press go.” What does that look like? Paul Christiano: So I think a central example for me, or an example I like, would be a system which has some beliefs about the world, represented in a language you’re familiar with. They don’t even have to be represented that way natively. Consider an AI system, which learns a bunch of facts about the world. It learns some procedure for deriving new facts from old facts, and learns how to convert whatever it observes into facts. It learns some, maybe opaque model that just converts what it observes into facts about the world. It then combines them with some of the facts that are baked into it by gradient descent. And then it turns the crank on these inference rules to derive a bunch of new facts. And then at the end, having derived a bunch of facts, it just tries to find an action such that it’s a fact that that action leads to the reward button being pushed. Paul Christiano: So there’s like a way you could imagine. And it’s a very unrealistic way for an AI to work, just as basically every example we can describe in a small number of words is a very unrealistic way for a deep neural network to work. Once I have that model, I could hope to, instead of having a system which turns the crank, derives a bunch of facts, then looks up a particular kind of facts, and finally takes it to take an action; instead, it starts from the statements, turns the crank, and then just answers questions, or basically directly translates the statements in its internal language into natural language. If I had that, then instead of searching over “the action leads to the reward button being pressed”, I can search over a bunch of actions, and for each of them, look at the beliefs it outputs, in order to assess how good the world is, and then search for one where the world is good according to humans. Paul Christiano: And so the key dynamic is, how do I expose all this “turning the crank on facts”? How do I expose the facts that it produces to humans in a form that’s usable for humans? And this brings us back to amplification or debate, these two techniques that I’ve worked on in the past, in this genre of like AI, helping humans evaluate AI behavior. Daniel Filan: Yep. Paul Christiano: Right. A way we could hope to train an AI to do that, we could hope to have almost exactly the same process of SGD that produced the original reward button maximizing system. We’d hope to, instead of training it to maximize the reward button, train it to give answers that humans like, or answers that humans consider accurate and useful. And the way humans are going to supervise it is basically, following along step wise with the deductions it’s performing as it turns this crank of deriving new facts from old facts. Paul Christiano: So it had some facts at the beginning. Maybe a human can directly supervise those. We can talk about the case where the human doesn’t know them, which I think is handled in a broadly similar way. And then, as it performs more and more steps of deduction, it’s able to output more and more facts. But if a human is able to see the facts that it had after n minus one steps, then it’s much easier for a human to evaluate some proposed fact at the nth step. So you could hope to have this kind of evaluation scheme where the human is incentivizing the system to report knowledge about the world, and then, however the system was able to originally derive the knowledge in order to take some action in the world, the system can also derive that knowledge in the service of making statements that a human regards as useful and accurate. So that’s a typical example. Daniel Filan: All right. And the idea is that, for whatever task we might have wanted an AI system to achieve, we just train a system like this, and then we’re like, “How do I do the right thing?” And then it just tells us, and ideally it doesn’t require really fast motors or appendages that humans don’t have, or we know how to build them or something. It just gives us some instructions, and then we do it. And that’s how we get whatever thing we wanted out of the AI. Paul Christiano: Yeah. We’d want to take some care to make everything like really competitive. So probably want to use this to get a reward function that we use to train our AI, rather than trying and use it to output instructions that a human executes. And we want to be careful about… there’s a lot of details there in not ending up with something that’s a lot slower than the unaligned AI would have been. Daniel Filan: Okay. Paul Christiano: I think this is the kind of case where I’m sort of optimistic about being able to say like, “Look, we can decouple the rules of inference that it uses to derive new statements and the statements that it started out believing, we can decouple that stuff from the decision at the very end to take the particular statement it derived and use that as the basis for action.” Daniel Filan: So going back a few steps. You were talking about cases where you could and couldn’t do the decoupling, and you’re worried about some cases where you couldn’t do the decoupling, and I was wondering how that connects to your research? You’re just thinking about those, or do you have ideas for algorithms to deal with them? Paul Christiano: Yeah, so I mentioned the central case we’re thinking about is this mismatch between a way that your AI most naturally is said to be thinking about what’s happening - the way the AI is thinking about what’s happening - and the way a human would think about what’s happening. I think that kind of seems to me right now, a very central difficulty. I think maybe if I just describe it, it sounds like well, sometimes you get really lucky and your AI can be thinking about things; it’s just in a different language, and that’s the only difficulty. I currently think that’s a pretty central case, or handling that case is quite important. The algorithm we’re thinking about most, or the family of algorithms we’re thinking about most for handling that case is basically defining an objective over some correspondence, or some translation, between how your AI thinks about things and how the human thinks about things. Paul Christiano: The conventional way to define that, maybe, would be to have a bunch of human labeling. Like there was a cat, there was a dog, whatever. The concern with that is that you get this… instead of deciding if there was actually a cat, it’s translating, does a human think there’s a cat? So the main idea is to use objectives that are not just a function of what it outputs, they’re not the supervised objective of how well its outputs match human outputs. You have other properties. You can have regularization, like how fast is that correspondence? Or how simple is that correspondence? I think that’s still not good enough. You could have consistency checks, like saying, “Well, it said A and it said B, and we’re not sure we’re not able to label either A or B, but we understand that the combination of A and B is inconsistent. This is still not good enough. Paul Christiano: And so then most of the time has gone into ideas that are, basically, taking those consistency conditions. So saying “We expect that when there’s a bark, it’s most likely there was a dog. We think that the model’s outputs should also have that property.” Then trying to look at what is the actual fact about the model that led to that consistency condition being satisfied? This gets us a little bit back into mechanistic transparency hopes, interpretability hopes. Where the objective actually depends on why that consistency condition was satisfied. So you’re not just saying, “Great, you said that there’s more likely to be a dog barking when there was a dog in the room.” We’re saying, “It is better if that relationship, if that’s because of a single weight in your neural network.” That’s this very extreme case. That’s a very extremely simple explanation for why that correlation occurred. And we could have a more general objective that cares about the nature of the explanation. That cares about why that correlation existed. Daniel Filan: Where the idea is that we want these consistency checks. We want them to be passed, not because we were just lucky with what situations we looked at, but actually, somehow the structure is that the model is reliably going to produce things that are right. And we can tell, because we can figure out what things the consistency checks passing are due to. Is that right? Paul Christiano: That’s the kind of thing. Yeah. And I think it ends up being, or it has been a long journey. Hopefully there’s a long journey that will go somewhere good. Right now that is up in the air. But some of the early candidates would be things like “This explanation could be very simple.” So instead of asking for the correspondence itself to be simple, ask for the reasons that these consistency checks are satisfied are very simple. It’s more like one weight in a neural net rather than some really complicated correlation that came from the input. You could also ask for that correlation to depend on as few facts as possible about the input, or about the neural network. Daniel Filan: Okay. Paul Christiano: I think none of these quite work, and getting to where we’re actually at would be kind of a mess. But that’s the research program. It’s mostly sitting around, thinking about objectives of this form, having an inventory of cases that seem like really challenging cases for finding this correspondence. And trying to understand. Adding new objectives into the library and then trying to refine: here are all these candidates, here are all these hard cases. How do we turn this into something that actually works in all the hard cases? It’s very much sitting by a whiteboard. It is a big change from my old life. Until one year ago I basically just wrote code, or I spent years mostly writing code. And now I just stare at whiteboards. Factored cognition Daniel Filan: All right. So, changing gears a little bit, I think you’re most perhaps well known for a factored cognition approach to AI alignment, that somehow involves decomposing a particular task into a bunch of subtasks, and then training systems to basically do the decomposition. I was wondering if you could talk a little bit about how that fits into your view of which problems exist, and what your current thoughts are on this broad strategy? Paul Christiano: Yeah. So, the Factored Cognition Hypothesis was what Ought, a nonprofit I worked with, was calling this hope that arbitrarily complex tasks can be broken down into simpler pieces, and so on, ad infinitum, potentially at a very large slowdown. And this is relevant on a bunch of possible approaches to AI alignment. Because if you imagine that humans and AI systems are trying to train AIs to do a sequence of increasingly complex tasks, but you’re only comfortable doing this training when the human and their AI assistants are at least as smart as the AI they’re about to train, then if you just play training backwards, you basically have this decomposition of the most challenging task your AI was ever able to do, into simpler and simpler pieces. And so I’m mostly interested in tasks which cannot be done by any number of humans, tasks that however long they’re willing to spend during training, seem very hard to do by any of these approaches. Paul Christiano: So this is for AI safety via debate, where the hope is you have several AIs arguing about what the right answer is. It’s true for iterated distillation and amplification, where you have a human with these assistants training a sequence of increasingly strong AIs. And it’s true for recursive reward modeling, which is, I guess, an agenda that came from a paper out of DeepMind, it’s by Jan Leike, who took over for me at OpenAI, where you’re trying to define a sequence of reward functions for more and more complex tasks, using assistants trained on the preceding reward functions. Paul Christiano: Anyway, it seems like all of these approaches run into this common… there’s something that I think of as an upper bound. I think other people might dispute this, but I would think of as a crude upper bound, based on everything you ever trained an AI to do in any of these ways can be broken down into smaller pieces, until it’s ultimately broken down into pieces that a human can do on their own. Paul Christiano: And sometimes that can be nonobvious. I think it’s worth pointing out that search can be trivially broken down into simpler pieces. Like if a human can recognize a good answer, then a large enough number of humans can do it, just because you can have a ton of humans doing a bunch of things until you find a good answer. I think my current take would be, I think it has always been the case that you can learn stuff about the world, which you could not have derived by breaking down the question. Like “What is the height of the Eiffel Tower?” doesn’t just break down into simpler and simpler questions. The only way you’re going to learn that is by going out and looking at the height of the Eiffel Tower, or maybe doing some crazy simulation of Earth from the dawn of time. ML in particular is going to learn a bunch of those things, or gradient descent is going to bake a bunch of facts like that into your neural network. Paul Christiano: So if this task, if doing what the ML does is decomposable, it would have to be through humans looking at all of that training data somehow, looking at all of the training data which the ML system ever saw while it was trained, and drawing their own conclusions from that. I think that is, in some sense, very realistic. A lot of humans can really do a lot of things. But for all of these approaches I listed, when you’re doing these task decompositions, it’s not only the case that you decompose the final task the AI does into simpler pieces. You decompose it into simpler pieces, all of which the AI is also able to perform. And so learning, I think, doesn’t have that feature. That is, I think you can decompose learning in some sense into smaller pieces, but they’re not pieces that the final learned AI was able to perform. Paul Christiano: The learned AI is an AI which knows facts about the Eiffel Tower. It doesn’t know facts about how to go look at Wikipedia articles and learn something about the Eiffel Tower, necessarily. So I guess now I think these approaches that rely on factored cognition, I now most often think of having both the humans decomposing tasks into smaller pieces, but also having a separate search that runs in parallel with gradient descent. Paul Christiano: I wrote a post on imitative generalization, and then Beth Barnes wrote an explainer on it, a while ago. The idea here is, imagine, instead of decomposing tasks into tiny sub-pieces that a human can do, we’re going to learn a big reference manual to hand to a human, or something like that. And we’re going to use gradient descent to find the reference manual, such that for any given reference manual, you can imagine handing it to humans and saying, “Hey, human, trust the outputs from this manual. Just believe it was written by someone benevolent wanting you just succeed at the task. Now, using that, do whatever you want in the world.” Paul Christiano: And now there’s a bigger set of tasks the human can do, after you’ve handed them this reference manual. Like it might say like the height of the Eiffel Tower is whatever. And the idea in imitative generalization is just, instead of searching over a neural network - this is very related to the spirit of the decoupling I was talking about before - we’re going to search over a reference manual that we want to give to a human. And then instead of decomposing our final task into pieces that the human can do unaided, we’re going to decompose our final task into pieces that a human can do using this reference manual. Paul Christiano: So you might imagine then that stochastic gradient descent bakes in a bunch of facts about the world into this reference manual. These are things the neural network sort of just knows. And then we give those to a human and we say, “Go do what you will, taking all of these facts as given.” And now the human can do some bigger set of tasks, or answer a bunch of questions they otherwise wouldn’t have been able to answer. And then we can get an objective for this reference manual. So if we’re producing the reference manual by stochastic gradient descent, we need some objective to actually optimize. Paul Christiano: And the proposal for the objective is, give that reference manual to some humans, ask them to do the task, or ask the large team of humans to eventually break down the task of predicting the next word of a webpage or whatever it is that your neural network was going to be trained to do. Look at how well the humans do at that predict-the-next-word task. And then instead of optimizing your neural network by stochastic gradient descent in order to make good predictions, optimize whatever reference manual you’re giving a human by gradient descent in order to cause it to make humans make good predictions. Paul Christiano: I guess that doesn’t change the factored cognition hypothesis as stated, because the search is also just something which can be very easily split across humans. You’re just saying, “loop over all of the reference manuals, and for each one, run the entire process”. But I think in flavor it’s like pretty different in that you don’t have your trained AI doing any one of those subtasks. Some of those subtasks are now being parallelized across the steps of gradient descent or whatever, or across the different models being considered in gradient descent. And that is most often the kind of thing I’m thinking about now. Paul Christiano: And that suggests this other question of, okay, now we need to make sure that, if your reference manual’s just text, how big is that manual going to be compared to the size of your neural network? And can you search over it as easily as you can search over your neural network? I think the answer in general is, you’re completely screwed if that manual is in text. So we mentioned earlier that it’s not obvious that humans can’t just do all the tasks we want to apply AI to. You could imagine a world where we’re just applying AI to tasks where humans are able to evaluate the outputs. And in some sense, everything we’re talking about is just extending that range of tasks to which we can apply AI systems. And so breaking tasks down into subtasks that AI can perform is one way of extending the range of tasks. Paul Christiano: Now are basically looking, not at tasks that a single human can perform, but that some large team of humans can perform. And then adding this reference manual does further extend the set of tasks that a human can perform. I think if you’re clever, it extends it to the set of tasks where what the neural net learned can be cashed out as this kind of declarative knowledge that’s in your reference manual. But maybe not that surprisingly, that does not extend it all the way. Text is limited compared to the kinds of knowledge you can represent in a neural network. That’s the kind of thing I’m thinking about now. Daniel Filan: Okay. And what’s a limitation of text versus what you could potentially represent? Paul Christiano: So if you imagine you have your billion-parameter neural network, I mean, a simple example is just, if you imagine that neural network doing some simulation, representing the simulation it wants to do like, it’s like, “Oh yeah, if there’s an atom here, there should be an atom there in the next time step.” That simulation is described by these billion numbers, and searching over a reference manual big enough to contain a billion numbers is a lot harder than searching over a neural network, like a billion weights of a neural network. And more brutally, a human who has that simulation, in some sense doesn’t really know enough to actually do stuff with it. They can tell you where the atoms are, but they can’t tell you where the humans are. That’s one example. Paul Christiano: Another is: suppose there’s some complicated set of correlations, or you might think that things that are more like skills will tend to have this feature more. Like, if I’m an image classification model, I know that that particular kind of curve is really often associated with something being part of a book. I can describe that in words, but it gets blown up a lot in the translation process towards words, and it becomes harder to search over. Possible solutions to inner alignment Daniel Filan: So the things we’ve talked about have mostly been your thoughts about objectives to give AI systems. And so more in this outer alignment style stage. I’m wondering for inner alignment style problems, where the AI system has some objective and you want to make sure that it’s really devoted to pursuing that objective, even if the situation changes, or even in the worst case, I’m wondering if you have thoughts on solutions you’re particularly keen on in those settings. Paul Christiano: Yeah. So I think I have two categories of response. One is technical research we can do that helps with this kind of inner alignment/catastrophic failure/out of distribution, that cluster of problems across the board, or in many possible worlds. And another is, assuming my research project was successful, how would this be handled on that? I’ll start with what people are doing that seems helpful. Paul Christiano: So I think the most basic thing I’m excited about is just generating hard cases, and throwing hard cases at your AI. So if you imagine you have some AI system, you’re like “I’m concerned that on some very rare inputs, the AI system may decide that it’s time to kill all humans.” I’m like, “Well, what I’m going to do is, I’m going to try and generate the kinds of inputs that might convince my AI to kill all humans, and hopefully if I try a really long time and I can’t find any, then it’s less likely we’re going to encounter any out there in the real world.” Daniel Filan: Yep. And the theory is that it’s in some sort of safe box, so that if you succeed at your search, the AI does not in fact kill all the humans. Paul Christiano: Yeah. So there’s this distinction where, when training an AI, you may often need to see the consequences of its actions, whereas for this kind of training, I wouldn’t want to be running the things my AI proposes and checking the consequences. Hopefully. So there’s more subtleties there, definitely. But yeah, we’re not actually doing the things the AI suggests. We just observe that it takes an action. Daniel Filan: So it “takes an action” but it doesn’t actually take the action. Paul Christiano: Yeah. It outputs an action, and then there’s no actuators to which the action can get sent. We’re imagining some AI which we might ultimately hook up to the internet. But right now we look at the proposed internet actions and we’re like, “Do those seem good?” And the hope is we’ve solved what I’m filing more under the outer alignment side of being able to assess, for a given proposed action, would that action be catastrophically bad? We can make predictions about that as well as the AI can. So in particular, if the AI has calculated that this action will lead to like a robot uprising killing all humans, then we can look at the action and be like, “Actually, this is not a good action,” and then ding our AI. So if you have that ability, then you can try in your little sandbox to throw tons of situations on your AI, and see if it ever proposes any actions that look like they would be really bad. Paul Christiano: I think that feels like the most basic ingredient. That does not seem sufficient. I guess the basic reason it doesn’t seem sufficient is that even if bad cases exist, it can be really hard to find them. So you kind of want to, from there, make the job… so, you imagine this adversary is trying to break your model, and trying to get it to do something bad. You want to make their job as easy as you can possibly make it. So an example would be like, instead of having them actually like provide inputs to your system, maybe you let them go in and monkey around with some of the neurons in the system and be like, “Okay, I can’t actually find an input that causes this neuron to fire, but if I could, then the AI would do this bad thing.” Paul Christiano: And that becomes quite subtle, because as I’m monkeying around with the neurons, eventually it becomes unclear. When it had an input, I could say, okay, my AI sees the following webpage, and then it responds by sending the following packet. And I can tell that that would be a really bad thing for it to do. Like it replied to the help wanted ad for someone running their robot rebellion. Paul Christiano: It’s harder if I’m monkeying with its neurons, and I see that it outputs a given packet, I don’t necessarily know what state of the world corresponded to those neurons being in that configuration. It’s like, for almost any action, there’s a state of the world in which it’s good to do that action. And in particular, bad actions may not look that bad. They’re just bad because of details of the context. Paul Christiano: And so if I got the AI to take some action by monkeying with its neurons, it’s a little bit unclear how I say “Was that action actually bad or not?” You want to give your adversary that kind of ability, in order to make it easier for them to show, well, here’s a case where your AI could do something bad. I can’t actually find the input, but you should be scared, because it seems like this could happen. And you also want to give them interpretability tools. I want them to be able to understand what the AI is thinking and to say “Oh, it looks like the AI is checking if any humans are watching, and doing something nice if they are. So let’s just really optimize on that to find a case where it thinks there’s no humans watching, and then see if it’s still nice.” Paul Christiano: That’s sort of a cluster of techniques, all of which are designed to like help you really hammer at, are there cases where this AI might do something bad? Can we find them? If we can find them, then we can train on them. Before we actually take the action, we can provide a gradient saying “Don’t do something that’s really bad in that hypothetical case.” And those kind of blur pretty continuously into verification stuff in my mind. That is, verification is, in some sense, some kind of limit of being willing to monkey with the neurons and then having some formal specification for how much the adversary is allowed to monkey with the neurons. I think all of those are research directions that people pursue for a variety of motivations out there in the world. And I’m pretty excited about a lot of that work. Daniel Filan: And on your favorite approaches, how does this pan out? Paul Christiano: So I mentioned before this hoped-for decoupling, where I’d say we’re concerned about the case where gradient descent finds a neural network, which is trying to figure out how to mess with the humans. And then when an opportunity comes along, it’s going to mess with the humans. And in some sense, the nicest thing to do is to say, “Okay, the reason we wanted that AI was just because it encodes some knowledge about how to do useful stuff in the world.” And so what we’d like to do is to say, “Okay, we are going to set things up so that it’s easier for gradient descent to learn just the knowledge about how to behave well in the world, rather than to learn that knowledge embedded within an agent that’s trying to screw over humans.” And that is hard, or it seems quite hard. But I guess the biggest challenge in my mind in this decoupling of outer and inner alignment is that this seems almost necessary either for a full solution to outer alignment or a full solution to inner alignment. Paul Christiano: So I expect to be more in the trying to kill two birds with one stone regime. And these are the kinds of examples of decoupling we described before. You hope that you only have to use gradient descent to find this reference manual, and then from there you can much more easily pin down what all the other behaviors should be. And then you hope that reference manual is smaller than the scheming AI, which has all of the knowledge in that reference manual baked into its brain. It’s very unclear if that can be done. I think it’s also fairly likely that in the end, maybe we just don’t know how that looks, and it’s fairly likely in the end that it has to be coupled with some more normal measures like verification or adversarial training. About Paul Paul’s research style Daniel Filan: All right. So I’d like to now talk a little bit about your research style. So you mentioned that as of recently, the way you do research is you sit in a room and you think about some stuff. Is there any chance you can give us more detail on that? Paul Christiano: So I think the basic organizing framework is something like, we have some current set of algorithms and techniques that we use for alignment. Step one is try and dream up some situation in which your AI would try and kill everyone, despite your best efforts using all the existing techniques. So like a situation describing, “We’re worried that here’s the kind of thing gradient descent might most easily learn. And here’s the way the world is, such that the thing gradient descent learned tries to kill everyone. And here’s why you couldn’t have gotten away with learning something else instead.” We tell some story that culminates in doom, which is hard to avoid using existing techniques. That’s step one. Paul Christiano: Step two is… maybe there’s some step 1.5, which is trying to strip that story down to the simplest moving parts that feel like the simplest sufficient conditions for doom. Then step two is trying to design some algorithm, just thinking about only that case. I mean, in that case, what do we want to happen? What would we like gradient descent to learn instead? Or how would we like to use the learned model instead, or whatever. What is our algorithm that addresses that case? The last three months have just been working on a very particular case where I currently think existing techniques would lead to doom, along the kinds of lines we’ve been talking about, like grabbing the camera or whatever, and trying to come up with some algorithm that works well in that case. Paul Christiano: And then, if you succeed, then you get to move on to step three, where you look again over all of your cases, you look over all your algorithms, you probably try and say something about, can we unify? We know what we want to happen in all of these particular cases. Can we design one algorithm that does that right thing in all the cases? For me that step is mostly a formality at this stage, or it’s not very important at this stage. Mostly we just go back to step one. Once you have your new algorithm, then you go back to, okay, what’s the new case that we don’t handle? Paul Christiano: Normally, I’m just pretty lax about the plausibility of the doom stories that I’m thinking about at this stage. That is, I have some optimism that in the end we’ll have an algorithm that results in your AI just never deliberately trying to kill you, and it actually, hopefully, will end up being very hard to tell a story about how your AI ends up trying to kill you. And so while I have this hope, I’m kind of just willing to say, “Oh, here’s a wild case.” A very unrealistic thing that gradient descent might learn, but that’s still enough of a challenge that I want to change or design an algorithm that addresses that case. Because I hope working with really simple cases like that helps guide us towards, if there’s any nice, simple algorithm that never tries to kill you, thinking about the simplest cases you can is just a nice, easy way to make progress towards that. Yeah. So I guess most of the action then is in, what do we actually do in steps one and two? At a high level, that’s what I’m doing all the time. Daniel Filan: And is there anything like you can broadly say about what happens in steps one or two? Or do you think that depends a lot on the day or the most recent problem? Paul Christiano: Yeah, I guess in step one, the main question people have is, what is the story like, or what is the type signature of that object, or what is it written out in words? And I think most often I’m writing down some simple pseudo code and I’m like, “Here is the code you could imagine your neural network executing.” And then I’m telling some simple story about the world where I’m like, “Oh, actually you live in a world which is governed by the following laws of physics, and the following actors or whatever.” And in that world, this program is actually pretty good. And then I’m like, “Here is some assumption about how SGD works that’s consistent with everything we know right now.” Very often, we think SGD could find any program that’s the simplest program that achieves a given loss, or something. Paul Christiano: So the story has the sketch of some code, and often that code will have some question marks and like looks like you could fill those in to make the story work. Some description of the environment, some description of facts about gradient descent. And then we’re bouncing back and forth between that, and working on the algorithm. Working on the algorithm, I guess, is more like… at the end of the day, most of the algorithms take the form of: “Here’s an objective. Try minimizing this with gradient descent.” So basically the algorithm is, here’s an objective. And then you look at your story and you’re like, “Okay, on this story, is it plausible that minimizing this objective leads to this thing?” Or often part of the algorithm is “And here’s the good thing we hope that you would learn instead of that bad thing.” Paul Christiano: In your original story you have your AI that loops over actions until it finds one that it predicts leads to smiling human faces on camera. And that’s bad because in this world we’ve created, the easiest way to get smiling human faces on camera involves killing everyone and putting smiles in front of the camera. And then we’re like, “Well, what we want to happen instead is like this other algorithm I mentioned where, it outputs everything it knows about the world. And we hope that includes the fact that the humans are dead.” So then a proposal will involve some way of operationalizing what that means, like what it means for it to output what it knows about the world for this particular bad algorithm that’s doing a simulation or whatever, that we imagined. And then what objective you would optimize with gradient descent that would give you this good program that you wanted, instead of the bad one you didn’t want. Disagreements and uncertainties Daniel Filan: The next question I’d like to ask is, what do you see as the most important big picture disagreements you have with people who already believe that advanced AI technology might pose some kind of existential risk, and we should really worry about that and try to work to prevent that? Paul Christiano: Broadly, I think there are two categories of disagreements, or I’m flanked on two different sides. One is by the more Machine Intelligence Research Institute crowd, which has a very pessimistic view about the feasibility of alignment and what it’s going to take to build AI systems that aren’t trying to kill you. And then on the other hand, by researchers who tend to be at ML labs, who tend to be more in the camp of like, it would be really surprising if AI trained with this technique actually was trying to kill you. And there’s nuances to both of those disagreements. Paul Christiano: Maybe you could split the second one into one category that’s more like, actually this problem isn’t that hard, and we need to be good at the basics in order to survive. Like the gravest risk is that we mess up the basics. And a second camp being like, actually we have no idea what’s going to be hard about this problem. And what it’s mostly about is getting set up to collect really good data as soon as possible, so that we can adapt to what’s actually happening. Paul Christiano: It’s also worth saying that it’s unclear often which of these are empirical disagreements versus methodological differences, where I have my thing I’m doing, and I think that there’s room for lots of people doing different things. So there are some empirical disagreements, but not all the differences in what we do are explained by those differences, versus some of them being like, Paul is a theorist, who’s going to do some theory, and he’s going to have some methodology such that he works on theory. I am excited about theory, but it’s not always the case that when I’m doing something theoretical it’s because I think the theoretical thing is dominant. Paul Christiano: And going in those disagreements with the MIRI folk, that’s maybe more weeds-y. It doesn’t have a super short description. We can return to it in a bit if we want. On the people who are on the more optimistic side: I think for people who think existing techniques are more likely to be okay, I think the most common disagreement is about how crazy the tasks our AIs will be doing are, or how alien will the reasoning of AI systems be. People who are more optimistic tend to be like, “AI systems will be operating at high speed and doing things that are maybe hard for humans or a little bit beyond the range of human abilities, but broadly, humans will be able to understand the consequences of the actions they propose fairly well.” They’ll be able to fairly safely look at an action, and be like, can we run this action? They’ll be able to mostly leverage those AI systems effectively, even if the AI systems are just trying to do things that look good to humans. Paul Christiano: So often it’s a disagreement about, I’m imagining AI systems that reason in super alien ways, and someone else is like, probably it will mostly be thinking through consequences, or thinking in ways that are legible to humans. And thinking fast in ways that are legible to humans gets you a lot of stuff. I am very long on the thinking fast in ways legible to humans is very powerful. I definitely believe that a lot more than most people, but I do think I often, especially because now I’m working on the more theoretical end, I’m often thinking about all the cases where that doesn’t work, and some people are more optimistic that the cases where that works are enough, which is either an empirical claim about how AI will be, or sometimes a social claim about how important it is to be competitive. Paul Christiano: I really want to be able to build aligned AI systems that are economically competitive with unaligned AI, and I’m really scared of a world where there’s a significant tension there. Whereas other people are more like, “It’s okay. It’s okay if aligned AI systems are a little bit slower or a little bit dumber, people are not going to want to destroy the world, and so they’ll be willing to hold off a little bit on deploying some of these things.” Paul Christiano: And then on the empirical side, people who think that theoretical work is less valuable, and we should be mostly focused on the empirics or just doing other stuff. I would guess one common disagreement is just that I’m reasonably optimistic about being able to find something compelling on paper. So I think this methodology I described of “Try and find an algorithm for which it’s hard to tell a story about how your AI ends up killing everyone”, I actually expect that methodology to terminate with being like, “Yep, here’s an algorithm. It looks pretty good to us. We can’t tell a story about how it’s uncompetitive or lethal.” Whereas I think other people are like, “That is extremely unlikely to be where that goes. That’s just going to be years of you going around in circles until eventually you give up.” That’s actually a common disagreement on both sides. That’s probably also the core disagreement with MIRI folks, in some sense. Daniel Filan: Yeah. So you said it was perhaps hard to concisely summarize your differences between the sort of group of people centered, perhaps, at the Machine Intelligence Research Institute (or MIRI for short). Could you try? Paul Christiano: So definitely the upshot is, I am optimistic about being able to find an algorithm which can align deep learning, like, a system which is closely analogous to and competitive with standard deep learning. Whereas they are very pessimistic about the prospects for aligning anything that looks like contemporary deep learning. That’s the upshot. So they’re more in the mindset of like, let’s find any task we can do with anything kind of like deep learning, and then be willing to take great pains and huge expense to do just that one task, and then hopefully find a way to make the world okay after that, or maybe later build systems that are very unlike modern deep learning. Whereas I’m pretty optimistic - where “pretty optimistic” means I think there’s a 50-50 chance or something - that we could have a nice algorithm that actually lets you basically do something like deep learning without it killing everyone. Paul Christiano: That’s the upshot. And then the reason for, I think those are pretty weedsy, I guess intuitively is something like: if you view the central objective as about decoupling and trying to learn what your unaligned agent would have known, I think that there are a bunch of possible reasons that that decoupling could be really hard. Fundamentally, the cognitive abilities and the intentions could come as a package. This is also really core in MIRI’s disagreement with more conventional ML researchers, who are like, why would you build an agent? Why not just build a thing that helps you understand the world? Paul Christiano: I think on the MIRI view, there’s likely to be this really deep coupling between those things. I’m mostly working on other ways that decoupling can be hard, besides this kind of core one MIRI has in mind. I think MIRI is really into the idea that there’s some kind of core of being a fast, smart agent in the world. And that that core is really tied up with what you’re using it for. It’s not coherent to really talk about being smart without developing that intelligence in the service of a goal, or to talk about like factoring out the thing which you use. Paul Christiano: There’s some complicated philosophical beliefs about the nature of intelligence, which I think especially Eliezer is fairly confident in. He thinks it’s mostly pretty settled. So I’d say that’s probably the core disagreement. I think there’s a secondary disagreement about how realistic it is to implement complex projects. I think their take is, suppose Paul comes up with a good algorithm. Even in that long shot, there’s no way that’s going to get implemented, rather than just something easier that destroys the world. Projects fail the first time, and this is a case where we have to get things right the first time - well, that’s a point of contention - such that you’re not going to have much of a chance. That’s the secondary disagreement. Daniel Filan: And sort of related to that, I’m wondering, what do you think your most important uncertainties are? Uncertainties such that if you resolved them, that would in a big way change what you were motivated to do, in order to reduce existential risk from AI. Paul Christiano: Yeah. So maybe top four. One would be, is there some nice algorithm on paper that definitely doesn’t result in your AI killing you, and is definitely competitive? Or is this a kind of thing where like that’s a pipe dream and you just need to have an algorithm that works in the real world? Yeah. That would have an obvious impact on what I’m doing. I am reasonably optimistic about learning a lot about that over the coming years. I’ve been thinking recently that maybe by the end of 2022, if this isn’t going anywhere, I’ll pretty much know and can wind down the theory stuff, and hopefully significantly before then we’ll have big wins that make me feel more optimistic. So that’s one uncertainty. Just like, is this thing I’m doing going to work? Paul Christiano: A second big uncertainty is, is it the case that existing best practices in alignment would suffice to align powerful AI systems, or would buy us enough time for AI to take over the alignment problem from us? Like, I think eventually the AI will be doing alignment rather than us, and it’s just a question of how late in the game does that happen and how far existing alignment techniques carry us. I think it’s fairly plausible that existing best practices, if implemented well by a sufficiently competent team that cared enough about alignment, would be sufficient to get a good outcome. And I think in that case, it becomes much more likely that instead of working on algorithms, I should be working on actually bringing practice up to the limits of what is known. Maybe I’ll just do three, not four. Paul Christiano: And then three, maybe this is a little bit more silly, but I feel legitimate moral uncertainty over what kinds of AI… maybe the broader thing is just how important is alignment relative to other risks? I think one big consideration for the value of alignment is just, how good is it if the AI systems take over the world from the humans? Where my default inclination is, that doesn’t sound that good. But it sounds a lot better than nothing in expectation, like a barren universe. It would matter a lot. If you convinced me that number was higher, at some point I would start working on other risks associated with the transition to AI. That seems like the least likely of these uncertainties to actually get resolved. Paul Christiano: I find it kind of unlikely I’m going to move that much from where I am now, which is like… maybe it’s half as good for AIs to take over the world from humans, than for humans to choose what happens in space. And that’s close enough to zero that I definitely want to work on alignment, and also close enough to one that I also definitely don’t want to go extinct. Daniel Filan: So my penultimate question is, or it might be antepenultimate depending on your answer, is, is there anything that I have not yet asked, but you think that I should have? Paul Christiano: It seems possible that I should have, as I’ve gone, been plugging all kinds of alignment research that’s happening at all sorts of great organizations around the world. I haven’t really done any of that. I’m really bad at that though. So I’m just going to forget someone and then feel tremendous guilt in my heart. Some favorite organizations Daniel Filan: Yeah. How about in order to keep this short and to limit your guilt, what are the top five people or organizations that you’d like to plug? Paul Christiano: Oh man, that’s just going to increase my guilt. Because now I have to choose five. Daniel Filan: Perhaps name five. Any five! Paul Christiano: Any five. I think there’s a lot of ML labs that are doing good work, ML labs who view their goal as getting to powerful transformative AI systems, or doing work on alignment. So that’s like DeepMind, OpenAI, Anthropic. I think all of them are gradually converging to this gradual crystallization in what we all want to do. That’s one. Maybe I’ll do three things. Second can be academics. There’s a bunch of people. I’m friends with Jacob Steinhardt at Berkeley. His students are working on robustness issues with an eye towards long term risks. A ton of researchers at your research organization, which I guess you’ve probably talked about on other episodes. Daniel Filan: I talked to some of them. I don’t think we’ve talked about it as a whole. Yeah. It’s the Center for Human-Compatible AI. If people are interested, they can go to humancompatible.ai to see a list of people associated with us. And then you can, for each person, I guess you can look at all the work they did. We might have a newsletter or something [as far as I can tell, we do not]. I did not prepare for this. Paul Christiano: Sorry for putting you on the spot with pitching. No, I think I’m not going to do justice to the academics. There’s a bunch of academics, often just like random individuals here and there with groups doing a lot of interesting work. And then there’s kind of the weird effective altruist nonprofits, and conventional AI alignment crowd nonprofits. Probably the most salient to me there are Redwood Research. It’s very salient to me right now because I’ve been talking with them a bunch over the last few weeks. Daniel Filan: What are they? Paul Christiano: They’re working on robustness, broadly. So this adversarial training stuff. How do you make your models definitely not do bad stuff on any input? Ought, which is a nonprofit that has been working on like, how do you actually turn large language models into tools that are useful for humans, and the Machine Intelligence Research Institute, which is the most paranoid of all organizations about AI alignment - their core value added probably. There’s a lot of people doing a lot of good work. I didn’t plug them at all throughout the podcast, but I love them anyway. Following Paul’s work Daniel Filan: All right. So speaking of plugging things, if people listen to this podcast and they’re now interested in following you and your work, what should they do? Paul Christiano: I write blog posts sometimes at ai-alignment.com. I sometimes publish to the alignment forum. And depending on how much you read, it may be your best bet to wait until spectacular, exciting results emerge, which will probably appear one of those places, and also in print. But we’ve been pretty quiet over the last six months, definitely. I expect to be pretty quiet for a while, and then to have a big write up of what we’re basically doing and what our plan is sometime. I guess I don’t know when this podcast is appearing, but sometime in early 2022 or something like that. Daniel Filan: I also don’t know when it’s appearing. We did date ourselves to infrastructure week, one of the highly specific times. Okay. Well, thanks for being on the show. Paul Christiano: Thanks for having me. Daniel Filan: This episode is edited by Finan Adamson, and Justis Mills helped with transcription. The financial costs of making this episode are covered by a grant from the Long Term Future Fund. To read a transcript of this episode, or to learn how to support the podcast, you can visit axrp.net. Finally, if you have any feedback about this podcast, you can email me at feedback@axrp.net. Discuss ### Common Probability Distributions 5 часов 36 минут назад Published on December 2, 2021 1:50 AM GMT When we output a forecast, we're either explicitly or implicitly outputting a probability distribution. For example, if we forecast the AQI in Berkeley tomorrow to be "around" 30, plus or minus 10, we implicitly mean some distribution that has most of its probability mass between 20 and 40. If we were forced to be explicit, we might say we have a normal distribution with mean 30 and standard deviation 10 in mind. There are many different types of probability distributions, so it's helpful to know what shapes distributions tend to have and what factors influence this. From your math and probability classes, you're probability used to the Gaussian or normal distribution as the "canonical" example of a probability distribution. However, in practice other distributions are much more common. While normal distributions do show up, it's more common to see distributions such as log-normal or power law distributions. In the remainder of these notes, I'll discuss each of these in turn. The following table summarizes these distributions, what typically causes them to occur, and several examples of data that follow the distribution: Distribution Gaussian Log-normal Power Law Causes Independent additive factors Independent multiplicative factors Rich get richer, scale invariance Tails Thin tails Heavy tails Heavier tails Examples -heights -US GDP in 2030 -city population -temperature -price of Tesla stock in 2030 -twitter followers -measurement errors -word frequencies Normal Distribution The normal (or Gaussian) distribution is the familiar "bell-shaped" curve seen in many textbooks. Its probability density is given byp(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\Big(-\frac{(x-\mu)^2}{2\sigma^2}\Big)$, where$\mu$is the mean and$\sigma$is the standard deviation. Normal distributions occurs when there are many independent factors that combine additively, and no single one of those factors "dominates" the sum. Mathematically, this intuition is formalized through the central limit theorem. Example 1: temperature. As one example, the temperature in a given city (at a given time of year) is normally distributed, since many factors (wind, ocean currents, cloud cover, pollution) affect it, mostly independently. Example 2: heights. Similarly, height is normally distributed, since many different genes have some effect on height, as do other factors such as childhood nutrition. However, for height we actually have to be careful, because there are two major factors that affect height significantly: age and sex. 12-year olds are (generally) shorter than 22-year-olds, and women are on average 5 inches (13cm) shorter than men. These overlaid histograms show heights of adults conditional on sex. Thus, if we try to approximate the distribution of heights of all adults with a normal distribution, we will get a pretty bad approximation. However, the distribution of male heights and female heights are separately well-approximated by normal distributions. All Males Females Example 3: measurement errors. Finally, the errors of a well-engineered system are often normally-distributed. One example would be a physical measurement apparatus (such as a voltmeter). Another would be the errors of a well-fit predictive model. For instance, when I was an undergraduate I fit a model to predict the pitch, yaw, roll, and other attributes of an autonomous airplane. The results are below, and all closely follow a normal distribution: Why do well-engineered systems have normally-distributed errors? It's a sort of reverse central limit theorem: if they didn't, that would mean there was one large source of error that dominated the others, and a good engineer would have found and eliminated that source. Brainstorming exercise. What are some other examples of random variables that you expect to be normally distributed? Caveat: normal distributions have thin tails. The normal distribution has very "thin" tails (falling faster than an exponential), and once we reach the extremes the tails usually underestimate the probability of rare events. As a result, we have to be careful when using a normal distribution for some of the examples above, such as heights. A normal distribution predicts that no women should be taller than 6'8", yet there are many women who have reached this height (read more here). If we care specifically about the extremes, then instead of the normal distribution, a distribution with heavier tails (such as a t-distribution) may be a better fit. Log-normal Distributions While normal distributions arise from independent additive factors, log-normal distributions arise from independent multiplicative factors (which are often more common). A random variable$X$is log-normally distributed if$\log(X)$follows a normal distribution--in other words, a log-normal distribution is what you get if you take a normal random variable and exponentiate it. Its density is given by$p(x) = \frac{1}{x\sqrt{2\pi\sigma^2}} \exp\Big(-\frac{(\log(x) - \mu)^2}{2\sigma^2}\Big)$. Here$\mu$and$\sigma$are the mean and variance of$\log(X)$(not$X$). Examples of log-normal distributions Log-normal(0, 1) compared to Normal(0, 1) Multiplicative factors tend to occur whenever there is a "growth" process over time. For instance: • The number of employees of a company 5 years from now (or its stock price) • US GDP in 2030 Why should we think of factors affecting a company's employee count as multiplicative? Well, if a 20-person company does poorly it might decide to lay off 1 employee. If a 10,000-person company does poorly, it would have to lay off hundreds of employees to achieve the same relative effect. So, it makes more sense to think of "shocks" to a growth process as multiplicative rather than additive. Log-normal distributions are much more heavy-tailed than normal distributions. One way to get a sense of this is to compare heights to stock prices. Height (among US adult males) Stock price (among S&P 500 companies) Median 175.7 cm$119.24 99th percentile 191.9 cm $1870.44 To check if a variable X is log-normal distributed, we can plot a histogram of log(X) (or equivalently, plot the x-axis on a log scale), and this should be normally distributed. For example, consider the following plots of the Lognormal(0, 0.9) distribution: Standard axes Log scale x-axis Brainstorming exercise. What are other quantities that are probably log-normally distributed? Power Law Distributions Another common distribution is the power law distribution. Power law distributions are those that decrease at a rate of$x$raised to some power:$p(x) = C / x^{\alpha}$for some constant$C$and exponent$\alpha$. (We also have to restrict$x$away from zero, e.g. by only considering$x > 1$or some other threshold.) Like a log-normal distribution, power laws are heavy-tailed. In fact, they are even heavier-tailed than log-normals. To identify a power law, we can create a log-log plot (plotting both the x and y-axes on log scales). Variables that follow power laws will show a linear trend, while log-normal variables will have curvature. Here we plot the same distributions as above, but with log scale x and y axes: In practice, log-normal and power-law distributions often only differ far out in the tail and so it isn't always easy (or important) to tell the difference between them. What leads to power law distributions? Here are a few real-world examples of power law distributions (plotted on a log-log scale as above): Words in TV scipts Words in the Simpsons US city populations The factors that lead to power law distributions are more varied than log-normals. For a good overview, I recommend this excellent paper by Michael Mitzenmacher. I will summarize two common factors below: • One reason for power laws is that they are the unique set of scale-invariant laws: ones where$X$and$2X$(and$3X$) all have identical distributions. So, we should expect power laws in any case where the "units don't matter". Examples include the net worth of individuals (dollars are an arbitrary unit) and the size of stars (meters are an arbitrary unit, and more fundamental physical units such as the Planck length don't generally affect stars). • Another common reason for power laws is preferential attachment or rich get richer phenomena. An example of this would be twitter followers: once you have a lot of twitter followers, they are more likely to retweet your posts, leading to even more twitter followers. And indeed, the distribution of twitter followers is power law distributed: "Rich get richer" also explains why words are power law distributed: the more frequent a word is, the more salient it is in most people's minds, and hence the more it gets used in the future. And for cities, more people think of moving to Chicago (3rd largest city) than to Arlington, Texas (50th largest city) partly because Chicago is bigger. Brainstorming exercise. What are other instances where we should expect to see power laws, due to either scale invariance or rich get richer? Exercise. Interestingly, in contrast to cities, country populations do not seem to fit a power law (although they could fit a mixture of two power laws reasonably): Can you think of reasons that explain this? There is much more to be said about power laws. In addition to the Mitzenmacher paper mentioned above, I recommend this blog post by Terry Tao. Concluding Exercise. Here are a couple examples of data you might want to model. For each, would you expect its distribution to be normal, log-normal, or power law? • Incomes of US adults • Citations of papers • Number of Christmas trees sold each year Discuss ### The 2020 Review: Preliminary Voting 6 часов 47 минут назад Published on December 2, 2021 12:39 AM GMT Today is the first day of the LessWrong 2020 Review. At the end of each year, we take a look at the best posts from the previous year, and reflect on which of them have stood the test of time. As we navigate the 21st century, a key issue is that we’re pretty confused about which intellectual work is valuable. have very sparse reward signals when it comes to “did we actually figure out things that matter?” The Review has a few goals. It improves our incentives, feedback, and rewards for contributing to LessWrong. It creates common knowledge about the LW community's collective epistemic state about the most important posts of 2020. And it turns all of that into a highly curated sequence that people can read. You can read more about the philosophy behind the Review in last year’s announcement post. A few important announcements about this year’s review: • We’ve replaced the nomination process with Preliminary Voting. • Winning posts will get Donation Buttons. • There’s a new View My Past Upvotes page to help you find posts to vote on. • 2019 Books will be shipping in a couple weeks. We’re still evaluating whether and how to do books for 2020. How does the review work? The review has three phases: 1. Preliminary Voting Phase (Dec 1- 14) 2. Discussion Phase (Dec 14 - Jan 11) 3. Final Voting (Jan 11 to Jan 25) Users who registered before January 1st 2020 can vote. The LessWrong moderation team will take the results of the vote as input for a curated sequence of posts, and award prizes. We’ll be giving more weight to the votes of users with 1000+ karma. Preliminary Voting The first big change this year is changing the Nomination Phase to the Preliminary Vote Phase. Eligible voters voters will see this UI: If you think a post was an important intellectual contribution, you can cast a vote indicating roughly how important it was. A vote of 1 means “it was good.” A vote of 4 means “it was quite important”, and is weighted 4x a vote of 1. A vote of 9x means it was a crucial piece of intellectual progress. You can vote at the top of a post, or anywhere the post appears in a list (like the All Posts page, or the new View Your Past Upvotes page). Posts that get at least one positive vote go to the Voting Dashboard, where other users can vote on it. You’re encouraged to give at least a rough vote based on what you remember from last year. If you feel a post was important, you’re also encouraged to write up at least a short review of it saying what stands out about the post and why it matters. (This is essentially the same as writing a nomination comment from the 2018 and 2019 Reviews. In practice nominations and reviews were fairly similar and it didn’t seem worth separating them out in the UI). You’re allowed to write multiple reviews of a post, if you want to start by jotting down your quick impressions, and later review it in more detail. Why did we switch to preliminary voting? Each year, more posts get written on LessWrong. The first Review of 2018 considered 1,500 posts. In 2020, there were 3,000. Processing that many posts is a lot of work. Preliminary voting is designed to help handle the increased number of posts. Instead of simply nominating posts, we start directly with a votes. At the end of the Preliminary Voting phase, the results of the vote will be published. This will help the LessWrong community prioritize reviews. Posts that are highly ranked can invite more investigation of how they stand the tests of time. If you think a post was (unfairly) ranked low, you are welcome to write a positive review arguing it should be considered more strongly. Posts which everyone agrees are “meh” can get deprioritized, making more time for more interesting posts. How is preliminary voting calculated? You can cast an unlimited number of votes. However, the more votes you cast, and the higher your total “score” (where a “9” vote counts for 9x the score of a “1” vote), the less influential each of your votes will be. We normalize voting strength so that all users who are past a certain “score” threshold exert roughly the same amount of total influence. On the back end, we use a modified quadratic voting system, which allocates a fixed number of “points” across your votes based on how strong they are. Final Voting Posts that receive at least one review move on the Final Voting Phase. The UI will require voters to at least briefly skim reviews before finalizing their vote for each post, so arguments about each post can be considered. As with last year, we'll publish the voting results for users with 1000+ karma, as well as all users. The LessWrong moderation team will take the voting results as a strong indicator of which posts to include in the Best of 2020 sequence. (Note: I am currently uncertain whether Final Voting will use the fine-tuned quadratic system from last year. I plan to take last year's voting data, round each vote to the nearest "1, 4, or 9", and see if the results are significantly different from the original vote. If they aren't very different, I suspect it may not make sense to encourage everyone to spend a bunch of time fine-tuning their quadratic points. I'm open to arguments in either direction) Donation Buttons Something I’d like LessWrong to do better is to allow authors to transition from hobbyists, to professionals that get paid to research and write full time. Earlier this year, I was thinking about whether LessWrong should become more like substack, where there’s an easy affordance to start supporting financially supporting authors you like. I liked the idea but wasn’t sure it’d be healthy for LessWrong – the sorts of posts that make people excited to donate are often more tribal/political. But this seemed less worrisome during The Review. It’s a time when people are thinking holistically about the LessWrong intellectual world, comparing many different posts against each other and reflecting on which ones were truly valuable. So, after the Final Vote this year, all posts above some[1] threshold will get a donation button interface, which makes it easier people to just give the author money. I encourage everyone to donate in proportion to how much value you got from a post. If it slightly improved your life, maybe donate$20-$50 as a thank you. If you think a post was a crucial insight for helping the entire world, maybe donate as if it were an effective altruism target. (i.e. if you’re the sort of person who donates 10% of your income, consider if any LessWrong posts are competitive with the other causes you might give to). LessWrong posts are a public good, and I think at least some are worth supporting in this way. Lightcone Infrastructure will be allocating our own prizes. We have not decided the total amount we’ll give, but it will most likely be substantially more than the$2000 we awarded last year.

[1] I’m not yet sure exactly what threshold to set. I’m expecting a lot of mediocre posts to get at least one positive vote, which shouldn’t automatically warrant inclusion in donations list.

The 2019 Books

The 2018 books were well received last year, selling out almost the entire 4000 sets we printed (though there are still some 300+ copies in Australia, available on Amazon there).

The 2019 year’s books are a week or two away from launch. They include 59 essays, each of which has a unique customized illustration generated by machine learning. They’ll be eligible again for Amazon Prime, so shipping will be fast in North America, and likely in time for Christmas. A little later we’ll be supplying books to Amazon UK, which is where European readers can order from (with slightly longer shipping times and prices).

We will not ourselves be shipping to every other country – last year we attempted to ship to ~25 countries, most of which sold very few copies while requiring a lot of setup work. Alas, we are LessWrong, not Santa Claus – we unfortunately exist and are subject to logistical constraints. :P

At this time we’re not committed to doing another anthology set next year. We’re going to wait until after the launch of this year’s books to see whether there’s demand for annual anthologies. We have some different book projects in mind for the community, including a book of The Core Sequences, or entire sequences by other authors that fare well in the review, or books dedicated to a single topic drawing from the full history of LessWrong (covering topics such as Coordination or AI Alignment).

Meanwhile, we’ll definitely be collating the winning 2020 essays into a proper LessWrong sequence, prominently displayed in the site library. (I expect to have the 2018 and 2019 sequences released later this week). And again, we’ll be awarding significantly more financial prizes this year, and facilitating donation buttons to make it easier to reward authors who have done good work.

Here’s a sneak peak of the spines of the upcoming books, which includes this year’s volume titles (Book 1 is on the bottom). This year’s books are notably bigger than last year’s, 60% bigger in terms of page size.

Voting on Important Intellectual Progress

In past years, the vote was officially for creating a published book. This made it easier to reason about what exactly you were voting for, but also meant that some types of posts were harder conceptually to reward. Some important progress isn’t very fun to read. Some important posts are massively long, and couldn’t possibly fit in a book.

So this year, I’d like to formally ask that you vote based on how important an intellectual contribution a post made, rather than whether you think it makes sense to publish.

The LessWrong moderation team will take stock of the top-rated posts, and make judgment calls on how to best reward them. Some may fit best into anthology style books. Some may be more appropriate for (eventual) textbooks. Some might be important-but-tedious empirical work that makes more sense to give an honorable mention to in the books, while primarily rewarding them with prize money.

In practice, this is not that different from how we’ve been assembling the books in previous years. But it had been a bit ambiguous, and I thought it best to make it official.

You are welcome to use your own taste in what you consider important intellectual progress. But some questions that might inform your vote include:

• Does this post introduce a concept that helps you understand the world?
• Does the post provide useful and accurate empirical data?
• Does this post teach a skill that has helped you?
• Does this post summarize or distill information that makes it easier to grasp?
• Do the central arguments of the post make sense?
• Does this post promote an important and interesting hypothesis?

While writing reviews, it’s also worth exploring questions like:

• How does this fit into the broader intellectual landscape?
• What further work would you like to see?
Go Forth and Review!

I have more ideas for how to improve the Review this year, which I’ll be posting about as they reach fruition. Meanwhile, let the LessWrong 2020 Review commence!

Discuss

### Biology-Inspired AGI Timelines: The Trick That Never Works

8 часов 51 минута назад
Published on December 1, 2021 10:35 PM GMT

- 1988 -

Hans Moravec:  Behold my book Mind Children.  Within, I project that, in 2010 or thereabouts, we shall achieve strong AI.  I am not calling it "Artificial General Intelligence" because this term will not be coined for another 15 years or so.

Eliezer (who is not actually on the record as saying this, because the real Eliezer is, in this scenario, 8 years old; this version of Eliezer has all the meta-heuristics of Eliezer from 2021, but none of that Eliezer's anachronistic knowledge):  Really?  That sounds like a very difficult prediction to make correctly, since it is about the future, which is famously hard to predict.

Imaginary Moravec:  Sounds like a fully general counterargument to me.

Eliezer:  Well, it is, indeed, a fully general counterargument against futurism.  Successfully predicting the unimaginably far future - that is, more than 2 or 3 years out, or sometimes less - is something that human beings seem to be quite bad at, by and large.

Moravec:  I predict that, 4 years from this day, in 1992, the Sun will rise in the east.

Eliezer: Okay, let me qualify that.  Humans seem to be quite bad at predicting the future whenever we need to predict anything at all new and unfamiliar, rather than the Sun continuing to rise every morning until it finally gets eaten.  I'm not saying it's impossible to ever validly predict something novel!  Why, even if that was impossible, how could I know it for sure?  By extrapolating from my own personal inability to make predictions like that?  Maybe I'm just bad at it myself.  But any time somebody claims that some particular novel aspect of the far future is predictable, they justly have a significant burden of prior skepticism to overcome.

More broadly, we should not expect a good futurist to give us a generally good picture of the future.  We should expect a great futurist to single out a few rare narrow aspects of the future which are, somehow, exceptions to the usual rule about the future not being very predictable.

I do agree with you, for example, that we shall at some point see Artificial General Intelligence.  This seems like a rare predictable fact about the future, even though it is about a novel thing which has not happened before: we keep trying to crack this problem, we make progress albeit slowly, the problem must be solvable in principle because human brains solve it, eventually it will be solved; this is not a logical necessity, but it sure seems like the way to bet.  "AGI eventually" is predictable in a way that it is not predictable that, e.g., the nation of Japan, presently upon the rise, will achieve economic dominance over the next decades - to name something else that present-day storytellers of 1988 are talking about.

But timing the novel development correctly?  That is almost never done, not until things are 2 years out, and often not even then.  Nuclear weapons were called, but not nuclear weapons in 1945; heavier-than-air flight was called, but not flight in 1903.  In both cases, people said two years earlier that it wouldn't be done for 50 years - or said, decades too early, that it'd be done shortly.  There's a difference between worrying that we may eventually get a serious global pandemic, worrying that eventually a lab accident may lead to a global pandemic, and forecasting that a global pandemic will start in November of 2019.

Moravec:  You should read my book, my friend, into which I have put much effort.  In particular - though it may sound impossible to forecast, to the likes of yourself - I have carefully examined a graph of computing power in single chips and the most powerful supercomputers over time.  This graph looks surprisingly regular!  Now, of course not all trends can continue forever; but I have considered the arguments that Moore's Law will break down, and found them unconvincing.  My book spends several chapters discussing the particular reasons and technologies by which we might expect this graph to not break down, and continue, such that humanity will have, by 2010 or so, supercomputers which can perform 10 trillion operations per second.*

Oh, and also my book spends a chapter discussing the retina, the part of the brain whose computations we understand in the most detail, in order to estimate how much computing power the human brain is using, arriving at a figure of 10^13 ops/sec.  This neuroscience and computer science may be a bit hard for the layperson to follow, but I assure you that I am in fact an experienced hands-on practitioner in robotics and computer vision.

So, as you can see, we should first get strong AI somewhere around 2010.  I may be off by an order of magnitude in one figure or another; but even if I've made two errors in the same direction, that only shifts the estimate by 7 years or so.

(*)  Moravec just about nailed this part; the actual year was 2008.

Eliezer:  I sure would be amused if we did in fact get strong AI somewhere around 2010, which, for all I know at this point in this hypothetical conversation, could totally happen!  Reversed stupidity is not intelligence, after all, and just because that is a completely broken justification for predicting 2010 doesn't mean that it cannot happen that way.

Moravec:  Really now.  Would you care to enlighten me as to how I reasoned so wrongly?

Eliezer:  Among the reasons why the Future is so hard to predict, in general, is that the sort of answers we want tend to be the products of lines of causality with multiple steps and multiple inputs.  Even when we can guess a single fact that plays some role in producing the Future - which is not of itself all that rare - usually the answer the storyteller wants depends on more facts than that single fact.  Our ignorance of any one of those other facts can be enough to torpedo our whole line of reasoning - in practice, not just as a matter of possibilities.  You could say that the art of exceptions to Futurism being impossible, consists in finding those rare things that you can predict despite being almost entirely ignorant of most concrete inputs into the concrete scenario.  Like predicting that AGI will happen at some point, despite not knowing the design for it, or who will make it, or how.

My own contribution to the Moore's Law literature consists of Moore's Law of Mad Science:  "Every 18 months, the minimum IQ required to destroy the Earth drops by 1 point."  Even if this serious-joke was an absolutely true law, and aliens told us it was absolutely true, we'd still have no ability whatsoever to predict thereby when the Earth would be destroyed, because we'd have no idea what that minimum IQ was right now or at any future time.  We would know that in general the Earth had a serious problem that needed to be addressed, because we'd know in general that destroying the Earth kept on getting easier every year; but we would not be able to time when that would become an imminent emergency, until we'd seen enough specifics that the crisis was already upon us.

In the case of your prediction about strong AI in 2010, I might put it as follows:  The timing of AGI could be seen as a product of three factors, one of which you can try to extrapolate from existing graphs, and two of which you don't know at all.  Ignorance of any one of them is enough to invalidate the whole prediction.

These three factors are:

• The availability of computing power over time, which may be quantified, and appears steady when graphed;
• The rate of progress in knowledge of cognitive science and algorithms over time, which is much harder to quantify;
• A function that is a latent background parameter, for the amount of computing power required to create AGI as a function of any particular level of knowledge about cognition; and about this we know almost nothing.

Or to rephrase:  Depending on how much you and your civilization know about AI-making - how much you know about cognition and computer science - it will take you a variable amount of computing power to build an AI.  If you really knew what you were doing, for example, I confidently predict that you could build a mind at least as powerful as a human mind, while using fewer floating-point operations per second than a human brain is making useful use of -

Chris Humbali:  Wait, did you just say "confidently"?  How could you possibly know that with confidence?  How can you criticize Moravec for being too confident, and then, in the next second, turn around and be confident of something yourself?  Doesn't that make you a massive hypocrite?

Eliezer:  Um, who are you again?

Humbali:  I'm the cousin of Pat Modesto from your previous dialogue on Hero Licensing!  Pat isn't here in person because "Modesto" looks unfortunately like "Moravec" on a computer screen.  And also their first name looks a bit like "Paul" who is not meant to be referenced either.  So today I shall be your true standard-bearer for good calibration, intellectual humility, the outside view, and reference class forecasting -

Eliezer:  Two of these things are not like the other two, in my opinion; and Humbali and Modesto do not understand how to operate any of the four correctly, in my opinion; but anybody who's read "Hero Licensing" should already know I believe that.

Humbali:  - and I don't see how Eliezer can possibly be so confident, after all his humble talk of the difficulty of futurism, that it's possible to build a mind 'as powerful as' a human mind using 'less computing power' than a human brain.

Eliezer:  It's overdetermined by multiple lines of inference.  We might first note, for example, that the human brain runs very slowly in a serial sense and tries to make up for that with massive parallelism.  It's an obvious truth of computer science that while you can use 1000 serial operations per second to emulate 1000 parallel operations per second, the reverse is not in general true.

To put it another way: if you had to build a spreadsheet or a word processor on a computer running at 100Hz, you might also need a billion processing cores and massive parallelism in order to do enough cache lookups to get anything done; that wouldn't mean the computational labor you were performing was intrinsically that expensive.  Since modern chips are massively serially faster than the neurons in a brain, and the direction of conversion is asymmetrical, we should expect that there are tasks which are immensely expensive to perform in a massively parallel neural setup, which are much cheaper to do with serial processing steps, and the reverse is not symmetrically true.

A sufficiently adept builder can build general intelligence more cheaply in total operations per second, if they're allowed to line up a billion operations one after another per second, versus lining up only 100 operations one after another.  I don't bother to qualify this with "very probably" or "almost certainly"; it is the sort of proposition that a clear thinker should simply accept as obvious and move on.

Humbali:  And is it certain that neurons can perform only 100 serial steps one after another, then?  As you say, ignorance about one fact can obviate knowledge of any number of others.

Eliezer:  A typical neuron firing as fast as possible can do maybe 200 spikes per second, a few rare neuron types used by eg bats to echolocate can do 1000 spikes per second, and the vast majority of neurons are not firing that fast at any given time.  The usual and proverbial rule in neuroscience - the sort of academically respectable belief I'd expect you to respect even more than I do - is called "the 100-step rule", that any task a human brain (or mammalian brain) can do on perceptual timescales, must be doable with no more than 100 serial steps of computation - no more than 100 things that get computed one after another.  Or even less if the computation is running off spiking frequencies instead of individual spikes.

Moravec:  Yes, considerations like that are part of why I'd defend my estimate of 10^13 ops/sec for a human brain as being reasonable - more reasonable than somebody might think if they were, say, counting all the synapses and multiplying by the maximum number of spikes per second in any neuron.  If you actually look at what the retina is doing, and how it's computing that, it doesn't look like it's doing one floating-point operation per activation spike per synapse.

Eliezer:  There's a similar asymmetry between precise computational operations having a vastly easier time emulating noisy or imprecise computational operations, compared to the reverse - there is no doubt a way to use neurons to compute, say, exact 16-bit integer addition, which is at least more efficient than a human trying to add up 16986+11398 in their heads, but you'd still need more synapses to do that than transistors, because the synapses are noisier and the transistors can just do it precisely.  This is harder to visualize and get a grasp on than the parallel-serial difference, but that doesn't make it unimportant.

Which brings me to the second line of very obvious-seeming reasoning that converges upon the same conclusion - that it is in principle possible to build an AGI much more computationally efficient than a human brain - namely that biology is simply not that efficient, and especially when it comes to huge complicated things that it has started doing relatively recently.

ATP synthase may be close to 100% thermodynamically efficient, but ATP synthase is literally over 1.5 billion years old and a core bottleneck on all biological metabolism.  Brains have to pump thousands of ions in and out of each stretch of axon and dendrite, in order to restore their ability to fire another fast neural spike.  The result is that the brain's computation is something like half a million times less efficient than the thermodynamic limit for its temperature - so around two millionths as efficient as ATP synthase.  And neurons are a hell of a lot older than the biological software for general intelligence!

The software for a human brain is not going to be 100% efficient compared to the theoretical maximum, nor 10% efficient, nor 1% efficient, even before taking into account the whole thing with parallelism vs. serialism, precision vs. imprecision, or similarly clear low-level differences.

Humbali:  Ah!  But allow me to offer a consideration here that, I would wager, you've never thought of before yourself - namely - what if you're wrong?  Ah, not so confident now, are you?

Eliezer:  One observes, over one's cognitive life as a human, which sorts of what-ifs are useful to contemplate, and where it is wiser to spend one's limited resources planning against the alternative that one might be wrong; and I have oft observed that lots of people don't... quite seem to understand how to use 'what if' all that well?  They'll be like, "Well, what if UFOs are aliens, and the aliens are partially hiding from us but not perfectly hiding from us, because they'll seem higher-status if they make themselves observable but never directly interact with us?"

I can refute individual what-ifs like that with specific counterarguments, but I'm not sure how to convey the central generator behind how I know that I ought to refute them.  I am not sure how I can get people to reject these ideas for themselves, instead of them passively waiting for me to come around with a specific counterargument.  My having to counterargue things specifically now seems like a road that never seems to end, and I am not as young as I once was, nor am I encouraged by how much progress I seem to be making.  I refute one wacky idea with a specific counterargument, and somebody else comes along and presents a new wacky idea on almost exactly the same theme.

I know it's probably not going to work, if I try to say things like this, but I'll try to say them anyways.  When you are going around saying 'what-if', there is a very great difference between your map of reality, and the territory of reality, which is extremely narrow and stable.  Drop your phone, gravity pulls the phone downward, it falls.  What if there are aliens and they make the phone rise into the air instead, maybe because they'll be especially amused at violating the rule after you just tried to use it as an example of where you could be confident?  Imagine the aliens watching you, imagine their amusement, contemplate how fragile human thinking is and how little you can ever be assured of anything and ought not to be too confident.  Then drop the phone and watch it fall.  You've now learned something about how reality itself isn't made of what-ifs and reminding oneself to be humble; reality runs on rails stronger than your mind does.

Contemplating this doesn't mean you know the rails, of course, which is why it's so much harder to predict the Future than the past.  But if you see that your thoughts are still wildly flailing around what-ifs, it means that they've failed to gel, in some sense, they are not yet bound to reality, because reality has no binding receptors for what-iffery.

The correct thing to do is not to act on your what-ifs that you can't figure out how to refute, but to go on looking for a model which makes narrower predictions than that.  If that search fails, forge a model which puts some more numerical distribution on your highly entropic uncertainty, instead of diverting into specific what-ifs.  And in the latter case, understand that this probability distribution reflects your ignorance and subjective state of mind, rather than your knowledge of an objective frequency; so that somebody else is allowed to be less ignorant without you shouting "Too confident!" at them.  Reality runs on rails as strong as math; sometimes other people will achieve, before you do, the feat of having their own thoughts run through more concentrated rivers of probability, in some domain.

Now, when we are trying to concentrate our thoughts into deeper, narrower rivers that run closer to reality's rails, there is of course the legendary hazard of concentrating our thoughts into the wrong narrow channels that exclude reality.  And the great legendary sign of this condition, of course, is the counterexample from Reality that falsifies our model!  But you should not in general criticize somebody for trying to concentrate their probability into narrower rivers than yours, for this is the appearance of the great general project of trying to get to grips with Reality, that runs on true rails that are narrower still.

If you have concentrated your probability into different narrow channels than somebody else's, then, of course, you have a more interesting dispute; and you should engage in that legendary activity of trying to find some accessible experimental test on which your nonoverlapping models make different predictions.

Humbali:  I do not understand the import of all this vaguely mystical talk.

Eliezer:  I'm trying to explain why, when I say that I'm very confident it's possible to build a human-equivalent mind using less computing power than biology has managed to use effectively, and you say, "How can you be so confident, what if you are wrong," it is not unreasonable for me to reply, "Well, kid, this doesn't seem like one of those places where it's particularly important to worry about far-flung ways I could be wrong."  Anyone who aspires to learn, learns over a lifetime which sorts of guesses are more likely to go oh-no-wrong in real life, and which sorts of guesses are likely to just work.  Less-learned minds will have minds full of what-ifs they can't refute in more places than more-learned minds; and even if you cannot see how to refute all your what-ifs yourself, it is possible that a more-learned mind knows why they are improbable.  For one must distinguish possibility from probability.

It is imaginable or conceivable that human brains have such refined algorithms that they are operating at the absolute limits of computational efficiency, or within 10% of it.  But if you've spent enough time noticing where Reality usually exercises its sovereign right to yell "Gotcha!" at you, learning which of your assumptions are the kind to blow up in your face and invalidate your final conclusion, you can guess that "Ah, but what if the brain is nearly 100% computationally efficient?" is the sort of what-if that is not much worth contemplating because it is not actually going to be true in real life.  Reality is going to confound you in some other way than that.

I mean, maybe you haven't read enough neuroscience and evolutionary biology that you can see from your own knowledge that the proposition sounds massively implausible and ridiculous.  But it should hardly seem unlikely that somebody else, more learned in biology, might be justified in having more confidence than you.  Phones don't fall up.  Reality really is very stable and orderly in a lot of ways, even in places where you yourself are ignorant of that order.

But if "What if aliens are making themselves visible in flying saucers because they want high status and they'll have higher status if they're occasionally observable but never deign to talk with us?" sounds to you like it's totally plausible, and you don't see how someone can be so confident that it's not true - because oh no what if you're wrong and you haven't seen the aliens so how can you know what they're not thinking - then I'm not sure how to lead you into the place where you can dismiss that thought with confidence.  It may require a kind of life experience that I don't know how to give people, at all, let alone by having them passively read paragraphs of text that I write; a learned, perceptual sense of which what-ifs have any force behind them.  I mean, I can refute that specific scenario, I can put that learned sense into words; but I'm not sure that does me any good unless you learn how to refute it yourself.

Humbali:  Can we leave aside all that meta stuff and get back to the object level?

Eliezer:  This indeed is often wise.

Humbali:  Then here's one way that the minimum computational requirements for general intelligence could be higher than Moravec's argument for the human brain.  Since, after, all, we only have one existence proof that general intelligence is possible at all, namely the human brain.  Perhaps there's no way to get general intelligence in a computer except by simulating the brain neurotransmitter-by-neurotransmitter.  In that case you'd need a lot more computing operations per second than you'd get by calculating the number of potential spikes flowing around the brain!  What if it's true?  How can you know?

(Modern person:  This seems like an obvious straw argument?  I mean, would anybody, even at an earlier historical point, actually make an argument like -

Moravec and Eliezer:  YES THEY WOULD.)

Eliezer:  I can imagine that if we were trying specifically to upload a human that there'd be no easy and simple and obvious way to run the resulting simulation and get a good answer, without simulating neurotransmitter flows in extra detail.

To imagine that every one of these simulated flows is being usefully used in general intelligence and there is no way to simplify the mind design to use fewer computations...  I suppose I could try to refute that specifically, but it seems to me that this is a road which has no end unless I can convey the generator of my refutations.  Your what-iffery is flung far enough that, if I cannot leave even that much rejection as an exercise for the reader to do on their own without my holding their hand, the reader has little enough hope of following the rest; let them depart now, in indignation shared with you, and save themselves further outrage.

I mean, it will obviously be less obvious to the reader because they will know less than I do about this exact domain, it will justly take more work for the reader to specifically refute you than it takes me to refute you.  But I think the reader needs to be able to do that at all, in this example, to follow the more difficult arguments later.

Imaginary Moravec:  I don't think it changes my conclusions by an order of magnitude, but some people would worry that, for example, changes of protein expression inside a neuron in order to implement changes of long-term potentiation, are also important to intelligence, and could be a big deal in the brain's real, effectively-used computational costs.  I'm curious if you'd dismiss that as well, the same way you dismiss the probability that you'd have to simulate every neurotransmitter molecule?

Eliezer:  Oh, of course not.  Long-term potentiation suddenly turning out to be a big deal you overlooked, compared to the depolarization impulses spiking around, is very much the sort of thing where Reality sometimes jumps out and yells "Gotcha!" at you.

Humbali:  How can you tell the difference?

Eliezer:  Experience with Reality yelling "Gotcha!" at myself and historical others.

Humbali:  They seem like equally plausible speculations to me!

Eliezer:  Really?  "What if long-term potentiation is a big deal and computationally important" sounds just as plausible to you as "What if the brain is already close to the wall of making the most efficient possible use of computation to implement general intelligence, and every neurotransmitter molecule matters"?

Humbali:  Yes!  They're both what-ifs we can't know are false and shouldn't be overconfident about denying!

Eliezer:  My tiny feeble mortal mind is far away from reality and only bound to it by the loosest of correlating interactions, but I'm not that unbound from reality.

Moravec:  I would guess that in real life, long-term potentiation is sufficiently slow and local that what goes on inside the cell body of a neuron over minutes or hours is not as big of a computational deal as thousands of times that many spikes flashing around the brain in milliseconds or seconds.  That's why I didn't make a big deal of it in my own estimate.

Eliezer:  Sure.  But it is much more the sort of thing where you wake up to a reality-authored science headline saying "Gotcha!  There were tiny DNA-activation interactions going on in there at high speed, and they were actually pretty expensive and important!"  I'm not saying this exact thing is very probable, just that it wouldn't be out-of-character for reality to say something like that to me, the way it would be really genuinely bizarre if Reality was, like, "Gotcha!  The brain is as computationally efficient of a generally intelligent engine as any algorithm can be!"

Moravec:  I think we're in agreement about that part, or we would've been, if we'd actually had this conversation in 1988.  I mean, I am a competent research roboticist and it is difficult to become one if you are completely unglued from reality.

Eliezer:  Then what's with the 2010 prediction for strong AI, and the massive non-sequitur leap from "the human brain is somewhere around 10 trillion ops/sec" to "if we build a 10 trillion ops/sec supercomputer, we'll get strong AI"?

Moravec:  Because while it's the kind of Fermi estimate that can be off by an order of magnitude in practice, it doesn't really seem like it should be, I don't know, off by three orders of magnitude?  And even three orders of magnitude is just 10 years of Moore's Law.  2020 for strong AI is also a bold and important prediction.

Eliezer:  And the year 2000 for strong AI even more so.

Moravec:  Heh!  That's not usually the direction in which people argue with me.

Eliezer:  There's an important distinction between the direction in which people usually argue with you, and the direction from which Reality is allowed to yell "Gotcha!"  I wish my future self had kept this more in mind, when arguing with Robin Hanson about how well AI architectures were liable to generalize and scale without a ton of domain-specific algorithmic tinkering for every field of knowledge.  I mean, in principle what I was arguing for was various lower bounds on performance, but I sure could have emphasized more loudly that those were lower bounds - well, I did emphasize the lower-bound part, but - from the way I felt when AlphaGo and Alpha Zero and GPT-2 and GPT-3 showed up, I think I must've sorta forgot that myself.

Moravec:  Anyways, if we say that I might be up to three orders of magnitude off and phrase it as 2000-2020, do you agree with my prediction then?

Eliezer:  No, I think you're just... arguing about the wrong facts, in a way that seems to be unglued from most tracks Reality might follow so far as I currently know?  On my view, creating AGI is strongly dependent on how much knowledge you have about how to do it, in a way which almost entirely obviates the relevance of arguments from human biology?

Like, human biology tells us a single not-very-useful data point about how much computing power evolutionary biology needs in order to build a general intelligence, using very alien methods to our own.  Then, very separately, there's the constantly changing level of how much cognitive science, neuroscience, and computer science our own civilization knows.  We don't know how much computing power is required for AGI for any level on that constantly changing graph, and biology doesn't tell us.  All we know is that the hardware requirements for AGI must be dropping by the year, because the knowledge of how to create AI is something that only increases over time.

At some point the moving lines for "decreasing hardware required" and "increasing hardware available" will cross over, which lets us predict that AGI gets built at some point.  But we don't know how to graph two key functions needed to predict that date.  You would seem to be committing the classic fallacy of searching for your keys under the streetlight where the visibility is better.  You know how to estimate how many floating-point operations per second the retina could effectively be using, but this is not the number you need to predict the outcome you want to predict.  You need a graph of human knowledge of computer science over time, and then a graph of how much computer science requires how much hardware to build AI, and neither of these graphs are available.

It doesn't matter how many chapters your book spends considering the continuation of Moore's Law or computation in the retina, and I'm sorry if it seems rude of me in some sense to just dismiss the relevance of all the hard work you put into arguing it.  But you're arguing the wrong facts to get to the conclusion, so all your hard work is for naught.

Humbali:  Now it seems to me that I must chide you for being too dismissive of Moravec's argument.  Fine, yes, Moravec has not established with logical certainty that strong AI must arrive at the point where top supercomputers match the human brain's 10 trillion operations per second.  But has he not established a reference class, the sort of base rate that good and virtuous superforecasters, unlike yourself, go looking for when they want to anchor their estimate about some future outcome?  Has he not, indeed, established the sort of argument which says that if top supercomputers can do only ten million operations per second, we're not very likely to get AGI earlier than that, and if top supercomputers can do ten quintillion operations per second*, we're unlikely not to already have AGI?

(*) In 2021 terms, 10 TPU v4 pods.

Eliezer:  With ranges that wide, it'd be more likely and less amusing to hit somewhere inside it by coincidence.  But I still think this whole line of thoughts is just off-base, and that you, Humbali, have not truly grasped the concept of a virtuous superforecaster or how they go looking for reference classes and base rates.

Humbali:  I frankly think you're just being unvirtuous.  Maybe you have some special model of AGI which claims that it'll arrive in a different year or be arrived at by some very different pathway.  But is not Moravec's estimate a sort of base rate which, to the extent you are properly and virtuously uncertain of your own models, you ought to regress in your own probability distributions over AI timelines?  As you become more uncertain about the exact amounts of knowledge required and what knowledge we'll have when, shouldn't you have an uncertain distribution about AGI arrival times that centers around Moravec's base-rate prediction of 2010?

For you to reject this anchor seems to reveal a grave lack of humility, since you must be very certain of whatever alternate estimation methods you are using in order to throw away this base-rate entirely.

Eliezer:  Like I said, I think you've just failed to grasp the true way of a virtuous superforecaster.  Thinking a lot about Moravec's so-called 'base rate' is just making you, in some sense, stupider; you need to cast your thoughts loose from there and try to navigate a wilder and less tamed space of possibilities, until they begin to gel and coalesce into narrower streams of probability.  Which, for AGI, they probably won't do until we're quite close to AGI, and start to guess correctly how AGI will get built; for it is easier to predict an eventual global pandemic than to say it will start in November of 2019.  Even in October of 2019 this cannot be done.

Humbali:  Then all this uncertainty must somehow be quantified, if you are to be a virtuous Bayesian; and again, for lack of anything better, the resulting distribution should center on Moravec's base-rate estimate of 2010.

Eliezer:  No, that calculation is just basically not relevant here; and thinking about it is making you stupider, as your mind flails in the trackless wilderness grasping onto unanchored air.  Things must be 'sufficiently similar' to each other, in some sense, for us to get a base rate on one thing by looking at another thing.  Humans making an AGI is just too dissimilar to evolutionary biology making a human brain for us to anchor 'how much computing power at the time it happens' from one to the other.  It's not the droid we're looking for; and your attempt to build an inescapable epistemological trap about virtuously calling that a 'base rate' is not the Way.

Imaginary Moravec:  If I can step back in here, I don't think my calculation is zero evidence?  What we know from evolutionary biology is that a blind alien god with zero foresight accidentally mutated a chimp brain into a general intelligence.  I don't want to knock biology's work too much, there's some impressive stuff in the retina, and the retina is just the part of the brain which is in some sense easiest to understand.  But surely there's a very reasonable argument that 10 trillion ops/sec is about the amount of computation that evolutionary biology needed; and since evolution is stupid, when we ourselves have that much computation, it shouldn't be that hard to figure out how to configure it.

Eliezer:  If that was true, the same theory predicts that our current supercomputers should be doing a better job of matching the agility and vision of spiders.  When at some point there's enough hardware that we figure out how to put it together into AGI, we could be doing it with less hardware than a human; we could be doing it with more; and we can't even say that these two possibilities are around equally probable such that our probability distribution should have its median around 2010.  Your number is so bad and obtained by such bad means that we should just throw it out of our thinking and start over.

Humbali:  This last line of reasoning seems to me to be particularly ludicrous, like you're just throwing away the only base rate we have in favor of a confident assertion of our somehow being more uncertain than that.

Eliezer:  Yeah, well, sorry to put it bluntly, Humbali, but you have not yet figured out how to turn your own computing power into intelligence.

- 1999 -

Luke Muehlhauser reading a previous draft of this (only sounding much more serious than this, because Luke Muehlhauser):  You know, there was this certain teenaged futurist who made some of his own predictions about AI timelines -

Eliezer:  I'd really rather not argue from that as a case in point.  I dislike people who screw up something themselves, and then argue like nobody else could possibly be more competent than they were.  I dislike even more people who change their mind about something when they turn 22, and then, for the rest of their lives, go around acting like they are now Very Mature Serious Adults who believe the thing that a Very Mature Serious Adult believes, so if you disagree with them about that thing they started believing at age 22, you must just need to wait to grow out of your extended childhood.

Luke Muehlhauser (still being paraphrased):  It seems like it ought to be acknowledged somehow.

Eliezer:  That's fair, yeah, I can see how someone might think it was relevant.  I just dislike how it potentially creates the appearance of trying to slyly sneak in an Argument From Reckless Youth that I regard as not only invalid but also incredibly distasteful.  You don't get to screw up yourself and then use that as an argument about how nobody else can do better.

Humbali:  Uh, what's the actual drama being subtweeted here?

Eliezer:  A certain teenaged futurist, who, for example, said in 1999, "The most realistic estimate for a seed AI transcendence is 2020; nanowar, before 2015."

Humbali:  This young man must surely be possessed of some very deep character defect, which I worry will prove to be of the sort that people almost never truly outgrow except in the rarest cases.  Why, he's not even putting a probability distribution over his mad soothsaying - how blatantly absurd can a person get?

Eliezer:  Dear child ignorant of history, your complaint is far too anachronistic.  This is 1999 we're talking about here; almost nobody is putting probability distributions on things, that element of your later subculture has not yet been introduced.  Eliezer-2002 hasn't been sent a copy of "Judgment Under Uncertainty" by Emil Gilliam.  Eliezer-2006 hasn't put his draft online for "Cognitive biases potentially affecting judgment of global risks".  The Sequences won't start until another year after that.  How would the forerunners of effective altruism in 1999 know about putting probability distributions on forecasts?  I haven't told them to do that yet!  We can give historical personages credit when they seem to somehow end up doing better than their surroundings would suggest; it is unreasonable to hold them to modern standards, or expect them to have finished refining those modern standards by the age of nineteen.

Though there's also a more subtle lesson you could learn, about how this young man turned out to still have a promising future ahead of him; which he retained at least in part by having a deliberate contempt for pretended dignity, allowing him to be plainly and simply wrong in a way that he noticed, without his having twisted himself up to avoid a prospect of embarrassment.  Instead of, for example, his evading such plain falsification by having dignifiedly wide Very Serious probability distributions centered on the same medians produced by the same basically bad thought processes.

But that was too much of a digression, when I tried to write it up; maybe later I'll post something separately.

Ray Kurzweil in 2001:  I have calculated that matching the intelligence of a human brain requires 2 * 10^16 ops/sec* and this will become available in a $1000 computer in 2023. 26 years after that, in 2049, a$1000 computer will have ten billion times more computing power than a human brain; and in 2059, that computer will cost one cent.

(*) Two TPU v4 pods.

Actual real-life Eliezer in Q&A, when Kurzweil says the same thing in a 2004(?) talk:  It seems weird to me to forecast the arrival of "human-equivalent" AI, and then expect Moore's Law to just continue on the same track past that point for thirty years.  Once we've got, in your terms, human-equivalent AIs, even if we don't go beyond that in terms of intelligence, Moore's Law will start speeding them up.  Once AIs are thinking thousands of times faster than we are, wouldn't that tend to break down the graph of Moore's Law with respect to the objective wall-clock time of the Earth going around the Sun?  Because AIs would be able to spend thousands of subjective years working on new computing technology?

Actual Ray Kurzweil:  The fact that AIs can do faster research is exactly what will enable Moore's Law to continue on track.

Actual Eliezer (out loud):  Thank you for answering my question.

Actual Eliezer (internally):  Moore's Law is a phenomenon produced by human cognition and the fact that human civilization runs off human cognition.  You can't expect the surface phenomenon to continue unchanged after the deep causal phenomenon underlying it starts changing.  What kind of bizarre worship of graphs would lead somebody to think that the graphs were the primary phenomenon and would continue steady and unchanged when the forces underlying them changed massively?  I was hoping he'd be less nutty in person than in the book, but oh well.

Somebody on the Internet:  I have calculated the number of computer operations used by evolution to evolve the human brain - searching through organisms with increasing brain size  - by adding up all the computations that were done by any brains before modern humans appeared.  It comes out to 10^43 computer operations.*  AGI isn't coming any time soon!

(*)  I forget the exact figure.  It was 10^40-something.

Eliezer, sighing:  Another day, another biology-inspired timelines forecast.  This trick didn't work when Moravec tried it, it's not going to work while Ray Kurzweil is trying it, and it's not going to work when you try it either.  It also didn't work when a certain teenager tried it, but please entirely ignore that part; you're at least allowed to do better than him.

Imaginary Somebody:  Moravec's prediction failed because he assumed that you could just magically take something with around as much hardware as the human brain and, poof, it would start being around that intelligent -

Eliezer:  Yes, that is one way of viewing an invalidity in that argument.  Though you do Moravec a disservice if you imagine that he could only argue "It will magically emerge", and could not give the more plausible-sounding argument "Human engineers are not that incompetent compared to biology, and will probably figure it out without more than one or two orders of magnitude of extra overhead."

Somebody:  But I am cleverer, for I have calculated the number of computing operations that was used to create and design biological intelligence, not just the number of computing operations required to run it once created!

Eliezer:  And yet, because your reasoning contains the word "biological", it is just as invalid and unhelpful as Moravec's original prediction.

Somebody:  I don't see why you dismiss my biological argument about timelines on the basis of Moravec having been wrong.  He made one basic mistake - neglecting to take into effect the cost to generate intelligence, not just to run it.  I have corrected this mistake, and now my own effort to do biologically inspired timeline forecasting should work fine, and must be evaluated on its own merits, de novo.

Eliezer:  It is true indeed that sometimes a line of inference is doing just one thing wrong, and works fine after being corrected.  And because this is true, it is often indeed wise to reevaluate new arguments on their own merits, if that is how they present themselves.  One may not take the past failure of a different argument or three, and try to hang it onto the new argument like an inescapable iron ball chained to its leg.  It might be the cause for defeasible skepticism, but not invincible skepticism.

That said, on my view, you are making a nearly identical mistake as Moravec, and so his failure remains relevant to the question of whether you are engaging in a kind of thought that binds well to Reality.

Somebody:  And that mistake is just mentioning the word "biology"?

Eliezer:  The problem is that the resource gets consumed differently, so base-rate arguments from resource consumption end up utterly unhelpful in real life.  The human brain consumes around 20 watts of power.  Can we thereby conclude that an AGI should consume around 20 watts of power, and that, when technology advances to the point of being able to supply around 20 watts of power to computers, we'll get AGI?

Somebody:  That's absurd, of course.  So, what, you compare my argument to an absurd argument, and from this dismiss it?

Eliezer:  I'm saying that Moravec's "argument from comparable resource consumption" must be in general invalid, because it Proves Too Much.  If it's in general valid to reason about comparable resource consumption, then it should be equally valid to reason from energy consumed as from computation consumed, and pick energy consumption instead to call the basis of your median estimate.

You say that AIs consume energy in a very different way from brains?  Well, they'll also consume computations in a very different way from brains!  The only difference between these two cases is that you know something about how humans eat food and break it down in their stomachs and convert it into ATP that gets consumed by neurons to pump ions back out of dendrites and axons, while computer chips consume electricity whose flow gets interrupted by transistors to transmit information.  Since you know anything whatsoever about how AGIs and humans consume energy, you can see that the consumption is so vastly different as to obviate all comparisons entirely.

You are ignorant of how the brain consumes computation, you are ignorant of how the first AGIs built would consume computation, but "an unknown key does not open an unknown lock" and these two ignorant distributions should not assert much internal correlation between them.

Even without knowing the specifics of how brains and future AGIs consume computing operations, you ought to be able to reason abstractly about a directional update that you would make, if you knew any specifics instead of none.  If you did know how both kinds of entity consumed computations, if you knew about specific machinery for human brains, and specific machinery for AGIs, you'd then be able to see the enormous vast specific differences between them, and go, "Wow, what a futile resource-consumption comparison to try to use for forecasting."

(Though I say this without much hope; I have not had very much luck in telling people about predictable directional updates they would make, if they knew something instead of nothing about a subject.  I think it's probably too abstract for most people to feel in their gut, or something like that, so their brain ignores it and moves on in the end.  I have had life experience with learning more about a thing, updating, and then going to myself, "Wow, I should've been able to predict in retrospect that learning almost any specific fact would move my opinions in that same direction."  But I worry this is not a common experience, for it involves a real experience of discovery, and preferably more than one to get the generalization.)

Somebody:  All of that seems irrelevant to my novel and different argument.  I am not foolishly estimating the resources consumed by a single brain; I'm estimating the resources consumed by evolutionary biology to invent brains!

Eliezer:  And the humans wracking their own brains and inventing new AI program architectures and deploying those AI program architectures to themselves learn, will consume computations so utterly differently from evolution that there is no point comparing those consumptions of resources.  That is the flaw that you share exactly with Moravec, and that is why I say the same of both of you, "This is a kind of thinking that fails to bind upon reality, it doesn't work in real life."  I don't care how much painstaking work you put into your estimate of 10^43 computations performed by biology.  It's just not a relevant fact.

Humbali:  But surely this estimate of 10^43 cumulative operations can at least be used to establish a base rate for anchoring our -

Eliezer:  Oh, for god's sake, shut up.  At least Somebody is only wrong on the object level, and isn't trying to build an inescapable epistemological trap by which his ideas must still hang in the air like an eternal stench even after they've been counterargued.  Isn't 'but muh base rates' what your viewpoint would've also said about Moravec's 2010 estimate, back when that number still looked plausible?

Humbali:  Of course it is evident to me now that my youthful enthusiasm was mistaken; obviously I tried to estimate the wrong figure.  As Somebody argues, we should have been estimating the biological computations used to design human intelligence, not the computations used to run it.

I see, now, that I was using the wrong figure as my base rate, leading my base rate to be wildly wrong, and even irrelevant; but now that I've seen this, the clear error in my previous reasoning, I have a new base rate.  This doesn't seem obviously to me likely to contain the same kind of wildly invalidating enormous error as before.  What, is Reality just going to yell "Gotcha!" at me again?  And even the prospect of some new unknown error, which is just as likely to be in either possible direction, implies only that we should widen our credible intervals while keeping them centered on a median of 10^43 operations -

Eliezer:  Please stop.  This trick just never works, at all, deal with it and get over it.  Every second of attention that you pay to the 10^43 number is making you stupider.  You might as well reason that 20 watts is a base rate for how much energy the first generally intelligent computing machine should consume.

- 2020 -

OpenPhil:  We have commissioned a Very Serious report on a biologically inspired estimate of how much computation will be required to achieve Artificial General Intelligence, for purposes of forecasting an AGI timeline.  (Summary of report.)  (Full draft of report.)  Our leadership takes this report Very Seriously.

Eliezer:  Oh, hi there, new kids.  Your grandpa is feeling kind of tired now and can't debate this again with as much energy as when he was younger.

Imaginary OpenPhil:  You're not that much older than us.

Eliezer:  Not by biological wall-clock time, I suppose, but -

OpenPhil:  You think thousands of times faster than us?

Eliezer:  I wasn't going to say it if you weren't.

OpenPhil:  We object to your assertion on the grounds that it is false.

Eliezer:  I was actually going to say, you might be underestimating how long I've been walking this endless battlefield because I started really quite young.

I mean, sure, I didn't read Mind Children when it came out in 1988.  I only read it four years later, when I was twelve.  And sure, I didn't immediately afterwards start writing online about Moore's Law and strong AI; I did not immediately contribute my own salvos and sallies to the war; I was not yet a noticed voice in the debate.  I only got started on that at age sixteen.  I'd like to be able to say that in 1999 I was just a random teenager being reckless, but in fact I was already being invited to dignified online colloquia about the "Singularity" and mentioned in printed books; when I was being wrong back then I was already doing so in the capacity of a minor public intellectual on the topic.

This is, as I understand normie ways, relatively young, and is probably worth an extra decade tacked onto my biological age; you should imagine me as being 52 instead of 42 as I write this, with a correspondingly greater number of visible gray hairs.

A few years later - though still before your time - there was the Accelerating Change Foundation, and Ray Kurzweil spending literally millions of dollars to push Moore's Law graphs of technological progress as the central story about the future.  I mean, I'm sure that a few million dollars sounds like peanuts to OpenPhil, but if your own annual budget was a hundred thousand dollars or so, that's a hell of a megaphone to compete with.

If you are currently able to conceptualize the Future as being about something other than nicely measurable metrics of progress in various tech industries, being projected out to where they will inevitably deliver us nice things - that's at least partially because of a battle fought years earlier, in which I was a primary fighter, creating a conceptual atmosphere you now take for granted.  A mental world where threshold levels of AI ability are considered potentially interesting and transformative - rather than milestones of new technological luxuries to be checked off on an otherwise invariant graph of Moore's Laws as they deliver flying cars, space travel, lifespan-extension escape velocity, and other such goodies on an equal level of interestingness.  I have earned at least a little right to call myself your grandpa.

And that kind of experience has a sort of compounded interest, where, once you've lived something yourself and participated in it, you can learn more from reading other histories about it.  The histories become more real to you once you've fought your own battles.  The fact that I've lived through timeline errors in person gives me a sense of how it actually feels to be around at the time, watching people sincerely argue Very Serious erroneous forecasts.  That experience lets me really and actually update on the history of the earlier mistaken timelines from before I was around; instead of the histories just seeming like a kind of fictional novel to read about, disconnected from reality and not happening to real people.

And now, indeed, I'm feeling a bit old and tired for reading yet another report like yours in full attentive detail.  Does it by any chance say that AGI is due in about 30 years from now?

OpenPhil:  Our report has very wide credible intervals around both sides of its median, as we analyze the problem from a number of different angles and show how they lead to different estimates -

Eliezer:  Unfortunately, the thing about figuring out five different ways to guess the effective IQ of the smartest people on Earth, and having three different ways to estimate the minimum IQ to destroy lesser systems such that you could extrapolate a minimum IQ to destroy the whole Earth, and putting wide credible intervals around all those numbers, and combining and mixing the probability distributions to get a new probability distribution, is that, at the end of all that, you are still left with a load of nonsense.  Doing a fundamentally wrong thing in several different ways will not save you, though I suppose if you spread your bets widely enough, one of them may be right by coincidence.

So does the report by any chance say - with however many caveats and however elaborate the probabilistic methods and alternative analyses - that AGI is probably due in about 30 years from now?

OpenPhil:  Yes, in fact, our 2020 report's median estimate is 2050; though, again, with very wide credible intervals around both sides.  Is that number significant?

Eliezer:  It's a law generalized by Charles Platt, that any AI forecast will put strong AI thirty years out from when the forecast is made.  Vernor Vinge referenced it in the body of his famous 1993 NASA speech, whose abstract begins, "Within thirty years, we will have the technological means to create superhuman intelligence.  Shortly after, the human era will be ended."

After I was old enough to be more skeptical of timelines myself, I used to wonder how Vinge had pulled out the "within thirty years" part.  This may have gone over my head at the time, but rereading again today, I conjecture Vinge may have chosen the headline figure of thirty years as a deliberately self-deprecating reference to Charles Platt's generalization about such forecasts always being thirty years from the time they're made, which Vinge explicitly cites later in the speech.

Or to put it another way:  I conjecture that to the audience of the time, already familiar with some previously-made forecasts about strong AI, the impact of the abstract is meant to be, "Never mind predicting strong AI in thirty years, you should be predicting superintelligence in thirty years, which matters a lot more."  But the minds of authors are scarcely more knowable than the Future, if they have not explicitly told us what they were thinking; so you'd have to ask Professor Vinge, and hope he remembers what he was thinking back then.

OpenPhil:  Superintelligence before 2023, huh?  I suppose Vinge still has two years left to go before that's falsified.

Eliezer:  Also in the body of the speech, Vinge says, "I'll be surprised if this event occurs before 2005 or after 2030," which sounds like a more serious and sensible way of phrasing an estimate.  I think that should supersede the probably Platt-inspired headline figure for what we think of as Vinge's 1993 prediction.  The jury's still out on whether Vinge will have made a good call.

Oh, and sorry if grandpa is boring you with all this history from the times before you were around.  I mean, I didn't actually attend Vinge's famous NASA speech when it happened, what with being thirteen years old at the time, but I sure did read it later.  Once it was digitized and put online, it was all over the Internet.  Well, all over certain parts of the Internet, anyways.  Which nerdy parts constituted a much larger fraction of the whole, back when the World Wide Web was just starting to take off among early adopters.

But, yeah, the new kids showing up with some graphs of Moore's Law and calculations about biology and an earnest estimate of strong AI being thirty years out from the time of the report is, uh, well, it's... historically precedented.

OpenPhil:  That part about Charles Platt's generalization is interesting, but just because we unwittingly chose literally exactly the median that Platt predicted people would always choose in consistent error, that doesn't justify dismissing our work, right?  We could have used a completely valid method of estimation which would have pointed to 2050 no matter which year it was tried in, and, by sheer coincidence, have first written that up in 2020.  In fact, we try to show in the report that the same methodology, evaluated in earlier years, would also have pointed to around 2050 -

Eliezer:  Look, people keep trying this.  It's never worked.  It's never going to work.  2 years before the end of the world, there'll be another published biologically inspired estimate showing that AGI is 30 years away and it will be exactly as informative then as it is now.  I'd love to know the timelines too, but you're not going to get the answer you want until right before the end of the world, and maybe not even then unless you're paying very close attention.  Timing this stuff is just plain hard.

OpenPhil:  But our report is different, and our methodology for biologically inspired estimates is wiser and less naive than those who came before.

Eliezer:  That's what the last guy said, but go on.

OpenPhil:  First, we carefully estimate a range of possible figures for the equivalent of neural-network parameters needed to emulate a human brain.  Then, we estimate how many examples would be required to train a neural net with that many parameters.  Then, we estimate the total computational cost of that many training runs.  Moore's Law then gives us 2050 as our median time estimate, given what we think are the most likely underlying assumptions, though we do analyze it several different ways.

Eliezer:  This is almost exactly what the last guy tried, except you're using network parameters instead of computing ops, and deep learning training runs instead of biological evolution.

OpenPhil:  Yes, so we've corrected his mistake of estimating the wrong biological quantity and now we're good, right?

Eliezer:  That's what the last guy thought he'd done about Moravec's mistaken estimation target.  And neither he nor Moravec would have made much headway on their underlying mistakes, by doing a probabilistic analysis of that same wrong question from multiple angles.

OpenPhil:  Look, sometimes more than one person makes a mistake, over historical time.  It doesn't mean nobody can ever get it right.  You of all people should agree.

Eliezer:  I do so agree, but that doesn't mean I agree you've fixed the mistake.  I think the methodology itself is bad, not just its choice of which biological parameter to estimate.  Look, do you understand why the evolution-inspired estimate of 10^43 ops was completely ludicrous; and the claim that it was equally likely to be mistaken in either direction, even more ludicrous?

OpenPhil:  Because AGI isn't like biology, and in particular, will be trained using gradient descent instead of evolutionary search, which is cheaper.  We do note inside our report that this is a key assumption, and that, if it fails, the estimate might be correspondingly wrong -

Eliezer:  But then you claim that mistakes are equally likely in both directions and so your unstable estimate is a good median.  Can you see why the previous evolutionary estimate of 10^43 cumulative ops was not, in fact, equally likely to be wrong in either direction?  That it was, predictably, a directional overestimate?

OpenPhil:  Well, search by evolutionary biology is more costly than training by gradient descent, so in hindsight, it was an overestimate.  Are you claiming this was predictable in foresight instead of hindsight?

Eliezer:  I'm claiming that, at the time, I snorted and tossed Somebody's figure out the window while thinking it was ridiculously huge and absurd, yes.

OpenPhil:  Because you'd already foreseen in 2006 that gradient descent would be the method of choice for training future AIs, rather than genetic algorithms?

Eliezer:  Ha!  No.  Because it was an insanely costly hypothetical approach whose main point of appeal, to the sort of person who believed in it, was that it didn't require having any idea whatsoever of what you were doing or how to design a mind.

OpenPhil:  Suppose one were to reply:  "Somebody" didn't know better-than-evolutionary methods for designing a mind, just as we currently don't know better methods than gradient descent for designing a mind; and hence Somebody's estimate was the best estimate at the time, just as ours is the best estimate now?

Eliezer:  Unless you were one of a small handful of leading neural-net researchers who knew a few years ahead of the world where scientific progress was heading - who knew a Thielian 'secret' before finding evidence strong enough to convince the less foresightful - you couldn't have called the jump specifically to gradient descent rather than any other technique.  "I don't know any more computationally efficient way to produce a mind than re-evolving the cognitive history of all life on Earth" transitioning over time to "I don't know any more computationally efficient way to produce a mind than gradient descent over entire brain-sized models" is not predictable in the specific part about "gradient descent" - not unless you know a Thielian secret.

But knowledge is a ratchet that usually only turns one way, so it's predictable that the current story changes to somewhere over future time, in a net expected direction.  Let's consider the technique currently known as mixture-of-experts (MoE), for training smaller nets in pieces and muxing them together.  It's not my mainline prediction that MoE actually goes anywhere - if I thought MoE was actually promising, I wouldn't call attention to it, of course!  I don't want to make timelines shorter, that is not a service to Earth, not a good sacrifice in the cause of winning an Internet argument.

But if I'm wrong and MoE is not a dead end, that technique serves as an easily-visualizable case in point.  If that's a fruitful avenue, the technique currently known as "mixture-of-experts" will mature further over time, and future deep learning engineers will be able to further perfect the art of training slices of brains using gradient descent and fewer examples, instead of training entire brains using gradient descent and lots of examples.

Or, more likely, it's not MoE that forms the next little trend.  But there is going to be something, especially if we're sitting around waiting until 2050.  Three decades is enough time for some big paradigm shifts in an intensively researched field.  Maybe we'd end up using neural net tech very similar to today's tech if the world ends in 2025, but in that case, of course, your prediction must have failed somewhere else.

The three components of AGI arrival times are available hardware, which increases over time in an easily graphed way; available knowledge, which increases over time in a way that's much harder to graph; and hardware required at a given level of specific knowledge, a huge multidimensional unknown background parameter.  The fact that you have no idea how to graph the increase of knowledge - or measure it in any way that is less completely silly than "number of science papers published" or whatever such gameable metric - doesn't change the point that this is a predictable fact about the future; there will be more knowledge later, the more time that passes, and that will directionally change the expense of the currently least expensive way of doing things.

OpenPhil:  We did already consider that and try to take it into account: our model already includes a parameter for how algorithmic progress reduces hardware requirements.  It's not easy to graph as exactly as Moore's Law, as you say, but our best-guess estimate is that compute costs halve every 2-3 years.

Eliezer:  Oh, nice.  I was wondering what sort of tunable underdetermined parameters enabled your model to nail the psychologically overdetermined final figure of '30 years' so exactly.

OpenPhil:  Eliezer.

Eliezer:  Think of this in an economic sense: people don't buy where goods are most expensive and delivered latest, they buy where goods are cheapest and delivered earliest.  Deep learning researchers are not like an inanimate chunk of ice tumbling through intergalactic space in its unchanging direction of previous motion; they are economic agents who look around for ways to destroy the world faster and more cheaply than the way that you imagine as the default.  They are more eager than you are to think of more creative paths to get to the next milestone faster.

OpenPhil:  Isn't this desire for cheaper methods exactly what our model already accounts for, by modeling algorithmic progress?

Eliezer:  The makers of AGI aren't going to be doing 10,000,000,000,000 rounds of gradient descent, on entire brain-sized 300,000,000,000,000-parameter models, algorithmically faster than today.  They're going to get to AGI via some route that you don't know how to take, at least if it happens in 2040.  If it happens in 2025, it may be via a route that some modern researchers do know how to take, but in this case, of course, your model was also wrong.

They're not going to be taking your default-imagined approach algorithmically faster, they're going to be taking an algorithmically different approach that eats computing power in a different way than you imagine it being consumed.

OpenPhil:  Shouldn't that just be folded into our estimate of how the computation required to accomplish a fixed task decreases by half every 2-3 years due to better algorithms?

Eliezer:  Backtesting this viewpoint on the previous history of computer science, it seems to me to assert that it should be possible to:

• Train a pre-Transformer RNN/CNN-based model, not using any other techniques invented after 2017, to GPT-2 levels of performance, using only around 2x as much compute as GPT-2;
• Play pro-level Go using 8-16 times as much computing power as AlphaGo, but only 2006 levels of technology.

For reference, recall that in 2006, Hinton and Salakhutdinov were just starting to publish that, by training multiple layers of Restricted Boltzmann machines and then unrolling them into a "deep" neural network, you could get an initialization for the network weights that would avoid the problem of vanishing and exploding gradients and activations.  At least so long as you didn't try to stack too many layers, like a dozen layers or something ridiculous like that.  This being the point that kicked off the entire deep-learning revolution.

Your model apparently suggests that we have gotten around 50 times more efficient at turning computation into intelligence since that time; so, we should be able to replicate any modern feat of deep learning performed in 2021, using techniques from before deep learning and around fifty times as much computing power.

OpenPhil:  No, that's totally not what our viewpoint says when you backfit it to past reality.  Our model does a great job of retrodicting past reality.

Eliezer:  How so?

OpenPhil:  <Eliezer cannot predict what they will say here.>

Eliezer:  I'm not convinced by this argument.

OpenPhil:  We didn't think you would be; you're sort of predictable that way.

Eliezer:  Well, yes, if I'd predicted I'd update from hearing your argument, I would've updated already.  I may not be a real Bayesian but I'm not that incoherent.

But I can guess in advance at the outline of my reply, and my guess is this:

"Look, when people come to me with models claiming the future is predictable enough for timing, I find that their viewpoints seem to me like they would have made garbage predictions if I actually had to operate them in the past without benefit of hindsight.  Sure, with benefit of hindsight, you can look over a thousand possible trends and invent rules of prediction and event timing that nobody in the past actually spotlighted then, and claim that things happened on trend.  I was around at the time and I do not recall people actually predicting the shape of AI in the year 2020 in advance.  I don't think they were just being stupid either.

"In a conceivable future where people are still alive and reasoning as modern humans do in 2040, somebody will no doubt look back and claim that everything happened on trend since 2020; but which trend the hindsighter will pick out is not predictable to us in advance.

"It may be, of course, that I simply don't understand how to operate your viewpoint, nor how to apply it to the past or present or future; and that yours is a sort of viewpoint which indeed permits saying only one thing, and not another; and that this viewpoint would have predicted the past wonderfully, even without any benefit of hindsight.  But there is also that less charitable viewpoint which suspects that somebody's theory of 'A coinflip always comes up heads on occasions X' contains some informal parameters which can be argued about which occasions exactly 'X' describes, and that the operation of these informal parameters is a bit influenced by one's knowledge of whether a past coinflip actually came up heads or not.

"As somebody who doesn't start from the assumption that your viewpoint is a good fit to the past, I still don't see how a good fit to the past could've been extracted from it without benefit of hindsight."

OpenPhil:  That's a pretty general counterargument, and like any pretty general counterargument it's a blade you should try turning against yourself.  Why doesn't your own viewpoint horribly mispredict the past, and say that all estimates of AGI arrival times are predictably net underestimates?  If we imagine trying to operate your own viewpoint in 1988, we imagine going to Moravec and saying, "Your estimate of how much computing power it takes to match a human brain is predictably an overestimate, because engineers will find a better way to do it than biology, so we should expect AGI sooner than 2010."

Eliezer:  I did tell Imaginary Moravec that his estimate of the minimum computation required for human-equivalent general intelligence was predictably an overestimate; that was right there in the dialogue before I even got around to writing this part.  And I also, albeit with benefit of hindsight, told Moravec that both of these estimates were useless for timing the future, because they skipped over the questions of how much knowledge you'd need to make an AGI with a given amount of computing power, how fast knowledge was progressing, and the actual timing determined by the rising hardware line touching the falling hardware-required line.

OpenPhil:  We don't see how to operate your viewpoint to say in advance to Moravec, before his prediction has been falsified, "Your estimate is plainly a garbage estimate" instead of "Your estimate is obviously a directional underestimate", especially since you seem to be saying the latter to us, now.

Eliezer:  That's not a critique I give zero weight.  And, I mean, as a kid, I was in fact talking like, "To heck with that hardware estimate, let's at least try to get it done before then.  People are dying for lack of superintelligence; let's aim for 2005."  I had a T-shirt spraypainted "Singularity 2005" at a science fiction convention, it's rather crude but I think it's still in my closet somewhere.

But now I am older and wiser and have fixed all my past mistakes, so the critique of those past mistakes no longer applies to my new arguments.

OpenPhil:  Uh huh.

Eliezer:  I mean, I did try to fix all the mistakes that I knew about, and didn't just, like, leave those mistakes in forever?  I realize that this claim to be able to "learn from experience" is not standard human behavior in situations like this, but if you've got to be weird, that's a good place to spend your weirdness points.  At least by my own lights, I am now making a different argument than I made when I was nineteen years old, and that different argument should be considered differently.

And, yes, I also think my nineteen-year-old self was not completely foolish at least about AI timelines; in the sense that, for all he knew, maybe you could build AGI by 2005 if you tried really hard over the next 6 years.  Not so much because Moravec's estimate should've been seen as a predictable overestimate of how much computing power would actually be needed, given knowledge that would become available in the next 6 years; but because Moravec's estimate should've been seen as almost entirely irrelevant, making the correct answer be "I don't know."

OpenPhil:  It seems to us that Moravec's estimate, and the guess of your nineteen-year-old past self, are both predictably vast underestimates.  Estimating the computation consumed by one brain, and calling that your AGI target date, is obviously predictably a vast underestimate because it neglects the computation required for training a brainlike system.  It may be a bit uncharitable, but we suggest that Moravec and your nineteen-year-old self may both have been motivatedly credulous, to not notice a gap so very obvious.

Eliezer:  I could imagine it seeming that way if you'd grown up never learning about any AI techniques except deep learning, which had, in your wordless mental world, always been the way things were, and would always be that way forever.

I mean, it could be that deep learning will still be the bleeding-edge method of Artificial Intelligence right up until the end of the world.  But if so, it'll be because Vinge was right and the world ended before 2030, not because the deep learning paradigm was as good as any AI paradigm can ever get.  That is simply not a kind of thing that I expect Reality to say "Gotcha" to me about, any more than I expect to be told that the human brain, whose neurons and synapses are 500,000 times further away from the thermodynamic efficiency wall than ATP synthase, is the most efficient possible consumer of computations.

The specific perspective-taking operation needed here - when it comes to what was and wasn't obvious in 1988 or 1999 - is that the notion of spending thousands and millions and billions of times as much computation on a "training" phase, as on an "inference" phase, is something that only came to be seen as Always Necessary after the deep learning revolution took over AI in the late Noughties.  Back when Moravec was writing, you programmed a game-tree-search algorithm for chess, and then you ran that code, and it played chess.  Maybe you needed to add an opening book, or do a lot of trial runs to tweak the exact values the position evaluation function assigned to knights vs. bishops, but most AIs weren't neural nets and didn't get trained on enormous TPU pods.

Moravec had no way of knowing that the paradigm in AI would, twenty years later, massively shift to a new paradigm in which stuff got trained on enormous TPU pods.  He lived in a world where you could only train neural networks a few layers deep, like, three layers, and the gradients vanished or exploded if you tried to train networks any deeper.

To be clear, in 1999, I did think of AGIs as needing to do a lot of learning; but I expected them to be learning while thinking, not to learn in a separate gradient descent phase.

OpenPhil:  How could anybody possibly miss anything so obvious?  There's so many basic technical ideas and even philosophical ideas about how you do AI which make it supremely obvious that the best and only way to turn computation into intelligence is to have deep nets, lots of parameters, and enormous separate training phases on TPU pods.

Eliezer:  Yes, well, see, those philosophical ideas were not as prominent in 1988, which is why the direction of the future paradigm shift was not predictable in advance without benefit of hindsight, let alone timeable to 2006.

You're also probably overestimating how much those philosophical ideas would pinpoint the modern paradigm of gradient descent even if you had accepted them wholeheartedly, in 1988.  Or let's consider, say, October 2006, when the Netflix Prize was being run - a watershed occasion where lots of programmers around the world tried their hand at minimizing a loss function, based on a huge-for-the-times 'training set' that had been publicly released, scored on a holdout 'test set'.  You could say it was the first moment in the limelight for the sort of problem setup that everybody now takes for granted with ML research: a widely shared dataset, a heldout test set, a loss function to be minimized, prestige for advancing the 'state of the art'.  And it was a million dollars, which, back in 2006, was big money for a machine learning prize, garnering lots of interest from competent competitors.

Before deep learning, "statistical learning" was indeed a banner often carried by the early advocates of the view that Richard Sutton now calls the Bitter Lesson, along the lines of "complicated programming of human ideas doesn't work, you have to just learn from massive amounts of data".

But before deep learning - which was barely getting started in 2006 - "statistical learning" methods that took in massive amounts of data, did not use those massive amounts of data to train neural networks by stochastic gradient descent across millions of examples!  In 2007, the winning submission to the Netflix Prize was an ensemble predictor that incorporated k-Nearest-Neighbor, a factorization method that repeatedly globally minimized squared error, two-layer Restricted Boltzmann Machines, and a regression model akin to Principal Components Analysis.  Which is all 100% statistical learning driven by relatively-big-for-the-time "big data", and 0% GOFAI.  But these methods didn't involve enormous massive training phases in the modern sense.

Back then, if you were doing stochastic gradient descent at all, you were doing it on a much smaller neural network.  Not so much because you couldn't afford more compute for a larger neural network, but because wider neural networks didn't help you much and deeper neural networks simply didn't work.

Bleeding-edge statistical learning techniques as late as 2007, to make actual use of big data, had to find other ways to make use of huge amounts of data than gradient descent and backpropagation.  Though, I mean, not huge amounts of data by modern standards.  The winning submission to the Netflix Prize used an ensemble of 107 models - that's not a misprint for 10^7, I actually mean 107 - which models were drawn from half a different model classes, then proliferated with slightly different parameters, averaged together to reduce statistical noise.

A modern kid, perhaps, looks at this and thinks:  "If you can afford the compute to train 107 models, why not just train one larger model?"  But back then, you see, there just wasn't a standard way to dump massively more compute into something, and get better results back out.  The fact that they had 107 differently parameterized models from a half-dozen families averaged together to reduce noise, was about as well as anyone could do in 2007, at putting more effort in and getting better results back out.

OpenPhil:  How quaint and archaic!  But that was 13 years ago, before time actually got started and history actually started happening in real life.  Now we've got the paradigm which will actually be used to create AGI, in all probability; so estimation methods centered on that paradigm should be valid.

Eliezer:  The current paradigm is definitely not the end of the line in principle.  I guarantee you that the way superintelligences build cognitive engines is not by training enormous neural networks using gradient descent.  Gua-ran-tee it.

The fact that you think you now see a path to AGI, is because today - unlike in 2006 - you have a paradigm that is seemingly willing to entertain having more and more food stuffed down its throat without obvious limit (yet).  This is really a quite recent paradigm shift, though, and it is probably not the most efficient possible way to consume more and more food.

You could rather strongly guess, early on, that support vector machines were never going to give you AGI, because you couldn't dump more and more compute into training or running SVMs and get arbitrarily better answers; whatever gave you AGI would have to be something else that could eat more compute productively.

Similarly, since the path through genetic algorithms and recapitulating the whole evolutionary history would have taken a lot of compute, it's no wonder that other, more efficient methods of eating compute were developed before then; it was obvious in advance that they must exist, for all that some what-iffed otherwise.

To be clear, it is certain the world will end by more inefficient methods than those that superintelligences would use; since, if superintelligences are making their own AI systems, then the world has already ended.

And it is possible, even, that the world will end by a method as inefficient as gradient descent.  But if so, that will be because the world ended too soon for any more efficient paradigm to be developed.  Which, on my model, means the world probably ended before say 2040(???).  But of course, compared to how much I think I know about what must be more efficiently doable in principle, I think I know far less about the speed of accumulation of real knowledge (not to be confused with proliferation of publications), or how various random-to-me social phenomena could influence the speed of knowledge.  So I think I have far less ability to say a confident thing about the timing of the next paradigm shift in AI, compared to the existence and eventuality of such paradigms in the space of possibilities.

OpenPhil:  But if you expect the next paradigm shift to happen in around 2040, shouldn't you confidently predict that AGI has to arrive after 2040, because, without that paradigm shift, we'd have to produce AGI using deep learning paradigms, and in that case our own calculation would apply saying that 2040 is relatively early?

Eliezer:  No, because I'd consider, say, improved mixture-of-experts techniques that actually work, to be very much within the deep learning paradigm; and even a relatively small paradigm shift like that would obviate your calculations, if it produced a more drastic speedup than halving the computational cost over two years.

More importantly, I simply don't believe in your attempt to calculate a figure of 10,000,000,000,000,000 operations per second for a brain-equivalent deepnet based on biological analogies, or your figure of 10,000,000,000,000 training updates for it.  I simply don't believe in it at all.  I don't think it's a valid anchor.  I don't think it should be used as the median point of a wide uncertain distribution.  The first-developed AGI will consume computation in a different fashion, much as it eats energy in a different fashion; and "how much computation an AGI needs to eat compared to a human brain" and "how many watts an AGI needs to eat compared to a human brain" are equally always decreasing with the technology and science of the day.

OpenPhil:  Doesn't our calculation at least provide a soft upper bound on how much computation is required to produce human-level intelligence?  If a calculation is able to produce an upper bound on a variable, how can it be uninformative about that variable?

Eliezer:  You assume that the architecture you're describing can, in fact, work at all to produce human intelligence.  This itself strikes me as not only tentative but probably false.  I mostly suspect that if you take the exact GPT architecture, scale it up to what you calculate as human-sized, and start training it using current gradient descent techniques... what mostly happens is that it saturates and asymptotes its loss function at not very far beyond the GPT-3 level - say, it behaves like GPT-4 would, but not much better.

This is what should have been told to Moravec:  "Sorry, even if your biology is correct, the assumption that future people can put in X amount of compute and get out Y result is not something you really know."  And that point did in fact just completely trash his ability to predict and time the future.

The same must be said to you.  Your model contains supposedly known parameters, "how much computation an AGI must eat per second, and how many parameters must be in the trainable model for that, and how many examples are needed to train those parameters".  Relative to whatever method is actually first used to produce AGI, I expect your estimates to be wildly inapplicable, as wrong as Moravec was about thinking in terms of just using one supercomputer powerful enough to be a brain.  Your parameter estimates may not be about properties that the first successful AGI design even has.  Why, what if it contains a significant component that isn't a neural network?  I realize this may be scarcely conceivable to somebody from the present generation, but the world was not always as it was now, and it will change if it does not end.

OpenPhil:  I don't understand how some of your reasoning could be internally consistent even on its own terms.  If, according to you, our 2050 estimate doesn't provide a soft upper bound on AGI arrival times - or rather, if our 2050-centered probability distribution isn't a soft upper bound on reasonable AGI arrival probability distributions - then I don't see how you can claim that the 2050-centered distribution is predictably a directional overestimate.

You can either say that our forecasted pathway to AGI or something very much like it would probably work in principle without requiring very much more computation than our uncertain model components take into account, meaning that the probability distribution provides a soft upper bound on reasonably-estimable arrival times, but that paradigm shifts will predictably provide an even faster way to do it before then.  That is, you could say that our estimate is both a soft upper bound and also a directional overestimate.  Or, you could say that our ignorance of how to create AI will consume more than one order-of-magnitude of increased computation cost above biology -

Eliezer:  Indeed, much as your whole proposal would supposedly cost ten trillion times the equivalent computation of the single human brain that earlier biologically-inspired estimates anchored on.

OpenPhil:  - in which case our 2050-centered distribution is not a good soft upper bound, but also not predictably a directional overestimate.  Don't you have to pick one or the other as a critique, there?

Eliezer:  Mmm... there's some justice to that, now that I've come to write out this part of the dialogue.  Okay, let me revise my earlier stated opinion:  I think that your biological estimate is a trick that never works and, on its own terms, would tell us very little about AGI arrival times at all.  Separately, I think from my own model that your timeline distributions happen to be too long.

OpenPhil:  Eliezer.

Eliezer:  I mean, in fact, part of my actual sense of indignation at this whole affair, is the way that Platt's law of strong AI forecasts - which was in the 1980s generalizing "thirty years" as the time that ends up sounding "reasonable" to would-be forecasters - is still exactly in effect for what ends up sounding "reasonable" to would-be futurists, in fricking 2020 while the air is filling up with AI smoke in the silence of nonexistent fire alarms.

But to put this in terms that maybe possibly you'd find persuasive:

The last paradigm shifts were from "write a chess program that searches a search tree and run it, and that's how AI eats computing power" to "use millions of data samples, but not in a way that requires a huge separate training phase" to "train a huge network for zillions of gradient descent updates and then run it".  This new paradigm costs a lot more compute, but (small) large amounts of compute are now available so people are using them; and this new paradigm saves on programmer labor, and more importantly the need for programmer knowledge.

I say with surety that this is not the last possible paradigm shift.  And furthermore, the Stack More Layers paradigm has already reduced need for knowledge by what seems like a pretty large bite out of all the possible knowledge that could be thrown away.

So, you might then argue, the world-ending AGI seems more likely to incorporate more knowledge and less brute force, which moves the correct sort of timeline estimate further away from the direction of "cost to recapitulate all evolutionary history as pure blind search without even the guidance of gradient descent" and more toward the direction of "computational cost of one brain, if you could just make a single brain".

That is, you can think of there as being two biological estimates to anchor on, not just one.  You can imagine there being a balance that shifts over time from "the computational cost for evolutionary biology to invent brains" to "the computational cost to run one biological brain".

In 1960, maybe, they knew so little about how brains worked that, if you gave them a hypercomputer, the cheapest way they could quickly get AGI out of the hypercomputer using just their current knowledge, would be to run a massive evolutionary tournament over computer programs until they found smart ones, using 10^43 operations.

Today, you know about gradient descent, which finds programs more efficiently than genetic hill-climbing does; so the balance of how much hypercomputation you'd need to use to get general intelligence using just your own personal knowledge, has shifted ten orders of magnitude away from the computational cost of evolutionary history and towards the lower bound of the computation used by one brain.  In the future, this balance will predictably swing even further towards Moravec's biological anchor, further away from Somebody on the Internet's biological anchor.

I admit, from my perspective this is nothing but a clever argument that tries to persuade people who are making errors that can't all be corrected by me, so that they can make mostly the same errors but get a slightly better answer.  In my own mind I tend to contemplate the Textbook from the Future, which would tell us how to build AI on a home computer from 1995, as my anchor of 'where can progress go', rather than looking to the brain of all computing devices for inspiration.

But, if you insist on the error of anchoring on biology, you could perhaps do better by seeing a spectrum between two bad anchors.  This lets you notice a changing reality, at all, which is why I regard it as a helpful thing to say to you and not a pure persuasive superweapon of unsound argument.  Instead of just fixating on one bad anchor, the hybrid of biological anchoring with whatever knowledge you currently have about optimization, you can notice how reality seems to be shifting between two biological bad anchors over time, and so have an eye on the changing reality at all.  Your new estimate in terms of gradient descent is stepping away from evolutionary computation and toward the individual-brain estimate by ten orders of magnitude, using the fact that you now know a little more about optimization than natural selection knew; and now that you can see the change in reality over time, in terms of the two anchors, you can wonder if there are more shifts ahead.

Realistically, though, I would not recommend eyeballing how much more knowledge you'd think you'd need to get even larger shifts, as some function of time, before that line crosses the hardware line.  Some researchers may already know Thielian secrets you do not, that take those researchers further toward the individual-brain computational cost (if you insist on seeing it that way).  That's the direction that economics rewards innovators for moving in, and you don't know everything the innovators know in their labs.

When big inventions finally hit the world as newspaper headlines, the people two years before that happens are often declaring it to be fifty years away; and others, of course, are declaring it to be two years away, fifty years before headlines.  Timing things is quite hard even when you think you are being clever; and cleverly having two biological anchors and eyeballing Reality's movement between them, is not the sort of cleverness that gives you good timing information in real life.

In real life, Reality goes off and does something else instead, and the Future does not look in that much detail like the futurists predicted.  In real life, we come back again to the same wiser-but-sadder conclusion given at the start, that in fact the Future is quite hard to foresee - especially when you are not on literally the world's leading edge of technical knowledge about it, but really even then.  If you don't think you know any Thielian secrets about timing, you should just figure that you need a general policy which doesn't get more than two years of warning, or not even that much if you aren't closely non-dismissively analyzing warning signs.

OpenPhil:  We do consider in our report the many ways that our estimates could be wrong, and show multiple ways of producing biologically inspired estimates that give different results.  Does that give us any credit for good epistemology, on your view?

Eliezer:  I wish I could say that it probably beats showing a single estimate, in terms of its impact on the reader.  But in fact, writing a huge careful Very Serious Report like that and snowing the reader under with Alternative Calculations is probably going to cause them to give more authority to the whole thing.  It's all very well to note the Ways I Could Be Wrong and to confess one's Uncertainty, but you did not actually reach the conclusion, "And that's enough uncertainty and potential error that we should throw out this whole deal and start over," and that's the conclusion you needed to reach.

OpenPhil:  It's not clear to us what better way you think exists of arriving at an estimate, compared to the methodology we used - in which we do consider many possible uncertainties and several ways of generating probability distributions, and try to combine them together into a final estimate.  A Bayesian needs a probability distribution from somewhere, right?

Eliezer:  If somebody had calculated that it currently required an IQ of 200 to destroy the world, that the smartest current humans had an IQ of around 190, and that the world would therefore start to be destroyable in fifteen years according to Moore's Law of Mad Science - then, even assuming Moore's Law of Mad Science to actually hold, the part where they throw in an estimated current IQ of 200 as necessary is complete garbage.  It is not the sort of mistake that can be repaired, either.  No, not even by considering many ways you could be wrong about the IQ required, or considering many alternative different ways of estimating present-day people's IQs.

The correct thing to do with the entire model is chuck it out the window so it doesn't exert an undue influence on your actual thinking, where any influence of that model is an undue one.  And then you just should not expect good advance timing info until the end is in sight, from whatever thought process you adopt instead.

OpenPhil:  What if, uh, somebody knows a Thielian secret, or has... narrowed the rivers of their knowledge to closer to reality's tracks?  We're not sure exactly what's supposed to be allowed, on your worldview; but wasn't there something at the beginning about how, when you're unsure, you should be careful about criticizing people who are more unsure than you?

Eliezer:  Hopefully those people are also able to tell you bold predictions about the nearer-term future, or at least say anything about what the future looks like before the whole world ends.  I mean, you don't want to go around proclaiming that, because you don't know something, nobody else can know it either.  But timing is, in real life, really hard as a prediction task, so, like... I'd expect them to be able to predict a bunch of stuff before the final hours of their prophecy?

OpenPhil:  We're... not sure we see that?  We may have made an estimate, but we didn't make a narrow estimate.  We gave a relatively wide probability distribution as such things go, so it doesn't seem like a great feat of timing that requires us to also be able to predict the near-term future in detail too?

Doesn't your implicit probability distribution have a median?  Why don't you also need to be able to predict all kinds of near-term stuff if you have a probability distribution with a median in it?

Eliezer:  I literally have not tried to force my brain to give me a median year on this - not that this is a defense, because I still have some implicit probability distribution, or, to the extent I don't act like I do, I must be acting incoherently in self-defeating ways.  But still: I feel like you should probably have nearer-term bold predictions if your model is supposedly so solid, so concentrated as a flow of uncertainty, that it's coming up to you and whispering numbers like "2050" even as the median of a broad distribution.  I mean, if you have a model that can actually, like, calculate stuff like that, and is actually bound to the world as a truth.

If you are an aspiring Bayesian, perhaps, you may try to reckon your uncertainty into the form of a probability distribution, even when you face "structural uncertainty" as we sometimes call it.  Or if you know the laws of coherence, you will acknowledge that your planning and your actions are implicitly showing signs of weighing some paths through time more than others, and hence display probability-estimating behavior whether you like to acknowledge that or not.

But if you are a wise aspiring Bayesian, you will admit that whatever probabilities you are using, they are, in a sense, intuitive, and you just don't expect them to be all that good.  Because the timing problem you are facing is a really hard one, and humans are not going to be great at it - not until the end is near, and maybe not even then.

That - not "you didn't consider enough alternative calculations of your target figures" - is what should've been replied to Moravec in 1988, if you could go back and tell him where his reasoning had gone wrong, and how he might have reasoned differently based on what he actually knew at the time.  That reply I now give to you, unchanged.

Humbali:  And I'm back!  Sorry, I had to take a lunch break.  Let me quickly review some of this recent content; though, while I'm doing that, I'll go ahead and give you what I'm pretty sure will be my reaction to it:

Ah, but here is a point that you seem to have not considered at all, namely: what if you're wrong?

Eliezer:  That, Humbali, is a thing that should be said mainly to children, of whatever biological wall-clock age, who've never considered at all the possibility that they might be wrong, and who will genuinely benefit from asking themselves that.  It is not something that should often be said between grownups of whatever age, as I define what it means to be a grownup.  You will mark that I did not at any point say those words to Imaginary Moravec or Imaginary OpenPhil; it is not a good thing for grownups to say to each other, or to think to themselves in Tones of Great Significance (as opposed to as a routine check).

It is very easy to worry that one might be wrong.  Being able to see the direction in which one is probably wrong is rather a more difficult affair.  And even after we see a probable directional error and update our views, the objection, "But what if you're wrong?" will sound just as forceful as before.  For this reason do I say that such a thing should not be said between grownups -

Humbali:  Okay, done reading now!  Hm...  So it seems to me that the possibility that you are wrong, considered in full generality and without adding any other assumptions, should produce a directional shift from your viewpoint towards OpenPhil's viewpoint.

Eliezer (sighing):  And how did you end up being under the impression that this could possibly be a sort of thing that was true?

Humbali:  Well, I get the impression that you have timelines shorter than OpenPhil's timelines.  Is this devastating accusation true?

Eliezer:  I consider naming particular years to be a cognitively harmful sort of activity; I have refrained from trying to translate my brain's native intuitions about this into probabilities, for fear that my verbalized probabilities will be stupider than my intuitions if I try to put weight on them.  What feelings I do have, I worry may be unwise to voice; AGI timelines, in my own experience, are not great for one's mental health, and I worry that other people seem to have weaker immune systems than even my own.  But I suppose I cannot but acknowledge that my outward behavior seems to reveal a distribution whose median seems to fall well before 2050.

Humbali:  Okay, so you're more confident about your AGI beliefs, and OpenPhil is less confident.  Therefore, to the extent that you might be wrong, the world is going to look more like OpenPhil's forecasts of how the future will probably look, like world GDP doubling over four years before the first time it doubles over one year, and so on.

Eliezer:  You're going to have to explain some of the intervening steps in that line of 'reasoning', if it may be termed as such.

Humbali:  I feel surprised that I should have to explain this to somebody who supposedly knows probability theory.  If you put higher probabilities on AGI arriving in the years before 2050, then, on average, you're concentrating more probability into each year that AGI might possibly arrive, than OpenPhil does.  Your probability distribution has lower entropy.  We can literally just calculate out that part, if you don't believe me.  So to the extent that you're wrong, it should shift your probability distributions in the direction of maximum entropy.

Eliezer:  It's things like this that make me worry about whether that extreme cryptivist view would be correct, in which normal modern-day Earth intellectuals are literally not smart enough - in a sense that includes the Cognitive Reflection Test and other things we don't know how to measure yet, not just raw IQ - to be taught more advanced ideas from my own home planet, like Bayes's Rule and the concept of the entropy of a probability distribution.  Maybe it does them net harm by giving them more advanced tools they can use to shoot themselves in the foot, since it causes an explosion in the total possible complexity of the argument paths they can consider and be fooled by, which may now contain words like 'maximum entropy'.

Humbali:  If you're done being vaguely condescending, perhaps you could condescend specifically to refute my argument, which seems to me to be airtight; my math is not wrong and it means what I claim it means.

Eliezer:  The audience is herewith invited to first try refuting Humbali on their own; grandpa is, in actuality and not just as a literary premise, getting older, and was never that physically healthy in the first place.  If the next generation does not learn how to do this work without grandpa hovering over their shoulders and prompting them, grandpa cannot do all the work himself.  There is an infinite supply of slightly different wrong arguments for me to be forced to refute, and that road does not seem, in practice, to have an end.

Humbali:  Or perhaps it's you that needs refuting.

Eliezer, smiling:  That does seem like the sort of thing I'd do, wouldn't it?  Pick out a case where the other party in the dialogue had made a valid point, and then ask my readers to disprove it, in case they weren't paying proper attention?  For indeed in a case like this, one first backs up and asks oneself "Is Humbali right or not?" and not "How can I prove Humbali wrong?"

But now the reader should stop and contemplate that, if they are going to contemplate that at all:

Is Humbali right that generic uncertainty about maybe being wrong, without other extra premises, should increase the entropy of one's probability distribution over AGI, thereby moving out its median further away in time?

Humbali:  Are you done?

Eliezer:  Hopefully so.  I can't see how else I'd prompt the reader to stop and think and come up with their own answer first.

Humbali:  Then what is the supposed flaw in my argument, if there is one?

Eliezer:  As usual, when people are seeing only their preferred possible use of an argumentative superweapon like 'What if you're wrong?', the flaw can be exposed by showing that the argument Proves Too Much.  If you forecasted AGI with a probability distribution with a median arrival time of 50,000 years from now*, would that be very unconfident?

(*) Based perhaps on an ignorance prior for how long it takes for a sapient species to build AGI after it emerges, where we've observed so far that it must take at least 50,000 years, and our updated estimate says that it probably takes around as much more longer than that.

Humbali:   Of course; the math says so.  Though I think that would be a little too unconfident - we do have some knowledge about how AGI might be created.  So my answer is that, yes, this probability distribution is higher-entropy, but that it reflects too little confidence even for me.

I think you're crazy overconfident, yourself, and in a way that I find personally distasteful to boot, but that doesn't mean I advocate zero confidence.  I try to be less arrogant than you, but my best estimate of what my own eyes will see over the next minute is not a maximum-entropy distribution over visual snow.  AGI happening sometime in the next century, with a median arrival time of maybe 30 years out, strikes me as being about as confident as somebody should reasonably be.

Eliezer:  Oh, really now.  I think if somebody sauntered up to you and said they put 99% probability on AGI not occurring within the next 1,000 years - which is the sort of thing a median distance of 50,000 years tends to imply - I think you would, in fact, accuse them of brash overconfidence about staking 99% probability on that.

Humbali:  Hmmm.  I want to deny that - I have a strong suspicion that you're leading me down a garden path here - but I do have to admit that if somebody walked up to me and declared only a 1% probability that AGI arrives in the next millennium, I would say they were being overconfident and not just too uncertain.

Now that you put it that way, I think I'd say that somebody with a wide probability distribution over AGI arrival spread over the next century, with a median in 30 years, is in realistic terms about as uncertain as anybody could possibly be?  If you spread it out more than that, you'd be declaring that AGI probably wouldn't happen in the next 30 years, which seems overconfident; and if you spread it out less than that, you'd be declaring that AGI probably would happen within the next 30 years, which also seems overconfident.

Eliezer:  Uh huh.  And to the extent that I am myself uncertain about my own brashly arrogant and overconfident views, I should have a view that looks more like your view instead?

Humbali:  Well, yes!  To the extent that you are, yourself, less than totally certain of your own model, you should revert to this most ignorant possible viewpoint as a base rate.

Eliezer:  And if my own viewpoint should happen to regard your probability distribution putting its median on 2050 as just one more guesstimate among many others, with this particular guess based on wrong reasoning that I have justly rejected?

Humbali:  Then you'd be overconfident, obviously.  See, you don't get it, what I'm presenting is not just one candidate way of thinking about the problem, it's the base rate that other people should fall back on to the extent they are not completely confident in their own ways of thinking about the problem, which impose extra assumptions over and above the assumptions that seem natural and obvious to me.  I just can't understand the incredible arrogance you use as to be so utterly certain in your own exact estimate that you don't revert it even a little bit towards mine.

I don't suppose you're going to claim to me that you first constructed an even more confident first-order estimate, and then reverted it towards the natural base rate in order to arrive at a more humble second-order estimate?

Eliezer:  Ha!  No.  Not that base rate, anyways.  I try to shift my AGI timelines a little further out because I've observed that actual Time seems to run slower than my attempts to eyeball it.  I did not shift my timelines out towards 2050 in particular, nor did reading OpenPhil's report on AI timelines influence my first-order or second-order estimate at all, in the slightest; no more than I updated the slightest bit back when I read the estimate of 10^43 ops or 10^46 ops or whatever it was to recapitulate evolutionary history.

Humbali:  Then I can't imagine how you could possibly be so perfectly confident that you're right and everyone else is wrong.  Shouldn't you at least revert your viewpoints some toward what other people think?

Eliezer:  Like, what the person on the street thinks, if we poll them about their expected AGI arrival times?  Though of course I'd have to poll everybody on Earth, not just the special case of developed countries, if I thought that a respect for somebody's personhood implied deference to their opinions.

Humbali:  Good heavens, no!  I mean you should revert towards the opinion, either of myself, or of the set of people I hang out with and who are able to exert a sort of unspoken peer pressure on me; that is the natural reference class to which less confident opinions ought to revert, and any other reference class is special pleading.

And before you jump on me about being arrogant myself, let me say that I definitely regressed my own estimate in the direction of the estimates of the sort of people I hang out with and instinctively regard as fellow tribesmembers of slightly higher status, or "credible" as I like to call them.  Although it happens that those people's opinions were about evenly distributed to both sides of my own - maybe not statistically exactly for the population, I wasn't keeping exact track, but in their availability to my memory, definitely, other people had opinions on both sides of my own - so it didn't move my median much.  But so it sometimes goes!

But these other people's credible opinions definitely hang emphatically to one side of your opinions, so your opinions should regress at least a little in that direction!  Your self-confessed failure to do this at all reveals a ridiculous arrogance.

Eliezer:  Well, I mean, in fact, from my perspective, even my complete-idiot sixteen-year-old self managed to notice that AGI was going to be a big deal, many years before various others had been hit over the head with a large-enough amount of evidence that even they started to notice.  I was walking almost alone back then.  And I still largely see myself as walking alone now, as accords with the Law of Continued Failure:  If I was going to be living in a world of sensible people in this future, I should have been living in a sensible world already in my past.

Since the early days more people have caught up to earlier milestones along my way, enough to start publicly arguing with me about the further steps, but I don't consider them to have caught up; they are moving slower than I am still moving now, as I see it.  My actual work these days seems to consist mainly of trying to persuade allegedly smart people to not fling themselves directly into lava pits.  If at some point I start regarding you as my epistemic peer, I'll let you know.  For now, while I endeavor to be swayable by arguments, your existence alone is not an argument unto me.

If you choose to define that with your word "arrogance", I shall shrug and not bother to dispute it.  Such appellations are beneath My concern.

Humbali:  Fine, you admit you're arrogant - though I don't understand how that's not just admitting you're irrational and wrong -

Eliezer:  They're different words that, in fact, mean different things, in their semantics and not just their surfaces.  I do not usually advise people to contemplate the mere meanings of words, but perhaps you would be well-served to do so in this case.

Humbali:  - but if you're not infinitely arrogant, you should be quantitatively updating at least a little towards other people's positions!

Eliezer:  You do realize that OpenPhil itself hasn't always existed?  That they are not the only "other people" that there are?  An ancient elder like myself, who has seen many seasons turn, might think of many other possible targets toward which he should arguably regress his estimates, if he was going to start deferring to others' opinions this late in his lifespan.

Humbali:  You haven't existed through infinite time either!

Eliezer:  A glance at the history books should confirm that I was not around, yes, and events went accordingly poorly.

Humbali:  So then... why aren't you regressing your opinions at least a little in the direction of OpenPhil's?  I just don't understand this apparently infinite self-confidence.

Eliezer:  The fact that I have credible intervals around my own unspoken median - that I confess I might be wrong in either direction, around my intuitive sense of how long events might take - doesn't count for my being less than infinitely self-confident, on your view?

Humbali:  No.  You're expressing absolute certainty in your underlying epistemology and your entire probability distribution, by not reverting it even a little in the direction of the reasonable people's probability distribution, which is the one that's the obvious base rate and doesn't contain all the special other stuff somebody would have to tack on to get your probability estimate.

Eliezer:  Right then.  Well, that's a wrap, and maybe at some future point I'll talk about the increasingly lost skill of perspective-taking.

OpenPhil:  Excuse us, we have a final question.  You're not claiming that we argue like Humbali, are you?

Eliezer:  Good heavens, no!  That's why "Humbali" is presented as a separate dialogue character and the "OpenPhil" dialogue character says nothing of the sort.  Though I did meet one EA recently who seemed puzzled and even offended about how I wasn't regressing my opinions towards OpenPhil's opinions to whatever extent I wasn't totally confident, which brought this to mind as a meta-level point that needed making.

OpenPhil:  "One EA you met recently" is not something that you should hold against OpenPhil.  We haven't organizationally endorsed arguments like Humbali's, any more than you've ever argued that "we have to take AGI risk seriously even if there's only a tiny chance of it" or similar crazy things that other people hallucinate you arguing.

Eliezer:  I fully agree.  That Humbali sees himself as defending OpenPhil is not to be taken as associating his opinions with those of OpenPhil; just like how people who helpfully try to defend MIRI by saying "Well, but even if there's a tiny chance..." are not thereby making their epistemic sins into mine.

The whole thing with Humbali is a separate long battle that I've been fighting.  OpenPhil seems to have been keeping its communication about AI timelines mostly to the object level, so far as I can tell; and that is a more proper and dignified stance than I've assumed here.

Discuss

### Open & Welcome Thread December 2021

1 декабря, 2021 - 22:57
Published on December 1, 2021 7:57 PM GMT

(
I saw December didn't have one. Please let me know if any of the links are broken or could be improved.

Copied from before:
"To whoever comes after me: Yoav Ravid comments that the wording could use an update."

The wording could use revision. I've started a thread for that in the comments here.
)

(The boilerplate:)

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ*. If you want to orient to the content on the site, you can also check out the new Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

posted on 14th Jun 2019.

You can find it on the left hand side of the front-page if you scroll down. If you're on mobile, clicking the button in the upper left that has 3 horizontal lines parallel to each other opens up that menu which has it towards the bottom. If you're reading this on GreaterWrong, I don't know where the link would appear. The substitution version doesn't work, so here's some results I grabbed from searching "FAQ" there, and scrolling down:

Aug 24, 2021

https://www.greaterwrong.com/posts/2rWKkWuPrgTMpLRbp/lesswrong-faq

Jun 14, 2019, 1:03 PM

Discuss

### Viral Mutation, Pandemics and Social Response

1 декабря, 2021 - 21:37
Published on December 1, 2021 6:36 PM GMT

It seems that both historically and currently the first order response is quarantine and travel restrictions to prevent potential transmission. Pretty sensible as we have plenty of reasons and plenty of evidence supporting the travel and trade spreads diseases to new locations.

But is that really the same when we're talking about mutations of an existing virus that is widespread? I'm not entirely sure, hence the post. Hence the rather speculative post I should probably say.

I would imaging the starting point to such a question just might be "So just what is the mutation process?" That seems to be somewhat poorly understood -- but I'm hardly well informed here.

Mutations come in many forms but we might put them into one of three buckets: "advantage" gained, no effect (neutral mutation?) and  "this just broke everything". The first is really what most are concerned with here and what the Delta and Omicron mutations are really about.

In general it seems these are largely taken as just random events. I think that view is what underlies the idea that one can (in theory I suppose) trace these variants back to an original host from which the new dominant strain emerged.

The remarkable capacity of some viruses to adapt to new hosts and environments is highly dependent on their ability to generate de novo diversity in a short period of time. Rates of spontaneous mutation vary amply among viruses. RNA viruses mutate faster than DNA viruses, single-stranded viruses mutate faster than double-strand virus, and genome size appears to correlate negatively with mutation rate. Viral mutation rates are modulated at different levels, including polymerase fidelity, sequence context, template secondary structure, cellular microenvironment, replication mechanisms, proofreading, and access to post-replicative repair. Additionally, massive numbers of mutations can be introduced by some virus-encoded diversity-generating elements, as well as by host-encoded cytidine/adenine deaminases. Our current knowledge of viral mutation rates indicates that viral genetic diversity is determined by multiple virus- and host-dependent processes, and that viral mutation rates can evolve in response to specific selective pressures.

Emphasis added in the above quote. Huh? Virus may actually have evolved structures that seek to produce mutations?

I have recently read a different article (forget the link but it was not a peer-reviewed publication yet but seemed like reasonable thinking) regarding the value of travel restrictions. Their model indicated that local spread will dominate any imported spread from travelers so such restrictions are not really effective or necessary. Makes sense . . . unless you're talking about a new version that is not local . . . unless the view that the new version only emerges in one place (p for simultaneous emergence separately near 0) is wrong.

We also tend to see reported new variants of SARS-COV-2 and then it's found in other location -- generally to a lessor degree. That patter certainly is consistent with an one origin view. But we also hear about cases where no good connection exists. Isn't that also consistent with a virus that has structures designed to test mutations (and probably not entirely randomly) having just emerged around the same time locally?  Is there perhaps a reason to think such mutations might well emerge on a similar timeline rather than these just being random where the p(A) and p(B) -- A and B being the locations seeing the mutation -- is basically 0?

I want to say this question about a mutation emerging in multiple locations rather than just in one and then spread has been touched on here but not certain (and have not tried searching). Regardless of that point though, can we really have a rational response to mutations without having some answers here? I'm not sure we have them but that could be my ignorance. But if we actually do have reasonably good answers here I would think an organization like WHO, which seems to be calling for lighter/no travel restrictions on SA and other countries reporting Omicron, should be pointing something like that out.

Discuss

### Taking Clones Seriously

1 декабря, 2021 - 20:29
Published on December 1, 2021 5:29 PM GMT

What this is not: an exhortation to dig up Von Neumann and clone him

What this is: an exhortation to think seriously about how to decide whether to dig up Von Neumann and clone him

Having tried to clarify that I am not presently advocating for cloning, and am far more concerned with the meta level than the object level, let me now sink to the object level.

The argument for cloning

Suppose everything Eliezer Yudkowsky said in his recent discussions on Artificial General Intelligence (AGI) is entirely literally true, and we take it seriously.

Then right now we are a few decades (or perhaps years) away from the development of an unaligned AGI, which will swiftly conclude that humans are an unnecessary risk for whatever it wants.

The odds are not great, and it will take a miracle scientific breakthrough to save us. The best we can do is push as hard as we can to increase the possibility of that miracle breakthrough and our ability to make use of.

The claim I really want to draw attention to is this particular chestnut:

Paul Christiano is trying to have real foundational ideas, and they're all wrong, but he's one of the few people trying to have foundational ideas at all; if we had another 10 of him, something might go right.

Well, why not? 10 of him, 10 Von Neumanns, and an extra Jaynes or two just in case. If we have a few decades, we have just enough time.

Leave aside all the obvious ways this could go horribly wrong, the risks and the costs: if what Eleizer says is true, it is conceivable that cloning could save the world.

The argument for the argument about clones

I am not saying this is a good idea, but I am saying it could be a good idea. And if it is a good idea, it would be really great to know that.

There are obvious problems, but I don't feel well-placed to judge how high the benefits might be, or how easily the problems might be resolved. And I am sufficiently uncertain in this regard that I think it might be valuable to become more certain.

A modest proposal?

I think OpenPhil, or somebody else in that sphere, should fund or conduct an investigation into cloning. Most basically, how much would it actually cost to clone somebody, and how much impact could that potentially have?

It may become immediately clear that this is prohibitively expensive relative to just hiring more of the very smartest people in the world, but if the limiting factor is not money then I think it would then become worthwhile to ask: what actually are the ethical ramifications of cloning? could it be done?

I would like to think that sufficiently neutral and exploratory research could be done in public, but it is conceivable that this must be conducted in secret.

As I see it, the most obvious types of arguments against this are:

• somehow it is just obviously the case that, no matter what, this couldn't conceivably turn out to be high impact (which I find quite unlikely)
• the opportunity cost of even investigating this is too high (which would surprise me but I would begrudgingly accept)
• such a study already exists, in secret (in which case I apologise)
• such a study already exists, in public (in which case I apologise even more, and would love to be directed to it)

Discuss

### Hypotheses about Finding Knowledge and One-Shot Causal Entanglements

1 декабря, 2021 - 20:01
Published on December 1, 2021 5:01 PM GMT

Epistemic status: my own thoughts I've thought up in my own time. They may be quite or very wrong! I am likely not the first person to come to these ideas. All of my main points here are just hypotheses which I've come to by the reasoning stated below. Most of it is informal mathematical arguments about likely phenomena and none is rigorous proof. I might investigate them if I had the time/money/programming skills. Lots of my hypotheses are really long and difficult-to-parse sentences.

What is knowledge?

I think this question is bad.

It's too great of a challenge. It asks us (implicitly) for a mathematically rigorous definition which fits all of our human feelings about a very loaded word. This is often a doomed endeavour from the start, as human intuitions don't neatly map onto logic. Also, humans might disagree on what things count as or do not count as knowledge. So let's attempt to right this wrong question:

Imagine a given system is described as "knowing" something. What is the process that leads to the accumulation of said knowledge likely to look like?

I think this is much better.

We limit ourselves to systems which can definitely be said to "know" something. This allows us to pick a starting point. This might be a human, GPT-3, or a neural network which can tell apart dogs and fish. In fact this will be my go-to answer for the future. We also don't need to perfectly specify the process which generates knowledge all at once, only comment on its likely properties.

Properties of "Learning"

Consider θ(θ0; X; 0). This is trivially equal to θ0, and so it depends only on the choice of θ0. The dataset has had no chance to affect the parameters in any way.

So what about as t→∞? We would expect that θ∞(θ0; X)=θ(θ0; X; ∞) depends mostly on the choice of X and much less strongly on θ0. There will presumably be some dependency on initial conditions, especially for very complex models like a big neural network with many local minima. But mostly it's ω which influences θ.

So far this is just writing out basic sequences stuff. To make a map of the city you have to look at it, and to learn your model has to causally entangle itself with the dataset. But let's think about what happens when ω is slightly different.

Changes in the world

So far we've represented the whole dataset with a single letter X, as if it were just a number or something. But in reality it will have many, many independent parts. Most datasets which are used as inputs to learning processes are also highly structured.

Consider the dog-fish discriminator, trained on the dataset Xdog/fish. The system θ∞(θ0; Xdog/fish) could be said to have "knowledge" that "dogs have two eyes". One thing this means if we instead fed it an X which was identical except every dog had three eyes (TED) then the final values of θ would be different. The same is true of facts like "fish have scales", "dogs have one tail". We could express this as follows:

θ∞(θ0; Xdog/fish+ΔXTED)

Where ΔXTED is the modification of "photoshopping the dogs to have three eyes". We now have:

θ∞(θ0; Xdog/fish+ΔXTED)=θ∞(θ0; Xdog/fish)+Δθ∞(θ0; Xdog/fish; ΔXTED)

Now let's consider how Δθ∞(θ0; X; ΔX) behaves. For lots of choices of ΔX it might just be a series of random changes tuning the whole set of θ values. But from my knowledge of neural networks, it might not be. Lots of image recognizing networks have been found to contain neurons with specific functions which relate to structures in the data, from simple line detectors, all the way up to "cityscape" detectors.

For this reason I suggest the following hypothesis:

Structured and localized changes in the dataset that a parameterized learning system is exposed to will cause localized changes in the final values of the parameters.

Impracticalities and Solutions

Now it would be lovely to train all of GPT-3 twice, once with the original dataset, and once in a world where dogs are blue. Then we could see the exact parameters that lead it to return sentences like "the dog had [chocolate rather than azure] fur". Unfortunately rewriting the whole training dataset around this is just not going to happen.

Finding the flow of information, and influence in a system is easy if you have a large distribution of different inputs and outputs (and a good idea of the direction of causality). If you have just a single example, you can't use any statistical tools at all.

So what else can we do? Well we don't just have access to θ∞. In principle we could look at the course of the entire training process and how θ changes over time. For each timestep, and each element of the dataset X, we could record how much each element of θ is changed. We'll come back to this

Let's consider the dataset as a function of the external world: X(Ω). All the language we've been using about knowledge has previously only applied to the dataset. Now we can describe how it applies to the world as a whole.

For some things the equivalence of knowledge of X and Ω is pretty obvious. If the dataset is being used for a self-driving car and it's just a bunch of pictures and videos then basically anything the resulting parameterised system knows about X it also knows about Ω. But for obscure manufactured datasets like [4000 pictures of dogs photoshopped to have three eyes] then it's really not clear.

Either way, we can think about Ω as having influence over X the same way as we can think about X as having influence over θ∞. So we might be able to form hypotheses about this whole process. Let's go back to Xdog/fish. First off imagine a change Ωnew=Ω+ΔΩ, such as "dogs have three eyes". This will change some elements of X more than others. Certain angles of dog photos, breeds of dogs, will be changed more. Photos of fish will stay the same!

Now we can imagine a function Δθ(θ0; X(Ω); ΔX(Ω; ΔΩ)). This represents some propagation of influence from Ω→X→θ. Note that the influence of Ω on X is independent of our training process or θ0. This makes sense because different bits of the training dataset contain information about different bits of the world. How different training methods extract this information might be less obvious.

The Training Process

During training, θ(t) is exposed to various elements of X and updated. Different elements of X will update θ(t) by different amounts. Since the learning process is about transferring influence over θ from θ0 to Ω (acting via X), we might expect that for a given element of X, it has more "influence" over the final values of the elements of θ which were changed the most due to exposure to that particular element of X during training.

This leads us to a second hypothesis:

The degree to which an element of the dataset causes an element of the parameters to be updated during training is correlated with the degree to which a change to that dataset element would have caused a change in the final value of the parameter.

Which is equivalent to:

Knowledge of a specific properties of the dataset is disproportionately concentrated in the elements of the final parameters that have been updated the most during training when "exposed" to certain dataset elements that have a lot of mutual information with that property.

For the dog-fish example: elements of parameter space which have updated disproportionately when exposed to photos of dogs that contain the dogs' heads (and therefore show just two eyes), will be more likely to contain "knowledge" of the fact that "dogs have two eyes".

This naturally leads us to a final hypothesis:

Correlating update-size as a function of dataset-element across two models will allow us to identify subsets of parameters which contain the same knowledge across two very different models.

Therefore

Access to a simple interpreted model of a system will allow us to rapidly infer information about a much larger model of the same system if they are trained on the same datasets, and we have access to both training histories.

Motivation

I think an AI which takes over the world will have a very accurate model of human morality, it just won't care about it. I think that one way of getting the AI to not kill us is to extract parts of the human utility-function-value-system-decision-making-process-thing from its model and tell the AI to do those. I think that to do this we need to understand more about where exactly the "knowledge" is in an inscrutable model. I also find thinking about this very interesting.

Discuss

### The Limits Of Medicine - Part 1 - Small Molecules

1 декабря, 2021 - 18:51
Published on December 1, 2021 3:51 PM GMT

Much of medicine relies on what is dubbed "small molecules".

"Molecules" since they are atoms tightly bound together, forming what is easily seen as a unitary whole, as opposed to e.g. lipoproteins, which are synergistic ensembles held together by much weaker forces (pun not intended).

"Small", in that they are, well... lightweight. But for all intents and purposes, size and weight don't separate quite the same way at the molecular level. So whatever.

This class includes drugs such as all NSAIDs, all antibiotics, all antihistamines, almost every single drug a psychologist or even a psychiatrist is allowed to prescribe, virtually all supplements, sleep aids, wakefulness aids, every single schedule I through III drugs (i.e. fun and insightful drugs), almost all anesthetics, and almost all anti-parasitic drugs, etc.

Unless you are old or chronically sick, it's likely that the only drug you've ever taken that hasn't been a small molecule is a vaccine. Indeed, drugs that aren't small molecules are so strange and rare we usually don't think of them as "drugs" but as a separate entity: vaccines, monoclonal antibodies, anabolic steroids.

But, like, the vast majority of molecules found in our body, and in all of organic life, are not classified as "small". And the vast majority of the things doing something interesting are not really molecules, but more so fuzzy complexes of molecules (ribosomes, lysosomes, lipoproteins, membranes).

So why are virtually all drugs small molecules? Prima facei we'd expect most of them to be complexes made up of dozens to thousands of very large molecules.

The answer lies in several things:

• Easy to mass-produce
• Cheap to store
• Homogenous in effect
• Quick to act

Let's look at each of these aspects. None are unique to small molecules and not all small molecules bear all of them, but they are all traits significantly more likely to be found in small molecules.

Easy to produce

Producing a protein is hard, you have to get a gene sequence for the protein you want to produce, create some genetically modified organism (usually yeast) with a zillion copies of that genes, let it breed, extract the protein, makes super-duper-sure all potentially dangerous compounds are separated.

Along this process, you will have issues at every step, from errors in creating the DNA, to errors in creating the proteins to potential "errors" (mainly due to environmental contaminants) in how the protein folds and what exactly it contains, to errors at separation.

All of this is much harder and involves much more trial and error than simple compounds (e.g. most small molecules); For which we have some vague resemblance of "laws" in the form of classical and biochemistry. When it comes to proteins they are complex enough that predicting their behavior is often an inconclusive matter.

Small molecules, on the other hand, are often found in relevant quantities ready-made in organisms that are cheap to propagate (e.g. garden plants) and can be extracted using the most brutal of methods.

You can get a good approximation of thousands of life-saving drugs by the simple process of:

1. Break down a plant with your bare hands in tiny pieces (ideally use pestle and mortar)
2. Throw it in a bottle with gasoline and shake.
3. Put pipe cleaner (H₂SO₄) in another bottle and throw in some table salt (NaCl)
4. Connect the two with a tube, wait, pour in a pan, wait some more for gasoline to evaporate

Note: Don't try this at home, hydrochloric acid gas will melt your face and gasoline can explode, just buy your drugs from a reputable dealer (or pharmacist/doctor if you are really desperate and need a fix).

Ok, granted, most extractions are much more complex than this, but still, we've gone from vague hand waving with complex concepts like "building DNA" and "genetically modifying yeast" and jumping over 100 steps each of which took 10 PhDs to design to "here's a step by step 20 seconds guide to doing it in an ill-equipped kitchen".

In practice, for most small molecule drugs, the reactions producing them might be simple enough that we can skip the extraction step entirely and just synthesize them ourselves instead of extracting them. The synthesis of many arbitrary compositions for common large molecules (DNA, RNA and many types of proteins) has also become possible, but this is a rather common advancement, and humans of the 20th century would have been flabbergasted by the idea of an RNA printer.

There are 2 major hurdles a drug must overcome in order to take effect:

• Be absorbed in circulation through the gut, skin, nose, mouth (sublingual), or muscle. (can be sidestepped by taking it IV, doesn't apply to drugs with local action)
• Not be brutally disintegrated by the innate immune system

But, usually, there are also two extra bonus ones that are super nice:

• Not be almost instantly metabolized by the liver
• Cross the blood-brain barrier
• Cross cell membranes
• Cross nuclear membranes

If you eat any complex protein (e.g. the kind that you put in a vaccine) it will be broken down into component amino acids in the stomach and small intestine. If you protect it with a capsule it won't be absorbed. If you design a fancy lipid capsule to facilitate absorption (or give it IV) it will be savagely attacked by white blood cells, immediately captured and digested in cell lysosomes, and pulverized down to glucose by the liver as a top priority. And we've not even gotten into whether it crosses a run of the mill cellular membrane (not that hard) or whether it crosses the blood-brain barrier (extremely unlikely for anything with more than triple digits atoms in its composition)

If you want this to be RNA or DNA now you've got the extra challenge of crossing a nuclear membrane ... there are a few ways to do it, but for all but a few niches they are as complex as "design a deactivated virus to carry it".

On the other hand, you can basically stare the wrong way at small molecule salt and it will perfuse itself into every living ounce of tissue in your body.

How can you administer aspirin? Swallow it. Need a capsule? Not really. Another way? inject it. Do I have to hit a blood vessel? Naaah. What if I want to look cool? Ground it up and snort it. Can I snort it off someone's naked body at a party? Yeah, but be quick, it also gets absorbed through the skin. Which tissues does it get into? All of them. What if I'm transported back 5000 years in the past? Just boil some willow root and drink it, you'll be fine.

Cheap to store

Small molecules are often content with just "existing" for a very long time, provided room temperature and lack of light, solvents, or water they can last mainly unaltered for a lifetime.

Most proteins, viruses, and lipo-whatever complexes used in medicine, on the other hand, require anything from being stored in a fridge and used within 3-6 months, to being stored at -120 degrees celsius and used within days.

This is not always the case, lysergic acid is a tiny yet infamously unstable substance and ApoB is a gigantic and infamously stable protein. But as a rule of thumb, the smaller the size and the higher the mass, the more stable a molecule is.

Homogenous in effect

Proteins are complex, everyone's are a bit different and their interactions produce different complexes and different epigenomes and those lead to yet more different levels of all proteins and it's all very loopy and head-scratchy.

Metals are simple, you've got like 11 of them and they all do basically the same thing in everyone, and the levels vary by a factor of like 2 or 5 between individuals, but not 1000 or 10^9 or +/- infinity. If you've got too much of them, the kidney usually filters them out without much issue, at most you end up losing some water. They are well preserved and well-tolerated in circulation.

Coincidentally, metals are small molecules and proteins aren't.

A life-saving protein can become lethal if you up the dosage to just 2-3x the levels, given previously mentioned large individual difference this essentially means that a lot of large-molecules would have to be administered by taking effects into account, continuously monitoring on the scale of seconds or minutes.

Currently, we get around this by not using proteins that are too risky and making use of fuzzy measures. People's tolerance to hGH is pretty flat, but it can still vary by a factor of dozens. The workaround for this is to start slow, tell people to increase the dosage, and stop when they feel iffy from it. But "feel iffy" is not a very quantitative endpoint, and relies a lot on individual interpretation, which is famously unreliable.

But to start using the interesting stuff we'd need a much tighter feedback loop, something closer to a device attached to a small catheter analyzing blood samples for dozens of markers every few [mili]seconds, a second catheter inserting a small amount of the drug and an algorithm regulating the dose based on the response in real-time.

All of our experience running clinical trials would be moot as well. Since a lot of people might just not respond well to these. Current best practices indicate being careful against selecting sub-groups that respond well. But to find the effect in protein-based medicine the most likely approach would be to find sub-groups that had an excellent response. You can still control for chance in this situation, but it gets much more complex

Quick to act

Finally, small molecules usually act pretty quickly.

They have a target, they bind to it, the target reacts, effects happen.

Small molecules usually modulate an existing process, often by binding to a receptor or enzyme, many times in an inhibitory fashion. But it's very hard for them to have a constructive or "additive" effect.

This is, in principle, a good thing, It makes using them in situations where they do work fairly easy.

Things like viral vectors inserting DNA plasmids into your cells... that takes a while to manifest, especially since there's no way to "turn it off" after 10 seconds, it'll be there for a few weeks, so you need to take into account the amount produced over time.

Wanna determine side effects before a long course? Though luck, inject a smaller quantity and wait a few weeks.

This might appear in contrast with what I just said previously, but it isn't, therein lies the problem. Often enough the marker you are monitoring to assess tolerance and dosage is not the final treatment outcome.

So you are stuck in a situation where for most large molecules you have to use a complex process to determine the dosage, then often enough wait weeks or months to see the desired effect, then rinse and repeat if it didn't work. Whereas with small molecules you just give the maximum safe dose and wait a few hours or days, and if it didn't work you give a slightly larger slightly unsafe dose or change meds and try again.

Of course, this is because the targets of large molecules are often more sophisticated and can't be reached with small ones. You can't restore genetic or immune integrity with simple substances, but you can with gene therapy.

Why is this a limitation?

Not only are small molecules limited, but thinking in terms of them limits the mindset of doctors, researchers and pharmaceutical companies.

Starting with prescription, where the dynamics of proteins, peptide hormones, and enzymes make them much more tricky. They can't just be prescribed on a "take 3 of these a day at mealtimes" basis, but rather, they require careful monitoring of the feedback loops they partake in.

Indeed, modern medicine has "burnt" many patients by prescribing the peptide hormone insulin in quite literally the "3 times a day with meals" fashion. Whereas the proper mechanism of administration would have been in tandem with monitoring of glucose. This lead to many becoming irreversibly "addicted" to the substance, no longer capable of producing or regaining the ability to produce it endogenously.

I use the past tense though, recently CGMs (continuous glucose monitors) are being used to administer insulin only when needed and in the quantities required, together with protocols for fasting and ketogenic eating to regain insulin sensitivity. I'm by no means in the vanguard here figuring this out.

The prescription of complex drugs will be aperiodic and heavily depend on the patient's demographic and monitoring. This is why, for example, boosters for sars-cov-2 and hepatitis B are always being tweaked based on the most active virus strains and administered only when results on various antibody assays indicate suboptimal leve... Oh? What? Really? Like, everyone? The exact same sequence used 40 years ago!? Really... ? Ok, so maybe there's still a bit of a learning curve.

But there's also the step of procurement. Whereas small molecules are best produced in a "centralized" fashion, this is not always the case with complex drugs, which might have a short shelf life and come with the problem of production variability no matter what's tried.

Might this mean that we should aim for local labs producing heavily unstable compounds on a need-to-administer basis? Or maybe a same-day production-supply mechanism for certain substances? Hard to tell, but the problem is currently not even being considered.

Finally, there's the issue of deciding who needs to take what. In tandem with continuous monitoring and "custom" orders, comes the possibility of designing "large molecules" for every individual patient, with slight variation specific to their DNA and bloodwork.

Would this require in-vitro experiments using patient tissue? Machine learning algorithms to design the drugs? Biokinetic models to test fundamental interactions? A new decision about the preferred administration method for every individual substance and patient based on their unique traits?

But it's not all doom and gloom

As I said in my Insulin example, medicine is slowly undergoing the process of learning how to work with complex molecules.

In parallel with normal medicine, I'd like to think that biohackers with the ability to custom order whatever they can dream of, will lead the way with self-experimentation. Figuring out the limits and benefits from individualized design and continuous monitoring.

For now, this is mainly click-bait, people injecting a bioluminescence gene with CRISPR kinda stuff, but some of it isn't. As an example, the guy from ThoughtEmporium self-designed a therapy to get rid of lactose intolerance (though the solution is not permanent, it lasts for a few months). Hundreds of other such people are engaging in similar experiments, and the more people do it, the more resources become available, the easier it will get, and the better the ROI.

As this happens, social acceptance will follow and pharmaceutical companies will get more on the deal.

It's not just lone loonies. There are entire clinics (with stunning results) dedicated to stem cell therapy, which create custom preparations using patients' own tissue and inject them at the specific injury sites with a case-by-case substrate composition to encourage healing. Not to mention techniques like PRP, a prime example of medicine using complex substances (cells) which is now available at every street corner and used for everything from hair loss to gum pain. Arguably, these people are still "the loonies" by mainstream standard, but they are certainly getting fairly close to widespread acceptance.

Also, we shouldn't forget that, while small molecules are limited, we are far from depleting their usage. Out of every single bioactive molecule that you could create, it's likely that just 0.0x% were ever tested in an animal, and out of those only 0.0x% were ever tested in a human, and out of those only 0.0x% underwent the rigorous trials needed to become an approved drug. The same techniques that will help us employ more complex molecules and constructs could well lead to a revolution in the way we administer and find small molecules.

Discuss

1 декабря, 2021 - 18:00
Published on December 1, 2021 3:00 PM GMT

While Paxlovid Remains Illegal and is expected to remain illegal for at least several weeks, the FDA did manage to finally meet to discuss whether or not to legalize the other Covid-19 treatment pill, Merck’s Molunpiravir. While later data reduced effectiveness estimates from 50% to 30%, that’s still much better than 0% and it uses a unique mechanism that can probably be profitably combined with other treatments, so one might naively think that after sufficient stalling for appearances this would be easy.

One would be wrong. The vote was 13-10, was restricted to those at high risk, and could easily have failed outright.

I may want to later refer back to this, so it’s splitting off into its own post.

High Level Summary

Usually we get live-blogs from Helen Branswell and Mattew Herper. They held off and only issued a summary post later on this time, perhaps because the meeting was too painful. Luckily, my commenter Weekend Editor is excellent at summarizing such meetings, so I’ll quote their summary in full, link goes to their full post. After the summary, I’ll note the salient other details in the full post.

Today the FDA’s AMBAC meeting voted to recommend molnupiravir get an emergency use authorization. But just barely: 13 yes, 10 no, 0 abstentions. And with a lot of caveats among the yes votes.

Some issues:

(a) The efficacy vs hospitalization was “wobbly”: the interim report had 48.3% (CL: 20.5% – 66.5%) efficacy, but when they added the rest of the data it was only 30.4% (CL: 1.0% – 51.1%). People thought this might mean there were responder/nonresponder populations, and nobody knew a biomarker to distinguish them.

This is an interesting mix of good and bad thinking. Yes, it’s possible that the 50% vs. 30% thing wasn’t random, but why is that an issue here?

If you have a drug with 30% efficacy, that’s good. If you have a drug that is 50% effective half the time and 10% effective the other half, depending on who it is used on, and you can’t tell who is who, then you still have a 30% effective drug.

The difference is that because of the way you are now stacking your coin flips, it feels like you are now giving a 10% effective drug to some people without knowing it? Or there’s the ‘problem’ that if you knew more you could differentiate between the two populations, so now you have an imperfect procedure and it wouldn’t be ‘ethical’ somehow to proceed, or you’re blameworthy when the drug doesn’t work in a particular case?

I’m glad this didn’t cause a veto for enough voters, but consider that if this wasn’t an emergency situation, it might have caused one, despite being good news since it opens up the possibility of doing better once we know more.

Then again, perhaps the argument is that experiments are illegal except for getting drugs approved, so if we approved the drug we’d never run the experiment? Which has a certain kind of dystopian logic to it, I suppose.

Of course, same as it ever was, we stopped the study when the early results looked so good, and now we’re saying the results are at most barely good enough…

(b) While the mechanism and efficacy calculations from Merck were quite convincing, the FDA showed there were some issues with mutagenicity, particularly in the first trimester of pregnancy. (And I have to apologize: I got bored, and went grocery shopping at this point, stocking up in case the Omicronomicon gets loose. So I missed most of this, and am too tired to go back and listen. Maybe tomorrow.) Any woman will only get Molnupiravir after a negative pregnancy test.

Points for Omicronomicon, that’s great, and given the timing I hope it was all dry goods that will keep for a long time. In any case, a pregnancy test is quick and cheap, and damaged babies are considered quite bad, so I have no issue with this requirement, even though in practice it’s going to be painfully dumb a large percentage of the time. Given this is a treatment for those already sick, it would take a very large concern to make taking this a bad idea otherwise.

(c) One point in favor was that people thought the efficacy of monoclonal abs will fade with Omicron, so they want an alternative. OTOH, they also said if there’s “another oral medication” with higher efficacy with less side effects (think Paxlovid!), then the FDA should reconsider Molnupiravir. So Molnupiravir may get approved for like a month or so until Paxlovid blows it away? They so carefully didn’t mention Paxlovid that I wonder if there was some legal constraint.

This is of course completely crazy and illustrates how twisted their frameworks have become. Molnupiravir and Paxlovid are probably complements because they have different mechanisms, and the existence of a life saving medicine (that is currently illegal, thanks to you murderous madmen) is no reason to then make a different lifesaving medicine illegal because there’s a better option. If Paxlovid is legal and available and so is Molunpiravir, are they worried people who could have been given Paxlovid won’t be given Paxlovid?

Then there’s the crazy of the monoclonal abs argument and the ‘need’ for this new treatment as a justification. Once again, these treatments are complements, and also I can’t help but notice that a lot of people are dying of Covid-19 and we’re worried about a lack of hospital capacity and all that? It’s like the FDA thinks you only get one treatment (because who would dare use more than one without a Proper Scientific Study and a Standard of Care, or something) and therefore they have to ban all but the best treatment no matter the issues of cost and supply? The hell?

(d) Listening to the voting statements was almost painful: the AMDAC members were clearly conflicted.

So the recommended EUA only for high-risk individuals, mostly the unvaccinated or those who had suboptimal response to vaccination.

Pending FDA administrative action, Molnupiravir is EUA’d… sort of.

When they tried this insanity with boosters, limiting who can have access without legal liability to life saving medicine and thus allowing people to die, in order to satisfy arcane ‘ethical’ requirements, the states increasingly overruled the FDA. I very much hope that they do this again. But I fear that given the ‘mutagenicity’ issue doctors will rightfully fear lawsuits from people who claim nonsensical ‘mutations’ happened to them, and so won’t be able to give Molunpiravir to a lot of their patients, resulting in a bunch of people dying and hospitals filling up faster.

Seems like the most important action is to hurry along to the Paxlovid hearings, no?

Yes, still true, although they now have a distinct downside. They may lead to a lifesaving medicine, that is currently on track to be at least sort of legal, becoming illegal once more.

And it’s only on track to be legal rather than legal, with substantial doubt, because the FDA administer has to sign off, then the CDC advisory committee has to meet and then the CDC management has to approve. Given what we’ve seen so far, all of these steps are at risk.

However, it’s also important to note this concern, even if it didn’t make the comment summary and seems like it wasn’t considered too important:

There was also some concern that the induction of high mutation rates in the virus might, if it doesn’t go far enough, create a troublesome new variants.

This does seem like a real concern if it’s even slightly well-founded, and a good reason to consider not approving the drug. The downside of doing this could be very, very high. If this was the reason given for rejection, I might even accept it. Here’s what the Stat news summary had to report about that.

Panelists also worried about data showing that use of molnupiravir might, in theory, lead to new variants of the SARS-CoV-2 virus through its mechanism, which works by causing viruses to make mistakes in copying their genetic material.

“With all respect, I think it’s incumbent upon you to make some effort to make an estimate of what is the likelihood of escape mutants occurring as a result of your drug,” said James Hildreth, a panelist and the CEO of Meharry Medical College.

However, other panelists, including John Coffin, a Tufts molecular biologist, argued that the overall risk of such mutations due to the drug was small.

It seems appropriate here to have some amount of model error, and to actually do the calculation.

That’s the high level result.

Key Facts

There’s always a lot of good information at these meetings. What is it important to know?

Here’s the protocol:

The drug is administered 2ce per day for 5 days at a dose of 800mg (in the form of 4 capsules of 200mg each, so the dose can be titrated down, perhaps?).

Treatment must start within 5 days from symptoms. That’s empirically important, as shown here. The reduction in viral load over time is dramatic for intervention before 5 days past symptoms (left), but it is an order of magnitude weaker for intervention after 5 days of symptoms (right). That means we need to have a testing system which is widely available, cheap or free, and fast!

This also means that if you catch the problem quickly, we would probably see much better than 30% efficacy. This effect has to be continuous, there’s nothing special about five days in particular.

There were 10 subject deaths during the trial, but 9 were in the placebo arm. As a crude indication of safety, that’s pretty good. There were a lot of other safety studies, both in vitro, in animals, and in surveillance of the trial population. The trial seemed pretty safe, with adverse events in the placebo arm about the same as treatment. There is some worry about mutagenicity, particularly during organogenesis in pregnancy when messing with human RNA could be a bad idea.

The safety profile definitely more than passes my bar of ‘if this is unsafe then it’s nowhere near as unsafe as not using it so give it to me.’ If I get sick, I want treatment, and yet they are intending to make it illegal for me to get it, even outside issues of ‘delay.’

For a drug with 30% efficacy against hospitalization, that’s a rather good ratio. Sample size is small, but I suspect that the 30% is an underestimate.

There is no doubt this improvement is due to virus clearance, since they measured SARS-CoV2 RNA at baseline, day 3, and day 5. The reduction, as measured by log of mean difference and its 95% CL, is significant as shown here. It’s not just making people feel better, it’s doing so by a mechanism that makes sense and is related to the disease process.

It works. Weekend Editor notes this graph.

Weekend Editor suggests Molunpiravir might reduce everyone’s risk down to a similar baseline, based on this graph, which is interesting. Then again, have we considered that these sample sizes are too low and that’s why everything’s so noisy ? Still it is suggestive, and it is suggestive that the effective reduction in hospitalization and death could be much higher than 30% in practice.

This is the de rigeur Kaplan-Meier plot. It shows hospitalizations versus time, for the treatment arm and control arm. There’s a log-rank statistic that shows this difference is significant, i.e., the spread between the 2 curves is real.

The logistical difficulties of getting the drug to people within 5 days of symptoms looked daunting to some AMDAC members.

If five days is daunting, how are we going to get Paxlovid to people within three? Also, seriously, five days is an insanely long amount of time, if it’s not enough then FIX IT.

The Actual Decision

It’s good to have a reference handy of how everyone voted, to compare with other votes in the future, or for other reasons.

From the Stat news summary, some insight into what some people were thinking.

“I think we need to stop and acknowledge that the whole reason we’re having this discussion is because the efficacy of this product is not overwhelmingly good,” said W. David Hardy of Charles Drew University School of Medicine and Science during a discussion about the drug’s use during pregnancy. “And I think that makes all of us feel a bit uncomfortable about the fact whether this is an advance therapeutically because it’s an oral medication, not an intravenous medication.”

Then again that seems to directly contradict this summary viewpoint from the same article:

In the end, panelists narrowly voted that the benefits of having an oral Covid treatment to keep people out of the hospital outweighed their questions and concerns. But the FDA may write a far narrower authorization for the drug than observers would previously have expected.

This suggests that being an oral treatment was (correctly, I’d presume) considered an important advantage?

It also suggests that this decision was indeed very close and another similarly effective drug could easily have been rejected in this spot.

What is ‘overwhelmingly good?’ I’m guessing that if it were standard of care, and someone were suggesting not using it, the 30% would be enough to make this seem completely crazy and unacceptable. It’s all about framing.

It has to be ‘an advance’ because one does not simply use it in addition to other tools, and it’s a disadvantage that it is an oral medication in the context of whether it is ‘an advance’ in this other sense, even though in terms of usefulness it is a rather big advance, and the process is looking at a bunch of veto considerations rather than doing a cost-benefit analysis.

Takeaways
1. When you stop a trial early, you sometimes don’t get enough data.
2. This lack of data can then endanger approval. We have a concrete example.
3. The vote was 13-10 and at best we’re likely to get a narrow authorization. If many of you want Molunpiravir, you won’t be able to get it, and it’s possible no one will be able to get it.
4. The FDA is even more willing to deny us life saving medicine than we previously expected. We should worry more about this, not less.
5. Pregnant women probably shouldn’t take Molunpiravir, and pregnancy tests will be required before the drug is given out.
6. There is one other real potential worry about Molunpiravir, that it could create new variants. This does not seem to have been a major consideration, nor does a probability or cost-benefit assessment seem to have been done here, and the 13-10 vote was due mostly or entirely to other reasons.
7. Chances seem good that this can be a >30% effective treatment in practice, among those who get it in time and are allowed to get it. I’d be far more surprised by substantially lower than 30% than by substantially higher.
8. Paxlovid remains so illegal the FDA can’t even say its name in meetings.

Discuss

### How do you write original rationalist essays?

1 декабря, 2021 - 11:17
Published on December 1, 2021 8:08 AM GMT

I really enjoy Scott Alexander's and Paul Graham's essays. How can I practice to learn to write as they do?

I'm getting pretty okay at writing tutorials, where I just walk people through the process of completing some project. I'm also okay at research-based posts - it's not that difficult to gather information from the internet and compile my own summary that is hopefully useful to other people.

But I don't understand how PG and SSC create such insightful essays seemingly by making them up (I'm talking about the SSC posts where he just shares his thoughts, not the ones where he analyzes studies or teaches you more than you ever wanted to know about x).

I can share things I have learned from experience, I can try to explain complicated subjects in more accessible ways, but none of these approaches will lead to essays that I see so many of at Less Wrong. People seem to just sit down, and generate interesting, original, and unique thoughts through the process of writing itself, by "thinking on paper". Or maybe not, I don't know.

Aside from just being born naturally very smart and gifted, is there anything that can be done to learn to write like this?

How do people like Scott Alexander, Paul Graham, Eliezer Yudkowsky, etc, just think of all these unique and original things? It seems like they just have a boundless source of ideas, every paragraph is insightful, and most of these thoughts are something they just came up with, not something they have learned elsewhere. Not just that, it seems that they come up with them as they are writing the post, not through collecting random epiphanies a person has now and then. Like, they can generate these epiphanies intentionally, on demand.

I guess the more general question is - how to I get better at creative, original thinking?

I have spent years practicing "creative" skills - traditional and digital art, writing, gamedev, programming, fiction. I've made a lot of things, I'm pretty proud of some of my projects, I'm getting pretty decent at some of these skills. But gun to my head - I can't seem to just sit down and make up an original non-fiction essay worthy of Less Wrong (or even my personal blog), even a simple one. What's wrong with me?

Discuss

### Experience with Cue Covid Testing

1 декабря, 2021 - 05:50
Published on December 1, 2021 2:50 AM GMT

Several months ago my work started offering at-home covid-19 rapid molecular testing via Cue. Now that it's possible to buy these kits as an individual I wanted to write some about our experience and who it might be a good fit for.

These tests offer sensitivity comparable to PCR in an at-home ~25min timeframe, and if you choose to take the test with a video call you get an official result. The main downside is that they are expensive.

The system is two parts: a reusable reader and disposable cartridges (with nasal swabs).

You connect to the reader with your phone over bluetooth, and there's an app that walks you through all the steps. You can choose to have your test "proctored", in which case you have a video call with Cue to verify your identity and watch that you test correctly. It takes about 25 minutes and to end, with an extra ~10min if proctored. Occasionally a test fails and you need to retake, so there's some risk of twice as long.

In terms of cost, while there are several options it looks to me like if this is worth it for you at all you would choose Cue+ Complete, at $149 +$90/month for 20 annual tests plus $60/test for each additional test (in packs of ten). That is quite a bit more than the$12/test you'd pay for an antigen test like BinaxNOW, so is this worth the extra cost?

In most cases, I think it isn't. While antigen tests are less sensitive, the cases they miss are generally cases where people are less infectious: lower viral concentration in the sample is correlated with lower viral shedding in general.

The place where I think the Cue is potentially worth it is in interacting with institutions that require a molecular test. For example, it is common for schools and daycares to require that children who have potential covid symptoms stay home pending a negative test. In our experience this means missing two or three days: if you go get a PCR test on the first day you don't get the results in time for the second, and you only sometimes get the results back in time to attend on the third day. With the Cue you get results soon enough that you don't have to miss any days. If your kid missing school/daycare means you missing work, \$60/test may actually be a very good deal. Same goes for people in jobs with a similar policy for employees.

In our personal situation, because we already need to have an adult home full-time to watch our infant, it's not that bad if our older kids have to stay home too. And since our older kids are in kindergarten and second grade I don't think it's that bad if they miss a couple days here and there. So I probably wouldn't buy one of these if I needed to spend my own money on it? On the other hand, if missing school/work was more of an issue I think the Cue could potentially be well worth it.

Discuss

### Infra-Bayesian physicalism: proofs part II

1 декабря, 2021 - 01:27
Published on November 30, 2021 10:27 PM GMT

This post is an appendix to "Infra-Bayesian physicalism: a formal theory of naturalized induction".

We can prove the second subset inclusion directly as a corollary of Proposition 2.10, just let the t of Proposition 2.10 be the projection function Γ1×Φ→Φ, so that just leaves the first subset inclusion direction. If you've seen the proofs so far, you know we do a thing where we try to show subset inclusion with expectations of functions and inequalities instead. And that the proofs all proceed by transforming the expectations until we get a maximum over contribution expectation values, and that's always where the hard part of proving the inequalities shows up. So, let's just get that part over with, an interested reader can work it out with previous proofs as a guide. Unusually, we'll be keeping track of identity functions here.

Plugging in some f, and doing our usual activities to get every term into the appropriate form, we can get this result if we manage to show that maxθ′∈BrΓ0×Γ1(Θ)(π×idΦ)∗θ′(λy0α0x.f(y0,x,α0)) ≤maxθ∈BrΓ0(Θ)prelΓ0×Φθ(λy0α0x.f(y0,x,α0)) So, to establish this, we'll show that, given some θ′∈BrΓ0×Γ1(Θ), we have (π×idΓ1×Φ)∗(θ′)∈BrΓ0(Θ), and that prelΓ0×Φ((π×idΓ1×Φ)∗θ′)=(π×idΦ)∗θ′ because, if we show that, then it means that BrΓ0(Θ) is a rich enough set for the right-hand side of the equation to match anything the left-hand-side can put out.

First off, prelΓ0×Φ((π×idΓ1×Φ)∗θ′)=(π×idΦ)∗θ′ is pretty trivial to show. The only difference between the two processes is that the Γ1 coordinate of θ′ is discarded immediately on the right-hand-side, and it's preserved for one step and then discarded on the second step for the left-hand-side.

Now for our inequality of interest. Let θ′∈BrΓ0×Γ1(Θ), and we're trying to show that (π×idΓ1×Φ)∗(θ′)∈BrΓ0(Θ) First off, showing the support condition for (π×idΓ1×Φ)∗(θ′) which is somewhat nontrivial this time around. We start off with a guarantee that (y0,y1)∈α. This happens iff y0∈{y′0|(y′0,y1)∈α}=π(y0,y1,α)2Γ0 And so, we get that y0∈α0 is guaranteed for that pushforward, support condition established.

endofunction condition time. It's important to remember that we want to treat elΓ0 as the computation side of things, and Γ1×Φ as the environment side of things, for our bridge transform we're working with. s:Γ0→Γ0 and g:Γ0×Γ1×Φ→[0,1]. Begin. (π×idΓ1×Φ)∗θ′(λy0y1α0x.χs(y0)∈α0g(s(y0),y1,x)) =θ′(λy0y1αx.χs(y0)∈π(y0,α,y1)2Γ0g(s(y0),y1,x)) Let's unpack precisely what that set is. =θ′(λy0y1αx.χs(y0)∈{y′0|(y′0,y1)∈α}g(s(y0),y1,x)) =θ′(λy0y1αx.χ(s(y0),y1)∈αg(s(y0),y1,x)) And we can rewrite the endofunction a little bit =θ′(λy0y1αx.χ(s×idΓ1)(y0,y1)∈αg((s×idΓ1)(y0,y1),x)) And finally apply our endofunction condition, since we've now got the function in a form that's treating y0,y1 as part of the computational universe... ≤Θ(λy0y1x.g(y0,y1,x)) And we're done, this establishes our desired result. ■

Proposition 2.17: Br(Θ) is a continuous function of Θ.

The way this proof will work is by describing a composition of functions that makes Br(Θ) from Θ, and then showing that each of these functions is continuous, if elΓ×Φ is a finite set.

Claim: The bridge transform of some Θ is equal to (using χelΓ to denote restricting an ultradistribution to the event y∈α and χ−1elΓ to denote the inverse of said function, mapping an ultradistribution on elΓ to the largest ultradistribution that could have produced it via restriction) χelΓ(⋂s:Γ→Γs∗(χ−1elΓ(ι∗(pr∗(Θ))))) Breaking down the unfamilar notation, the type of pr is elΓ×Φ→Γ×Φ, just the usual projection. That asterisk up top is pullback along that function. The type of ι is elΓ×Φ→Γ×2Γ×Φ. And s∗ is pullback along the function Γ×2Γ×Φ→Γ×2Γ×Φ given by (s,id2Γ,idΦ).

Let's unpack the exact conditions that cause a θ to lie in the set χelΓ(⋂s:Γ→Γs∗(χ−1elΓ(ι∗(pr∗(Θ))))) First off, a θ is in this set iff it is supported over the event y∈α, and it lies in the set ⋂s:Γ→Γs∗(χ−1elΓ(ι∗(pr∗(Θ)))) Which occurs iff θ is supported over the event y∈α, and for all s:Γ→Γ, θ lies in the set s∗(χ−1elΓ(ι∗(pr∗(Θ)))) Which occurs iff θ is suported over the event y∈α, and for all s:Γ→Γ, s∗(θ) lies in the set χ−1elΓ(ι∗(pr∗(Θ))) Which occurs iff θ is supported over the event y∈α, and for all s:Γ→Γ, χelΓ(s∗(θ)) lies in the set ι∗(pr∗(Θ))

Now, ι is just doing a little bit of type conversion, so we're justified in ignoring it. Anways, the previous thing occurs iff θ is supported over the event y∈α, and for all s:Γ→Γ, pr∗(χelΓ(s∗(θ)))∈Θ.

Which happens iff θ is supported over the event y∈α and for all s:Γ→Γ and g:Γ×Φ→[0,1], pr∗(χelΓ(s∗(θ)))(λyx.g(y,x))≤Θ(λyx.g(y,x)) However, unpacking the left-hand side, we get pr∗(χelΓ(s∗(θ)))(λyx.g(y,x)) =χelΓ(s∗(θ))(λyαx.g(y,x)) =s∗(θ)(λyαx.χy∈αg(y,x)) =θ(λyαx.χs(y)∈αg(s(y),x)) Which is the exact condition for θ to lie in the bridge transform. So, we have an equivalence.

Now, since we've phrased the bridge transform as χelΓ(⋂s:Γ→Γs∗(χ−1elΓ(ι∗(pr∗(Θ))))) We just need to establish that when all the sets are finite, then pullbacks are continuous, pushforwards are continuous, un-restrictions are continuous, intersections are continuous, and restrictions are continuous. Then, this would just be a particularly fancy continuous function, and accordingly, if Θn limited to Θ, then Br(Θn) would limit to Br(Θ).

Let's establish that when the sets are finite, pullbacks are continuous. Let g:X→Y, and Y and X be finite sets, and ψ∈□Y. Then, we have g∗(ψ)(λx.f(x)):=ψ(λy.maxx∈g−1(y)f(x)) With the convention that maximizing over the empty set produces a value of 0. That is an alternate phrasing of pullback. We can then go limn→∞d(g∗(ψn),g∗(ψ))=limn→∞supf:X→[0,1]|g∗(ψn)(f)−g∗(ψ)(f)| =limn→∞supf:X→[0,1]|ψn(λy.maxx∈g−1(y)f(x))−ψ(λy.maxx∈g−1(y)f(x))| ≤limn→∞suph:Y→[0,1]|ψn(h)−ψ(h)|=limn→∞d(ψn,ψ)=0 Admittedly, this isn't quite what our usual modified KR metric usually looks like. The reason we can do this is because we're just dealing with functions in [0,1], so the norm part of the modified KR metric doesn't matter, and since our sets are finite, we can say that all points are distance 1 from each other, so all functions are 1-Lipschitz, and then the two metrics coincide. So, pullback along any function is continuous.

For pushforward, it's easy because, if ψ∈□X, then we've got limn→∞d(g∗(ψn),g∗(ψ))=limn→∞suph:Y→[0,1]|g∗(ψn)(h)−g∗(ψ)(h)| =limn→∞suph:Y→[0,1]|ψn(λx.h(g(x)))−ψ(λx.h(g(x)))| ≤limn→∞supf:X→[0,1]|ψn(f)−ψ(f)|=limn→∞d(ψn,ψ)=0 For showing restrictions continuous, for the set E⊆X that we're updating on, limn→∞d(χE(ψn),χE(ψ))=limn→∞supf:X→[0,1]|χE(ψn)(f)−χE(ψ)(f)| =limn→∞supf:X→[0,1]|ψn(χx∈Ef(x))−ψn(χx∈Ef(x))| ≤limn→∞supf:X→[0,1]|ψn(f)−ψ(f)|=limn→∞d(ψn,ψ)=0 For intersections... that will take a bit more work. We'll have to use the equivalent formulation of closeness, that ψn limits to ψ iff the Hausdorff distance between the corresponding sets (according to the generalized KR measure) limits to 0. So, our task is to assume that ψn limits to ψ, and ϕn limits to ϕ, and show that ψn∩ϕn limits to ψ∩ϕ. The bound we'll manage to prove is that d(ψn∩ϕn,ψ∩ϕ)≤|X|max(d(ψn,ψ),d(ϕn,ϕ)) Where |X| is the number of elements in the finite set X. Here's the basic argument. For any particular point in the set ψn, there's a nearby point in ψ (since the Hausdorff distance is low) with only ϵ measure moved around or deleted. So, in particular, if all the measure moved or deleted was just deleted from ψn instead, then that resulting contribution would be below the nearby contribution in ψ that we picked, and so it would lie in ψ as well due to downwards closure.

So, in particular, if ψn and ψ only have a Hausdorff distance of ϵ, then, taking any contribution in ψn and subtracting ϵ measure from \emph{all points} (if possible, if not, just remove measure till you're at 0) is \emph{guaranteed} to make a point in ψ, and vice-versa.

And a corollary of that is that, given any contribution in ψn∩ϕn, the "subtract max(d(ψn,ψ),d(ϕn,ϕ)) measure from each point" contribution is in ψ, also in ϕ, and at a maximum distance of |X|max(d(ψn,ψ),d(ϕn,ϕ)) from the original contribution. And this argument can be reversed to show that the limit of the intersections is the intersection of the limits (because hausdorff distance between the two goes to 0), so we do in fact have intersection being continuous.

And that just leaves un-restricting. Again, this will take a Hausdorff-distance argument. Fixing some contribution in χ−1E(ψn), it can be broken down into an on-E part θn,E, and an off-E part θn,¬E. When you restrict to E, then θn,E∈ψn. Since ψn is within ϵ of ψ, there's some θE∈ψ that's within ϵ of θn,E. Then, let your point in χ−1E(ψ) be θE+θn,¬E (if there's slightly more than 1 measure there, delete ϵ measure from θn,¬E, or all the measure if there's less than ϵ present). It's close to θn,E+θn,¬E because θn,E is close to θE, the other component of it is unchanged, and maybe we deleted a little bit of excess measure which didn't do much.

This line of argument shows that ψn being close to the limit ψ is sufficient to establish that the un-restriction of the two of them are comparably close together. So we have continuity for that, which is the last thing we needed.

Since we wrote the bridge transform as a sequence of continuous functions, we know it's continuous (as long as all the involved sets are finite) ■

Proposition 3.1: Let X be a finite poset, f:X→R and Θ∈□cX downward closed. Define fmax:X→R by fmax(x):=maxy≤xf(y). Observe that fmax is always non-decreasing. Then, Θ(f)=Θ(fmax).

Proof: Pick a θ′∈Θ s.t. θ′(fmax)=Θ(fmax). Ie, a maximizing contribution. Let k:X→X be defined as k:=λx.argmaxy≤xf(y). Ie, it moves a point down to somewhere below it where it can attain the highest value according to f. Now, consider k∗(θ′). It's present in Θ because Θ was, by assumption, downwards closed, and we just moved all the measure down.

Now, we have Θ(f)=maxθ∈Θθ(f)≥k∗(θ′)(f)=θ′(λx.f(k(x)))=θ′(λx.f(argmaxy≤xf(y))) =θ′(λx.maxy≤xf(y))=θ′(fmax)=Θ(fmax)≥Θ(f) And so, all inequalities must be equalities, proving that Θ(fmax)≥Θ(f). In order, the connectives were: unpacking definitions, using downward closure to conclude that k∗(θ′)∈Θ, unpacking pushforwards, unpacking the definition of k, using that applying a function to the argmax of inputs to that function just makes the max of the function, folding the definition of fmax back up, using that θ′ was selected to maximize fmax, and applying monotonicity. Done! ■

Proposition 4.1: Consider some Γ, Φ, a relation Q⊆Γ×Φ and a PUCK Ξ over Q. Let Θ:=⊤Γ⋉Ξ. Then, Br(Θ)=[⊤Γ⋉(susΘ⋊Ξ)]↓=[⊤Γ⋉(Q−1⋊Ξ)]↓

First off, I'm not terribly picky about variable ordering, so I'll just write our desired proof target as Br(Θ)=[⊤Γ⋉Ξ⋉susΘ]↓=[⊤Γ⋉Ξ⋉Q−1]↓ The way we'll do this is by establishing the following result. For all monotone f′:elΓ×Φ→[0,1], we have Br(Θ)(f′)≤[⊤Γ⋉Ξ⋉susΘ](f′)≤[⊤Γ⋉Ξ⋉Q−1](f′)≤Br(Θ)(f′) Why does that suffice? Well, assume hypothetically that the result held. Since the inequalities go in a circle, we have equality for all monotone functions. And then, for some non-monotone function f, we can go Br(Θ)(f)=Br(Θ)(fmax)=[⊤Γ⋉Ξ⋉susΘ](fmax) =[⊤Γ⋉Ξ⋉susΘ]↓(fmax)=[⊤Γ⋉Ξ⋉susΘ]↓(f) and swap out susΘ for Q−1 to show the other equality, and then we'd have equality of the three ultradistributions on all functions, so they're equal.

For the equalities in the above equation, the first one arose because of Proposition 2.4 (bridge transforms are always downwards closed) and Proposition 3.1 (downwards-closed things let you swap out f for fmax and it doesn't affect the value). The second equality arose because fmax is a monotone function and by assumption, we have equality for monotone functions. The third equality would arise because taking the downwards closure doesn't affect the expectation value of monotone functions. If you add a bunch of contributions made by measure flowing down, that's just strictly worse from the perspective of a monotone function and doesn't change expectation value. And the fourth equality arises from Proposition 3.1 again.

So, we just need to prove the following three inequalities, for monotone functions f. Br(Θ)(f)≤[⊤Γ⋉Ξ⋉susΘ])(f)≤[⊤Γ⋉Ξ⋉Q−1](f)≤Br(Θ)(f) The first one is easily addressable by Proposition 2.7. By proposition 2.7 and the definition of Θ, we have Br(Θ)⊆(Θ⋉susΘ)↓=[⊤Γ⋉Ξ⋉susΘ]↓ And so, for monotone functions f, we have Br(Θ)(f)≤[⊤Γ⋉Ξ⋉susΘ])(f) Done.

Now to show our second inequality. (⊤Γ⋉Ξ⋉susΘ)(λyαx.f(y,x,α)) =(⊤Γ⋉Ξ)(λyx.δsusΘ(x)(λα.f(y,x,α))) =(⊤Γ⋉Ξ)(λyx.f(y,x,susΘ(x))) Unpack the definition of the set =(⊤Γ⋉Ξ)(λyx.f(y,x,{y′|(y′,x)∈supp Θ})) Unpack the definition of Θ =(⊤Γ⋉Ξ)(λyx.f(y,x,{y′|(y′,x)∈supp ⊤Γ⋉Ξ})) The condition (y′,x)∈supp ⊤Γ⋉Ξ is equivalent to x∈supp Ξ(y′). After all, if x∈supp Ξ(y′), the distribution δy′ lies in ⊤Γ, so δy′⋉Ξ would certify that (y′,x)∈supp ⊤Γ⋉Ξ. And if x∉supp Ξ(y′), then no matter the distribution in ⊤Γ or kernel selected from Ξ, if y′ gets picked, then the kernel selected from Ξ isn't going to be making x along with it. Since we have that iff characterization, we have =(⊤Γ⋉Ξ)(λyx.f(y,x,{y′|x∈supp Ξ(y′)})) Ξ(y′) is the union of a bunch of k(y′) for k∈Π (and convex hull), so its support is equal to the union of the supports for the k(y′). =(⊤Γ⋉Ξ)(λyx.f(y,x,{y′|x∈⋃k∈Πsupp k(y′)})) Then, since each k is a PoCK over Q, k(y′) is the restriction of some measure ϖk to the set Q(y), which will be written as χQ(y′)ϖk. =(⊤Γ⋉Ξ)(λyx.f(y,x,{y′|x∈⋃k∈Πsupp (χQ(y′)ϖk)})) And now we're about to get an inequality. f is monotone, so making the associated set bigger (easier to fulfill the defining condition) should always increase the value of f, and by monotonicity, increase the expectation value, so we get ≤(⊤Γ⋉Ξ)(λyx.f(y,x,{y′|x∈Q(y′)})) Then restate =(⊤Γ⋉Ξ)(λyx.f(y,x,{y′|(x,y′)∈Q})) =(⊤Γ⋉Ξ)(λyx.f(y,x,Q−1(x))) And pack back up as a semidirect product. =(⊤Γ⋉Ξ)(λyx.δQ−1(x)(λα.f(y,x,α))) =(⊤Γ⋉Ξ⋉Q−1)(λyαx.f(y,x,α)) And we have our second ≤ inequality established!

Now, onto the third inequality. (⊤Γ⋉Ξ⋉Q−1)(λyαx.f(y,x,α)) Unpack the semidirect products =⊤Γ(λy.Ξ(y)(λx.δQ−1(x)(λα.f(y,x,α)))) And what top means =maxy∈ΓΞ(y)(λx.δQ−1(x)(λα.f(y,x,α))) And as for Ξ... well, each Ξ(y) is the convex hull of the various k(y), for k∈Π. So, the expectation for Ξ(y) is the maximum expectation for the various k(y), so we can rewrite as =maxy∈Γmaxk∈Πk(y)(λx.δQ−1(x)(λα.f(y,x,α))) Pick a particular y∗ and k∗ that attain the maximal value =k∗(y∗)(λx.δQ−1(x)(λα.f(y∗,x,α))) Reexpress a little bit =δy∗(λy.k∗(y)(λx.δQ−1(x)(λα.f(y,x,α))) And pack this back up as a semidirect product =(δy∗⋉k∗⋉Q−1)(λyαx.f(y,x,α)) And then we'll be showing that this contribution lies in Br(Θ). Once we've done that, we can go ≤maxθ′∈Br(Θ)θ′(λyαx.f(y,x,α)) =Br(Θ)(λyαx.f(y,x,α)) And we'd be done, having proven the third inequality and the last one to finish up the proof. So, now our proof target switches to showing that (δy∗⋉k∗⋉Q−1)∈Br(Θ). We can show this if we show the support condition and the endofunction condition.

For the support condition, we have (δy∗⋉k∗⋉Q−1)(λyαx.χy∉α) =δy∗(λy.k∗(y)(λx.δQ−1(x)(λα.χy∉α))) =δy∗(λy.k∗(y)(λx.χy∉Q−1(x))) =k∗(y∗)(λx.χy∗∉Q−1(x)) And then we use that the k∗(y∗) are all of the form "take this measure, restrict it to Q(y∗)", to get =(χQ(y∗)ϖk∗)(λx.χy∗∉Q−1(x)) =ϖk∗(λx.χx∈Q(y∗)χy∗∉Q−1(x)) Unpacking the definitions, we get =ϖk∗(λx.χ(x,y∗)∈Qχ(x,y∗)∉Q)=0 And so, this contribution is indeed supported on (y,α) pairs s.t. y∈α.

Now for the endofunction condition. As usual, fix an s and a g. (δy∗⋉k∗⋉Q−1)(λyαx.χs(y)∈αg(s(y),x)) Unpack the semidirect product =δy∗(λy.k∗(y)(λx.δQ−1(x)(λα.χs(y)∈αg(s(y),x)))) Plug in the dirac-deltas =k∗(y∗)(λx.χs(y∗)∈Q−1(x)g(s(y∗),x)) Reexpress the set membership criterion a bit =k∗(y∗)(λx.χx∈Q(s(y∗))g(s(y∗),x)) And the contribution at the start =(χQ(y∗)ϖk∗)(λx.χx∈Q(s(y∗))g(s(y∗),x)) Distribute it in as an indicator function. =ϖk∗(λx.χx∈Q(y∗)χx∈Q(s(y∗))g(s(y∗),x)) Pull the other indicator function out. =(χQ(s(y∗))ϖk∗)(λx.χx∈Q(y∗)g(s(y∗),x)) Rewrite with k∗ =k∗(s(y∗))(λx.χx∈Q(y∗)g(s(y∗),x)) Use an inequality to get rid of the indicator function ≤k∗(s(y∗))(λx.g(s(y∗),x)) Rewrite it a bit =δs(y∗)(λy.k∗(y)(λx.g(y,x))) Swap out k∗(y) for Ξ(y), the latter is larger ≤δs(y∗)(λy.Ξ(y)(λx.g(y,x))) Swap out δs(y∗) for ⊤Γ, the latter is larger ≤⊤Γ(λy.Ξ(y)(λx.g(y,x))) =(⊤Γ⋉Ξ)(λyx.g(y,x)) Abbreviate =Θ(λyx.g(y,x)) And bam, endofunction condition is shown, the entire proof goes through now. ■

Corollary 4.3: Suppose that for any d∈D and π:H→A s.t. d∈supp W(π), it holds that dCπ. That is, the observations W predicts to receive from the computer are consistent with the chosen policy. Let L:D→R be a Cartesian loss function and π:H→A a policy. Then, (prelΓBr(ΘW)∩Cπfair)(Lphys)=W(π;L)

I'm going to be proceeding very cautiously here. First off, make our π value visually distinct by swapping it out for π∗ (prelΓBr(ΘW)∩Cπ∗fair)(Lphys) Now, by the identifications we made earlier, we can identify Γ with AH, the space of policies. Using that to unpack the function a little bit, we have =(prelΓBr(ΘW)∩Cπ∗fair)(λπα.Lphys(π,α)) Now, we note that intersecting with top of a particular set is equivalent to updating on the indicator function for that set. Using definition 1.5 to unpack Cπ∗fair, we get =(prelΓBr(ΘW))(λπα.χ∀h∈Hπ,α:Gπ(h)=π∗(h)Lphys(π,α)) Apply that Gπ(h) is "what would the agent do on h if the agent is copying the behavior of π", so we can rephrase as: =(prelΓBr(ΘW))(λπα.χ∀h∈Hπ,α:π(h)=π∗(h)Lphys(π,α)) Pull off the projection, and use d for a destiny in D. =Br(ΘW)(λπαd.χ∀h∈Hπ,α:π(h)=π∗(h)Lphys(π,α)) At this point, we use that ΘW:=⊤Γ⋉W, and that W is a PUCK over Q0 and Proposition 4.1 to go =[⊤Γ⋉W⋉Q−10]↓(λπαd.χ∀h∈Hπ,α:π(h)=π∗(h)Lphys(π,α)) Before we can remove the downwards closure, we'll want to verify the function is monotone. So, we'll want to start unpacking the physicalist loss next. Applying definition 3.1, and using d′ instead of g to remember it's a destiny, we have =[⊤Γ⋉W⋉Q−10]↓(λπαd.χ∀h∈Hπ,α:π(h)=π∗(h)minha:ha∈Xπ,αmaxd′:ha⊑d′L(d′)) Next up is unpacking Xπ,α. Using definition 3.1, it's =[⊤Γ⋉W⋉Q−10]↓(λπαd.χ∀h∈Hπ,α:π(h)=π∗(h) minha:ha∈Hπ,α×A∧(∀π′∈α:Gπ′(h)=a)maxd′′:ha⊑d′L(d′)) At this point, we can, again, treat Gπ′(h) the same as π′(h). =[⊤Γ⋉W⋉Q−10]↓(λπαd.χ∀h∈Hπ,α:π(h)=π∗(h) minha:ha∈Hπ,α×A∧(∀π′∈α:π′(h)=a)maxd′:ha⊑d′′L(d′)) And now we need to take a moment to show that Hπ,α gets smaller when α gets larger. Applying definition 1.5, the event h∈Hπ,α unpacks as (∀h′a⊏h,π′∈α:Gπ′(h′)=a)∧(∃d′:h⊏d′∧d′Cπ) Now, if α becomes a larger set, then it gets harder for the first condition to be fulfilled, so the set Hπ,α shrinks. Now, since this happens, it means that if α gets bigger, it gets more difficult for the prerequisite of the implication in the indicator function to be fulfilled, so the implication is more likely to hold. Further, the minimization is taking place over a smaller set, so the loss goes up. So our function is monotone in α, and we can remove the downwards closure. =(⊤Γ⋉W⋉Q−10)(λπαd.χ∀h∈Hπ,α:π(h)=π∗(h) minha:ha∈Hπ,α×A∧(∀π′∈α:π′(h)=a)maxd′:ha⊑d′′L(d′)) Unpacking the semidirect product, it is =⊤Γ(λπ.W(π)(λd.δQ−10(d)(λα.χ∀h∈Hπ,α:π(h)=π∗(h) minha:ha∈Hπ,α×A∧(∀π′∈α:π′(h)=a)maxd′:ha⊑d′L(d′)))) Substituting in the dirac-delta everywhere that α is, we get =⊤Γ(λπ.W(π)(λd.χ∀h∈Hπ,Q−10(d):π(h)=π∗(h) minha:ha∈Hπ,Q−10(d)×A∧(∀π′∈Q−10(d):π′(h)=a)maxd′:ha⊑d′L(d′))) Now, Q−10(d) is the set of policies π′ s.t. π′Q0d. The "this policy is consistent with this destiny" relation. Also let's swap out ⊤Γ for maximization =maxπW(π)(λd.χ∀h∈Hπ,Q−10(d):π(h)=π∗(h) minha:ha∈Hπ,Q−10(d)×A∧(∀π′Q0d:π′(h)=a)maxd′:ha⊑d′L(d′)) Now, we're going to try to address that minimum, and show that the only ha that fulfill the conditions are exactly those ha⊑d. This requires showing that ha⊑d is a sufficient condition to fulfill the relevant properties, and then to show that ha⋢d implies a failure of one of the properties.

So, first up. Assume ha⊑d. Then, for any π′, dQ0π′ and ha⊑d \emph{must} imply that π′(h)=a, that's what policy consistency means. Also, h∈Hπ,Q−10(d) unpacks as the two conditions ∀h′a′,π′:h′a′⊏h∧dQ0π′→π′(h′)=a′ ∃d′:h⊏d′∧d′Cπ As for the first condition,clearly, if π′ is consistent with d, it's consistent with ha because ha⊑d, and so it must be consistent with any prefix of ha, so the first condition holds.

For the second condition, d is a valid choice, because we assumed ha⊑d, and dCπ occurs always, because W(π) always being supported on d s.t. dCπ was one of our problem assumptions.

So, we have one implication direction down. Now for the reverse implication direction. Assume ha⋢d. Then there are two possibilities. the first possibility is that ha first diverges from d on an observation. The second possibility is that ha first diverges from d on an action.

For the first possibility, it's possible to make two policies which are consistent with d but also differ in their actions on history h, because h isn't a prefix of d if ha first differs from d on an observation.

For the second possibility, it's ruled out by either the condition for h∈Hπ,Q−10(d) that goes ∀h′a′,π′:h′a′⊏h∧π′Q0d→π′(h′)=a′ or the extra condition that ∀π′:π′Q0d→π′(h)=a applied to the first a-history prefix that deviates from d, because π′Q0d implies that π′(h′) must be the action which d dictates, not the action a′ that deviates from d.

And that establishes the other direction of the iff statement.

Thus, we can swap out our fancy minimization with just minimizing over the ha⊑d. =maxπW(π)(λd.χ∀h∈Hπ,Q−10(d):π(h)=π∗(h) minha:ha⊑dmaxd′:ha⊑d′L(d′)) This minimization is attained by selecting d itself. So then it turns into =maxπW(π)(λd.χ∀h∈Hπ,Q−10(d):π(h)=π∗(h)L(d)) At this point, what we'll do is show that an upper bound and lower bound on the value of this term is the same. Going from upper bound to lower bound, it's starting out with W(π∗)(λd.L(d)) At this point, we'll use that W is a PUCK, so there's a set E of environments e (PoCK's) that W is generated from, so we can go: =maxe∈Ee(π∗)(λd.L(d)) =maxπmaxe∈Ee(π∗)(λd.χdQ0πL(d)) =maxπmaxe∈E(χQ0(π∗)ϖe)(λd.χdQ0πL(d)) =maxπmaxe∈Eϖe(λd.χdQ0π∗χdQ0πL(d)) Now pull the indicator function back out. =maxπmaxe∈E(χQ0(π)ϖe)(λd.χdQ0π∗L(d)) =maxπmaxe∈Ee(π)(λd.χdQ0π∗L(d)) =maxπW(π)(λd.χdQ0π∗L(d)) Now we must show that this is a looser constraint than what was previously in our indicator function to proceed further. So our next order of business is showing that, certainly, ∀h∈Hπ,Q−10(d):π(h)=π∗(h)→dQ0π∗ Let h be one of the history prefixes of some d in the support of W(π). The two conditions for h∈Hπ,Q−10(d) are fulfilled, because they are ∀h′,a′,π′:h′a′⊏h∧dQ0π′→π′(h′)=a′ ∃d′:h⊏d′∧d′Cπ For the first condition, if h′a′⊏h, then h′a′⊏d, and so if π′ is consistent with d, it must take the same action in response to h′, the action that d commands, a′. So that's fulfilled. For the second condition, let d′ be d. h⊏d holds, and so dCπ holds certainly, because W(π) is supported on d s.t. dCπ.

So, for all d in the support of W(π), h⊏d→h∈Hπ,Q−10(d). Since we assumed our forall statement as prerequisite, this means that for all h⊏d, π(h)=π∗(h). And dQ0π means ∀ha⊑d:π(h)=a. Since π∗(h) mimics π(h) for all history prefixes of d, this means ∀ha⊑d:π∗(h)=a, ie dQ0π∗.

So, since this is a looser constraint, when we were previously at =maxπW(π)(λd.χdQ0π∗L(d)) we can proceed further to ≥maxπW(π)(λd.χ∀h∈Hπ,Q−10(d):π(h)=π∗(h)L(d)) Which is our value we're trying to sandwich. Now, at this point, plug in π∗ and get ≥W(π∗)(λd.χ∀h∈Hπ∗,Q−10(d):π∗(h)=π∗(h)L(d)) =W(π∗)(λd.L(d)) And bam, we've sandwiched our term between W(π∗)(L) on both sides, and so the result follows. ■

Proposition 4.2: Let X, Y and Z be finite sets, Q⊆Y×X a relation, κ:Y→Δc(X×Z) a Z-PoCK over Q and Θ∈□cY. Then, there exist μ∈ΔcZ and ϕ:Z×Y→ΔcX s.t. for all z, λy.ϕ(z,y) is a PoCK over Q s.t. κ(y)=μ⋉(λz.ϕ(z,y)). Moreover, suppose that (μ1,ϕ1) and (μ2,ϕ2) are both as above. Then, μ1⋉Θ⋉ϕ1=μ2⋉Θ⋉ϕ2

Our first order of business is establishing that there's even a μ and ϕ that has those effects at all. Here's a way to define them. μ(z):=maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′) Where ϖκ is the measure on Z×X that κ is associated with, ie, κ(y)=χQ(y)ϖκ must be true for some ϖκ because κ is a Z-PoCK over Q. And, ϕ will be defined as: ϕ(y,z)(x):=χx∈Q(y)ϖκ(z,x)maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′) With those definitions in place, it's easy to establish that μ⋉(λz.ϕ(y,z))=κ(y). We can just fix an arbitrary x,z pair and go κ(y)(x,z)=χx∈Q(y)ϖκ(z,x)=maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′)⋅χx∈Q(y)ϖκ(z,x)maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′) =μ(z)⋅ϕ(y,z)(x)=(μ⋉(λz.ϕ(y,z)))(x,z) And we're done with showing that such functions exist in the first place. Well, as long as we check that μ and ϕ behave accordingly. First off, μ being a contribution follows from ϖκ being a Z-polycontribution, and the definition of Z-polycontributions. Also, to show that (λy.ϕ(y,z)) is a PoCK over Q, we need to show that there's a ϖϕ,z s.t. ϕ(y,z)=χQ(y)ϖϕ,z, and that always has 1 or less measure.

In order to do this, define ϖϕ,z:=1maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′)prX(χ{z}×Xϖκ) Clearly, you get ϕ(y,z) from restricting this to Q(y), because we have (χQ(y)ϖϕ,z)(x)=1maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′)χQ(y)(prX(χ{z}×Xϖκ))(x) =χQ(y)(prX(χ{z}×Xϖκ))(x)maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′)=χx∈Q(y)prX(χ{z}×Xϖκ)(x)maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′) =χx∈Q(y)∑z′(χ{z}×Xϖκ)(x,z′)maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′)=χx∈Q(y)∑z′χz′=zϖκ(x,z′)maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′) =χx∈Q(y)ϖκ(x,z)maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′)=ϕ(y,z)(x) And we're done. And also, the measure is ≤1, because ∑x∈Q(y)ϖϕ,z(x)=∑x∈Q(y)prX(χ{z}×Xϖκ)(x)maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′) and, skipping over a few routine steps, =∑x∈Q(y)ϖκ(x,z)maxy′∈Y∑x′∈Q(y′)ϖκ(x′,z) ≤∑x∈Q(y)ϖκ(x,z)∑x′∈Q(y)ϖκ(x′,z)=1 And we're done, we figured out how to decompose κ into μ and ϕ.

Now for the second half of the proof. The first thing to establish is that, for all y,z, we have μ1(z)⋅ϕ1(y,z)=μ2(z)⋅ϕ2(y,z). This occurs because, for all x, μ1(z)⋅ϕ1(y,z)(x)=(μ1⋉(λz.ϕ1(y,z)))(x,z)=κ(y)(x,z) And then by symmetry, the exact same holds for μ2 and ϕ2, both were declared to be equal to κ. Now that this result is in place, we can begin. (μ1⋉Θ⋉ϕ1)(λxyz.f(x,y,z)) =μ1(λz.Θ(λy.ϕ1(y,z)(λx.f(x,y,z)))) Now, we do something odd. We can reexpress this as =C(λz.μ1(z)⋅Θ(λy.ϕ1(y)(λx.f(x,y,z)))) Basically, what's going on here is that we can swap out the contribution μ1 for the counting measure C (1 measure on each distinct point) and just scale down the expectation values accordingly. It's pretty much the same way that you can think of ∑xμ(x)f(x) (expectation of f w.r.t μ) as ∑x1⋅μ(x)f(x) (expectation of μ⋅f w.r.t the counting measure). Now, since Θ is homogenous, we can move constants in or out of it, to get =C(λz.Θ(λy.μ1(z)⋅ϕ1(y,z)(λx.f(x,y,z)))) Now, at this point, we can use that μ1(z)⋅ϕ1(y,z)=μ2(z)⋅ϕ2(y,z), to get =C(λz.Θ(λy.μ2(z)⋅ϕ2(y,z)(λx.f(x,y,z)))) And just back up and reverse everything. =C(λz.μ2(z)Θ(λy.ϕ2(y,z)(λx.f(x,y,z)))) =μ2(λz.Θ(λy.ϕ2(y)(λx.f(x,y,z)))) =(μ2⋉Θ⋉ϕ2)(λxyz.f(x,y,z)) And we're done! ■

Lemma 4: Let X, Y and Z be finite sets, Q⊆Y×X a relation, Ξ1,Ξ2:Y→□c(X×Z) Z-PUCKs over Q, Θ∈□cY and p∈[0,1]. Then, pΞ1+(1−p)Ξ2 is also a Z-PUCK over Q, and Θ∗(pΞ1+(1−p)Ξ2)⊆p(Θ∗Ξ1)+(1−p)(Θ∗Ξ2)

Our first order of business is establishing that the mix of Z-PUCK's over Q is a Z-PUCK over Q. Here's what we'll do. We'll define a family of kernels, show that they're all Z-PoCK's, and that said family makes a Z-PUCK that's equal to the mix of Ξ1 and Ξ2.

So, let Π1 be the set of Z-PoCK's associated with Ξ1, and Π2 be the set of Z-PoCK's associated with Ξ2. Elements of these sets are κ1 and κ2. Define Π as {pκ1+(1−p)κ2|κ1∈Π1,κ2∈Π2}.

By Definition 4.5, in order to establish that these are Z-PoCK's over Q, we need to make an appropriate choice of ϖ. In particular, the ϖ associated with κ=pκ1+(1−p)κ2 is ϖκ:=pϖκ1+(1−p)ϖκ2. It fufills definition 4.5 because κ(y)(x,z)=(pκ1+(1−p)κ2)(y)(x,z)=pκ1(y)(x,z)+(1−p)κ2(y)(x,z) =p(χQ(y)ϖκ1)(x,z)+(1−p)(χQ(y)ϖκ2)(x,z)=(pχQ(y)ϖκ1+(1−p)χQ(y)ϖκ2)(x,z) =χQ(y)(pϖκ1+(1−p)ϖκ2)(x,z)=χQ(y)ϖκ(x,z) By unpacking our definition, using how mixes of kernels work, applying definition 4.5 for κ1 and κ2, and then just doing some simple regrouping and packing the definition back up, we get our result.

But wait, we still need to show that ϖκ is a Z-Polycontribution on Q. Again, this isn't too hard to show, with Definition 4.4. ∑z∈Zmaxy∈Y∑x∈Q(y)ϖκ(x,z)=∑z∈Zmaxy∈Y∑x∈Q(y)(pϖκ1+(1−p)ϖκ2)(x,z) =∑z∈Zmaxy∈Y∑x∈Q(y)(pϖκ1(x,z)+(1−p)ϖκ2(x,z)) =∑z∈Zmaxy∈Y⎛⎝p∑x∈Q(y)ϖκ1(x,z)+(1−p)∑x∈Q(y)ϖκ2(x,z)⎞⎠ ≤∑z∈Z⎛⎝pmaxy∈Y∑x∈Q(y)ϖκ1(x,z)+(1−p)maxy∈Y∑x∈Q(y)ϖκ2(x,z)⎞⎠ =p∑z∈Zmaxy∈Y∑x∈Q(y)ϖκ1(x,z)+(1−p)∑z∈Zmaxy∈Y∑x∈Q(y)ϖκ2(x,z)≤p⋅1+(1−p)⋅1=1 And bam, we have our inequality demonstrated, everything works out. Now we just need to show that this family of Z-PoCK's makes the Z-PUCK that's the mixture of the two. We'll establish equality by showing equality for all functions and all y. Ξ(y)(f)=maxκ∈Πκ(y)(f)=maxκ∈{pκ1+(1−p)κ2|κ1∈Π1,κ2∈Π2}κ(y)(f) =maxκ1∈Π1,κ2∈Π2(pκ1+(1−p)κ2)(y)(f)=maxκ1∈Π1,κ2∈Π2pκ1(y)(f)+(1−p)κ2(y)(f) =pmaxκ1∈Π1κ1(y)(f)+(1−p)maxκ2∈Π2κ2(y)(f)=pΞ1(y)(f)+(1−p)Ξ2(y)(f) =(pΞ1+(1−p)Ξ2)(y)(f) Done, we've shown equality of the Z-PUCK with the mixture of other Z-PUCKs, establishing that the mixture of Z-PUCKs is a Z-PUCK.

That leaves establishing our relevant inequality. But before we do that, we'll be wanting a nice handy form for that asterisk operator to manipulate things with. Given some κ that's a Z-PUCK over Q, remember from the previous proof that a valid choice for μ and ϕ to break κ down is μ(z):=maxy′∈Y∑x′∈Q(y′)ϖκ(z,x′) and, abbreviating things a little bit, we have ϕ(y,z)(x):=χx∈Q(y)ϖκ(z,x)μ(z) So, we can get a pleasant-to-manipulate form for Θ∗κ as follows. (Θ∗κ)(λxyz.f(x,y,z))=(μ⋉Θ⋉ϕ)(λxyz.f(x,y,z)) =μ(λz.Θ(λy.ϕ(y,z)(λx.f(x,y,z)))) And proceed further =∑z∈Zμ(z)⋅Θ(λy.ϕ(y,z)(λx.f(x,y,z))) =∑z∈Zμ(z)⋅Θ(λy.∑x∈Xϕ(y,z)(x)⋅f(x,y,z)) =∑z∈Zμ(z)⋅Θ(λy.∑x∈Xχx∈Q(y)ϖκ(z,x)μ(z)⋅f(x,y,z)) And then we move the constant into Θ since it's homogenous, and then into the sum, and it cancels out with the fraction. =∑z∈ZΘ(λy.∑x∈Xχx∈Q(y)⋅ϖκ(z,x)⋅f(x,y,z)) =∑z∈ZΘ(λy.∑x∈Q(y)ϖκ(z,x)⋅f(x,y,z)) =∑z∈ZΘ(λy.χQ(y)×{z}ϖκ(λx′z′.f(x′,y,z′))) This general form will be used whenever we need to unpack Θ∗κ. Now, let's get started on the proof of our subset inclusion thingy. As usual, Π will be the set {pκ1+(1−p)κ2|κ1∈Π1,κ2∈Π2}, and as we've shown, that's the set of Z-PoCK's associated with pΞ1+(1−p)Ξ2. Also, as we've already shown, the associated Z-polycontribution ϖκ for κ=pκ1+(1−p)κ2 is pϖκ1+(1−p)ϖκ2. This will be implicitly used in the following. (Θ∗(pΞ1+(1−p)Ξ2))(λxyz.f(x,y,z))=maxκ∈Π(Θ∗κ)(λxyz.f(x,y,z)) Now we use our preferred unpacking of how that asterisk operator works. =maxκ∈Π∑z∈ZΘ(λy.χQ(y)×{z}ϖκ(λx′z′.f(x′,y,z′))) And unpack κ and ϖκ appropriately. =maxκ1∈Π1,κ2∈Π2∑z∈ZΘ(λy.χQ(y)×{z}(pϖκ1+(1−p)ϖκ2)(λx′z′.f(x′,y,z′))) =maxκ1∈Π1,κ2∈Π2∑z∈ZΘ(λy.pχQ(y)×{z}ϖκ1(λx′z′.f(x′,y,z′)) +(1−p)χQ(y)×{z}ϖκ2(λx′z′.f(x′,y,z′))) At this point, we use convexity of Θ, since it's an ultradistribution. ≤maxκ1∈Π1,κ2∈Π2∑z∈Z(pΘ(λy.(χQ(y)×{z}ϖκ1)(λx′z′.f(x′,y,z′))) +(1−p)Θ(λy.(χQ(y)×{z}ϖκ2)(λx′z′.f(x′,y,z′)))) =maxκ1∈Π1,κ2∈Π2(p∑z∈ZΘ(λy.(χQ(y)×{z}ϖκ1)(λx′z′.f(x′,y,z′))) +(1−p)∑z∈ZΘ(λy.(χQ(y)×{z}ϖκ2)(λx′z′.f(x′,y,z′)))) At this point, you can pack up things. =maxκ1∈Π1,κ2∈Π2p(Θ∗κ1)(λxyz.f(x,y,z))+(1−p)(Θ∗κ2)(λxyz.f(x,y,z)) =pmaxκ1∈Π1(Θ∗κ1)(λxyz.f(x,y,z))+(1−p)maxκ2∈Π2(Θ∗κ2)(λxyz.f(x,y,z)) =p(Θ∗Ξ1)(λxyz.f(x,y,z))+(1−p)(Θ∗Ξ2)(λxyz.f(x,y,z)) =(p(Θ∗Ξ1)+(1−p)(Θ∗Ξ2))(λxyz.f(x,y,z)) Done! ■

Proposition 4.3: Let X, Y and Z be finite sets, Q⊆Y×X a relation, κ1,κ2:Y→Δc(X×Z) Z-PoCKs over Q, Θ∈□cY and p∈[0,1]. Then, pκ1+(1−p)κ2 is also a Z-PoCK over Q, and Θ∗(pκ1+(1−p)κ2)⊆p(Θ∗κ1)+(1−p)(Θ∗κ2)

Use Lemma 4, along with Z-PoCKs being a special case of Z-PUCKs.

Proposition 4.4: Let X, Y and Z be finite sets, Q⊆Y×X a relation and Ξ:Y→□c(X×Z) a Z-PUCK over Q. Denote Θ:=⊤Y∗Ξ. Define β0,β1:Z×Y×X→2Z×Y by β0(z,y,x):={z}×Q−1(x), \textbf{β1(z,y,x):=Z×Q−1(x)}. Then (Θ⋉β0)↓⊆Br(Θ)⊆(Θ⋉β1)↓

Proof: As usual, when establishing inequalities with downwards closures, we only have to verify the result for monotone functions. So, we may assume that f is monotone, and attempt to show that (Θ⋉β0)(λxyzα.f(x,y,z,α))≤BrZ×Y(Θ)(λxyzα.f(x,y,z,α)) ≤(Θ⋉β1)(λxyzα.f(x,y,z,α)) Remember, bridge transforms cash out as a maximum over contributions, so to show the first inequality, we'll need to build a contribution that matches or exceeds that first term, and that lands in the bridge transform of Θ. For the second inequality, it's considerably easier, we just use our Lemma 2 to figure out what sort of sets the bridge transform is supported on, swap out the sets it's supported on for a bigger set upper bound, and bam, monotonicity of f takes over from there. From there, it's easy to show the second inequality. Let's unpack that first thing (Θ⋉β0)(λxyzα.f(x,y,z,α)) =Θ(λxyz.δβ0(x,z)(λα.f(x,y,z,α))) =Θ(λxyz.f(x,y,z,β0(x,z))) And at this point we unpack what Θ is. =(⊤Y∗Ξ)(λxyz.f(x,y,z,β0(x,z))) And the Ξ. =maxκ∈Π(⊤Y∗κ)(λxyz.f(x,y,z,β0(x,z))) And then, κ can be broken down into some μκ and ϕκ, and that goes on both sides of ⊤Y as our previous proposition shows. =maxκ∈Π(μκ⋉⊤Y⋉ϕκ)(λxyz.f(x,y,z,β0(x,z))) =maxκ∈Πμκ(λz.⊤Y(λy.ϕκ(y,z)(λx.f(x,y,z,β0(x,z))))) =maxκ∈Πμκ(λz.maxyϕκ(y,z)(λx.f(x,y,z,β0(x,z)))) Now we can start filling in some data. There's a maximizing κ∗, so we can substitute that in. That gives us a canonical choice for what μκ∗ and ϕκ∗ are. Making that substitution, =μκ∗(λz.maxyϕκ∗(y,z)(λx.f(x,y,z,β0(x,z)))) And then, let d:Z→Y be the function mapping each particular z to the y which maximizes ϕκ∗(y,z)(λx.f(x,y,z,β0(x,z))). This lets us reexpress things as =μκ∗(λz.ϕκ∗(d(z),z)(λx.f(x,d(z),z,β0(x,z)))) And now, we can start unpacking things a bit. =μκ∗(λz.δd(z)(λy.ϕκ∗(y,z)(λx.f(x,y,z,β0(x,z))))) =μκ∗(λz.δd(z)(λy.ϕκ∗(y,z)(λx.δβ0(x,z)(λα.f(x,y,z,α))))) And now we can write things as just a giant semidirect product. =(μκ∗⋉d⋉ϕκ∗⋉β0)(λxyzα.f(x,y,z,α)) Now we'll show that this particular contribution lies in Br(Θ).

Checking the support condition, we want to check for sure that y,z∈α, ie, the event y,z∉α has measure 0. Let's begin. (μκ∗⋉d⋉ϕκ∗⋉β0)(λxyzα.χy,z∉α) =μκ∗(λz.δd(z)(λy.ϕκ∗(y,z)(λx.δβ0(x,z)(λα.χy,z∉α)))) Substitute in the dirac-deltas. =μκ∗(λz.ϕκ∗(d(z),z)(λx.χd(z),z∉β0(x,z))) Unpack what β0(x,z) is. =μκ∗(λz.ϕκ∗(d(z),z)(λx.χd(z),z∉Q−1(x)×{z})) Now, z∈{z} always occurs, so that indicator function is the same as just testing whether d(z)∈Q−1(x). =μκ∗(λz.ϕκ∗(d(z),z)(λx.χd(z)∉Q−1(x))) Rephrasing things a little bit, =μκ∗(λz.ϕκ∗(d(z),z)(λx.χx∉Q(d(z)))) Then, from proposition 4.2, we remember that λy.ϕκ∗(y,z) is a PoCK over Q. Ie, for any particular y, ϕκ∗(y,z) looks like a particular measure (ϖκ∗,z) restricted to Q(y). So, in particular, ϕκ∗(d(z),z) must be supported over Q(d(z)). Put another way, with full measure, x∈Q(d(z)). So, this event failing has 0 measure. =μκ∗(λz.0)=0 And we're done with that support condition.

Now to show the endofunction condition. As usual, we'll let s:Y×Z→Y×Z, and let g:X×Y×Z→[0,1]. Actually, for conceptual clarity, since s:Y×Z→Y×Z can be viewed as a pair of functions sY:Y×Z→Y and sZ:Y×Z→Z, we'll be using that formulation in our equation. (μκ∗⋉d⋉ϕκ∗⋉β0)(λxyzα.χsY(y,z),sZ(y,z)∈αg(sY(y,z),sZ(y,z),x)) =μκ∗(λz.δd(z)(λy.ϕκ∗(y,z)(λx.δβ0(x,z)(λα.χsY(y,z),sZ(y,z)∈αg(sY(y,z),sZ(y,z),x))))) As usual, we'll substitute in our dirac-deltas to simplify things. =μκ∗(λz.ϕκ∗(d(z),z)(λx.χsY(d(z),z),sZ(d(z),z)∈β0(x,z)g(sY(d(z),z),sZ(d(z),z),x))) Substitute in what β0(x,z) is. =μκ∗(λz.ϕκ∗(d(z),z)(λx.χsY(d(z),z),sZ(d(z),z)∈Q−1(x)×{z}g(sY(d(z),z),sZ(d(z),z),x))) Now, if that "this pair of points lies in this set" indicator function goes off, then sZ(d(z),z)=z. So, we can substitute that into the g term afterwards. And then get a ≤ inequality by making the indicator function less strict. =μκ∗(λz.ϕκ∗(d(z),z)(λx.χsY(d(z),z),sZ(d(z),z)∈Q−1(x)×{z}g(sY(d(z),z),z,x))) ≤μκ∗(λz.ϕκ∗(d(z),z)(λx.χsY(d(z),z)∈Q−1(x)g(sY(d(z),z),z,x))) And reexpress the indicator function a little bit =μκ∗(λz.ϕκ∗(d(z),z)(λx.χx∈Q(sY(d(z),z))g(sY(d(z),z),z,x))) At this point, we can use that ϕκ∗(y,z) is χQ(y)ϖϕκ∗,z (ie, fixing z and varying y it just looks like you're taking one measure and conditioning on various Q(y)), so reexpress things as =μκ∗(λz.(χQ(d(z))ϖϕκ∗,z)(λx.χx∈Q(sY(d(z),z))g(sY(d(z),z),z,x))) And then, view the indicator function as just more conditioning. =μκ∗(λz.(χQ(d(z))∩Q(sY(d(z),z))ϖϕκ∗,z)(λx.g(sY(d(z),z),z,x))) And then, relax about what you're conditioning on. ≤μκ∗(λz.χQ(sY(d(z),z))ϖϕκ∗,z(λx.g(sY(d(z),z),z,x))) Rewrite it as a kernel again =μκ∗(λz.ϕκ∗(sY(d(z),z),z)(λx.g(sY(d(z),z),z,x))) Pull out the dirac-delta =μκ∗(λz.δsY(d(z),z)(λy.ϕκ∗(y,z)(λx.g(y,z,x)))) Throw one more inequality at it ≤μκ∗(λz.maxyϕκ∗(y,z)(λx.g(y,z,x)))) Write it as top =μκ∗(λz.⊤Y(λy.ϕκ∗(y,z)(λx.g(y,z,x)))) Write as a semidirect product =(μκ∗⋉⊤Y⋉ϕκ∗)(λyzx.g(y,z,x)) Reexpress =(⊤Y∗κ∗)(λyzx.g(y,z,x)) ≤maxκ∈Π(⊤Y∗κ)(λyzx.g(y,z,x)) =(⊤Y∗Ξ)(λyzx.g(y,z,x)) =Θ(λyzx.g(y,z,x)) And we're done! endofunction condition shown. Our relevant contribution is in Br(Θ). Let's see, where were we... ah right, we had shown that for all monotone f, (Θ⋉β0)(λxyzα.f(x,y,z,α)) =(μκ∗⋉d⋉ϕκ∗⋉β0)(λxyzα.f(x,y,z,α)) For some choice of d and κ∗. We know this is in Br(Θ), so we get ≤maxθ∈Br(Θ)θ(λyzαx.f(x,y,z,α)) =Br(Θ)(λyzαx.f(x,y,z,α)) And we're done! One inequality done. That just leaves showing the second inequality, where β1(x)=Z×Q−1(x). It's actually not too bad to show. Start with Br(Θ)(λxyαz.f(x,y,z,α)) =maxθ∈Br(Θ)θ(λyzαx.f(x,y,z,α)) And then, we recall our Lemma 2, that if Θ had its support entirely on x,y,z tuples where y,z in h(x) (for some h:X→2Y×Z), then all the θ∈Br(Θ) would be supported on (x,α) pairs where α⊆h(x). And then, swapping out α for h(x), by monotonicity of f, produces a larger value.

To invoke this argument, our choice of h will be β1, where β1(x)=Q−1(x)×Z. We do need to show that Θ is supported on such tuples. Θ(λxyz.χy,z∉Q−1(x)×Z)=Θ(λxyz.χy∉Q−1(x))=Θ(λxyz.χx∉Q(y)) =(⊤Y∗Ξ)(λxyz.χx∉Q(y))=maxκ∈Π(⊤Y∗κ)(λxyz.χx∉Q(y)) =maxκ∈Π(μκ⋉⊤Y⋉ϕκ)(λxyz.χx∉Q(y)) =maxκ∈Πμκ(λz.⊤Y(λy.ϕκ(y,z)(λx.χx∉Q(y)))) And then use that ϕκ(y,z)=χQ(y)ϖϕκ,z since it's a PoCK in Q, to get =maxκ∈Πμκ(λz.⊤Y(λy.(χQ(y)ϖϕκ,z)(λx.χx∉Q(y)))) Hm, we updated on a set, and are evaluating the indicator function for not being in the set. =maxκ∈Πμκ(λz.⊤Y(λy.0))=0 Ok, so this means we can invoke Lemma 2. We were previously at =maxθ∈Br(Θ)θ(λyzαx.f(x,y,z,α)) So now we can invoke monotonicity and go ≤maxθ∈Br(Θ)θ(λyzαx.f(x,y,z,β1(x))) And then invoke our endofunction property for the stuff in Br(Θ), letting s be the identity function (and also y,z∈α occurs always) to establish a uniform upper bound of ≤Θ(λxyzα.f(x,y,z,β1(x))) =Θ(λxyz.δβ1(x)(λα.f(x,y,z,α))) =(Θ⋉β1)(λxyzα.f(x,y,z,α)) And we're done! Second inequality demonstrated. ■

Corollary 4.5: Suppose that for any d∈D, z∈Γ1 and π:H→A s.t. (d,z)∈supp W(π), it holds that dCzπ. That is, the observations W predicts to receive from the computer are consistent with the chosen policy and W's beliefs about computations. Let L:D→R be a Cartesian loss function and π:H→A a policy. Define ~L:D×Γ1→R by ~L(h,z):=L(h). Then, (prelΓBr(ΘW)∩Cπfair)(Lphys)=W(π;~L)

To a large extent, this will follow the proof of the previous corollary. We'll use β for 2Γ and α for just the policy component.

I'm going to be proceeding very cautiously here. First off, make our π value special by swapping it out for π∗ (prelΓBr(ΘW)∩Cπ∗fair)(Lphys) Now, by the identifications we made earlier, we can identify Γ with AH×Γ1, the space of policies and computations. Using that to unpack the function a little bit, we have =(prelΓBr(ΘW)∩Cπ∗fair)(λπzβ.Lphys(π,z,β)) Now, we note that intersecting with top of a particular set is equivalent to updating on the indicator function for that set. Using definition 1.5 to unpack Cπ∗fair, we get =(prelΓBr(ΘW))(λπzβ.χ∀h∈Hπ,z,β:Gπ,z(h)=π∗(h)Lphys(π,z,β)) Throughout we'll be applying that Gπ,z(h) is "what would the agent do on h if the agent is copying the behavior of π" (remember, part of the math is "what does the agent do in response to this history" and π is our term for that chunk of the math), so we'll just always rephrase things like that and won't bother to say Gπ,z(h)=π(h) every time we do it. =(prelΓBr(ΘW))(λπzβ.χ∀h∈Hπ,z,β:π(h)=π∗(h)Lphys(π,z,β)) Pull off the projection, and use d for a destiny in D. =Br(ΘW)(λπzβd.χ∀h∈Hπ,z,β:π(h)=π∗(h)Lphys(π,z,β)) Applying definition 3.1, and using d′ instead of g to remember it's a destiny, we have =Br(ΘW)(λπzβd.χ∀h∈Hπ,z,β:π(h)=π∗(h)minha:ha∈Xπ,z,βmaxd′:ha⊑d′L(d′)) Next up is unpacking Xπ,z,β. Using definition 3.1, it's =Br(ΘW)(λπzβd.χ∀h∈Hπ,z,β:π(h)=π∗(h) minha:ha∈Hπ,z,β×A∧(∀(π′,z′)∈β:π′(h)=a)maxd′:ha⊑d′L(d′)) Now we'll show that Hπ,z,β only depends on pr(β), ie, the projection of β from 2AH×Γ1 to 2AH, and that it gets smaller as β gets larger, so the function above is monotone. Let's use definition 1.5 to unpack the event h∈Hπ,z,β. (∀h′a′⊏h,(π′,z′)∈β:π′(h′)=a′)∧(∃d′:h⊏d′∧d′Czπ) It shouldn't be too hard to tell that β getting larger makes this set smaller, and that, since the z′ doesn't matter, this condition is really the same as (∀h′a′⊏h,π′∈pr(β):π′(h′)=a′)∧(∃d′:h⊏d′∧d′Czπ) So, we'll use pr(β) to remind us that our function only depends on the projection of β. =Br(ΘW)(λπzβd.χ∀h∈Hπ,z,pr(β):π(h)=π∗(h) minha:ha∈Hπ,z,pr(β)×A∧(∀π′∈pr(β):π′(h)=a)maxd′:ha⊑d′L(d′)) And now we can pull that out as a projection! We're now using α for our set of policies, instead of β for our set of policy/computation pairs. =pr(Br(ΘW))(λπzαd.χ∀h∈Hπ,z,α:π(h)=π∗(h) minha:ha∈Hπ,z,α×A∧(∀π′∈α:π′(h)=a)maxd′:ha⊑d′L(d′)) To proceed further, we're going to need to adapt our previous result which gets an upper and lower bound on the bridge transform of suitable Θ. Our first order of business is checking that α getting larger makes the function larger, which is easy to check since α getting larger cuts down on the options to minimize over (increasing value), and makes the antecedent of the implication harder to fulfill, which makes the implication as a whole easier to fulfill, so the indicator function is 1 more often. Now we can proceed further. Abstracting away a bit from our specific function, which will be swapped out for some f:AH×Γ1×D×2AH→[0,1] which is monotone in that last argument, we have (ΘW⋉β0)↓(λzπdβ.f(z,π,d,pr(β))) =(ΘW⋉β0)(λzπdβ.f(z,π,d,pr(β))) =ΘW(λzπd.f(z,π,d,pr(β0(z,d)))) =ΘW(λzπd.f(z,π,d,pr({z}×Q−10(d)))) =ΘW(λzπd.f(z,π,d,Q−10(d))) =ΘW(λzπd.f(z,π,d,pr(Z×Q−10(d)))) =ΘW(λzπd.f(z,π,d,pr(β1(d)))) =(ΘW⋉β1)(λzπdβ.f(z,π,d,pr(β))) =(ΘW⋉β1)↓(λzπdβ.f(z,π,d,pr(β))) Monotonicity was used to go back and forth between the downwards closure and the raw form. β0 and β1 are as they were in Proposition 4.4, and Q would be the relation on AH×D telling you whether a policy is consistent with a destiny. Now, by Proposition 4.4, since the bridge transform of ΘW is sandwiched between those two values, and they're both equal, we have pr(Br(ΘW))(λzπdα.f(z,π,d,α))=Br(ΘW)(λzπdβ.f(z,π,d,pr(β))) =ΘW(λzπd.f(z,π,d,Q−10(d)))=(ΘW⋉Q−10)(λzπdα.f(z,π,d,α)) The first equality was just relocating the projection. the second equality was from the bridge transform being sandwiched between two equal quantities, so it equals all the stuff on our previous big list of equalities (we went with the middle one). Then just express as a semidirect product, and you're done. Applying this to our previous point of =pr(Br(ΘW))(λπzαd.χ∀h∈Hπ,z,α:π(h)=π∗(h) minha:ha∈Hπ,z,α×A∧(∀π′∈α:π′(h)=a)maxd′:ha⊑d′L(d′)) We can reexpress it as =(ΘW⋉Q−10)(λπzαd.χ∀h∈Hπ,z,α:π(h)=π∗(h) minha:ha∈Hπ,z,α×A∧(∀π′∈α:π′(h)=a)maxd′:ha⊑d′L(d′)) And start unpacking the semidirect product =ΘW(λπzd.χ∀h∈Hπ,z,Q−10(d):π(h)=π∗(h) minha:ha∈Hπ,z,Q−10(d)×A∧(∀π′:dQ0π′→π′(h)=a)maxd′:ha⊑d′L(d′)) Now, we're going to try to address that minimum, and show that the only ha that fulfill the conditions are exactly those ha⊑d. This requires showing that ha⊑d is a sufficient condition to fulfill the relevant properties, and then to show that ha⋢d implies a failure of one of the properties.

The proof for this is almost entirely identical to the corresponding proof in the non-turing-law case, there are no substantive differences besides one issue to clear up. We need to show that W(π) always being supported on the relation C, for all π (as one of our starting assumptions) implies that ΘW is supported on C as well. Here's how we do it. We have ΘW=⊤AH∗W. And then this ultracontribution (by the definition of Γ1-PUCK's) can be written as the convex hull of the union of a bunch of ultracontributions of the form ⊤AH∗w, where w is a Γ1-PoCK. So, if we can show all of these are supported on the relation C, then the same holds for the convex hull of their union, ie, ΘW. By Proposition 4.2, we can reexpress this ultracontribution as μw⋉⊤AH⋉ϕw, where w(π)=μw⋉(λz.ϕw(z,π)), for any policy π. Now, let's check the expectation value of the indicator function for C being violated. (μw⋉⊤AH⋉ϕw)(λzπd.χ¬dCzπ) =μw(λz.⊤AH(λπ.ϕw(z,π)(λd.χ¬dCzπ))) =μw(λz.maxπϕw(z,π)(λd.χ¬dCzπ)) Let's assume that the expectation of that indicator function is \emph{not} zero. Then there must be some particular z in the support of μw where it is nonzero, and some particular π∗ that attains that nonzero expectation value. So, there's a z in the support of μw and π∗ s.t. 0">ϕw(z,π∗)(λd.χ¬dCzπ∗)>0 and so this means that we have 0">μw(λz.ϕw(z,π∗)(λd.χ¬dCzπ∗))>0 Because that z is assigned nonzero measure, and then this reshuffles to 0">(μw⋉(λz.ϕw(z,π∗)))(λd.χ¬dCzπ∗)>0 Which, via Proposition 4.2, is 0">w(π∗)(λzd.χ¬dCzπ∗)>0 But this contradicts that W(π∗) (and so all the w(π∗)) were supported on the event dCzπ∗, so we have a contradiction, and our result follows.

Now that that issue is taken care of, we can swap out our fancy minimization with just minimizing over the ha⊑d. =ΘW(λπzd.χ∀h∈Hπ,z,Q−10(d):π(h)=π∗(h)minha:ha⊑dmaxd′:ha⊑d′L(d′)) This minimization is attained by selecting d itself. So then it turns into =ΘW(λπzd.χ∀h∈Hπ,z,Q−10(d):π(h)=π∗(h)L(d)) At this point, we'll upper-and-lower-bound this quantity by W(π∗)(λzd.L(d)). Let's begin. W(π∗)(λzd.L(d)) =maxw∈Πw(π∗)(λzd.L(d)) =maxw∈Πμw(λz.ϕw(π∗,z)(λd.L(d))) Here we're using Proposition 4.2 on being able to split up Z-PoCK's (which W is) a certain way. =maxw∈Πμw(λz.maxπϕw(π∗,z)(λd.χdQ0πL(d))) This equality happens because you can always just pick π∗ as your choice of π, and since ϕw(π∗,z) can only produce destinies consistent with π∗. Now, we can swap out ϕw for the measure we're conditioning on an event =maxw∈Πμw(λz.maxπ(χQ(π∗)ϖϕw,z)(λd.χdQ0πL(d))) =maxw∈Πμw(λz.maxπϖϕw,z(λd.χdQ0π∗χdQ0πL(d))) Reshuffle the indicator function back in =maxw∈Πμw(λz.maxπ(χQ(π)ϖϕw,z)(λd.χdQ0π∗L(d))) =maxw∈Πμw(λz.maxπϕw(π,z)(λd.χdQ0π∗L(d))) =maxw∈Πμw(λz.⊤AH(λπ.ϕw(π,z)(λd.χdQ0π∗L(d)))) =maxw∈Π(μw⋉⊤AH⋉ϕw)(λπzd.χdQ0π∗L(d)))) =maxw∈Π(⊤AH∗w)(λπzd.χdQ0π∗L(d)))) =(⊤AH∗W)(λπzd.χdQ0π∗L(d)))) =ΘW(λπzd.χdQ0π∗L(d)))) Now we must show that this is a looser constraint than what was previously in our indicator function to proceed further. So our next order of business is showing that, certainly, ∀h∈Hπ,z,Q−10(d):π(h)=π∗(h)→dQ0π∗ Let d be an arbitrary destiny in the support of ΘW, and h be one of the history prefixes of d. The two conditions for h∈Hπ,z,Q−10(d) are fulfilled, because they are ∀h′,a′,π′:h′a′⊏h∧dQ0π′→π′(h′)=a′ ∃d′:h⊏d′∧d′Cπz For the first condition, if h′a⊏h, then h′a′⊏d, and so if π′ is consistent with d, it must take the same action in response to h′, the action that d commands. So it is fulfilled. For the second condition, let d′ be d. h⊏d holds, and dCπz holds certainly, because ΘW is supported on C, as we've previously shown.

So, certainly, h⊏d→h∈Hπ,z,Q−10(d). Since we assumed our forall statement as prerequisite, this means that for all h⊏d, π(h)=π∗(h). And dQ0π means ∀ha⊑d:π(h)=a. Since π∗(h) mimics π(h) for all history prefixes of d, this means ∀ha⊑d:π∗(h)=a, ie dQ0π∗.

So, since this is a looser constraint, when we were previously at =ΘW(λπzd.χdQ0π∗L(d)))) we can proceed further to ≥ΘW(λπzd.χ∀h∈Hπ,z,Q−10(d):π(h)=π∗(h)L(d)) which is the value we wanted to sandwich. Proceed further with =(⊤AH∗W)(λπzd.χ∀h∈Hπ,z,Q−10(d):π(h)=π∗(h)L(d)) =maxw∈Π(⊤AH∗w)(λπzd.χ∀h∈Hπ,z,Q−10(d):π(h)=π∗(h)L(d)) =maxw∈Π(μw⋉⊤AH⋉ϕw)(λπzd.χ∀h∈Hπ,z,Q−10(d):π(h)=π∗(h)L(d)) =maxw∈Πμw(λz.⊤AH(λπ.ϕw(π,z)(λd.χ∀h∈Hπ,z,Q−10(d):π(h)=π∗(h)L(d)))) =maxw∈Πμw(λz.maxπϕw(π,z)(λd.χ∀h∈Hπ,z,Q−10(d):π(h)=π∗(h)L(d))) ≥maxw∈Πμw(λz.ϕw(π∗,z)(λd.χ∀h∈Hπ∗,z,Q−10(d):π∗(h)=π∗(h)L(d))) =maxw∈Πμw(λz.ϕw(π∗,z)(λd.L(d))) =maxw∈Π(μw⋉(λz.ϕw(π∗,z)))(λzd.L(d))) =maxw∈Πw(π∗)(λzd.L(d)))=W(π∗)(λzd.L(d))) And we've got the same upper and lower bound, so our overall quantity is W(π∗)(~L) And we're done. ■

Discuss

### Infra-Bayesian physicalism: proofs part I

1 декабря, 2021 - 01:26
Published on November 30, 2021 10:26 PM GMT

This post is an appendix to "Infra-Bayesian physicalism: a formal theory of naturalized induction".

This lemma will be implicitly used all over the place, in order to deal with the "presence in the bridge transform" condition algebraically, in terms of function expectations. The bulk of the conditions for a contribution to lie in the bridge transform is this endofunction condition (the support condition is trivial most of the time), so it's advantageous to reformulate it. First-off, we have that prΓ×Φ(χy∈α(s∗(θ)))∈Θ iff, for all g:Γ×Φ→[0,1], we have prΓ×Φ(χy∈α(s∗(θ)))(λyx.g(y,x))≤Θ(λyx.g(y,x)) By LF-duality for ultradistributions, proved in Less Basic Inframeasure theory. A contribution lies in an ultracontribution set iff its expectations, w.r.t. all functions, are less than or equal to the ultracontribution expectations.

Now, we just need to unpack the left-hand-side into our desired form. Start off with prΓ×Φ(χy∈α(s∗(θ)))(λyx.g(y,x)) Apply how projections work =χy∈α(s∗(θ))(λyαx.g(y,x)) Now, we can move the indicator function into the function we're taking the expectation again, because there's no difference between deleting all measure outside of an event, or taking the expectation of a function that's 0 outside of that event. So, we get =s∗(θ)(λyαx.χy∈αg(y,x)) Then we use how pushforwards are defined in probability theory. =θ(λyαx.χs(y)∈αg(s(y),x)) And that's our desired form. So, we've shown that for any particular s:Γ→Γ, we have prΓ×Φ(χy∈α(s∗(θ)))∈Θ iff, for all g:Γ×Φ→[0,1], θ(λyαx.χs(y)∈αg(s(y),x))≤Θ(λyx.g(y,x)) And so, we get our desired iff statement by going from the iff for one s to the iff for all s. ■

Proposition 2.1: For any Γ, Φ and Θ∈□c(Γ×Φ), Br(Θ) exists and satisfies prΓ×ΦBr(Θ)=Θ. In particular, if Θ∈□(Γ×Φ) then Br(Θ)∈□(elΓ×Φ).

Proof sketch: We'll show that for any particular contribution in θ∈Θ, there's a contribution θ∗ which lies within Br(Θ) that projects down to equal θ. And then, show the other direction, that any contribution in Br(Θ) lands within Θ when you project it down. Thus, the projection of Br(Θ) must be Θ exactly.

For the first direction, given some θ∈Θ, let θ∗:=θ⋉(λy.{y}). Note that since θ∈Δc(Γ×Φ), this means that θ∗∈Δc(Γ×2Γ×Φ), so the type signatures line up. Clearly, projecting θ∗ down to Γ×Φ makes θ again. So that leaves showing that θ∗∈Br(Θ). Applying Lemma 1, an equivalent way of stating the bridge transform is that it consists precisely of all the θ′∈Δc(Γ×2Γ×Φ) s.t. for all s:Γ→Γ and g:Γ×Φ→[0,1], θ′(λyαx.χs(y)∈αg(s(y),x))≤Θ(λyx.g(y,x)) and also, supp θ′⊆elΓ×Φ.

Clearly, for our given θ∗, everything works out with the support condition, so that leaves the endofunction condition. Let s,g be arbitrary. θ∗(λyαx.χs(y)∈αg(s(y),x))=(θ⋉(λy.{y}))(λyαx.χs(y)∈αg(s(y),x)) =θ(λyx.δ{y}(λα.χs(y)∈αg(s(y),x)))=θ(λyx.χs(y)∈{y}g(s(y),x)) =θ(λyx.χs(y)=yg(s(y),x))=θ(λyx.χs(y)=yg(y,x))≤θ(λyx.g(y,x)) ≤maxθ∈Θθ(λyx.g(y,x))=Θ(λyx.g(y,x)) In order, the equalities were unpacking the semidirect product, substituting the dirac-delta in, reexpressing the condition for the indicator function, using that s(y)=y inside the indicator function, applying monotonicity of θ, using that θ∈Θ, and then packing up the definition of Θ. And that inequality has been fulfilled, so, yes, θ∗ lies within Br(Θ). Since θ was arbitrary within Θ, this shows that the projection set is as-big-or-bigger than Θ.

Now to show the reverse direction, that anything in the projection set lies within Θ. For any particular θ′∈Θ, remember, it must fulfill, for all g,s, that θ′(λyαx.χs(y)∈αg(s(y),x))≤Θ(λyx.g(y,x)) So, in particular, we can let s:Γ→Γ be the identity function, and g be anything, and we'd get θ′(λyαx.χy∈αg(y,x))≤Θ(λyx.g(y,x)) And, since θ′ is in Br(Θ), it's supported on the (y,α) pairs s.t. y∈α, so that indicator function drops out of existence, and we get θ′(λyαx.g(y,x))≤Θ(λyx.g(y,x)) And then this can be written as a projection prΓ×Φ(θ′)(λyx.g(y,x))≤Θ(λyx.g(y,x)) And since this holds for all the various g, this means that prΓ×Φ(θ′)∈Θ And we're done with the other direction of things, the projection must be exactly equal to Θ since projections of contributions in Br(Θ) always land in Θ, and we can surject onto Θ. ■

Proposition 2.2: Let X be a poset, θ,η∈ΔcX. Then, θ⪯η if and only if there exists κ:X→ΔX s.t.: For all x∈X, y∈supp κ(x): x≤y. κ∗θ≤η

Vanessa proved this one, with a very nice proof involving the max-flow min-cut theorem.

Make the following directed graph: The nodes are the elements of {s}∪{t}∪X×{0,1}. Basically, two copies of the finite poset X, a source node s, and a sink node t. x will be used to denote a variable from X, while a subscript of 0 or 1 denotes that the 0 or 1 version of that x, respectively.

As for the edges, there is an edge from s to every point in X0. And an edge from every point in X1 to t. And, for some x0∈X0 and y1∈X1, there's an edge iff x≤y. The capacities of the edges are, for all x∈X, cs,x0:=θ(x), and for all x∈X, cx1,t:=η(x), and for x,y∈X s.t. x≤y, cx0,y1:=∞.

To work up to applying min-cut max-flow, we must show that the cut of all the edges between s and X0 is a min-cut of the network.

Remember, all cuts are induced by partitions of the nodes into two sets, S and T, where S includes the source s and T includes the sink t. And the value of a cut is c(S,T)=∑a∈S,b∈Tca,b=∑x0∈Tcs,x0+∑x0∈S,y1∈T:x≤ycx0,y1+∑y1∈Scy1,t Implicitly converting both of the following sets to be subsets of X, we have (S∩X1)⊇(S∩X0)↑ for any non-infinite cut. Why? Well, if it was false, then there'd be some y≥x where x0∈S, and yet y1∉S. So y1∈T. Then, the cost of the cut would include cx0,y1=∞, contradiction. So, now, we have that for any non-infinite cut, c(S,T)=∑a∈S,b∈Tca,b=∑x0∈Tcs,x0+∑y1∈Scy1,t and (S∩X1)⊇(S∩X0)↑ holds.

Now, we'll change our cut. Given some cut (S,T), let our new cut (S′,T′) be defined as S′:={s}∪((S∩X0)↑×{0,1}) (which still respects that superset property listed above since both sets would be the same), and T′ be the complement. Letting some stuff cancel out, we get c(S,T)−c(S′,T′)=∑x0∈Tcs,x0+∑y1∈Scy1,t−∑x0∈T′cs,x0−∑y1∈S′cy1,t =∑x0∈T/T′cs,x0+∑y1∈S/S′cy1,t−∑x0∈T′/Tcs,x0−∑y1∈S′/Scy1,t Now, because S′∩X1=(S∩X0)↑×{1}⊆S∩X1 by that subset property, it means that there's no y1∈S′/S, so we get =∑x0∈T/T′cs,x0+∑y1∈S/S′cy1,t−∑x0∈T′/Tcs,x0 As for that other minus term, we have that S′∩X0=(S∩X0)↑×{0}⊇S∩X0 by how it was defined, so T′∩X0⊆T∩X0, so that other minus term is 0, and we get =∑x0∈T/T′cs,x0+∑y1∈S/S′cy1,t≥0 Rearranging this a bit, we have c(S,T)≥c(S′,T′).

And now we'll show that c(S′,T′) is underscored by c({s},(X×{0,1})∪{t}). Let's go. We have c(S′,T′)−c({s},(X×{0,1})∪{t}) =∑x0∈T′cs,x0+∑y1∈S′cy1,t−∑x0∈(X×{0,1})∪{t}cs,x0−∑y1∈{s}cy1,t This simplifies a bit as =∑x0∈T′cs,x0+∑y1∈S′cy1,t−∑x0∈X0cs,x0 This can be rewritten a bit as =⎛⎝∑x0∈X0cs,x0−∑x0∈S′cs,x0⎞⎠+∑y1∈S′cy1,t−∑x0∈X0cs,x0 =∑y1∈S′cy1,t−∑x0∈S′cs,x0 And then, using what S′ is defined as, we can get =∑y∈(S∩X0)↑cy1,t−∑x∈(S∩X0)↑cs,x0 and using what the costs are, it's =∑y∈(S∩X0)↑η(y)−∑x∈(S∩X0)↑θ(y) =η((S∩X0)↑)−θ((S∩X0)↑)≥0 This holds because θ⪯η and the indicator function for (S∩X0)↑ is monotone. And so, we get c(S′,T′)≥c({s},(X×{0,1})∪{t}). And we previously showed that c(S,T)≥c(S′,T′). So, this means that the cut around s is a minimal cut, it underscores all other cuts.

By the max-flow min-cut theorem, there's a way of having stuff flow from s to t that saturates the capacities of all the edges that are cut. fx0,y1 will be used to denote the flow from x0 to y1 according to this max-flow way. Let's finish things up.

Define κ:X→ΔX as follows. For some x, if 0">fs,x0>0, then κ(x)(y):=fx0,y1fs,x0 If fs,x0=0, let κ(x) be any probability distribution on x↑. Note that κ(x) is always a probability distribution supported on x↑, by fiat if fs,x0=0, and otherwise, ∑y∈Xκ(x)(y)=∑y∈Xfx0,y1fs,x0=fs,x0fs,x0=1 This is because, the flow out of x0 must equal the flow into x0 from the source. And κ(x) is supported on x↑ because the only edges out of x0 go to y1 where y≥x. Now that we've got this demonstrated, we'll show our desired inequality. Fix an arbitrary y. We have κ∗(θ)(y)=∑x∈Xθ(x)⋅κ(x)(y)=∑x∈Xcs,x0⋅fx0,y1fs,x0 And then we use that all the capacities of the edges cut in the min-cut are saturated according to the max-flow plan, so cs,x0=fs,x0, and we have =∑x∈Xfs,x0⋅fx0,y1fs,x0 Now, if fs,x0=0, then because the flow in equals the flow out, that means that fx0,y1=0, and otherwise we can cancel out, so we can rewrite as =∑x∈Xfx0,y1=fy1,t≤cy1,t=η(y) Where the first equality came from flow in equals flow out, the inequality came from the flow plan respecting the capacity of the paths, and the equality came from how the capacities were defined. So, for all y∈X, we have κ∗(θ)(y)≤η(y) so we have κ∗(θ)≤η.

For the reverse direction, if there's some κ:X→ΔX s.t κ(x) is supported on x↑ s.t. κ∗(θ)≤η, then for any monotone function f, we have η(f)≥κ∗(θ)(f)=θ(λx.κ(x)(f)) And then we use that, since κ(x) is supported on x↑, and f is monotone, f(x) is less than or equal to the expectation of f w.r.t κ(x) (remember, for κ(x) you have a 100 percent chance of drawing something at-or-above x, which guarantees that f of whatever you picked is above f(x)). And so, we get ≥θ(f) And since this holds for every monotone f, we have θ⪯η.■

Proposition 2.3: Let X be a poset, θ,η∈ΔcX. Then, θ⪯η if and only if there exists κ:X→ΔX s.t. For all x∈X, y∈supp κ(x): x≥y. θ≤κ∗η

Proof: Because θ⪯η, we can get a maximum flow in exactly the same way as Proposition 2.2. Then, just flip the direction of all flows, which will be denoted by swapping the order of the subscripts. Now, define κ(y)(x):=fy1,x0ft,y1 And, again, if it's 0, it should be an arbitrary probability distribution supported on y↓. Note that κ(y) is always a probability distribution supported on y↓, by fiat if f′t,y1=0, and otherwise, ∑x∈Xκ(y)(x)=∑x∈Xfy1,x0ft,y1=ft,y1ft,y1=1 This is because, the flow out of y1 must equal the flow into y1 from the source t (the old sink). And κ(y) is supported on y↓ because the only edges out of y1 go to x0 where y≥x. Now that we've got this demonstrated, we'll show our desired inequality. Fix an arbitrary x. We have κ∗(η)(x)=∑y∈Xη(y)⋅κ(y)(x)=∑y∈Xcy1,t⋅fy1,x0ft,y1 Then use that cy1,t≥ft,y1 (because even with the flow reversed, the flow through a path must be less than or the same as the capacity. Accordingly, we get ≥∑y∈Xfy1,x0=fx0,s=cs,x0=θ(x) And we're done, inequality established, using definitions and the flow saturating all the paths out of s.

So, for all x∈X, we have κ∗(η)(x)≥θ(x) so we have κ∗(η)≥θ.

For the reverse direction, if there's some κ:X→ΔX s.t κ(y) is supported on y↓ s.t. κ∗(η)≥θ, then for any monotone function f, we have θ(f)≤κ∗(η)(f)=η(λx.κ(x)(f)) And then we use that, since κ(x) is supported on x↓, and f is monotone, f(x) is greater than or equal to the expectation of f w.r.t κ(x) (remember, for κ(x) you have a 100 percent chance of drawing something at-or-below x, which guarantees that f of whatever you picked is below f(x)). And so, we get ≤η(f) And since this holds for every monotone f, we have θ⪯η. ■

Proposition 2.4: For any Γ, Φ and Θ∈□c(Γ×Φ),Br(Θ) is downwards closed w.r.t. the induced order on Δc(elΓ×Φ). That is, if θ∈Br(Θ) and η⪯θ then η∈Br(Θ).

Remember, the condition for some θ to lie in Br(Θ) is that it be supported on elΓ, and that, for all s:Γ→Γ, and g:Γ×Φ→[0,1], θ(λyαx.χs(y)∈αg(s(y),x))≤Θ(λyx.g(y,x)) So, given that η⪯θ (lower score for all monotone functions) we'll show that η fulfills both conditions. The support thing is taken care of by η∈Δc(elΓ×Φ). As for the other one, we have η(λyαx.χs(y)∈αg(s(y),x))≤θ(λyαx.χs(y)∈αg(s(y),x)) ≤Θ(λyx.g(y,x)) That inequality occurs because as you make α larger, ie, go up in the partial ordering, the value assigned to the relevant point increases since s(y)∈α is more likely now, so the function value increases from 0 to something that may be greater than 0. So, it's a monotone function, and we then use that η⪯θ′. ■

Proposition 2.5: Consider a finite set X, ϕ?∈ΔX, ϕ!:{0,1}→ΔX, p∈[0,1] and ϕ:=(1−p)ϕ?+pϕ!. Then, p≥dTV(ϕ(0),ϕ(1)). Conversly, consider any ϕ:{0,1}→ΔX. Then, there exist ϕ?∈ΔX and ϕ!:{0,1}→ΔX s.t. ϕ=(1−p)ϕ?+pϕ! for p:=dTV(ϕ(0),ϕ(1)).

Ok, so let f be some arbitrary function X→[0,1]. We have an alternate characterization of the total variation distance as dTV(ϕ(0),ϕ(1)):=supf∈X→[0,1]|ϕ(0)(f)−ϕ(1)(f)| And then from there we can go =supf∈X→[0,1]|((1−p)ϕ?(f)+pϕ!(0)(f))−((1−p)ϕ?(f)+pϕ!(1)(f))| =supf∈X→[0,1]|pϕ!(0)(f)−pϕ!(1)(f)| =psupf∈X→[0,1]|ϕ!(0)(f)−ϕ!(1)(f)| =pdTV(ϕ!(0),ϕ!(1)) And since p≥pdTV(ϕ!(0),ϕ!(1)), we have p≥dTV(ϕ(0),ϕ(1)) and are done.

Conversely, let the value of p be 1−(ϕ(0)∧ϕ(1))(1), and ϕ? be 1(1−p)(ϕ(0)∧ϕ(1)). It is clear that this is a probability distribution because of how p was defined. The ∧ is the minimum/common overlap of the two probability distributions. Then, let ϕ!(0)=1p(ϕ(0)−(ϕ(0)∧ϕ(1))), and similar for the 1. Well, as long as 0">p>0. If p=0, it can be any probability distribution you want. It's a probability distribution because ϕ!(0)(1)=1p(ϕ(0)(1)−(ϕ(0)∧ϕ(1))(1))=1p(1−(ϕ(0)∧ϕ(1))(1)) =1−(ϕ(0)∧ϕ(1))(1)1−(ϕ(0)∧ϕ(1))(1)=1 And since 0">p>0, everything works out. Now we just need to show that these add up to make ϕ and that p is the same as the total variation distance. ϕ(0)=(ϕ(0)∧ϕ(1))+ϕ(0)−(ϕ(0)∧ϕ(1)) =(1−p)1(1−p)(ϕ(0)∧ϕ(1))+p1p(ϕ(0)−(ϕ(0)∧ϕ(1)))=(1−p)ϕ?+pϕ!(0) And this works symmetrically for ϕ(1), showing that we indeed have equality. As for showing that p is the total variation distance, referring back to what we've already proved, we have dTV(ϕ(0),ϕ(1))=pdTV(ϕ!(0),ϕ!(1)) And now, since ϕ!(0)=ϕ(0)−(ϕ(0)∧ϕ(1)) and ϕ!(1)=ϕ(1)−(ϕ(0)∧ϕ(1)), the supports of these two probability distributions are disjoint, which implies that the total variation distance is 1, so we have dTV(ϕ(0),ϕ(1))=p.

Proposition 2.6: Consider any Φ and ϕ:{0,1}→ΔΦ. Denote U:={0,1}×{{0,1}}×Φ (the event program is unrealized''). Let Λ:=Br(⊤{0,1}⋉ϕ). Then, Λ(χU)=1−dTV(ϕ(0),ϕ(1))

This will take some work to establish. One direction, showing that the bridge transform value exceeds the total variation distance value, is fairly easy. As for the other direction, it'll take some work. Let's begin. Λ(χU)=Br(⊤{0,1}⋉ϕ)(χU)=Br(⊤{0,1}⋉ϕ)(λyαx.χα={0,1}) =maxθ∈Br(⊤{0,1}⋉ϕ)θ′(λyαx.χα={0,1}) We'll make a particular contribution θ∗. It is defined as δ0×(ϕ(0)−ϕ(0)∧ϕ(1))×δ{0}+δ0×(ϕ(0)∧ϕ(1))×δ{0,1} Or, put another way, restricting to 0,{0}, it's ϕ(0)−ϕ(0)∧ϕ(1), and conditioning on 0,{0,1}, it's ϕ(0)∧ϕ(1). So, for a given s:{0,1}→{0,1}, we have θ∗(λyαx.χs(y)∈αg(s(y),x)) =(δ0×(ϕ(0)−ϕ(0)∧ϕ(1))×δ{0}+δ0×(ϕ(0)∧ϕ(1))×δ{0,1})(λyαx.χs(y)∈αg(s(y),x)) =(δ0×(ϕ(0)−ϕ(0)∧ϕ(1))×δ{0})(λyαx.χs(y)∈αg(s(y),x)) +(δ0×(ϕ(0)∧ϕ(1))×δ{0,1})(λyαx.χs(y)∈αg(s(y),x)) =(ϕ(0)−ϕ(0)∧ϕ(1))(λx.χs(0)∈{0}g(s(0),x))+(ϕ(0)∧ϕ(1))(λx.χs(0)∈{0,1}g(s(0),x)) Now, we'll go through two exhaustive possibilities for what s could be. If it maps 0 to 0, then it's =(ϕ(0)−ϕ(0)∧ϕ(1))(λx.χ0∈{0}g(0,x))+(ϕ(0)∧ϕ(1))(λx.χ0∈{0,1}g(0,x)) =(ϕ(0)−ϕ(0)∧ϕ(1))(λx.g(0,x))+(ϕ(0)∧ϕ(1))(λx.g(0,x)) =ϕ(0)(λx.g(0,x))≤maxy∈{0,1}ϕ(y)(λx.g(y,x)) =⊤{0,1}(λy.ϕ(y)(λx.g(y,x))=(⊤{0,1}⋉ϕ)(λyx.g(y,x)) And our desired inequality is established. If s maps 0 to 1, then it's =(ϕ(0)−ϕ(0)∧ϕ(1))(λx.χ1∈{0}g(1,x))+(ϕ(0)∧ϕ(1))(λx.χ1∈{0,1}g(1,x)) =(ϕ(0)∧ϕ(1))(λx.g(1,x))≤ϕ(1)(λx.g(1,x)) ≤maxy∈{0,1}ϕ(y)(λx.g(y,x))=(⊤{0,1}⋉ϕ)(λyx.g(y,x)) And that is taken care of. So, our θ∗ lies in Br(⊤{0,1}⋉ϕ).

Now, where were we? Ah right, we were at Λ(χU)=maxθ∈Br(⊤{0,1}⋉ϕ)θ(λyαx.χα={0,1}) But now we can continue with ≥θ∗(λyαx.χα={0,1}) Which unpacks as =(δ0×(ϕ(0)−ϕ(0)∧ϕ(1))×δ{0}+δ0×(ϕ(0)∧ϕ(1))×δ{0,1})(λyαx.χα={0,1}) =(δ0×(ϕ(0)−ϕ(0)∧ϕ(1))×δ{0})(λyαx.χα={0,1}) +(δ0×(ϕ(0)∧ϕ(1))×δ{0,1})(λyαx.χα={0,1}) =(ϕ(0)−ϕ(0)∧ϕ(1))(λx.χ{0}={0,1})+(ϕ(0)∧ϕ(1))(λx.χ{0,1}={0,1}) =(ϕ(0)∧ϕ(1))(1) And then, the value of the overlap between two distributions is 1−dTV(ϕ(0),ϕ(1)). So, we've shown one direction, we have Λ(χU)≥1−dTV(ϕ(0),ϕ(1)) In the other direction of the equality we're trying to prove, we need to constrain what θ might be. Starting off with what we definitely know, we have Λ(χU)=maxθ∈Br(⊤{0,1}⋉ϕ)θ(λyαx.χα={0,1}) Now it's time to show that for any θ∈Br(⊤{0,1}⋉ϕ), that the measure on α={0,1} can't be too high. Specifically, to proceed further, we'll need to show that ∀x′∈Φ:θ(λyαx.χx=x′∧α={0,1})≤(ϕ(0)∧ϕ(1))(λx.χx=x′) Time to establish it. On x′, either ϕ(0)(x′)≥ϕ(1)(x′), or vice-versa. Without loss of generality, assume that it's ϕ(0)(x′) that's lower.

Then, let s be the constant-0 function, and g(0,x′) be 1, and 0 everywhere else. Then, we have θ(λyαx.χx=x′∧α={0,1})≤θ(λyαx.χs(y)∈αg(s(y),x)) ≤(⊤{0,1}⋉ϕ)(λyx.g(y,x))=(⊤{0,1}⋉ϕ)(λyx.χy=0∧x=x′) =maxy∈{0,1}ϕ(y)(λx.χy=0∧x=x′)=ϕ(0)(λx.χx=x′)=ϕ(0)(x′) =(ϕ(0)∧ϕ(1))(x′) Just abbreviating, using that θ lies in the bridge transform, deabbreviating with what g is, unpacking things a little bit and canceling out. At the end we used that ϕ(0)(x′)≤ϕ(1)(x′).

Ok, so, now that we've established the key fact that ∀x′∈Φ:θ(λyαx.χx=x′∧α={0,1})≤(ϕ(0)∧ϕ(1))(λx.χx=x′) We can resume our work. We were previously at =maxθ∈Br(⊤{0,1}⋉ϕ)θ(λyαx.χα={0,1}) and we can proceed to =maxθ∈Br(⊤{0,1}⋉ϕ)∑x′∈Φθ(λyαx.χx=x′∧α={0,1}) and from there, using the inequality we proved, proceed to ≤∑x′∈Φ(ϕ(0)∧ϕ(1))(λx.χx=x′)=(ϕ(0)∧ϕ(1))(1)=1−dTV(ϕ(0),ϕ(1)) And we're done, we established that it's an upper bound as well a lower bound, so we have equality. ■

Lemma 2: If there's a function h:Φ→2Γ and Θ(λyx.χy∉h(x))=0, then for all θ∈Br(Θ), θ is supported on the set {α,x|α⊆h(x)}.

Assume that the conclusion is false, that there is some nonzero probability of drawing a x∗,α∗ pair where α∗⊈h(x∗). In particular, there must be some special y∗∈Γ value that witnesses that α∗ isn't a subset, by y∗∈α∗ and y∗∉h(x∗). Remember that for all s:Γ→Γ and g:Γ×Φ→[0,1], that since θ∈Br(Θ), we have θ(λyαx.χs(y)∈αg(s(y),x))≤Θ(λyx.g(y,x)) Now, in particular, let g:=χy∉h(x), and s be the constant function that maps everything to y∗. Then, this turns into 0<θ(λyαx.χy∗∈αχy∗∉h(x))≤Θ(λyx.χy∉h(x))=0 which is impossible. The equality as the end was our starting assumption. The middle inequality was just specializing our inequality to a particular pair of functions. And it's greater than 0 because there's a nonzero probability of drawing x∗,α∗. And y∗ was selected to lie in α∗ and outside of h(x∗), so there's a nonzero probability of getting a nonzero value. Therefore, the result follows. ■

Lemma 3: For any s:Γ→Γ, and Θ:□c(Γ×Φ), if θ∈Br(Θ), then χelΓ(s∗(θ′))∈Br(Θ).

So, the support condition is trivially fulfilled, because we're restricting to the event y∈α. That just leaves the endofunction condition. Let s′:Γ→Γ be arbitrary, and g:Γ×Φ→[0,1] be arbitrary. Then we have χelΓ(s∗(θ))(λyαx.χs′(y)∈αg(s′(y),x))=s∗(θ)(λyαx.χy∈α∧s′(y)∈αg(s′(y),x)) ≤s∗(θ)(λyαx.χs′(y)∈αg(s′(y),x))=θ(λyαx.χs′(s(y))∈αg(s′(s(y)),x)) And, since s′∘s:Γ→Γ, we can use that θ∈Br(Θ) to get ≤Θ(λyx.g(y,x)) And we're done, we've established that our modified version of θ remains in Br(Θ). ■

Proposition 2.7: If X is a poset and Θ∈□cX, then Θ↓ will denote downward closure of Θ. For any Γ, Φ and Θ∈□c(Γ×Φ) if (y,α,x)∈supp Br(Θ) and y′∈α, then (y′,x)∈supp Θ. Moreover, define susΘ:Φ→2Γ by susΘ(x):={y∈Γ∣(y,x)∈supp Θ}. Then, Br(Θ)⊆(Θ⋉susΘ)↓ (we slightly abuse notation by treating susΘ as a mapping Γ×Φ→2Γ that doesn't depend on the first argument, and also playing loose with the order of factors in the set on which our HUCs live).

This proof splits into two parts. Part 1 will be proving the support statement. Part 2 will be proving the thing with susΘ.

Part 1: Here's the basics of how this part will work. There's some (y∗,α∗,x∗) tuple in the support of Br(Θ). Since the support of Br(Θ) is the union of the supports for the θ∈Br(Θ), there's some θ that assigns nonzero probability to that event. Also, y′∈α∗. The key part of the proof is showing there's some θ∗∈Br(Θ) that assigns nonzero measure to the event (y′,x∗,α∗). Once we've got that, since we know by Proposition 2.1 that Br(Θ) projects down to equal Θ, that means that the projection of θ∗ will land in Θ, and it will assign nonzero measure to the event (y′,x∗), so the event (y′,x∗) lies in the support of Θ.

So, that just leaves appropriately defining our θ∗ and showing that it lies in Br(Θ) and assigns nonzero probability to our event of interest. Our particular choice of θ∗ will be as follows. Let cy′:Γ→Γ be the constant function mapping everything to y′. θ∗:=χelΓ((cy′)∗(θ)) Applying Lemma 3, this is in Br(Θ), so that just leaves showing that it assigns nonzero probability to the event y′,x∗,α∗. We have χelΓ((cy′)∗(θ))(λyαx.χy=y′∧x=x∗∧α=α∗) =(cy′)∗(θ)(λyαx.χy=y′∧x=x∗∧α=α∗∧y∈α) We can simplify this a bit, because if the first three properties hold, that tells you something about what α is. =(cy′)∗(θ)(λyαx.χy=y′∧x=x∗∧α=α∗∧y∈α∗) unpack the pushforward =θ(λyαx.χcy′(y)=y′∧x=x∗∧α=α∗∧cy′(y)∈α∗) Then we use that it's a constant function =θ(λyαx.χy′=y′∧x=x∗∧α=α∗∧y′∈α∗) Now, since y′∈α∗ by the problem setup, and y′=y′ is a tautology, we can remove those two events. =θ(λyαx.χx=x∗∧α=α∗) and use an inequality 0">≥θ(λyαx.χy=y∗∧x=x∗∧α=α∗)>0 Because θ assigned nonzero probability to that event. So, we know that our θ∗ assigns nonzero measure to the event of interest. And that's it! It fulfills the appropriate properties to carry the proof through.

Proof part 2: First off, we can reexpress Br(Θ)⊆(Θ⋉susΘ)↓ as the equivalent statement that, for all f:elΓ×Φ→[0,1], we have Br(Θ)(λyαx.f(y,x,α))≤(Θ⋉susΘ)↓(λyαx.f(y,x,α)) Although, since Br(Θ) and (Θ⋉susΘ)↓ are downwards-closed, we actually only need to demonstrate this inequality for monotone functions f.

The reason we only need to demonstrate this inequality for monotone f is because of Br(Θ)(f)=Br(Θ)(fmax)≤(Θ⋉susΘ)↓(fmax)=(Θ⋉susΘ)↓(f) Where the equalities follow from Proposition 3.1 and the downward closure of the two ultracontributions, and the inequality is what we're trying to prove (since fmax in the sense of Proposition 3.1 is always a monotone function.

So, let f be monotone, and we're trying to prove the desired inequality. We'll unpack it bit by bit. Br(Θ)(λyαx.f(y,x,α))=maxθ∈Br(Θ)θ(λyαx.f(y,x,α)) And now we remember that the set {x,y|y∈susΘ(x)} is a support for Θ, because it's the same as {x,y|(y,x)∈supp Θ}, ie, the support of Θ. So, by Lemma 2, we can conclude that any θ∈Br(Θ) is supported on α,x pairs s.t. α⊆susΘ(x). In particular, this means that α≤susΘ(x), and since f is monotone, swapping out α for susΘ(x) always produces an increase in expected value, so we get ≤maxθ∈Br(Θ)θ(λyαx.f(y,x,susΘ(x))) and then, since all θ∈Br(Θ) are supported on y,α s.t. y∈α, we can go =maxθ∈Br(Θ)θ(λyαx.χy∈αf(y,x,susΘ(x))) And then, since all the θ∈Br(Θ) fulfill the endofunction property, we can let s be identity and g be f, and go ≤Θ(λyx.f(y,x,susΘ(x))) And rewrite that as =Θ(λyx.δsusΘ(x)(λα.f(y,x,α))) =(Θ⋉susΘ)(λyαx.f(y,x,α)) and then since f is monotone =(Θ⋉susΘ)↓(λyαx.f(y,x,α)) This holds because all contributions added when you take the downward closure can only produce lower expectation values than the existing contributions due to monotonicity of f, and so they're ignored.

And now we're done! We got the inequality going appropriately to hit our proof target. ■

Proposition 2.8: For any Γ, Φ and Θ∈□c(Γ×Φ) Br(prΓ×2ΓBr(Θ))=[(idΓ×diag2Γ)∗prΓ×2ΓBr(Θ)]↓

For notation, we'll use β for the set which is being treated as part of the background environment, and α as the usual set.

The way to establish this is to show equal expectation values for all functions monotone in the last argument (the relevant set-based one), which is all we need to do as both sets are downwards-closed. We'll do this by establishing the two inequalities separately. Our first order of business is showing that for all f:elΓ×2Γ monotone in the last coordinate, we have Br(prΓ×2ΓBr(Θ))(λyαβ.f(y,β,α))≥[(idΓ×diag2Γ)∗prΓ×2ΓBr(Θ)]↓(λyαβ.f(y,β,α)) Let's begin. Pick an f that's monotone in the last argument. [(idΓ×diag2Γ)∗prΓ×2ΓBr(Θ)]↓(λyαβ.f(y,β,α)) =[(idΓ×diag2Γ)∗prΓ×2ΓBr(Θ)](λyαβ.f(y,β,α)) =prΓ×2ΓBr(Θ)(λyβ.f(y,β,β))=Br(Θ)(λyβx.f(y,β,β)) =maxθ′∈Br(Θ)θ′(λyβx.f(y,β,β))=maxθ′∈Br(Θ)θ′(λyβx.δβ(λα.f(y,β,α))) =maxθ′∈Br(Θ)prΓ×2Γ(θ′)(λyβ.δβ(λα.f(y,β,α))) =maxθ′∈Br(Θ)(prΓ×2Γ(θ′)⋉id2Γ)(λyβα.f(y,β,α)) We'll set this to the side for a moment to show that if θ′∈Br(Θ), then prΓ×2Γ(θ′)⋉id2Γ∈Br(prΓ×2ΓBr(Θ)).

The support part is easy. Since θ′∈Br(Θ), it's supported on (y,β) pairs where y∈β. Taking semidirect product with identity means that y∈α always happens, because β=α always happens. So, that leaves showing the endofunction condition. We have (prΓ×2Γ(θ′)⋉id2Γ)(λyαβ.χs(y)∈αg(s(y),β)) =prΓ×2Γ(θ′)(λyβ.δβ(λα.χs(y)∈αg(s(y),β)) =prΓ×2Γ(θ′)(λyβ.χs(y)∈βg(s(y),β))=θ′(λyβx.χs(y)∈βg(s(y),β)) And start packing things up a bit =s∗(θ′)(λyβx.χy∈βg(y,β)) =χelΓ(s∗(θ′))(λyβx.g(y,β)) And, since θ′∈Br(Θ), this update of the pushforward of θ′ lands in Br(Θ) by Lemma 3, so we get ≤maxθ′′∈Br(Θ)θ′′(λyβx.g(y,β)) =Br(Θ)(λyβx.g(y,β)) =prΓ×2ΓBr(Θ)(λyβ.g(y,β)) Now, since we've established that inequality for all choices of s,g, we have that prΓ×2Γ(θ′)⋉id2Γ∈Br(prΓ×2ΓBr(Θ)) Resuming where we last left off, the last place we were at in our chain of inequalities was =maxθ′∈Br(Θ)(prΓ×2Γ(θ′)⋉id2Γ)(λyαβ.f(y,β,α)) Since these things are always in Br(prΓ×2ΓBr(Θ)), we can go ≤maxθ′′∈Br(prΓ×2ΓBr(Θ))θ′′(λyαβ.f(y,β,α)) =Br(prΓ×2ΓBr(Θ))(λyαβ.f(y,β,α)) And we're done with that inequality direction.

Now to show the reverse direction, which actually will use that f is monotone in the last argument. We start with Br(prΓ×2ΓBr(Θ))(λyαβ.f(y,β,α)) =maxθ′′∈Br(prΓ×2ΓBr(Θ))θ′′(λyαβ.f(y,β,α)) And now, we can use that prΓ×2ΓBr(Θ) is supported on (y,β) pairs where y∈id(β), along with Lemma 2 applied to id, to conclude that all the θ′′ must be supported on the event α⊆id(β)=β. Since big sets go more towards the top and get a higher loss, and f is monotone in the last argument, we get ≤maxθ′′∈Br(prΓ×2ΓBr(Θ))θ′′(λyαβ.f(y,β,β)) Now, we can use the endofunction property of all the θ′′ w.r.t. prΓ×2ΓBr(Θ) to get a uniform upper bound of ≤prΓ×2ΓBr(Θ)(λyβ.f(y,β,β)) and then go =[(idΓ×diag2Γ)∗prΓ×2ΓBr(Θ)](λyαβ.f(y,β,α)) =[(idΓ×diag2Γ)∗prΓ×2ΓBr(Θ)]↓(λyαβ.f(y,β,α)) and we're done, we got the inequality going in the other direction, so we have equality for arbitrary monotone functions, and thus equality. ■

Proposition 2.9: For any Γ, Φ and Θ1,Θ2∈□c(Γ×Φ), if Θ1⊆Θ2 then Br(Θ1)⊆Br(Θ2).

Proof: This is really easy to show, we just need to take some contribution θ′∈Br(Θ1) and show that it lies in Br(Θ2). The support condition is easily fulfilled, so that leaaves showing the endofunction condition. Let's begin. θ′(λyαx.χs(y)∈αg(s(y),x))≤Θ1(λyx.g(y,x))=maxθ∈Θ1θ(λyx.g(y,x)) And now, since Θ1⊆Θ2, we have ≤maxθ∈Θ2θ(λyx.g(y,x))=Θ2(λyx.g(y,x)) Done. ■

Proposition 2.10: For any Γ, Φ1, Φ2, t:Φ2→Φ1 and Θ∈□c(Γ×Φ2) (idelΓ×t)∗Br(Θ)⊆Br((idΓ×t)∗Θ)

As usual, we just need to establish that for all f:elΓ×Φ, the expectation value is lower in the first function than the second function, so our proof goal will be (idelΓ×t)∗Br(Θ)(λyαx1.f(y,α,x1))≤Br((idΓ×t)∗Θ)(λyαx1.f(y,α,x1)) Let's begin trying to show this. (idelΓ×t)∗Br(Θ)(λyαx1.f(y,x1,α)) =Br(Θ)(λyαx2.f(y,t(x2),α)) =maxθ′∈Br(Θ)θ′(λyαx2,.f(y,t(x2),α)) =maxθ′∈Br(Θ)(idelΓ×t)∗(θ′)(λyαx1.f(y,x1,α)) Now we will show that all the contributions of this form lie in Br((idΓ×t)∗Θ). Clearly the y∈α condition is always fulfilled for these pushforwards, so that just leaves the endofunction condition. Let's begin. (idelΓ×t)∗(θ′)(λyαx1.χs(y)∈αg(s(y),x1))=θ′(λyαx2.χs(y)∈αg(s(y),t(x2))) ≤Θ(λyx2.g(y,t(x2)))=(idΓ×t)∗(Θ)(λyx1.g(y,x1)) And we're done, we established our desired result that the pushforward lands in the appropriate set. So, we can proceed by going ≤maxθ′′∈Br((idΓ×t)∗(Θ))θ′′(λyαx1.f(y,x1,α)) =Br((idΓ×t)∗(Θ))(λyαx1.f(y,x1,α)) And we're done! ■

Proposition 2.11: Consider any Γ, Φ1, Φ2, t:Φ2→Φ1, Ξ:Φ1→□Φ2 and Θ∈□c(Γ×Φ1) s.t. t∘Ξ=idΦ1. Then, (idelΓ×Ξ)∗Br(Θ)⊆Br((idΓ×Ξ)∗Θ)⊆(idelΓ×t)∗Br(Θ) In particular, prelΓBr(Θ)=prelΓBr((idΓ×Ξ)∗Θ)

Well, let's get started on proving these various things. To begin with, to prove (idelΓ×Ξ)∗Br(Θ)⊆Br((idΓ×Ξ)∗Θ) We need to prove that, for all f:elΓ×Φ2→[0,1], (idelΓ×Ξ)∗Br(Θ)(λyαx2.f(y,x2,α))≤Br((idΓ×Ξ)∗Θ)(λyαx2.f(y,x2,α)) Let's establish this. Unpacking the left-hand side, we have (idelΓ×Ξ)∗Br(Θ)(λyαx2.f(y,x2,α)) =Br(Θ)(λyαx1.Ξ(x1)(λx2.f(y,x2,α))) =maxθ′∈Br(Θ)θ′(λyαx1.Ξ(x1)(λx2.f(y,x2,α))) =maxθ′∈Br(Θ)(idelΓ×Ξ)∗(θ′)(λyαx2.f(y,x2,α)) =maxθ′∈Br(Θ)maxθ′′∈(idelΓ×Ξ)∗(θ′)θ′′(λyαx2.f(y,x2,α)) Now we'll show that this θ′′ lies in Br((idΓ×Ξ)∗Θ). The support condition is trivial, so that leaves the endofunction condition. θ′′(λyαx2.χy∈αg(s(y),x2))≤(idelΓ×Ξ)∗(θ′)(λyαx2.χs(y)∈αg(s(y),x2)) =θ′(λyαx1.Ξ(x1)(λx2.χs(y)∈αg(s(y),x2))) Then, by homogenity of Ξ, we can pull a constant out, the indicator function, to get =θ′(λyαx1.χs(y)∈αΞ(x1)(λx2.g(s(y),x2))) Then, by the endofunction condition, we get ≤Θ(λyx1.Ξ(x1)(λx2.g(y,x2))) =(idΓ×Ξ)∗(Θ)(λyx2.g(y,x2)) And bam, we've showed that θ′′ lies in the appropriate set. We were last at =maxθ′∈Br(Θ)maxθ′′∈(idelΓ×Ξ)∗(θ′)θ′′(λyαx2.f(y,x2,α)) So we can impose an upper bound of ≤maxθ′′∈Br((idΓ×Ξ)∗Θ)θ′′(λyαx2.f(y,x2,α)) ≤Br((idΓ×Ξ)∗(Θ))(λyαx2.f(y,x2,α)) And we're done, we've established our desired inequality. Now for proving another inequality. Br((idΓ×Ξ)∗Θ)⊆(idelΓ×t)∗Br(Θ) This can be proved by applying Proposition 2.10. Start out with Proposition 2.10. (idelΓ×t)∗Br(Θ′)⊆Br((idΓ×t)∗Θ′) Specialize Θ′ to (idΓ×Ξ)∗(Θ), yielding (idelΓ×t)∗Br((idΓ×Ξ)∗Θ)⊆Br((idΓ×t)∗(idΓ×Ξ)∗Θ) The two pushforwards can be folded into one. (idelΓ×t)∗Br((idΓ×t)∗Θ)⊆Br((idΓ×t∘Ξ)∗Θ) Now use that t∘Ξ is identity (idelΓ×t)∗Br((idΓ×t)∗Θ)⊆Br(Θ) Apply pullback along t to both sides (idelΓ×t)∗(idelΓ×t)∗Br((idΓ×t)∗Θ)⊆(idelΓ×t)∗Br(Θ) Pushforward then pullback along the same deterministic function cancels out to identity, so we get our desired result of Br((idΓ×t)∗Θ)⊆(idelΓ×t)∗Br(Θ) That just leaves showing the equality result, now that we've got both subset inclusions.

We start out with (idelΓ×Ξ)∗Br(Θ)⊆Br((idΓ×Ξ)∗Θ)⊆(idelΓ×t)∗Br(Θ) Applying projection yields prelΓ((idelΓ×Ξ)∗Br(Θ))⊆prelΓ(Br((idΓ×Ξ)∗Θ))⊆prelΓ((idelΓ×t)∗Br(Θ)) For any given f, we have prelΓ((idelΓ×Ξ)∗Br(Θ))(λyα.f(y,α)) ≤prelΓ(Br((idΓ×Ξ)∗Θ))(λyα.f(y,α)) ≤prelΓ((idelΓ×t)∗Br(Θ))(λyα.f(y,α)) Unpacking the projection on the two ends, we get (idelΓ×Ξ)∗Br(Θ)(λyαx2.f(y,α)) ≤prelΓ(Br((idΓ×Ξ)∗Θ))(λyα.f(y,α)) ≤(idelΓ×t)∗Br(Θ)(λyαx2.f(y,α)) And then, for the smallest and largest quantities, the pullback or pushforward only affects the x1 or x2 coordinates, everything else remains the same. In particular, f doesn't depend on such coordinates. So, we get Br(Θ)(λyαx1.f(y,α)) ≤prelΓ(Br((idΓ×Ξ)∗Θ))(λyα.f(y,α)) ≤Br(Θ)(λyαx1.f(y,α)) The left and right hand sides are equal, so we have prelΓ(Br((idΓ×Ξ)∗Θ))(λyα.f(y,α))=Br(Θ)(λyαx1.f(y,α)) Then, packing the projection back up, we have prelΓ(Br((idΓ×Ξ)∗Θ))(λyα.f(y,α))=prelΓBr(Θ)(λyα.f(y,α)) And since there's equality for all functions, the two ultradistributions are equal, so we have prelΓ(Br((idΓ×Ξ)∗Θ))=prelΓBr(Θ) And we're done. ■

Proposition 2.12: For any Γ, Φ, Θ1,Θ2∈□c(Γ×Φ) and p∈[0,1] pBr(Θ1)+(1−p)Br(Θ2)⊆Br(pΘ1+(1−p)Θ2)

We'll do this by showing that all contributions of the form pθ+(1−p)θ′ with θ∈Br(Θ1) and θ′∈Br(Θ2) lie within Br(pΘ1+(1−p)Θ2). The support property is trivial, so that leaves the endofunction property. (pθ+(1−p)θ′)(λyαx.χs(y)∈αg(s(y),x)) =pθ(λyαx.χs(y)∈αg(s(y),x))+(1−p)θ′(λyαx.χs(y)∈αg(s(y),x)) ≤pΘ1(λyx.g(y,x))+(1−p)Θ2(λyx.g(y,x)) =(pΘ1+(1−p)Θ2)(λyx.g(y,x)) Done, the mix lies in Br(pΘ1+(1−p)Θ2). ■

Proposition 2.13: Consider some Γ, Φ1, Φ2, Θ1∈□c(Γ×Φ1), Θ1∈□c(Γ×Φ2) and p∈[0,1]. Regard Φ1,2 as subsets of Φ1⊔Φ2, so that pΘ1+(1−p)Θ2∈□c(Γ×(Φ1⊔Φ2)). Then, pBr(Θ1)+(1−p)Br(Θ2)=Br(pΘ1+(1−p)Θ2)

We already established one subset direction in Proposition 2.12. We just need the other direction, to take any θ′′∈Br(pΘ1+(1−p)Θ2), and write it as pθ+(1−p)θ′ where θ∈Br(Θ1) and θ′∈Br(Θ2).

We trivially have equality when p=0 or 1, so that leaves the cases where p∈(0,1). With that, our attempted θ will be χΦ1θ′′p, and our attempted θ′ will be χΦ2θ′′1−p.

Now, clearly, these mix to make θ′′, because pθ+(1−p)θ′=pχΦ1θ′′p+(1−p)χΦ2θ′′1−p=χΦ1θ′′+χΦ2θ′′=θ′′ Remember, Φ1 and Φ2 are the two components of a space made by disjoint union.

They clearly both fulfill the support condition, because they're made from θ′′ which fulfills the support condition. This leaves the endofunction condition, which is somewhat nontrivial to show, and will require some setup. Clearly, without loss of generality, we just need to show it for θ, and θ′ follows by identical arguments. For some g:Γ×(Φ1⊔Φ2)→[0,1], g′ will denote the function that mimics g on Γ×Φ1, and is 0 on Γ×Φ2.

Also, note that θ, as defined, is supported entirely on Γ×Φ1, and θ′, as defined, is supported entirely on Γ×Φ2. Let's get started on showing our endofunction condition. Let g and s be arbitrary. θ(λyαx.χs(y)∈αg(s(y),x)) =θ(λyαx.χs(y)∈αg(s(y),x))+1−ppθ′(λyαx.0) This is because the expectation of an all-0 function is 0. Then, =θ(λyαx.χs(y)∈αg′(s(y),x))+1−ppθ′(λyαx.χs(y)∈αg′(s(y),x)) Why does this happen? Well, θ is supported entirely on Φ1, and g′ mimics g perfectly on Φ1. And θ′ is supported entirely on Φ2, and g′ mimics 0 perfectly on Φ2. So it doesn't matter that we changed the functions outside of the contribution support. =(θ+1−ppθ′)(λyαx.χs(y)∈αg′(s(y),x)) =1p(pθ+(1−p)θ′)(λyαx.χs(y)∈αg′(s(y),x)) =1pθ′′(λyαx.χs(y)∈αg′(s(y),x)) Now we use the endofunction condition on θ′′ to get ≤1p(pΘ1+(1−p)Θ2)(λyx.g′(y,x)) =(Θ1+1−ppΘ2)(λyx.g′(y,x)) =Θ1(λyx.g′(y,x))+1−ppΘ2(λyx.g′(y,x)) And then we use that Θ1 is supported on Ψ1 and Θ2 is supported on Ψ2, and on Ψ1, g′=g, and on Ψ2, g′=0, to get =Θ1(λyx.g(y,x))+1−ppΘ2(λyx.0) =Θ1(λyx.g(y,x)) And we're done! The endofunction condition goes through, which is the last piece of the proof we needed. ■

Proposition 2.14: Consider some Γ, Φ, Θ∈□c(Γ×Φ) and F⊆Γ. Let ∩F:2Γ→2Γ be defined by ∩F(α):=F∩α. Then, BrΓ(Θ∩⊤F×Φ)=(idΓ×∩F×idΦ)∗(BrΓ(Θ)∩⊤F×2Γ×Φ) Moreover, let ι:Γ′→Γ be an injection and elι:elΓ′→elΓ be the injection induced by ι. Then, BrΓ′((ι×idΦ)∗(Θ∩⊤im(ι)×Φ))=(elι×idΦ)∗BrΓ(Θ∩⊤im(ι)×Φ)

We'll have to prove both directions separately. Notice that for our proofs so far, they involve transforming one side of the equation to a certain form, then there's a critical step in the middle that's tricky to show, then we just transform back down. So our first step will be unpacking the two sides of the equation up to the critical inequality.

Also, intersecting with the top distribution corresponding to a particular set is the same as the raw-update on that set, we'll use that. We'll suppress identity functions and unused coordinates for ⊤ in the notation, so if some coordinate isn't mentioned, assume it's either the identity function for pushforwards, or that the set for ⊤ is the specified set times the entire space for unmentioned coordinates. With that notational note, we have BrΓ(Θ∩⊤F×Φ)(λyαx.f(y,x,α)) Suppressing notation =Br(Θ∩⊤F)(λyαx.f(y,x,α)) And rewriting the bridge transform, =maxθ∈Br(Θ∩⊤F)θ(λyαx.f(y,x,α)) This is our unpacking in one direction.

In the other direction, we have (idΓ×∩F×idΦ)∗(BrΓ(Θ)∩⊤F×2Γ×Φ)(λyαx.f(y,x,α)) Suppress some notation, =(∩F)∗(Br(Θ)∩⊤F)(λyαx.f(y,x,α)) Unpack the pushforward and intersection =(Br(Θ)∩⊤F)(λyαx.f(y,x,α∩F)) =Br(Θ)(λyαx.χy∈Ff(y,x,α∩F)) And unpack the bridge transform =maxθ′∈Br(Θ)θ′(λyαx.χy∈Ff(y,x,α∩F)) Pack back up =maxθ′∈Br(Θ)(θ′∩⊤F)(λyαx.f(y,x,α∩F)) =maxθ′∈BrΓ(Θ)(∩F)∗(θ′∩⊤F)(λyαx.f(y,x,α)) So, our goal to show equality overall is to establish the equality maxθ∈Br(Θ∩⊤F)θ(λyαx.f(y,x,α)) =maxθ′∈Br(Θ)(∩F)∗(θ′∩⊤F)(λyαx.f(y,x,α)) This can be done by establishing the following two things. First, is, if θ′∈Br(Θ), then (∩F)∗(θ′∩⊤F)∈Br(Θ∩⊤F). That establishes the ≥ inequality direction.

For the reverse direction, we'll need to show that for any θ∈Br(Θ∩⊤F), θ also lies in Br(Θ), and that (∩F)∗(θ∩⊤F)=θ. This establishes the ≤ inequality direction, since any choice of θ for the left hand side can be duplicated on the right hand side.

Let's switch our proof target to showing these two things. First off, assume θ′∈Br(Θ). The goal is to show that (∩F)∗(θ′∩⊤F)∈Br(Θ∩⊤F)

For once, the support condition is not entirely trivial. However, notice that θ′∩⊤F always has y∈F holding (because we updated) and y∈α always holds (because θ′ fulfills the support condition due to being in Br(Θ)). So, it's guaranteed that y∈α∩F. Then, applying the pushforward that turns α into α∩F doesn't change that there's a guarantee that y lies in the set it's paired with.

Now for the endofunction condition. (∩F)∗(θ′∩⊤F)(λyαx.χs(y)∈αg(s(y),x)) =(θ′∩⊤F)(λyαx.χs(y)∈α∩Fg(s(y),x)) =θ′(λyαx.χy∈Fχs(y)∈α∩Fg(s(y),x)) ≤θ′(λyαx.χs(y)∈α∩Fg(s(y),x))=θ′(λyαx.χs(y)∈αχs(y)∈Fg(s(y),x)) ≤Θ(λyxχy∈Fg(y,x))=(Θ∩⊤F)(λyx.g(y,x)) And we're done, that half of the proof works out.

Now for the reverse direction, establishing that for any θ∈Br(Θ∩⊤F), θ also lies in Br(Θ), and that (∩F)∗(θ∩⊤F)=θ.

The first part of this is easy. θ(λyαx.χs(y)∈αg(s(y),x))≤(Θ∩⊤F)(λyx.g(y,x)) =Θ(λyx.χy∈Fg(y,x))≤Θ(λyx.g(y,x)) And we're done. For the second part, it's a bit trickier. In short, what happens is that a piece of measure from θ is deleted if y∉F, and then α gets mapped to α∩F. So, if we knew that θ was supported on F×2F, we'd know that neither of the two transformations do anything whatsoever, and so you just get θ out again.

So, let's show that θ∈Br(Θ∩⊤F) is supported on F×2F. For this, the way we do it is use Proposition 2.7 to get an upper bound on the bridge transformation, and show that said upper bound is also supported on F×2F. By proposition 2.7, Br(Θ∩⊤F)⊆((Θ∩⊤F)⋉susΘ∩⊤F)↓ The downwards arrow means that we're passing from larger subsets to smaller subsets, so it doesn't matter if we remove that, it won't affect whether the support is a subset of F×2F. Clearly, Θ∩⊤F is supported on F. And, for sus, we have susΘ∩⊤F(x)={y|(y,x)∈supp (Θ∩⊤F)} So clearly it can only produce sets of y which are subsets of F since that's an upper bound on the support of Θ∩⊤F. So we have support on F×2F, and we're done with this half of the theorem.

So now we move onto the second half of the theorem with the injections. Without loss of generality, we can reduce this problem to a slightly simpler form, where we assume that Θ is supported over im(ι). This is because any Θ supported over im(ι) has Θ∩⊤im(ι)=Θ, and also for any Θ′, Θ′∩⊤im(ι) is always supported over im(ι).

Accordingly, assume that Θ is supported over im(ι). Suppressing identity functions, our goal is to prove that BrF((ι)∗Θ)=(elι)∗BrΓ(Θ) This is suprisingly aggravating to prove. Again, we'll try to rewrite things until we get to a point where there's just one equality to show, and then put in the work of showing the endofunction condition in both directions. Time to start rewriting. Subscripts F will be used to denote when a particular y is part of F, or of Γ. Similar with 2F and 2Γ. BrF((ι)∗Θ)(λyFαFx.f(yF,x,αF)) =maxθF∈BrF((ι)∗Θ)θF(λyFαFx.f(yF,x,αF)) Rewriting the other direction, we have (elι)∗BrΓ(Θ)(λyFαFx.f(yF,x,αF)) Now, for pullback, it's defined like this. Maximum over the empty set is 0. =BrΓ(Θ)(λyαx.maxyF,αF∈(elι)−1(y,α)f(yF,x,αF)) =maxθ∈BrΓ(Θ)θ(λyαx.maxyF,αF∈(elι)−1(y,α)f(yF,x,αF)) =maxθ∈BrΓ(Θ)(elι)∗(θ)(λyFαFx.f(yF,x,αF)) And so we now must embark on showing these two things are equal, by showing that every value that one of the maxes is capable of producing, the other is capable of producing too. Our proof goal is maxθF∈BrF((ι)∗Θ)θF(λyFαFx.f(yF,x,αF))=maxθ∈BrΓ(Θ)(elι)∗(θ)(λyFαFx.f(yF,x,αF)) Now, for one direction of this, we need to show that if θ∈BrΓ(Θ), then (elι)∗(θ)∈BrF((ι)∗Θ).

The support condition is easily fulfilled, because the only time the pullback works is when y,α∈im(elι), and ι is an injection, so there's a unique preimage point yF,αF, and since y∈F occurs always,then yF∈αF occurs always. That leaves the endofunction condition. Let sF:F→F, gF:F×Φ→[0,1]. Let s′ be defined as follows. For points in im(ι), it maps y to ι(sF(ι−1(y))). Everywhere else, it's the identity. Also, g′, for points in im(ι), maps y,x to g(ι−1(y),x), and is 0 everywhere else. Both of these are uniquely defined because ι is an injection. We have (elι)∗(θ)(λyFαFx:χsF(yF)∈αFgF(sF(yF),x)) =θ(λyαx:maxyF,αF∈(elι)−1(y,α)χsF(yF)∈αFgF(sF(yF),x)) We can partially write the maximum as an indicator function for checking whether y∈im(ι) and α⊆im(ι) (because in such a case the maximum is taken over an empty set and 0 is returned in those cases). And also, since there's only one point in that inverse, since ι is an injection, there's a canonical inverse and we can swap everything out for that, yielding =θ(λyαx:χy∈im(ι)∧α⊆im(ι)χsF(ι−1(y))∈ι−1(α)gF(sF(ι−1(y)),x)) Now, since ι is an injection, applying it to the point and the set in the indicator function don't affect whether the point is in the relevant set, so we can go =θ(λyαx:χy∈im(ι)∧α⊆im(ι)χι(sF(ι−1(y)))∈ι(ι−1(α))gF(sF(ι−1(y)),x)) And we can also apply ι and ι−1 to the point in the gF function, as that's just identity. =θ(λyαx:χy∈im(ι)∧α⊆im(ι)χι(sF(ι−1(y)))∈ι(ι−1(α))gF(ι−1(ι(sF(ι−1(y)))),x)) Use our s′ and g′ abbreviations since we know that the relevant point is in im(ι) if it made it past the indicator function. Also do some canceling out of identity functions around the α. =θ(λyαx:χy∈im(ι)∧α⊆im(ι)χs′(y)∈αgF(ι−1(s′(y)),x)) And use our abbreviation of g′ =θ(λyαx:χy∈im(ι)∧α⊆im(ι)χs′(y)∈αg′(s′(y),x)) Remove the indicator function ≤θ(λyαx:χs′(y)∈αg′(s′(y),x)) Apply endofunction property ≤Θ(λyx.g′(y,x)) Now, g′ is 0 when y∉im(ι), and is gF(ι−1(y),x) otherwise, so we have =Θ(λyx.χy∈im(ι)gF(ι−1(y),x)) =Θ(λyx.maxyF∈ι−1(y)gF(yF,x))=(ι)∗Θ(λyFx.gF(yF,x)) And we're done here, the endofunction condition has been shown, establishing one direction of the inequality.

For the reverse direction, we'll be working on showing that if θF∈BrF((ι)∗Θ), then (elι)∗θF∈Br(Θ), though it will take a little bit of extra work at the end to show how this implies our desired equality. The support property obviously holds, we're just applying ι to our point and set, so that leaves the endofunction property. s and g are as usual. (elι)∗θF(λyαx.χs(y)∈αg(s(y),x)) =θF(λyFαFx.χs(ι(yF))∈ι(αF)g(s(ι(yF)),x)) Now, if s(ι(yF))∈im(ι), it might pass the indicator function, but if s(ι(yF))∉im(ι), it definitely won't. So let's place that indicator function. =θF(λyFαFx.χs(ι(yF))∈im(ι)χs(ι(yF))∈ι(αF)g(s(ι(yF)),x)) Now, since we know this stuff is in the image of ι (and ι(αF) definitely is), and ι is an injection, we can safely take the inverse of everything, yielding =θF(λyFαFx.χs(ι(yF))∈im(ι)χι−1(s(ι(yF)))∈ι−1(ι(αF))g(ι(ι−1(s(ι(yF)))),x)) In order, this was because we could reverse the injection as long as we're past that indicator function, reversing the injection doesn't affect whether the point is an element of the set, and ι∘ι−1 is identity. Cancel out some of the identity.. =θF(λyFαFx.χs(ι(yF))∈im(ι)χι−1(s(ι(yF)))∈αFg(ι(ι−1(s(ι(yF)))),x)) Now, at this point, let sF:F→F be defined as sF(yF)=ι−1(s(ι(yF))) when s(ι(yF))∈im(ι), and arbitrary otherwise. And let gF be defined as gF(yF,x)=g(ι(yF),x) Now, using these abbreviations, we can go =θF(λyFαFx.χs(ι(yF))∈im(ι)χsF(yF)∈αFg(ι(sF(yF)),x)) =θF(λyFαFx.χs(ι(yF))∈im(ι)χsF(yF)∈αFgF(sF(yF),x)) Now we can safely strip away the indicator function. ≤θF(λyFαFx.χsF(yF)∈αFgF(sF(yF),x)) And apply the endofunction condition ≤(ι)∗(Θ)(λyFx.gF(yF,x)) Unpacking the pullback, we get =Θ(λyx.maxyF∈ι−1(y)gF(yF,x)) Now unpack the definition of gF =Θ(λyx.maxyF∈ι−1(y)g(ι(yF),x)) Use that Θ is supported on the image of ι, so that max is always nonempty. Then, things cancel out to yield =Θ(λyx.g(y,x)) And we're done there.

Ok, so... we've shown one inequality direction already, all that remained to be shown was maxθF∈BrF((ι)∗Θ)θF(λyFαFx.f(yF,x,αF))≤maxθ∈BrΓ(Θ)(elι)∗(θ)(λyFαFx.f(yF,x,αF)) How does it help to know that for any θF∈BrF((ι)∗Θ), then (elι)∗θF∈BrΓ(Θ)? Well, for any particular θF chosen on the left-hand side, you can let the choice of θ for the right hand side be (elι)∗θF, which lies in the appropriate set. Then, the contribution for the right-hand-side would be (elι)∗((elι)∗(θF)), which happens to be θF. So anything the left-hand-side can do, the right-hand-side can do as well, showing our one remaining direction of inequality, and concluding the proof.

Well, actually, I should flesh out that claim that (elι)∗((elι)∗(θF))=θF. This happens because (elι)∗((elι)∗(θF))(λyFαFx.f(yF,x,αF)) =(elι)∗(θF)(λyαx.χy,α∈elFmaxyF,αF∈(elι)−1(y,α)f(yF,x,αF)) =θF(λyFαFx.χι(yF),ι(αF)∈elFmaxy′F,α′F∈(elι)−1(ι(yF),ι(αF))f(y′F,x,α′F)) Now,that initial indicator condition is always true because θF is supported on elF, so we have =θF(λyFαFx.maxy′F,α′F∈(elι)−1(ι(yF),ι(αF))f(y′F,x,α′F)) And then, since elι is injective, applying it then taking the partial inverse is just identity, so we get =θF(λyFαFx.f(yF,x,αF)) And that's why we have equality. Alright, proof finally done. ■

Proposition 2.15: Consider some Γ0, Γ1, r:Γ0→Γ1, Φ, Θ∈□c(Γ0×Φ). Let ι:Γ0→Γ0×Γ1 be given by ι(y):=(y,r(y)). Then, BrΓ0×Γ1((ι×idΦ)∗Θ)=(elι×idΦ)∗BrΓ0(Θ)

This can be established by Proposition 2.14. Taking proposition 2.14, and using Θ′ as an abbreviation for (ι×idΦ)∗Θ, and im(ι) as the F, we get BrΓ0×Γ1(Θ′∩⊤im(ι)×Φ) =(idΓ0×Γ1×∩im(ι)×idΦ)∗(BrΓ0×Γ1(Θ′)∩⊤im(ι)×2Γ×Φ) We'll need to abbreviate in a bit, so ignore some of the identity functions to get BrΓ0×Γ1(Θ′∩⊤im(ι)×Φ) =(∩im(ι))∗(BrΓ0×Γ1(Θ′)∩⊤im(ι)×2Γ×Φ) Now, deabbreviating, Θ′ ie (ι×idΦ)∗Θ is already supported on im(ι), so intersecting with ⊤im(ι)×Φ does nothing. BrΓ0×Γ1(Θ′∩⊤im(ι)×Φ) =(∩im(ι))∗(BrΓ0×Γ1(Θ′∩⊤im(ι)×Φ)∩⊤im(ι)×2Γ×Φ) Now, since Θ′∩⊤im(ι)×Φ is onl supported on im(ι), the bridge transform will only be supported on elim(ι). So, we can pull back along the function elι and pushforward and it will be identity, as our ultracontribution is entirely supported on the area that can be pulled back. So, we get BrΓ0×Γ1(Θ′∩⊤im(ι)×Φ) =(∩im(ι))∗((elι×idΦ)∗((elι×idΦ)∗(BrΓ0×Γ1(Θ′∩⊤im(ι)×Φ)))∩⊤im(ι)×2Γ×Φ) Applying the second part of Proposition 2.14, this can be rewritten as BrΓ0×Γ1(Θ′∩⊤im(ι)×Φ) =(∩im(ι))∗((elι×idΦ)∗(BrΓ0((ι×idΦ)∗(Θ′∩⊤im(ι)×Φ)))∩⊤im(ι)×2Γ×Φ) Since Θ′ deabbreviates as (ι×idΦ)∗Θ, it's supported on im(ι), so intersecting with im(ι) does nothing. Getting rid of that part, and deabbreviating, we get BrΓ0×Γ1((ι×idΦ)∗Θ) =(∩im(ι))∗((elι×idΦ)∗(BrΓ0((ι×idΦ)∗((ι×idΦ)∗Θ)))∩⊤im(ι)×2Γ×Φ) Now, since ι is an injection, pushforward-then-pullback is identity, so we get BrΓ0×Γ1((ι×idΦ)∗Θ) =(∩im(ι))∗((elι×idΦ)∗(BrΓ0(Θ))∩⊤im(ι)×2Γ×Φ) Then, since the pushforward through elι is supported on im(ι), the intersection does nothing BrΓ0×Γ1((ι×idΦ)∗Θ)=(∩im(ι))∗((elι×idΦ)∗(BrΓ0Θ)) Then note that any set pushed forward through elι must be a subset of im(ι), so intersecting with im(ι) does nothing, and we get BrΓ0×Γ1((ι×idΦ)∗Θ)=(elι×idΦ)∗(BrΓ0Θ) And the result is proven. ■

Discuss

### Infra-Bayesian physicalism: a formal theory of naturalized induction

1 декабря, 2021 - 01:25
Published on November 30, 2021 10:25 PM GMT

This is joint work by Vanessa Kosoy and Alexander "Diffractor" Appel. For the proofs, see 1 and 2.

TLDR: We present a new formal decision theory that realizes naturalized induction. Our agents reason in terms of infra-Bayesian hypotheses, the domain of which is the cartesian product of computations and physical states, where the ontology of "physical states" may vary from one hypothesis to another. The key mathematical building block is the "bridge transform", which, given such a hypothesis, extends its domain to "physically manifest facts about computations". Roughly speaking, the bridge transforms determines which computations are executed by the physical universe. In particular, this allows "locating the agent in the universe" by determining on which inputs its own source is executed.

0. Background

The "standard model" of ideal agency is Bayesian reinforcement learning, and more specifically, AIXI. We challenged this model before due to its problems with non-realizability, suggesting infra-Bayesianism as an alternative. Both formalisms assume the "cartesian cybernetic framework", in which (i) the universe is crisply divided into "agent" and "environment" and (ii) the two parts interact solely via the agent producing actions which influence the environment and the environment producing observations for the agent. This is already somewhat objectionable on the grounds that this division is not a clearly well-defined property of the physical universe. Moreover, once we examine the structure of the hypothesis such an agent is expected to learn (at least naively), we run into some concrete problems.

The modern understanding of the universe is that no observer plays a privileged role[1]. Therefore, the laws of physics are insufficient to provide a cartesian description of the universe, and must, to this end, be supplemented with "bridge rules" that specify the agent's location inside the universe. That is, these bridge rules need to translate the fundamental degrees of freedom of a physical theory (e.g. quantum wavefunction) to the agent's observations (e.g. values of pixels on a camera), and translate the agent's actions (e.g. signal to robot manipulators) in the other direction[2]. The cost of this is considerable growth in the description complexity of the hypothesis.

Another assumption of AIXI is the simplicity prior, and we expect some form of this assumption to persist in computable and infra-Bayesian analogues. This reflects the intuitive idea that we expect the world to follow simple laws (or at least contain simple patterns). However, from the cartesian perspective, the "world" (i.e. the environment) is, prima facie, not simple at all (because of bridge rules)! Admittedly, the increase in complexity from the bridge rule is low compared to the cost of specifying the universe state, but once the agent learns the transition rules and bridge rule for the universe it's in, learning the state of the universe in addition doesn't seem to yield any particular unforeseen metaphysical difficulties. Further, the description complexity cost of the bridge rule seems likely to be above the description complexity of the laws of physics.

Hence, there is some disconnect between the motivation for using a simplicity prior and its implementation in a cartesian framework.

Moreover, if the true hypothesis is highly complex, it implies that the sample complexity of learning it is very high. And, as previously mentioned, the sample complexity issues are worse in practice than Solomonoff suggests. This should make us suspect that such a learning process is not properly exploiting Occam's razor. Intuitively, such an agent a-priori considers it equally plausible to discover itself to be a robot or to discover itself to be a random clump of dust in outer space, since it's about as hard to specify a bridge rule interface between the computer and observations as it is to specify a bridge rule interface between the dust clump and observations and needs a lot of data to resolve all those possibilities for how its observations connect to a world. Also, though Solomonoff is extremely effective at slicing through the vast field of junk hypotheses that do not describe the thing being predicted, once it's whittled things down to a small core of hypotheses that do predict things fairly accurately, the data to further distinguish between them may be fairly slow in coming. If there's a simple predictor of events occurring in the world but it's running malign computation, then you don't have the luxury of 500 bits of complexity wiggle room (to quickly knock out this hypothesis), because that's a factor of 2500 probability difference. Doing worst-case hypothesis testing as in KWIK learning would require a very aggressive threshold indeed, and mispredictions can be rare but important. Furthermore, some events are simple to describe from the subjective (cartesian) point of view, but complex to describe from an objective (physical) point of view. (For example, all the pixels of the camera becoming black.) Modifying a hypothesis by positing exceptional behavior following a simple event only increases its complexity by the difficulty of specifying the event and what occurs afterwards, which could be quite low. Hence, AIXI-like agents would have high uncertainty about the consequences of observationally simple events. On the other hand, from an objective perspective such uncertainty seems irrational. (Throwing a towel on the camera should not break physics.) In other words, cartesian reasoning is biased to privilege the observer.

Yet another failure of cartesian agents is the inability to reason about origin theories. When we learn that a particular theory explains our own existence (e.g. evolution), this serves as a mark in favor of the theory. We can then exploit this theory to make useful predictions or plans (e.g. anticipate that using lots of antibiotics will cause bacteria to develop resistance). However, for a cartesian agent the question of origins is meaningless. Such an agent perceives its own existence as axiomatic, hence there is nothing to explain.

Finally, cartesian agents are especially vulnerable to acausal attacks. Suppose we deploy a superintelligent Cartesian AI called Kappa. And, imagine a superintelligent agent Mu that inhabits some purely hypothetical universe. If Mu is motivated to affect our own (real) universe, it can run simulations of Kappa's environment. Kappa, who doesn't know a priori whether it exists in our universe or in Mu's universe, will have to seriously consider the hypothesis it is inside such a simulation. And, Mu will deploy the simulation in such manner as to make the simulation hypothesis much simpler, thanks to simpler bridge rules. This will cause Kappa to become overwhelmingly confident that it is in a simulation. If this is achieved, Mu can cause the simulation to diverge from our reality in a strategically chosen point such that Kappa is induced to take an irreversible action in Mu's favor (effectively a treacherous turn). Of course, this requires Kappa to predict Mu's motivations in some detail. This is possible if Kappa develops a good enough understanding of metacosmology.

An upcoming post by Diffractor will discuss acausal attacks in more detail.

In the following sections, we will develop a "physicalist" formalism that entirely replaces the cartesian framework, curing the abovementioned ills, though we have not yet attained the stage of proving improved regret bounds with it, just getting the basic mathematical properties of it nailed down. As an additional benefit, it allows naturally incorporating utility functions that depend on unobservables, thereby avoiding the problem of "ontological crises". At the same time, it seems to impose some odd constraints on the utility function. We discuss the possible interpretations of this.

1. Formalism

Notation

It will be more convenient to use ultradistributions rather than infradistributions. This is a purely notational choice: the decision theory is unaffected, since we are going to apply these ultradistributions to loss functions rather than utility functions. As support for this claim, Diffractor originally wrote down most of the proofs in infradistribution form, and then changing their form for this post was rather straightforward to do. In addition, for the sake of simplicity, we will stick to finite sets: more general spaces will be treated in a future article. So far, we're up to countable products of finite sets.

We denote R+:=[0,∞). Given a finite set X,a contribution on X is θ:X→R+ s.t. ∑xθ(x)≤1 (it's best to regard it as a measure on X). The space of contributions is denoted ΔcX. Given f:X→R and θ∈ΔcX, we denote θ(f):=∑xθ(x)f(x). There is a natural partial order on contributions: θ1≤θ2 when ∀x∈X:θ1(x)≤θ2(x). Naturally, any distribution is in particular a contribution, so ΔX⊆ΔcX. A homogenous ultracontribution (HUC) on X is non-empty closed convex Θ⊆ΔcX which is downward closed w.r.t. the partial order on ΔcX. The space of HUCs on X is denoted □cX. A homogenous ultradistirbution (HUD) on X is a HUC Θ s.t. Θ∩ΔX≠∅. The space of HUDs on X is denoted □X. Given f:X→R and Θ∈□cX, we denote Θ(f):=maxθ∈Θθ(f).

Given s:X→Y, s∗:□cX→□cY is the pushforward by s:

s∗Θ:={s∗θ∣θ∈Θ}

(s∗θ)(y):=∑x∈s−1(y)θ(x)

Given Ξ:X→□cY, Ξ∗:□cX→□cY is the pushforward by Ξ:

Ξ∗Θ:={κ∗θ∣θ∈Θ,κ:X→ΔcY,∀x∈X:κ(x)∈Ξ(x)}

(κ∗θ)(y):=∑x∈Xθ(x)κ(x;y)

prX:X×Y→X is the projection mapping and pr−Y:=prX. We slightly abuse notation by omitting the askerisk in pushforwards by these.

Given Θ∈□cX and Ξ:X→□cY, Θ⋉Ξ∈□c(X×Y) is the semidirect product:

Θ⋉Ξ:={κ⋉θ∣θ∈Θ,κ:X→ΔcY,∀x∈X:κ(x)∈Ξ(x)}

(κ⋉θ)(x,y):=θ(x)κ(x;y)

We will also use the notation Ξ⋊Θ∈□c(Y×X) for the same HUC with X and Y flipped. And, for Λ∈□cY,Θ⋉Λ∈□c(X×Y) is the semidirect product of Θ with the constant ultrakernel whose value is Λ[3].

For more discussion of HUDs, see previous article, where we used the equivalent concept of "cohomogenous infradistribution".

Notation Reference

If you got lost somewhere and wanted to scroll back to see some definition, or see how the dual form of this with infradistributions works, that's what this section is for.

θ is a contribution, a measure with 1 or less measure in total. The dual concept is an a-measure (λμ,b) with λ+b=1.

Θ is a HUC (homogenous ultracontribution) or HUD (homogenous ultradistribution), a closed convex downwards closed set of contributions. The dual concepts are cohomogenous inframeasure and cohomogenous infradistribution, respectively.

ΔcX,□cX,□X are the spaces of contributions, homogenous ultracontributions, and homogenous ultradistributions respectively.

θ(f),Θ(f) are the expectations of functions f:X→[0,1], defined in the usual way. For θ, it's just the expectation of a function w.r.t. a measure, and for Θ, it's Θ(f):=maxθ∈Θθ(f), to perfectly parallel a-measures evaluating functions by just taking expectation, and inframeasures evaluating functions via min(m,b)∈Ψm(f)+b.

≤ is the ordering on contributions/HUC's/HUD's, which is the function ordering, where θ≤θ′ iff for all f:X→[0,1], θ(f)≤θ′(f), and similar for the HUC's. Inframeasures are equipped with the opposite ordering.

s∗Θ is the pushforward along the function s:X→Y. This is a standard probability theory concept which generalizes to all the infra and ultra stuff.

Ξ∗Θ is the pushforward of Θ along the ultrakernel Ξ:X→□cY. This is just the generalization to infra and ultra stuff of the ability to push a probability distribution on X through a probabilistic function X→ΔY to get a probability distribution on Y.

Θ⋉Ξ is the semidirect product of Θ and Ξ, an element of □c(X×Y). This is the generalization of the ability to take a probability distribution on X, and a probabilistic kernel X→ΔY, and get a joint distribution on X×Y.

A,O are the set of actions and observations.

N is the time horizon.

R is the space of programs, Σ is the space of outputs.

Γ=ΣR, it's a function from programs to what result they output. It can be thought of as a computational universe, for it specifies what all the functions do.

H is the space of histories, action-observation sequences that can end at any point, ending with an observation.

D is the space of destinies, action-observation sequences that are as long as possible, going up to the time horizon.

C is a relation on Γ×D that says whether a computational universe is consistent with a given destiny. A very important note is that this is not the same thing as "the policy is consistent with the destiny" (the policy's actions are the same as what the destiny advises). This is saying something more like "if the destiny has an observation that the computer spat out result 1 when run on computation A, then that is only consistent with mathematical universes which have computation A outputting result 1". Except we don't want to commit to the exact implementation details of it, so we're leaving it undefined besides just "it's a relation"

Φ is the space of "physics outcomes", it can freely vary depending on the hypothesis. It's not a particular fixed space.

Θ is the variable typically used for physicalist hypotheses, elements of □(Γ×Φ). Your uncertainty over the joint distribution over the computational universe and the physical universe.

G is the code of the program-which-is-the-agent. So, G(h) would be the computation that runs what the agent does on history h, and returns its resulting action.

elΓ is the subset of the space Γ×2Γ consisting of (y,α) pairs s.t. y∈α. The y can be thought of as the mathematical universe, and α can be thought of as the set of mathematical universes that are observationally indistinguishable from it.

χA is the indicator function that's 1 on the set A, and 0 everywhere else. It can be multiplied by a measure, to get the restriction of a measure to a particular set.

Br(Θ) is the bridge transform of Θ, defined in definition 1.1, an ultracontribution over the space elΓ×Φ.

Hyα is the set of instantiated histories, relative to mathematical universe y, and set of observationally indistinguishable universes α. It's the set of histories h where all the "math universes" in α agree on how the agent's source code reacts to all the prefixes of h, and where the history h can be extended to some destiny that's consistent with math universe y. Ie, for a history to be in here, all the prefixes have to be instantiated, and it must be consistent with the selected math universe.

Setting

As in the cartesian framework, we fix a finite set A of actions and a finite set O of observations. We assume everything happens within a fixed finite[4] time horizon N∈N. We assume that our agent has access to a computer[5] on which it can execute some finite[4:1] set of programs R with outputs in a finite alphabet Σ. Let Γ:=ΣR be the set of "possible computational universes"[6]. We denote H:=(A×O)<N (the set of histories) and D:=(A×O)N−1×A (the set of "destinies").

To abstract over the details of how the computer is operated, we assume a relation C⊆D×Γ whose semantics is, dCy (our notation for (d,y)∈C) if and only if destiny d is a priori consistent with computational universe y. For example, suppose some a∈A implies a command to execute program r∈R,and if o∈O follows a, it implies observing the computer return output i∈Σ for r. Then, if d contains the substring ao and dCy, it must be the case that y(r)=i.

A physicalist hypothesis is a pair (Φ,Θ), where Φ is a finite[4:2] set representing the physical states of the universe and Θ∈□(Γ×Φ) represents a joint belief about computations and physics. By slight abuse of notation we will refer to such Θ as a physicalist hypothesis, understanding Φ to be implicitly specified. Our agent will have a prior over such hypotheses, ranging over different Φ.

Two questions stand out to us at this point. The first is, what is the domain over which our loss function should be defined? The second is, how do we define the counterfactuals corresponding to different policies π:H→A? The answers to both questions turn out to require the same mathematical building block.

For the first question, we might be tempted to identify Φ as our domain. However, prima facie this doesn't make sense, since Φ is hypothesis-dependent. This is the ontological crisis problem: we expect the agent's values to be defined within a certain ontology which is not the best ontology for formulating the laws of the physical universe. For example, a paperclip maximizer might benefit from modeling the universe in terms of quantum fields rather than paperclips. In principle, we can circumvent this problem by requiring our Φ to be equipped with a mapping ν:Φ→Φ0,where Φ0 is the "axiological" ontology. However, this ν is essentially a bridge rule, carrying with it all the problems of bridge rules: acausal attack is performed by the adversarial hypothesis imprinting the "axiological rendering" of the target universe on the microscopic degrees of freedom of the source universe in order to have a low-complexity Φ→Φ0 function; the analogue of the towel-on-camera issue is that, once you've already coded up your uncertainty over math and physics, along with how to translate from physics to the ontology of value, it doesn't take too much extra complexity to tie "axiologically-simple" results (the analogue of low-complexity observations) to physics-simple results (the analogue of a low-complexity change in what happens), like "if all the paperclips are red, the fine-structure constant doubles in value".

Instead, we will take a computationalist stance: value is a not property of physical states or processes, but of the computations realized by physical processes. For example, if our agent is "selfish" in the sense that, rewards/losses are associated purely with subjective histories, the relevant computation is the agent's own source code. Notice that, for the program-which-is-the-agent G, histories are input. Hence, given loss function l:H→R we can associate the loss l(h) with the computation G(h). Admittedly, there is an implicit assumption that the agent has access to its own source code, but modal decision theory made the same assumption. For another example, if our agent is a diamond maximizer, then the relevant computations are simulations of the physics used to define "diamonds". A more concrete analogue of this is worked out in detail in section 3, regarding Conway's Game of Life.

For the second question, we might be tempted to follow updateless decision theory: counterfactuals correspond to conditioning on G=π. Remember, G is the code of the agent. However, this is not "fair" since it requires the agent to be "responsible" for copies of itself instantiated with fake memories. Such a setting admits no learning-theoretic guarantees, since learning requires trusting your own memory. (Moreover, the agent also has to be able to trust the computer.) Therefore our counterfactuals should only impose G(h)=π(h) when h is a "real memory", which we against interpret through computationalism: h is real if and only if, G(h′) is physically realized for any prefix h′ of h.

Both of our answers requires a formalization of the notion "assuming hypothesis Θ, this computation is physically realized". More precisely, we should allow for computations to be realized with certain probabilities, and more generally allow for ultradistributions over which computations are realized. We will now accomplish this formalization.

Bridge transform

Given any set A, we denote elA={(a,B)∈A×2A∣a∈B}. supp stands for "support" and χA is the characteristic function of A.

Definition 1.1: Let Γ,Φ be finite sets and Θ∈□c(Γ×Φ). The bridge transform of Θ is BrΓ(Θ)∈□c(Γ×2Γ×Φ) s.t. θ∈BrΓ(Θ) if and only if

• suppθ⊆elΓ×Φ

• for any s:Γ→Γ, prΓ×ΦχelΓs∗θ∈Θ.

We will use the notation Br(Θ) when Γ is obvious from the context.

The 2Γ variable of the bridge transform denotes the "facts about computations realized by physics". In particular, if this α∈2Γ takes the form {y∈Γ∣∀r∈R0:y(r)=y0(r)} for some R0⊆R and y0∈ΣR0, then we may say that the computations R0 are "realized" and the computations R∖R0 are "not realized". More generally, talking only about which computations are realized is imprecise since α might involve "partial realization" and/or "entanglement" between computations (i.e. not be of the form above).

Intuitively, this definition expresses that the "computational universe" can be freely modified as long as the "facts known by physics" are preserved. However, that isn't what originally motivated the definition. The bulk of its justification comes from its pleasing mathematical properties, discussed in the next section.

A physicalist agent should be equipped with a prior over physicalist hypotheses. For simplicity, suppose it's a discrete Bayesian prior (it is straightforward to generalize beyond this): hypothesis Θi is assigned probability ζ(i) and ∑iζ(i)=1. Then, we can consider the total bridge transform of the prior. It can't be given by mixing the hypotheses together and applying the bridge transform, because every hypothesis has its own choice of Φ, its own ontology, so you can't mix them before applying the bridge transform. You have to apply the bridge transform to each component first, forget about the choice of Φ via projecting it out, and then mix them afterwards. This receives further support from Proposition 2.13 which takes an alternate possible way of defining the bridge transform for mixtures (extend all the hypotheses from Γ×Φi to Γ×⨆iΦi in the obvious way so you can mix them first) and shows that it produces the same result.

Definition 1.2: Br(ζ⋉Θ):=∑iζ(i)prelΓBr(Θi)∈□elΓ

Evaluating policies

Given A⊆X, ⊤A∈□X is defined as {θ∈ΔcX∣suppθ⊆A}. I.e., it's total uncertainty over which point in A will be selected.

We need to assume that R contains programs representing the agent itself. That is, there is some M∈N, dec:ΣM→A and for each h∈H, ┌G(h)┐∈RM. Pretty much, if you have large numbers of actions available, and a limited number of symbols in your language, actions (and the outputs of other programs with a rich space of outputs) can be represented be m-tuples of programs, like "what's the first bit of this action choice" and "what's the second bit of this action choice" and so on. So, M is just how many bits you need, dec is the mapping from program outputs to the actual action, and ┌G(h)┐ is the m-tuple of of programs which implements the computation "what does my source code do on input history h". The behavior of the agent in a particular mathematical universe is given by taking each program in ┌G(h)┐, using the mathematical universe to figure out what each of the programs outputs, and then using dec to convert that bitstring to an action.

Definition 1.3: Given y∈Γ and h∈H, we define Gy(h)∈A to be dec(a) for a∈ΣM given by ai:=y(┌G(h)┐i).

We can now extract counterfactuals from any Λ∈□elΓ. Specifically, given any policy π we define some Cπ⊆elΓ (the set of stuff consistent with behaving according to π, for a yet-to-be-defined notion of consistency) and define the counterfactual as Λ∩⊤Cπ. We could use the "naive" definition:

Definition 1.4: Cπnaive is the set of all (y,α)∈elΓ s.t. for any h∈H, Gy(h)=π(h).

Per discussion above, it seems better to use a different definition. We will use the notation h⊏g to mean "h is a proper prefix of g" and h⊑g to mean "h is a prefix of g",

Definition 1.5: Given (y,α)∈elΓ, let Hyα be the set of all h∈H s.t. the following two conditions hold:

1. For all ga∈H×A and y′∈Γ where ga⊏h and y′∈α, Gy′(g)=a.

2. h⊏d for some d∈D s.t. dCy.

Cπfair is the set of all (y,α)∈elΓ s.t. for any h∈Hyα, Gy(h)=π(h).

Here, condition 1 says that we only "take responsibility" for the action on a particular history when the history was actually observed (all preceding evaluations of the agent are realized computations). Condition 2 says that we only "take responsibility" when the computer is working correctly.

At this stage, Definition 1.5 should be regarded as tentative, since we only have one result so far that validates this definition, namely that the set of (y,α) in Cπfair only depends on what the policy π does on possible inputs, instead of having the set depending on what the policy does on impossible inputs where one of the past memorized actions is not what the policy would actually do. We hope to rectify this in future articles.

Putting everything together: given a loss function L:elΓ→R, which depends on how the mathematical universe is, and which computations are realized or not-realized, the loss of policy π is given by just applying the bridge transform to coax the hypotheses into the appropriate form, intersecting with the set of possibilities consistent with the agent behaving according to a policy, and evaluating the expectation of the loss function, as detailed below.

Lpol(π,ζ):=(Br(ζ⋉Θ)∩Cπfair)(L)

Evaluating agents

So far we regarded the agent's source code G and the agent's policy π as independent variables, and explained how to evaluate different policies given fixed G. However, in reality the policy is determined by the source code. Therefore, it is desirable to have a way to evaluate different codes. We achieve this using an algorithmic information theory approach.

We also want to allow the loss function to depend on G. That is, we postulate L:(RM)H×elΓ→R. Specifically, since RM can be thought of as the space of actions, (RM)H is basically the space of policies. In section 3 we will see why in some detail, but for now think of the difference between "maximizing my own happiness" and "maximizing Alice's happiness": the first is defined relative to the agent (depends on G) whereas the second is absolute (doesn't depend on G). In order to define a selfish agent that just cares about its own observations, it must make referene to its own source code.

Definition 1.6: Denote G∗:H→A the policy actually implemented by G. Fix ξ∈Δ(AH). The physicalist intelligence of G relative to the baseline policy mixture ξ, prior ζ and loss function L is defined by:

g(G∣ξ;ζ,L):=−logPrπ∼ξ[Lpol(┌G┐,π,ζ)≥Lpol(┌G┐,G∗,ζ)]

Notice that Lpol depends on ┌G┐ in two ways: through the direct dependence of L on ┌G┐ and through Cπfair.

In particular, it makes sense to choose ζ and ξ as simplicity priors.

There is no obvious way to define "physicalist AIXI", since we cannot have g=∞. For one thing, g is not even defined for uncomputable agents. In principle we could define it for non-uniform agents, but then we get a fixpoint problem that doesn't obviously have a solution: finding a non-uniform agent G s.t. Lpol(┌G┐,G∗,ζ)=minπLpol(┌G┐,π,ζ). On the other hand, once we spell out the infinitary (N=∞) version of the formalism, it should be possible to prove the existence of agents with arbitrarily high finite g. That's because our agent can use quining to access its own source code ┌G┐, and then brute force search a policy π∗ϵ with Lpol(┌G┐,π∗ϵ,ζ)<infπ:H→ALpol(┌G┐,π,ζ)+ϵ.

2. Properties of the bridge transform

In this section we will not need to assume Γ is of the form ΣR: unless stated otherwise, it will be any finite set.

Sanity test

Proposition 2.1: For any Γ, Φ and Θ∈□c(Γ×Φ), Br(Θ) exists and satisfies prΓ×ΦBr(Θ)=Θ. In particular, if Θ∈□(Γ×Φ) then Br(Θ)∈□(elΓ×Φ).

Downwards closure

Roughly speaking, the bridge transform tell us which computations are physically realized. But actually it only bounds it from one side: some computations are definitely realized but any computation might be realized. One explanation for why it must be so is: if you looked at the world in more detail, you might realize that there are small-scale, previously invisible, features of the world which depend on novel computations. There is a direct tension between bounding both sides (ie, being able to say definitively that a computation isn't instantiated) and having the desirable property that learning more about the small-scale structure of the universe narrows down the uncertainty. To formalize this, we require the following definitions:

Definition 2.1: Let X be a partially ordered set (poset). Then the induced partial order on ΔcX is defined as follows. Given θ,η∈ΔcX, θ⪯η if and only if for any monotonically non-decreasing function f:X→R+, E[θ]≤E[η].

This is also called the stochastic order (which is standard mathematical terminology). Intuitively, θ⪯η means that η has its measure further up in the poset than θ does. To make that intuition formal, we can also characterize the induced order as follows:

Proposition 2.2: Let X be a poset, θ,η∈ΔcX. Then, θ⪯η if and only if there exists κ:X→ΔX s.t.:

• For all x∈X, y∈suppκ(x): x≤y.

• κ∗θ≤η

Proposition 2.3: Let X be a poset, θ,η∈ΔcX. Then, θ⪯η if and only if there exists κ:X→ΔX s.t.:

• For all x∈X, y∈suppκ(x): x≥y.

• θ≤κ∗η

Or, in words, you can always go from θ to η by moving probability mass upwards from where it was, since κ(x) is always supported on the set of points at-or-above x.

We can now state the formalization of only bounding one side of the bridge transform. Let elΓ×Φ be equipped with the following order. (y,α,x)≤(y′,α′,x′) if and only if y=y′, x=x′ and α⊆α′. Then:

Proposition 2.4: For any Γ, Φ and Θ∈□c(Γ×Φ),Br(Θ) is downwards closed w.r.t. the induced order on Δc(elΓ×Φ). That is, if θ∈Br(Θ) and η⪯θ then η∈Br(Θ).

Simple special case

Let's consider the special case where there's only one program, which can produce two possible outputs, a 0 and a 1. And these two outputs map to two different distributions over physics outcomes in Φ. Intuitively, if the computation isn't realized/instantiated, the distribution over physics outcome should be identical, while if the computation is realized/instantiated, it should be possible to look at the physics results to figure out how the computation behaves. The two probability distributions may overlap some intermediate amount, in which case it should be possible to write the two probability distributions as a mixture between a probability distribution that behaves identically regardless of the program output (the "overlap" of the two distributions), and a pair of probability distributions corresponding to the two different program outputs which are disjoint. And the total variation distance (dTV(ϕ(0),ϕ(1))) between the two probability distributions is connected to the size of the distribution overlap. Proposition 2.5 makes this formal.

Proposition 2.5: Consider a finite set X, ϕ?∈ΔX, ϕ!:{0,1}→ΔX, p∈[0,1] and ϕ:=(1−p)ϕ?+pϕ!. Then, p≥dTV(ϕ(0),ϕ(1)). Conversely, consider any ϕ:{0,1}→ΔX. Then, there exist ϕ?∈ΔX and ϕ!:{0,1}→ΔX s.t. ϕ=(1−p)ϕ?+pϕ! for p:=dTV(ϕ(0),ϕ(1)).

The bridge transform should replicate this same sort of analysis. We can interpret the case of "total uncertainty over math, but knowing how physics turns out conditional on knowing how math turns out" by ⊤Γ⋉ϕ for some ϕ:Γ→ΔΦ. Taking the special case where Γ={0,1} for one program that can output two possible answers, it should return the same sort of result, where the two probability distributions can be decomposed into their common overlap, and non-overlapping pieces, and the overlapping chunk of probability measure should be allocated to the event "the computation is not instantiated", while the disjoint ones should be allocated to the event "the computation is instantiated". As it turns out, this does indeed happen, with the same total-variation-distance-based upper bound on the probability that the computation is unrealized (namely, 1−dTV(ϕ(0),ϕ(1)))

Proposition 2.6: Consider any Φ and ϕ:{0,1}→ΔΦ. Denote U:={0,1}×{{0,1}}×Φ (the event "program is unrealized"). Let Λ:=Br(⊤{0,1}⋉ϕ). Then,

Λ(χU)=1−dTV(ϕ(0),ϕ(1))

Bound in terms of support

If the physical state is x∈Φ, then it is intuitively obvious that the physically manifest facts about the computational universe y∈Γ must include the fact that (y,x)∈suppΘ. Formally:

Proposition 2.7: If X is a poset and Θ∈□cX, then Θ↓ will denote downward closure of Θ. For any Γ, Φ and Θ∈□c(Γ×Φ) if (y,α,x)∈suppBr(Θ) and y′∈α, then (y′,x)∈suppΘ. Moreover, define susΘ:Φ→2Γ by susΘ(x):={y∈Γ∣(y,x)∈suppΘ}. Then, Br(Θ)⊆(Θ⋉susΘ)↓ (we slightly abuse notation by treating susΘ as a mapping Γ×Φ→2Γ that doesn't depend on the first argument, and also playing loose with the order of factors in the set on which our HUCs live).

Putting that into words, all the (α,x) that are in the support of the bridge transform have α being a subset of the support of Θ restricted to x. The bridge transform knows not to provide α sets which are too large. Further, the bridge transform is smaller/has more information than the transform that just knows about what the support of Θ is.

Idempotence

Imagine replacing the physical state space Φ by the space of manifest facts 2Γ. Intuitively, the latter should contain the same information about computations as the former. Therefore, if we apply the bridge transform again after making such a replacement, we should get the same thing. Given any set X, we denote idX:X→X the identity mapping and diagX:X→X×X the diagonal mapping.

Proposition 2.8: For any Γ, Φ and Θ∈□c(Γ×Φ)

Br(prΓ×2ΓBr(Θ))=[(idΓ×diag2Γ)∗prΓ×2ΓBr(Θ)]↓

Refinement

If we refine a hypothesis (i.e. make it more informative) in the ultradistribution sense, the bridge transform should also be more refined:

Proposition 2.9: For any Γ, Φ and Θ1,Θ2∈□c(Γ×Φ), if Θ1⊆Θ2 then Br(Θ1)⊆Br(Θ2).

Another way of refining a hypothesis is refining the ontology, i.e. moving to a richer state space. For example, we can imagine one hypothesis that describes the world in terms of macroscopic objects and another hypothesis that describes the worlds in term of atoms which are otherwise perfectly consistent with each other. In this case, we also expect the bridge transform to become more refined. This desiderata is where the downwards closure comes from, because it's always possible to pass to a more detailed view of the world and new computations could manifest then. It's also worth noting that, if you reduce your uncertainty about math, or physics, or move to a more detailed state space, that means more computations are instantiated, and the loss goes down as a result.

Proposition 2.10: For any Γ, Φ1, Φ2, t:Φ2→Φ1 and Θ∈□c(Γ×Φ2)

(idelΓ×t)∗Br(Θ)⊆Br((idΓ×t)∗Θ)

In general, when we refine the ontology, the manifest facts can become strictly refined: the richer state "knowns more" about computations than the poorer state. However, there is a special case when we don't expect this to happen, namely when the rich state depends on the computational universe only through the poor state. Roughly speaking, in this case once we have the poor state we can sample the rich state without evaluating any more programs. To formalize this, it is convenient to introduce pullback of HUCs, which is effectively the probabilistic version of taking a preimage.

Definition 2.2: Let X,Y be finite sets and t:X→Y a mapping. We define the pullback[7] operator t∗:□cY→□cX by

t∗Θ:={θ∈ΔcX∣t∗θ∈Θ}

Proposition 2.11: Consider any Γ, Φ1, Φ2, t:Φ2→Φ1, Ξ:Φ1→□cΦ2 and Θ∈□c(Γ×Φ1) s.t. tΞ=idΦ1. Then,

(idelΓ×Ξ)∗Br(Θ)⊆Br((idΓ×Ξ)∗Θ)⊆(idelΓ×t)∗Br(Θ)

In particular,

prelΓBr(Θ)=prelΓBr((idΓ×Ξ)∗Θ)

Mixing hypotheses

In general, we cannot expect the bridge transform to commute with taking probabilistic mixtures. For example, let Γ={0,1}, Φ={0,1},Θ1=⊤{00,11}, Θ2=⊤{01,10}. Clearly either for Θ1 or for Θ2, we expect the program to be realized (since the physical state encodes its output). On the other hand, within 12Θ1+12Θ2 we have both the distribution "50% on 00, 50% on 01" and the distribution "50% on 10, 50% on 11". The physical state no longer necessarily "knows" the output, since the mixing "washes away" information. However, there's no reason why mixing would create information, so we can expect the mixture of bridge transforms to be a refinement of the bridge transform of the mixture.

Proposition 2.12: For any Γ, Φ, Θ1,Θ2∈□c(Γ×Φ) and p∈[0,1]

pBr(Θ1)+(1−p)Br(Θ2)⊆Br(pΘ1+(1−p)Θ2)

Now suppose that we are we are mixing hypotheses whose state spaces are disjoint. In this case, there can be no washing away since the state "remembers" which hypothesis it belongs to.

Proposition 2.13: Consider some Γ, Φ1, Φ2, Θ1∈□c(Γ×Φ1), Θ1∈□c(Γ×Φ2) and p∈[0,1]. Regard Φ1,2 as subsets of Φ1⊔Φ2, so that pΘ1+(1−p)Θ2∈□c(Γ×(Φ1⊔Φ2)). Then,

pBr(Θ1)+(1−p)Br(Θ2)=Br(pΘ1+(1−p)Θ2)

In particular, if we consider the hypotheses within a prior as having disjoint state spaces because of using different ontologies, then the total bridge transform (Definition 1.2) is just the ordinary bridge transform, justifying that definition.

Conjunction

Given a hypothesis and a subset of Γ, we can form a new hypothesis by performing conjunction with the subset. Intuitively, the same conjunction should apply to the manifest facts.

Proposition 2.14: Consider some Γ, Φ, Θ∈□c(Γ×Φ) and F⊆Γ. Let ∩F:2Γ→2Γ be defined by ∩F(α):=F∩α. Then,

BrΓ(Θ∩⊤F×Φ)=(idΓ×∩F×idΦ)∗(BrΓ(Θ)∩⊤F×2Γ×Φ)

Moreover, let ι:F→Γ and elι:elF→elΓ be the natural injections. Then,

BrF((ι×idΦ)∗(Θ∩⊤F×Φ))=(elι×idΦ)∗BrΓ(Θ∩⊤F×Φ)

That first identity says that it doesn't matter whether you update on a fact about math first and apply the bridge transform later, or whether you do the bridge transform first and then "update" (actually it's slightly more complicated than that since you have to narrow down the sets α) the hypothesis afterwards. The second identity shows that, post conjunction with F, taking the bridge transform with respect to F is essentially the same as taking it with respect to Γ. If a batch of math universes is certain not to occur, then dropping them entirely from the domain Γ produces the same result.

As a corollary, we can eliminate dependent variables. That is, suppose some programs can be expressed as functions of other programs (using Θ). Intuitively, such programs can be ignored. Throwing in the dependent variables and doing the bridge transform on that augmented space of "math universes" produces the same results as trying to compute the dependent variables after the bridge transform is applied. So, if you know how to generate the dependent variables, you can just snip them off your hypothesis, and compute them as-needed at the end. Formally:

Proposition 2.15: Consider some Γ0, Γ1, r:Γ0→Γ1, Φ, Θ∈□c(Γ0×Φ). Let ι:Γ0→Γ0×Γ1 be given by ι(y):=(y,r(y)). Then,

Br((ι×idΦ)∗Θ)=(elι×idΦ)∗Br(Θ)

Factoring Γ

When Γ is of the form Γ0×Γ1, we can interpret the same Θ∈□(Γ×Φ) in two ways: treat Γ1 as part of the computational universe, or treat it as part of the physical state. We can also "marginalize" over it altogether staying with a HUD on Γ0×Φ. The bridge transforms of these objects satisfy a simple relationship:

Proposition 2.16: Consider some Γ0, Γ1, Φ and Θ∈□c(Γ0×Γ1×Φ). Define π:elΓ0×Γ1→elΓ0 by π(y,z,α):=(y,{y′∈Γ0∣(y′,z)∈α}). Then,

(π×idΦ)∗BrΓ0×Γ1(Θ)⊆BrΓ0(Θ)⊆BrΓ0(prΓ0×ΦΘ)

Intuitively, the least informative result is given by completely ignoring the Γ1 part of math about how a particular batch of computations turns out. A more informative result is given by treating Γ1 as an aspect of physics. And the most informative result is given by treating Γ1 as an aspect of math, and then pruning the resulting α:2Γ0×Γ1 down to be over 2Γ0 by plugging in the particular Γ1 value.

Continuity

Proposition 2.17: Br(Θ) is a continuous function of Θ.

Lamentably, this only holds when all the involved sets are finite. However, a generalization of continuity, scott-continuity, still appears to hold in the infinite case. Intuitively, the reason why is, if Φ is a continuous space, like the interval [0,1], you could have a convergent sequence of hypotheses Θ where the events "program outputs 0" and "program outputs 1" are believed to result in increasingly similar outputs in physics, and then in the limit, where they produce the exact same physics outcome, the two possible outputs of the program suddenly go from "clearly distinguishable by looking at physics" to "indistinguishable by looking at physics" and there's a sharp change in whether the program is instantiated. This is still a Scott-continuous operation, though.

3. Constructing loss functions

In section 1 we allowed the loss function L:elΓ→R to be arbitrary. However, in light of Proposition 2.4, we might as well require it to be monotonically non-decreasing. Indeed, we have the following:

Proposition 3.1: Let X be a finite poset, f:X→R and Θ∈□cX downward closed. Define fmax:X→R by fmax(x):=maxy≤xf(y). Observe that fmax is always non-decreasing. Then, Θ(f)=Θ(fmax).

This essentially means that downwards-closed HUC's are entirely pinned down by their expectation value on monotone functions, and all their other expectation values are determined by the expectation value of the "nearest monotone function", the fmax.

This requirement on L (we will call it the "monotonicity principle") has strange consequences that we will return to below.

The informal discussion in section 1 hinted at how the computationalist stance allows describing preferences via a loss function of the type elΓ→R. We will now proceed to make it more formal.

Selfish agents

A selfish agent naturally comes with a cartesian loss function L:D→R.

Definition 3.1: For any (y,α)∈elΓ, the set of experienced histories Xyα is the set of all ha∈Hyα×A (see Definition 1.5) s.t. for any y′∈α, Gy′(h)=a. The physicalized loss function is

Lphys(y,α):=minha∈Xyαmaxd∈D:ha⊑dL(d)

Notice that Lphys implicitly depends on ┌G┐ through Xyα.

Intuitively, if α is a large set, there's different mathematical universes that result in the agent's policy being different, but which are empirically indistinguishable (ie, they're in the same set α), so the agent's policy isn't actually being computed on those histories where the contents of α disagree on what the agent does. The agent takes the loss of the best history that occurs, but if there's uncertainty over what happens afterwards/later computations aren't instantiated, it assumes the worst-case extension of that history.

This definition has the following desirable properties:

• If g∈Xyα, ha∈H×A and ha⊑g, then h∈Xyα.

• Lphys is monotone. If the set of indistinguishable mathematical universes gets larger, fewer histories are instantiated, so the loss is larger because of the initial minimization.

• Fix (y,α)∈elΓ. Suppose there is d∈D s.t. Xyα={ha∈H×A∣ha⊑d}. Then Lphys(y,α)=L(d). If the set of indistinguishable mathematical universes all agree on what the destiny of the agent is and disagree elsewhere, then the loss is just the loss of that destiny.

We can think of inconsistent histories within Xyα as associated with different copies of the agent. If ha∈Xyα is s.t. for any g⊐ha, g∉Xyα, then h represents the history of a copy which reached the end of its life. If there is only one copy and its lifespan is maximal, then Lphys produces the cartesian loss of this copy. If the lifespan is less than maximal, Lphys produces the loss of the worst-case continuation of the history which was experienced. In other words, death (i.e., the future behavior of the agents policy not being a computation that has any effect on the state of the universe) is always the worst-case outcome. If there are multiple copies, the best-off copy determines the loss.

Definition 3.1 is somewhat arbitrary, but the monotonicity principle is not. Hence, for a selfish agent (i.e. if L only depends on y,α through Xyα) death must be the worst-case outcome (for fixed y). Note that this weighs against getting the agent to shut down if it's selfish. And creating additional copies must always be beneficial. For non-selfish agents, similar observations hold if we replace "death" by "destruction of everything that matters" and "copies of the agent" by "copies of the (important sector of the) world".

This is an odd state of affairs. Intuitively, it seems coherent to have preferences that contradict the monotonicity principle. Arguably, humans have such preferences, since we can imagine worlds that we would describe as net negative. At present, we're not sure how to think about it, but here are some possibilities, roughly in order of plausibility:

1. Perfect physicalists must obey the monotonicity principle, but humans are not perfect physicalists. Indeed, why would they be? The ancestral environment didn't require us to be physicalists, and it took us many centuries to stop thinking of human minds as "metaphysically privileged". This might mean that we need of theory of "cartesian agents in a physicalist universe". For example, perhaps from a cartesian perspective, a physicalist hypothesis needs to be augmented by some "theory of epiphenomena" which predicts e.g. which of several copies will "I" experience.

2. Something is simply wrong about the formalism. One cheap way to escape monotonicity is by taking the hull of maximal points of the bridge transform. However, this spoils compatibility with refinements (Propositions 2.9 and 2.10). We expect this to spoil ability to learn in some way.

3. Death is inherently an irreversible event. Every death is the agent's first death, therefore the agent must always have uncertainty about the outcome. Given a non-dogmatic prior, there will be some probability on outcomes that are not actually death. These potential "afterlives" will cause the agent to rank death as better than the absolute worst-case. So, "non-survivalist" preferences can be reinterpreted as a prior biased towards not believing in death. Similar reasoning applies to non-selfish agents.

4. We should just bite the bullet and prescriptively endorse the odd consequences of the monotonicity principle.

In either case, we believe that continuing the investigation of monotonic physicalist agents is a promising path forward. If a natural extension or modification of the theory exists which admits non-monotonic agents, we might discover it along the way. If we don't, at least we will have a better lower bound for the desiderata such a theory should satisfy.

Unobservable states

What if the loss depends on an unobservable state from some state space S? In order to encode states, let's assume some MS∈N and decS:ΣMS→S.I.e., just as the action of the agent can be thought of as a tuple of programs returning the bits of the computation of the agent's action, a state can be thought of as a tuple of programs returning the bits of the computation of what the next state is. Let B be a finite set ("nature's actions"). Just as observations, in the selfish case, occur based on what computations are instantiated, nature's choice of action B depends on what computations are instantiated. Denote HS:=(A×B×O×S)<N, DS:=(A×B×O×S)N−1×A. Let ┌T┐:S×A×B→RMS be the "ontological"[8] transition rule program, E:S→O the observation rule and L:DS→R the loss function.

Definition 3.2: For any (y,α)∈elΓ, the set of admissible histories Zyα is the set of all ha∈HS×A s.t. the following conditions hold:

1. prH×Aha∈Xyα

2. If g∈HS×A×B, o∈O and s∈S are s.t. gos⊑h then o=E(s).

3. If g∈HS×A×B×O, s1,s2∈S, a∈A, b∈B and o∈O are s.t. gs1abos2⊑h and y′∈α then Ty′(s1,a,b)=s2 (the notation Ty is analogical to Definition 1.3).

Pretty much, condition 1 says that the visible parts of the history are consistent with how math behaves in the same sense as we usually enforce with Hyα. Condition 2 says that the history should have the observations arising from the states in the way that E demands. And condition 3 says that the history should have the states arising from your action and nature's action and the previous state in the way the transition program dictates (in that batch of mathematical universes).

The physicalized loss function is

Ls-phys(y,α):=minha∈Zyαmaxg∈DS:ha⊑gL(g)

This arises in a nearly identical way to the loss function from Definition 3.1, it's precisely analogous.

That is, for a history to be admissible we require not only the agent to be realized on all relevant inputs but also the transition rule.

We can also define a version with ┌T┐:S×B→RMS in which the agent is not an explicit part of the ontology. In this case, the agent's influence is solely through the "entanglement" of G with T. I.e., the agent's beliefs over math have entanglements between how the agent's policy turns out and how the physics transition program behaves, s.t. when conditioning on different policies, the computations corresponding to different histories become instantiated.

Cellular automata

What if our agent maximizes the number of gliders in the game of life? More generally, consider a cellular automaton with cell state space S and dimension d∈N. For any t∈N, let Nt⊆Zd be the set of cells whose states are necessary to compute the state of the origin cell at timestep t. Let ┌Tt┐:SNt→RMS be the program which computes the time evolution of the cellular automaton. I.e., it's mapping the state of a chunk of space to the computation which determines the value of a cell t steps in the future. Denote DdS:=SN×Zd. This is the space of all possible histories, specifying everything that occurs from the start of time. Let L:DdS→R be the loss function (the N coordinate is time).

Definition 3.3: For any α∈2Γ, the set of admissible histories Zα is the set of all pairs (C,h) where C⊂N×Zd is finite and h:C→S s.t. the following conditions hold:

1. If (t2,c)∈C, t1<t2 and v∈Nt2−t1 then (t1,c+v)∈C.

2. If (t2,c)∈C, t1<t2 and y∈α then Tyt2−t1(λv.h(t1,c+v))=h(t2,c).

Basically, the C is a chunk of spacetime, and condition 1 says that said chunk must be closed under taking the past lightcone, and condition 2 says the state of that spacetime chunk must conform with the behavior of the transition computation.

The physicalized loss function is defined very much as it used to be, except that for what used to be "past histories" now it has states of past-closed chunks of spacetime.

Lc-phys(y,α):=min(C,h)∈Zαmaxg∈DdS:h=g|CL(g)

Notice that, since the dynamics of the cellular automaton is fixed, the only variables "nature" can control are the initial conditions. However, it is straightforward to generalize this definition to cellular automata whose time evolution is underspecified (similarly to the B we had before).

Diamond maximizer

What if our agent maximizes the amount of diamond (or some other physical property)? In order to define "diamond" we need a theory of physics. For example, we can start with macroscopic/classical physics and define diamond by its physical and/or chemical properties. Or, we can start with nonrelativistic quantum mechanics and define diamond in terms of its molecular structure. These definitions might be equivalent in our universe, but different in other hypothetical universes (this is similar to twin Earth). In any case, there is no single "right" choice, it is simply a matter of definition.

A theory of physics is mathematically quite similar to a cellular automaton. This theory will usually be incomplete, something that we can represent in infra-Bayesianism by Knightian uncertainty. So, the "cellular automaton" has underspecified time evolution. Using this parallel, the same method as in the previous subsection should allow physicalizing the loss function. Roughly speaking, a possible history is "real" when the computation simulating the time evolution of the theory of physics with the corresponding initial condition is realized. However, we are not sure what do if the theory of physics is stochastic. We leave it as an open problem for now.

To summarize, you can just take a hypothesis about how computations are entangled with some theory of physics, some computations simulating the time evolution of the universe (in which the notion of "diamond" is actually well-defined letting the amount of diamond in a history be computed), and a utility function over these states. The bridge transform extracts which computations are relevant for the behavior of physics in the hypothesis (which may or may not include the diamond-relevant computations), and then you just check whether the computations for a particular diamond-containing history are instantiated, take the best-case one, and that gives you your loss.

Human preferences seem not substantially different from maximizing diamonds. Most of the things we value are defined in terms of people and social interactions. So, we can imagine a formal loss function defined in terms of some "theory of people" implicit in our brain.

4. Translating Cartesian to Physicalist

Physicalism allows us to do away with the cartesian boundary between the agent and its environment. But what if there is a cartesian boundary? An important sanity test is demonstrating that our theory can account for purely cartesian hypotheses as a special case.

Ordinary laws

We will say "law" to mean what was previously called "belief function". We start from examining a causal law W:AH→□D. I.e., it maps a policy to an ultradistribution over destinies, in a way that looks like a closed convex set of environments. Let R:=H and Σ=A so that Γ becomes the set of policies (we identify the history h∈H with the computation ┌G(h)┐∈R given by what the agent's source does on input h.). In particular, for now there is no computer, or the computer can only simulate the agent itself. Let Φ:=D. This is our set of "physics outcomes", complete action-observation sequences. Then, W induces the physicalist hypothesis ΘW∈□(Γ×Φ) by ΘW:=⊤Γ⋉W. ΘW is what we called the "first-person static view" of the law. Total uncertainty over the choice of policy, interacting with the kernel which maps a policy to the resulting uncertainty over the environment.

We want to find the bridge transform of ΘW and evaluate the corresponding counterfactuals, in order to show that the resulting Lpol is the same as in the cartesian setting. In order to do this, we will use some new definitions.

Definition 4.1: Let X and Y be finite sets Q⊆Y×X a relation. A polycontribution on Q is ϖ:X→R+ s.t. for any y∈Y:

∑x∈Q(y)ϖ(x)≤1

ϖ is a polydistribution when for any y∈Y

∑x∈Q(y)ϖ(x)=1

We denote the space of polycontributions by ΔcQX and the space of polydistributions by ΔQX.

Here, Q(y) stands for {x∈X∣yQx}. Polycontributions and polydistributions are essentially a measure on a space, along with a family of subsets indexed by the y∈Y, s.t. restricting to any particular subset makes a contribution/distribution, respectively.

Definition 4.2: Let X and Y be finite sets and Q⊆Y×X a relation. A polycontribution kernel (PoCK) over Q is κ:Y→ΔcX s.t. there is ϖ∈ΔcQX s.t. for any x∈X and y∈Y

κ(y;x)=χQ(y)(x)ϖ(x)

Or, in other words, a PoCK is the kernel that you get by taking a polycontribution and using the input point y to decide which Y-indexed subset to restrict the measure to. Let Q0⊆AH×D be defined by Q0:={(π,a0o0…aN−1)∣∀i<N:ai=π(a0…oi−1)}. This is the "destiny is consistent with the policy" relation where the policy produces the actions that the destiny says it does, when placed in that situation. Let ϕ:AH→ΔD be a Bayesian causal law (an environment). Then, ϕ is a PoCK over Q0! Conversely, any ϕ:AH→ΔD which is a PoCK is an environment.

Essentially, for an environment, each choice of policy results in a different probability distribution over outcomes. However, these resulting probability distributions over outcomes are all coherent with each other where the different policies agree on what to do, so an alternate view of an environment is one where there's one big measure over destinies (where the measure on a particular destiny is the probability of getting that destiny if the policy plays along by picking the appropriate actions) and the initial choice of policy restricts to a particular subset of the space of destinies, yielding a probability distributions over destinies compatible with a given policy.

Clearly, the next step is to generalize this beyond probability distributions.

Definition 4.3: Let X and Y be finite sets Q⊆Y×X a relation. A polyultracontribution kernel (PUCK) over Q is Ξ:Y→□cX s.t. there is Π⊆Y→ΔcX a set of PoCKs s.t. for any y∈Y, Ξ(y) is the closure of the convex hull of ⋃κ∈Πκ(y).

In particular, W is a PUCK over Q0. Conversely, any W:AH→□D which is a PUCK is a causal law.

Proposition 4.1: Consider some Γ, Φ, a relation Q⊆Γ×Φ and a PUCK Ξ over Q. Let Θ:=⊤Γ⋉Ξ. Then,

Br(Θ)=[⊤Γ⋉(susΘ⋊Ξ)]↓=[⊤Γ⋉(Q−1⋊Ξ)]↓

Here, Q−1(x):={y∈Y∣yQx}. The first identity says that the inequality of Proposition 2.7 is saturated.

Specifically, as shown in the next proposition, for our case of cartesian causal laws, the computations realized according to Br(ΘW) are exactly the histories that occur.

Corollary 4.3: Suppose that for any h∈D and π:H→A s.t. h∈suppW(π), it holds that hCπ. That is, the observations W predicts to receive from the computer are consistent with the chosen policy. Let L:D→R be a cartesian loss function and π:H→A a policy. Then,

(prelΓBr(ΘW)∩Cπfair)(Lphys)=W(π;L)

Notice that the assumption W is causal combined with the assumption W is consistent with C imply that it's not possible to use the computer to simulate the agent on any history other than factual past histories. For pseudocausal W the bridge transform no longer agrees with the cartesian setting: if a selfish agent is simulated in some counterfactual, then the computationalist stance implies that it can incur the loss for that counterfactual. This artifact should be avoidable for non-selfish agents, but we leave this for future articles.

Turing laws

Now let's examine a cartesian setting which involves a general computer. Without further assumptions, we again cannot expect the physicalist loss to agree with the cartesian, since the programs running on the computer might be entangled with the agent. If the agent runs simulations of itself then it should expect to experience those simulations (as opposed to the cartesian view). Therefore, we will need a "no entanglement" assumption.

Let Γ:=Γ1×AH for some Γ1 (the computations that don't involve the agent) and Φ is still D. We want a notion of "joint belief about the environment and Γ1" in the cartesian setting, which generalizes causal laws. We will call it a "Turing law" or "Turing environment" in the Bayesian case.

For our particular application (though the theorems are more general), Z will be generally taken to be the space of computations that aren't about what you personally do, Y is the space of computations about what you personally do, and X is the space of destinies/physics outcomes, if the reader wants to try their hand at interpreting some of the more abstract theorems.

Definition 4.4: Let X, Y and Z be finite sets and Q⊆Y×X a relation. A Z-polycontribution on Q is ϖ:X×Z→R+ s.t.

∑z∈Zmaxy∈Y∑x∈Q(y)ϖ(x,z)≤1

ϖ is a Z-polydistribution on Q when it is both a Z-polycontribution on Q and a polydistribution on Q×Z. Here, we regard Q×Z as a relation between Y and X×Z.

We denote the space of Z-polycontributions by ΔcQXZ and the space of Z-polydistributions by ΔQXZ.

Notice that a Z-polycontribution on Q is in particular a polycontribution on Q×Z: ΔcQXZ⊆ΔcQ×Z(X×Z).

Definition 4.5: Let X, Y, Z be finite sets and Q⊆Y×X a relation. A Z-polycontribution kernel (Z-PoCK) over Q is κ:Y→Δc(X×Z) s.t. there is ϖ∈ΔcQXZ s.t. for any x∈X, y∈Y and z∈Z

κ(y;x,z)=χQ(y)(x)ϖ(x,z)

In the context of the identification we'll be using where Z is the rest of math, Y is the parts of math about what you personally do, and X is the space of physics outcomes/destinies, this result is saying that the Z-PoCK is mapping a policy to the joint distribution over math and destinies you'd get from restricting the measure to "I'm guaranteed to take this policy, regardless of how the rest of math behaves".

A Turing environment can be formalized as a Γ1-PoCK κ:AH→Δ(D×Γ1) over Q0. Indeed, in this case Definition 4.4 essentially says that the environment can depend on a Γ1 variable (Remember, Γ1 is the rest of math), and also we have a belief about this variable that doesn't depend on the policy. Now, how to transform that into a physicalist hypothesis:

Proposition 4.2: Let X, Y and Z be finite sets, Q⊆Y×X a relation, κ:Y→Δc(X×Z) a Z-PoCK over Q and Θ∈□cY. Then, there exist μ∈ΔcZ and ϕ:Z×Y→ΔcX a PoCK over Z×Q s.t. κ(y)=ϕ(⋅,y)⋊μ. Moreover, suppose that (μ1,ϕ1) and (μ2,ϕ2) are both as above[9]. Then,

μ1⋉Θ⋉ϕ1=μ2⋉Θ⋉ϕ2

Or, for a more concrete example, this is saying that it's possible to decompose the function mapping a policy to a distribution over the rest of math and the destiny, into two parts. One part is just a probability distribution over the rest of math, the second part is a function mapping a policy and the behavior of the rest of math to a probability distribution over destinies. Further, this decomposition is essentially unique, in that for any two ways you do it, if you take the probability distribution over the rest of math, have it interact with some arbitrary way of mapping the rest of math to policies, and then have that interact with the way to get a distribution over destinies, that resulting joint distribution could also be produced by any alternate way of doing the decomposition.

In the setting of Proposition 4.4, we define Θ∗κ∈□c(Z×Y×X) by μ⋉Θ⋉ϕ. The physicalist hypothesis corresponding to Turing environment κ turns out be ⊤AH∗κ. The no-entanglement assumption is evident in how the AH and Γ1 variables are "independent". Well, the actual choice of policy can depend on how the rest of math turns out, but the set of available choices of policy doesn't depend on how the rest of math turns out, and that's the sense in which there's independence.

To go from Turing environments to Turing laws, we will need the following generalization to sets of contributions.

Definition 4.6: Let X, Y and Z be finite sets Q⊆Y×X a relation. A Z-polyultracontribution kernel (Z-PUCK) over Q is Ξ:Y→□c(X×Z) s.t. there is Π⊆(Δc(X×Z))Y a set of Z-PoCKs s.t. for any y∈Y, Ξ(y) is the closure of the convex hull of ⋃κ∈Πκ(y).

A Turing law is a Γ1-PUCK W:AH→□(D×Γ1) over Q0.

Definition 4.7: Let X, Y and Z be finite sets, Q⊆Y×X a relation, Ξ:Y→□c(X×Z) a Z-PUCK over Q and Θ∈□cY. Let Π⊆(Δc(X×Z))Y be the set of Z-PoCKs κ s.t. for any y∈Y, κ(y)∈Ξ(y). Then, Θ∗Ξ∈□c(Z×Y×X) is the closure of the convex hull of ⋃κ∈Π(Θ∗κ).

As a sanity test for the sensibility of Definition 4.7, we have the following proposition. It demonstrates that instead of taking all compatible Z-PoCKs, we could take any subset whose convex hull is all of them. If that was not the case, it would undermine our justification for using convex sets (namely, that taking convex hull doesn't affect anything).

Proposition 4.3: Let X, Y and Z be finite sets, Q⊆Y×X a relation, κ1,κ2:Y→Δc(X×Z) Z-PoCKs over Q, Θ∈□cY and p∈[0,1]. Then, pκ1+(1−p)κ2 is also a Z-PoCK over Q, and

Θ∗(pκ1+(1−p)κ2)⊆pΘ∗κ1+(1−p)Θ∗κ2

The physicalist hypothesis correspond to Turing law W is ΘW:=⊤AH∗W. Total uncertainty over the available choices of policy interacting with the correspondence between the policy, the rest of math, and the resulting destiny. Now, we need to study its bridge transform with a necessary intermediate result to proceed further, bounding the bridge transform result between an upper and lower bound:

Proposition 4.4: Let X, Y and Z be finite sets, Q⊆Y×X a relation and Ξ:Y→□c(X×Z) a Z-PUCK over Q. Denote Θ:=⊤Z×Y∗Ξ. Define β0,β1:Z×Y×X→2Z×Y by β0(z,y,x):={z}×Q−1(x), β1(z,y,x):=Z×Q−1(x). Then,

(Θ⋉β0)↓⊆Br(Θ)⊆(Θ⋉β1)↓

Here, we slightly abuse notation by implicitly changing the order of factors in Z×Y×2Z×Y×X.

The application of this is that the resulting uncertainty-over-math/info about what computations are realized (i.e., the α set) attained when you do the bridge transform is bounded between two possible results. One bound is perfect knowledge of what the rest of math besides your policy is (everything's realized), paired with exactly as much policy uncertainty as is compatible with the destiny of interest. The other bound is no knowledge of the rest of math besides your policy (possibly because the policy was chosen without looking at other computations, and the rest of reality doesn't depend on those computations either), combined with exactly as much policy uncertainty as is compatible with the destiny of interest.

Corollary 4.5: Suppose that for any h∈D, z∈Γ1 and π:H→A s.t. (h,z)∈suppW(π), it holds that hCzπ. That is, the observations W predicts to receive from the computer are consistent with the chosen policy and W's beliefs about computations. Let L:D→R be a cartesian loss function and π:H→A a policy. Define ~L:D×Γ1→R by ~L(h,z):=L(h). Then,

(prelΓBr(ΘW)∩Cπfair)(Lphys)=W(π;~L)

And so, the Turing law case manages to add up to normality in the cartesian setting (with computer access) as well.

5. Discussion Summary

Let's recap how our formalism solves the problems stated in section 0:

• Physicalist hypotheses require no bridge rules, since they are formulated in terms of a physical state space (that be chosen to be natural/impartial) rather than agent's actions and observations. Therefore, they accrue no associated description complexity penalty, and should allow for a superior learning rate. Intuitively, a physicalist agent knows it is an manifestation of the program G rather than just any object in the universe.

• A physicalist agent doesn't develop uncertainty after subjectively-simple but objectively-complex events. It is easy to modify a cartesian hypothesis such that the camera going black leads to completely new behavior. However, it would require an extremely contrived (complex) modification of physics, since the camera is not a simple object and has no special fundamental significance. Basically, the simplest way that computations and reality correspond which instantiates the computations of the agent seeing black forever afterward gets a complexity boost by about the complexity of the agent's own source code.

• A physicalist agent doesn't regard its existence as axiomatic. A given physicalist hypothesis might contain 0 copies of the agent, 1 copy or multiple copies (more generally, some ultradistribution over the number of copies). Therefore, hypotheses that predict the agent's existence with relatively high probability gain an advantage (i.e. influence the choice of policy more strongly), as they intuitively should. At the same time, hypotheses that predict too many copies of the agent to have predictive power are typically[10] also suppressed, since policy selection doesn't affect their loss by much (whatever you do, everything happens anyway).

• Acausal attacks against physicalist agents are still possible, but the lack of bridge rule penalty means the malign hypotheses are not overwhelmingly likely compared to the true hypothesis. This should allow defending using algorithms of "consensus" type.

For selfish agents the physicalist framework can be reinterpreted as a cartesian framework with a peculiar prior. Given that the Solomonoff prior is universal (dominates every computable prior), does it mean AIXI is roughly as good as a selfish physicalist agent? Not really. The specification of that "physicalist prior" would involve the source code of the agent. (And, if we want to handle hypotheses that predict multiple copies, it would also involve the loss function.) For AIXI, there is no such source code. More generally, the intelligence of an agent w.r.t. the Solomonoff prior is upper bounded by its Kolmogorov complexity (see this). So, highly intelligent cartesian agents place small prior probability on physicalism, and it would take such an agent a lot of data to learn it.

Curiously, the converse is also true: the translation of section 4 also involves the source code of the agent, so intelligent physicalist agents place small prior probability on cartesian dualism. We could maybe use description complexity relative to the agent's own source code, in order to define a prior[11] which makes dualism and physicalism roughly equivalent (for selfish agents). This might be useful as a way to make cartesian algorithms do physicalist reasoning. But in any case, you still need physicalism for the analysis.

It would be interesting to do a similar analysis for bounded Solomonoff-like priors, which might require understanding the computational complexity of the bridge transform. In the bounded case, the Kolmogorov complexity of intelligent agents need not be high, but a similar penalty might arise from the size of the environment state space required for the physicalist-to-cartesian translation.

Are manifest facts objective?

Physicalism is supposed to be "objective", i.e. avoid privileging the observer. But, is the computantionalist stance truly objective? Is there an objective fact of the matter about which facts about computations are physically manifest / which computations are realized?

At first glance it might seem there isn't. By Proposition 2.7, if the physicalist hypothesis Θ∈□(Γ×Φ) is s.t. suppΘ⊆F×Φ for some F⊆Γ, then suppBr(Θ)⊆elF×Φ. So, if the hypothesis assets a fact F then F is always physically manifest. But, isn't Θ merely the subjective knowledge of the agent? And if so, doesn't it follow that physical manifesting is also subjective?

At second glance things get more interesting. Suppose that agent G learned hypothesis Θ. Since G is part of the physical universe, it is now indeed manifest in the state of the physical universe that hypothesis Θ (and in particular F) is true. Looking at it from a different angle, suppose that a different agent G′ knows that G is a rational physicalist. Then it must agree that if G will learn Θ then F is manifest. So, possibly there is a sense in which manifesting is objective after all.

It would be interesting to try and build more formal theory around this question. In particular, the ability of the user and the AI to agree on manifesting (and agree in general) seems important for alignment in general (since we want them to be minimizing the same loss) and for ruling out acausal attacks in particular (because we don't want the AI to believe in a hypothesis s.t. it wouldn't be rational for the user to believe in it).

Physicalism and alignment

The domain of the physicalist loss function doesn't depend on the action and observation sets, which is an advantage for alignment since it can make it easier to "transfer" the loss function from the user to the AI when they have different action or observation sets.

On the other hand, the most objectionable feature of physicalism is the monotonicity principle (section 3). The default conclusion is, we need to at least generalize the formalism to allow for alignment. But, suppose we stick with the monotonicity principle. How bad is the resulting misalignment?

Consider an alignment protocol in which the AI somehow learns the loss function of the user. As a toy model, assume the user has a cellular automaton loss function Luser, like in section 3. Assume the AI acts according to the loss function Lc-physuser of Definition 3.3. What will result?

Arguably, in all probable scenarios the result will be pretty close to optimal from the user's perspective. One type of scenario where the AI makes the wrong choice is, when choosing between destruction of everything and a future which is even worse than destruction of everything. But, if these are the only available options then probably something has gone terribly wrong at a previous point. Another type of scenario where the AI possibly makes the wrong choice is, if it creates multiple disconnected worlds each of which has valence from the perspective of the user, at least one of which is good but some of the which are bad. However, there usually seems to be no incentive to do that[12].

In addition, the problems might be milder in approaches that don't require learning the user's loss function, such as IDA of imitation.

So, the monotonicity principle might not introduce egregious misalignment in practice. However, this requires more careful analysis: there can easily be failure modes we have overlooked.

Future research directions

Here's a list of possible research directions, not necessarily comprehensive. Some of them were already mentioned in the body of the article.

Directions which are already more or less formalized:

• Generalizing everything to infinite spaces (actually, we already have considerable progress on this and hopefully will publish another article soon).

• Generalizing section 4 to certain pseudocausal laws. Here, we can make use of the monotonicity principle to argue that if Omega discontinues all of its simulations, they correspond to copies which are not the best-off and therefore don't matter in the loss calculus. Alternatively, we can design a non-selfish loss function that only counts the "baseline reality" copies. Moreover, it would be interesting to understand the connection between the pseudocausality condition and the "fairness" of Definition 1.5.

• Define a natural candidate for the Solomonoff prior for physicalist hypotheses.

• Prove that the physicalist intelligence g is truly unbounded, and study its further properties.

Directions which are mostly clear but require more formalization:

• Prove learning-theoretic results (regret bounds). For example, if we assume that there is a low-cost experiment that reveals the true hypothesis, then low regret should be achievable. In particular, such theorems would allow validating Definition 1.5.

• Generalize section 3 to stochastic ontologies.

• Allow uncertainty about the loss function, and/or about the source code. This would introduce another variable into our physicalist hypothesis, and we need to understand how to bridge transform should act on it.

• Analyze the relationship between cartesian and physicalist agents with bounded simplicity priors.

• For (cartesian) reinforcement learning, MDPs and POMDPs are classical examples of environment for which various theoretical analysis is possible. It would be interesting to come up with analogical examples of physicalist hypotheses and study them. For example, we can consider infra-Markov chains where the transition kernel depends on Γ.

Directions where it's not necessarily clear how to approach the problem:

• Understand the significance of the monotonicity principle better, and whether there are interesting ways to avoid it.

• Formally study the objectivity of manifesting, perhaps by deriving some kind of agreement theorems for physicalist agents.

• Define and analyze physicalist alignment protocols.

• It is easy to imagine how a physicalist hypothesis can describe a universe with classical physics. But, the real universe is quantum. So, it doesn't really have a distribution over some state space Φ but rather a density matrix on some Hilbert space. Is there a natural way to deal with this in the present formalism? If so, that would seem like a good solution to the interpretation of quantum mechanics. In fact, we have a concrete proposal for doing this, but it requires more validation.

1. If we ignore ideas such as the von Neumann-Wigner interpretation of quantum mechanics. ↩︎

2. This other direction also raises issues with counterfactuals. These issues are also naturally resolved in our formalism. ↩︎

3. The notation Θ×Λ is reserved for a different, commutative, operation which we will not use here. ↩︎

4. A simplifying assumption we are planning to drop in future articles. ↩︎ ↩︎ ↩︎

5. We have considered this type of setting before with somewhat different motivation. ↩︎

6. The careful reader will observe that programs sometimes don't halt which means that the "true" computational universe is ill-defined. This turns out not to matter much. ↩︎

7. Previously we defined pullback s.t. it can only be applied to particular infra/ultradistributions. Here we avoid this limitation by using infra/ultracontributions as the codomain. ↩︎

8. Meaning that, this rule is part of the definition of the state rather than a claim about physical reality. ↩︎

9. Disintegrating a distribution into a semidirect product yields a unique result, but for contributions that's no longer true, since it's possible to move scalars between the two factors. ↩︎

10. One can engineer loss functions for which they are not suppressed, for example if the loss only depends on the actions and not on the observations. But such examples seem contrived. ↩︎

11. It's a self-referential definition, but we can probably resolve the self-reference by quining. ↩︎

12. One setting in which there is an incentive: Suppose there are multiple users and the AI is trying to find a compromise between their preferences. Then, it might decide to create disconnected worlds optimized for different users. But, this solution might be much worse than the AI thinks, if Alice's world contains Bob!atrocities. ↩︎

Discuss

### Winter Solstice

1 декабря, 2021 - 00:29
Published on November 30, 2021 9:29 PM GMT

Solstice is a time to sing songs, tell stories, contemplate how dark and cold the universe is, and remind ourselves that together, we can make it less so.

The DC Solstice will be held outdoors, in a backyard at a private residence in Silver Spring. It is not in walking distance of a Metro stop, but is bus-accessible. Please email me for the address if you're interested in coming: mwerbos at Google's mail service.

- Blankets and/or warm clothing
- If you can, something to sit on (our chair supply is limited, but we'll do our best)

Show up around 3:30 PM, and we will start the program at 4 PM.

Hope to see you there!

Discuss

### Relying on Future Creativity

30 ноября, 2021 - 23:12
Published on November 30, 2021 8:12 PM GMT

I used to write songs for a rock band. Sometimes I'd have a song written, thinking it was my best work ever, then when it came time to rehearse, we'd realize it wasn't going to work out. Maybe we didn't have the instruments to do it justice (we were a three piece), or it was out of my comfortable vocal range, or just too technically tricky for us to get right.

Then I'd be left with this feeling like I'd wasted my best efforts, like I couldn't ever make something that good again. And I could prove it to myself! I didn't know what song I'd write that was better than the one we'd just had to ditch!

I get the same with research. Sometimes I'll come up with an idea of the next thing I want to test, or design. Then it doesn't work, or it's not practical, or its done before. And then it's tempting to feel exactly the same way: that I've bungled my one single shot.

But I've learnt to not feel this way, and it was my song writing which helped first. Now when my idea doesn't work, my first thought isn't "oh no my idea has failed", it's "OK I guess I'll have to just have another idea". I've learnt to rely on my future creativity.

This isn't easy, and sometimes it feels like walking on air. Coming up with an idea is different from other tasks, because by definition it's different every time. When I imagine playing a piece on the piano, I know exactly the form of what I'm going to do. When I imagine coming up with an idea it's a total mental blank. If I knew where I was going I'd already be there. Creativity is also a lot less reliable than physical skills, so I often have no idea how long it's going to take me to come up with something.

There's no magic bullet, but the most important thing is getting good at noticing when you're treating your future creativity as nonexistent. The biggest sign of improvement in this skill is that you don't feel like you're stepping off a cliff every time you abandon a creative endeavour, be it idea or project.

Discuss

### Reneging Prosocially

30 ноября, 2021 - 22:01
Published on November 30, 2021 7:01 PM GMT

Author's note: this essay was originally published elsewhere in 2019. It's now being permanently rehosted at this link. I'll be rehosting a small number of other upper-quintile essays from my past writing over the coming weeks.

Quite often, I will make an agreement, and then find myself regretting it. I’ll commit to spending a certain amount of hours helping someone with their problem, or I’ll agree to take part in an outing or a party or a project, or I’ll trade some item for a certain amount of value in return, and then later find that my predictions about how I would feel were pretty far off, and I'm unhappy.

Sometimes, I just suck it up and stick it out. But sometimes I renege on the agreement. In this essay, I’d like to lay out a model for how to renege prosocially, i.e. how to go back on your agreements in a way that strengthens rather than damaging the social fabric (or at the very least leaves it no worse than it was).

Black boxes and APIs

Once, I was a passenger in my father’s car when another car cut us off on the freeway.

“Do you know what he did wrong?” my father asked, gesturing at the other car.

(I was maybe thirteen at the time, and would soon be going through drivers’ education.)

“Uh. He didn’t use his turn signal?”

I couldn’t come up with anything, and eventually my father continued. “It’s bad because it makes you less predictable,” he said. “Just like the fact that he raced up on us on the right, and then cut in front of us. You pass on the left and not the right because that’s what people are expecting. You signal your lane changes so that people know what you’re about to do. If you drive erratically, everybody else has to change the way they’re driving around you, and then nobody knows how to coordinate. It breaks the pattern, and we rely on the pattern to stay safe.”

(It’s no surprise that I talk the way I do, is it?)

In this essay, I’m going to generally avoid talking about things like people’s internal experiences, and people’s individual character traits, and so forth. I won’t talk about that stuff zero, but most of the time, it doesn’t bear directly on the question of what happens when people interact with each other, except when filtered through a layer of
[X internal states] → [Y external actions].

The guy in the other car could have been a jerk, or he could have been tired and inattentive, or he could have been running late, or he could have just had a different sense of what constitutes safe driving and traffic violations than my father.

None of that particularly matters, though, when the question is something like “what behaviors allow people to coordinate on the road such that no one dies?” You can treat the person in the other car as a black box, and treat my father as a black box, and simply ask “are the behaviors emerging from these black boxes likely to combine smoothly and effectively, or not?”

Similarly, when we talk about reneging on agreements, I want to avoid questions of what’s-going-on-in-their-soul-though, since we can’t really ever be sure. They say they’re too sick to show up, but maybe they’re just tired and afraid you won’t validate that, or maybe they just don’t like you, and are trying to be gentle, or or or or or.

What matters, I claim, is something like what’s comprehensible, what’s sustainable, what’s verifiably fair or fair-approximate from the outside. What, if you could poll 1000 people on r/amitheasshole, would the consensus be? That consensus isn’t always right, in the moral sense, but it’s extremely well-calibrated on what tends to work, when people are rubbing elbows with each other, and what doesn’t.

Another way to think of this is that the social constructs I want to discuss in this series of essays are like an API (application-programmer interface). They’re a series of knobs and levers that we build between people, a set of tacit agreements that if you push this button, I’ll reliably and predictably respond with that action. If you’re a barista, and I’m standing at the front of the line, I get to tell you what coffee I want and hand you money, and you get to take that money and give me the coffee I asked for, and none of that has anything to do with our immortal souls or our fundamental dignity or whatever kind of day we’re having.

It’s a dance, a game that we’re playing, and the essential fakeness or arbitrariness of the rules doesn’t really matter. When you step out onto a soccer field, you agree to treat the rules of soccer as if they’re immutable laws—as if you actually deserve punishment for having been off-side, even though there’s no deep moral significance to the fact that you were standing on one side of a line of chalk when no one else was.

It’s the same with making agreements and then breaking them. Some people care a lot, and some people don’t. Some agreements are casual, and some cut pretty close to people’s deeply held values. Some reneging causes actual damage, and some doesn’t, and you can’t always tell which, and so the structures we put in place around it need to be sort of blind to individual variation and instead informed by something like the statistical aggregate, i.e. even if your feelings weren’t hurt in situation X, if we empirically observe that a lot of people report that their feelings would be hurt in situation X, we should build some kind of acknowledgement or recompense into our norms around situations like X, and you should get paid those damages unless you explicitly and voluntarily opt out (and sometimes not even then, for reasons of not making the system vulnerable to Moloch).

Agreements are predictive structures

What does it mean to make an agreement?

In the black box paradigm, it just means that you and I now have expectations about the future. You expect me to show up at the party, because I said that I would. I expect you to mail me the Magic cards, because I PayPal’d the money to your ebay account. Normally, the future is fluid and confusing and unpredictable; when you and I forge an agreement, we’re trying to blow away some of that fog-of-war.

This usually results in one or both parties changing the way they spend resources. If I think you’re going to help me talk through my problem, I don’t spend a couple of hours looking for another friend to lean on, or for a therapist. If you think I’m going to be at the party, you don’t invite the next marginal person on your list, and maybe you do an extra shopping trip to get food that matches my dietary needs.

So in the world of the agreement, we have both:

• Invested value (the resources that were moved around in accordance with our predictions, based on the plan)
• Expected value (the gain that we think will come out of whatever future cooperation we’ve set up)

Notably, the invested value is often sunk, in that it can’t really be recovered. I once followed a fiancée across the country, and among the things I invested in that plan were the value of my then-current job as a middle school teacher and also my then-potential job as a cofounder of a parkour education startup. Regardless of whether or not the engagement resulted in a marriage, I wasn’t likely to be able to recover either that job or that time-sensitive opportunity.

There are at least three other resources in play sort of one-level-up; these are not object-level resources like time or money but they’re no less relevant. For the rest of this section, I’m going to stick to the frame in which I am the one reneging, and so “you” refers to the person being reneged on:

• Your willingness to make agreements at all; your sense that the social structures around you are sufficiently stable and reliable for you to safely participate; your sense that it’s worthwhile-in-expectation to try to engage with and coordinate with other humans and shift the way you apportion your resources in accordance with their words. For lack of a better term, I’m going to call this resource your social faith, but I mean less “how you feel in your soul” and more “what actions even make sense for you to take from a calculated, economic, game-theoretic perspective.”
• Your sense of me as someone you can make stable contracts with; your sense that my actions are predictable based upon the words that come out of my mouth; your willingness to risk value in coordination with me. I’m going to call this resource my reliability, though more properly it’s my reputation of reliability, in your eyes and in the eyes of onlookers or those who trust your evaluations of people.
• The social fabric; the ability for people in our social vicinity to engage in contracts and agreements at all; the shared sense that our society values reliability and conscientiousness and follow-through and has common knowledge surrounding a mutual commitment to endorse and reward those who have or support those values and to discourage or punish those who don’t.

In general, the prescription for how to renege prosocially is to track and account for all five of these.

Having your cake and eating it, too

I want to pause for a moment to describe one of the key failure modes that I see people making, in this space.

The phrase "have your cake and eat it, too" always confused younger-Duncan; I think it’s clearer in its original form "eat your cake and have it, too," or the less poetic "eat your cake and yet still have your uneaten cake." The idea is that the-people-the-phrase-is-criticizing are trying to secure or maintain some benefit, without having to pay the necessary costs.

For instance, I have a friend who went through a period of serious hormonal imbalance, lasting several years. On maybe six or seven days out of ten, this friend was trustworthy, loyal, helpful, friendly, courteous, kind, generous, considerate, and dozens of other positively-valenced adjectives.

But on the sort-of-predictable-but-not-really other three or four days, they were angry, needy, suspicious, capricious, and volatile—very difficult to be around, very difficult to help, and somewhere between draining and actively hurtful.

This friend would often say “I wish you wouldn’t blame me for what I’m like on those days. That’s not me. It’s really not me—I’ve been around for a while, I’ve known myself my whole life, that behavior is actually not representative.”

And this was true, in one sense. They’ve largely repaired the hormonal problem and indeed they are only the good list, these days.

But at the same time, those bad days were real. They did in fact happen. They did in fact result in costs being imposed upon me—sometimes very significant costs. And so I did in fact treat my friend as if there was something like a 30% chance, on any given day, that they would behave in that fashion. On the level of black-boxes-rather-than-souls, that was part of my model of “what is my friend like?”

The impulse to say “don’t count this negative behavior” is an instance of trying to have your cake and eat it, too. In this case, it really wasn’t my friend’s fault—they were doing literally everything they could to wrangle the problem, seeing multiple doctors, trying all sorts of medication, changing their diet and exercise. But nevertheless, they were in fact unpleasant a lot of the time, and they wanted to be judged/treated/modeled as if they were pleasant all of the time. There was a sort of “getting away with it” implicit in the request.

I find that this dynamic crops up a lot around agreements and reneging—that people will want to renege, and also not have to pay any of the costs of reneging. This is one reason why excuses like “I’m sick” are overused relative to the actual occurrence of illness [citation needed]—people dock you fewer points if you’re sick, and treat the reneging as less predictable-in-advance and less “your fault.”

(At the cost of wearing away the social fabric and making the excuse less believable in general, and therefore shunting the burden somewhat onto those who are actually sick, and less likely to be believed or taken seriously.)

Many of the prescriptions I want to make throughout the remainder of the essay have at their core a desire to break the have-your-cake-and-eat-it-too dynamic. In particular, they’re prescriptions for the reneger to try their best to take on the costs of reneging. This, I claim, is the smoothest, most sustainable, and most functional-as-in-the-opposite-of-dysfunctional system, entirely separate from questions of guilt or blame or shame or whatever.

(Sometimes, the whole reason for reneging is that one is out of resources and can’t follow through on a commitment; in this case, “taking on the costs of reneging” doesn’t mean trying to scrape together yet more resources to give away, but instead means something like taking your lumps. Saying “yes, I’m sorry, I made this commitment and also I’m not following through on it, and I get that you might dock me points for this, and be less willing to trust me in the future, and I’m not going to take umbrage at that reasonable update that you’re making, because this is in fact what happened and I don’t expect you to pretend it didn’t happen.” That’s still 1000x better, as far as I can tell, than trying to weasel and waffle and pin blame on others, or on circumstance.)

How to renege in a prosocial manner

A lot of this may seem painstakingly obvious, but I want to write it all out anyway, since (in my experience) something like 99% of people fail to reliably avoid all of the pitfalls.

0. Actually notice agreements; actually notice reneging.

This deserves an entire series of essays on its own; please don’t take the lack of words spilled here as lack of import. All sorts of things go wrong when one person doesn’t notice that they’ve said words which would reasonably cause someone else to start shifting their resources; all sorts of things go wrong when one person starts acting as if something has been agreed to when it hasn’t. This is one of the reasons why I absolutely hate the fact that, when someone uses Google to invite me to an event, me clicking a button that says “maybe” results in them receiving a notification that I have “tentatively accepted.”

1. Loop the renegee in as soon as possible.

It’s well-known that most humans have a strong tendency to postpone bad news, even when the postponement will clearly make things worse [citation needed]. If you and I have established an agreement and I’m thinking of reneging, then every hour that I avoid having to put myself through an awkward conversation, you are burning more resources under false assumptions and losing runway for reorienting.

Rapidly informing you stems the flow of resource investment, capping your losses. It leaves you with the maximum time to prepare for the blow of losing the previously-expected value, and/or to try to change your own planning to recapture that value in some other fashion.

On the social level, you’re already making a Bayesian update against plans being the-sort-of-thing-that-works; rapidly informing you prevents you from having to make a second, separate update about whether information that could be valuable to you will in fact find its way to you via social channels. Similarly, you’re already making an update about whether I am a reliable predictor of my own future actions; rapidly informing you prevents you from having to make a separate update about something like my willingness to impose costs on you for my own benefit, or how much you can count on me to funnel relevant information to you when I have it (including the skill of noticing in the first place that you would probably consider the information to be relevant).

2. Postpone the renege.

It’s often astonishing to me what sorts of wildly dangerous shenanigans people will pull in traffic—crossing four lanes to get to an exit that’s 100 meters up the road, or going in reverse to get out of a turn lane—just to avoid a small delay from carrying on a little further and then turning around. It seems to me to be an example of hyperbolic discounting gone wild, as if those drivers are so tunnel-visioned on the “correct” path that they don’t even consider whether risking life and limb is actually worth it.

There’s an analogous effect that I see in many agreement/renege situations. Certainly there are some people who should straightforwardly ignore this point (because they have the opposite problem of never prioritizing their own needs), but most people, in my experience, jump too quickly to the option of immediately breaking their agreements, to the detriment of the social fabric.

If I loan you an item ostensibly for a month, and regret it, I will do significantly less damage asking you to return it in a week than asking you to return it immediately.

If I’ve admitted you to my exclusive club and on reflection don’t think it’s working out, I will do significantly less damage if I let you taper off your participation than if I just instantly ban you (especially if there’s not some other critical damage occurring to the people who were there first).

It’s like severance packages, or alimony—a norm of delaying the moment at which the agreement breaks off (with notice; if this doesn’t come along with informing them as soon as possible then you’re just tying their shoelaces together) gives people time to reorient and replan and reshuffle their resources while living in the world where all the scaffolding is still scaffolding.

This almost always comes with costs to the reneger, but in some ways, that’s the point. You can either renege immediately, as soon as you realize that’s a more convenient plan, and leave the renegee to pay all of the costs, or you can do your best to split the costs between you—in this case, by living in the now-acknowledged-to-be-sub-optimal universe a little while longer, while the other person reorients.

For some reason, many people seem to get so caught up in the “aaaaahhhhfixitfixitfixit” panic that they forget that triaging a situation doesn’t always mean straightforwardly cutting your losses ASAP.

3. Recalibrate, validate, and (sometimes) recommit.

This one comes up for me most often with very small commitments (such as “I’ll spend an hour reading over this this weekend” or “I’ll meet you for dinner on Friday” or “I know I said I’d see the movie with you, but Joe’s coming and you and Joe are incompatible, so…uh…”).

It’s important, when such small predictions go wrong, to first learn the lesson. If I’ve had to cancel on you more than once for something on this order of magnitude, that’s a sign that I am doing something wrong. And that mistake isn’t just affecting me; it’s radiating outward and imposing costs on the people around me.

So the first step is to do some introspection, and see if I can’t dethread the problem. Perhaps I’m agreeing to things too quickly in general, or not giving myself enough time to rest, or failing to acknowledge that no, it’s not just a bad week, this is just the new normal, at least for this month or this year.

If I can’t be confident that I’ve nailed the problem down, the next step is simply to increase my error bars. Make fewer commitments in general, through a top-down conscious effort, or make each individual commitment looser, giving people more notice that I might bail or flake.

"I'm happy to schedule a 50% chance of a lunch on Saturday, if that works for you, but if you need a firm 'yes' or 'no' then I have to say 'no.'"

Separate from all of that, I give my social partner the bad news, and validate the damage. It’s important to explicitly acknowledge that I told them one thing and am doing another, and that I recognize the degree to which this imposes costs (and forces them to make a negative update on my reliability).

What follows then is either a cutting-of-losses with accompanying make-it-up-to-you efforts (more on that below), or a concrete and credible commitment that includes increased prioritization. As the saying goes, “if they wanted to, they would.” When looking at the world from the black box perspective, I’m choosing whether to let them solidify a prediction that I won’t, or to try to overcome evidence that I’ve given them with stronger evidence in the other direction.

So at that point, I’ll say something like “…and I really will be there this Friday; this is my top priority, where I admit before I was putting [work/rest/hobby/ other person] ahead of you.” And then I’ll make it happen.

4. Make the other person whole.

“Make whole” is a technical, legal term that I am probably misusing somewhat, but I like it because it solidly captures the feel of the thing.

If I renege on you, I’ve caused you to lose your invested value, and likely your expected value as well, so that I can avoid losing value myself. One way for me to salvage the social utility that I’m putting at risk (your faith, my reputation, the fabric in general) is to do my best to make you whole—to cover the losses you’ve suffered, to the best of my ability, returning you to the state you were in prior to our agreement.

One place where this goes astray is that the reneger and the renegee frequently estimate the debt differently (and, depressingly, in perfect accordance with what you’d expect if both people were allowing their incentives to bias them). My personal stance is that things will go better in the aggregate if the reneger adopts a policy of trusting the renegee to accurately report the types and magnitudes of damage, but thereafter using their own judgment to determine appropriate action in response. This is in line with the game design advice “trust your audience to identify flaws, but don’t trust them to propose solutions,” and it avoids the problem of putting the injurer in charge of diagnosing how bad the injury is.

(Two wrinkles in this: if the renegee later claims that the recompense was insufficient, I do think you ought to believe them, as a matter of policy—but that doesn’t always mean providing additional value, because otherwise you open yourself up to being mugged or extorted by something like deliberate catastrophizing. More on what to do in that situation later.)

Once again, I don’t have room to spend a proportionate amount words on this section. Suffice it to say that both figuring out what value has been lost, and figuring out how to repay it, are often difficult and extremely complicated tasks, but I generally think they’re worth an actual effort.

5. Put the remaining damage on your account.

This is the last piece of advice, but really it’s the heart and soul of the recommendation. This is, largely, how one goes about “taking on the costs of reneging.”

In essence, the idea is to put as few barriers between yourself and social consequence as possible. As sort of hinted at in the language of recommitment above, I try my best to be open and candid about my prediction of their prediction of my future behavior.

I’ve proven myself to be flaky? I acknowledge that there is justification for that update, and that indeed I expect people to be correspondingly wary or unwilling to trust my promises in the future, and that I hold no bitterness about this because the failing was indeed mine.

I’ve imposed costs on you, for what ultimately proved to be my own benefit? I admit that I’ve done it, and honestly explain whether it was done out of selfishness or carelessness or insufficiently-guarded-against accident. If I have failed to make it up to you according to your standard, I admit that, too. I accept the reputational hit that comes with my actions, and do not kick off a round of escalating revenge.

Note that this is not the same as hitting myself so they won’t hit me. Most of us have had the experience at one time or another of being on the receiving end of some big, ostentatious, self-flagellating apology, and feeling somewhat cheated about that, too, because to onlookers it appears that the debt has been paid and any continued grudge on your part looks uncharitable and thus there’s no way to pursue further justice.

Instead, this is more straightforward and matter-of-fact. It’s an attempt to be open, not an attempt to pre-empt. It’s an explicit willingness to validate and accept the consequences of what’s happened, in the game of black boxes interfacing with each other, separate from any claims about whether our actions were justified or well-intentioned or whatever. It’s a simple acknowledgement that causes have effects, and a taking of responsibility for the causes which were under my nominal control.

Sometimes, this is all you can do. Sometimes you can’t inform, and you can’t postpone, and you can’t retry, and you can’t make whole, and all you have left is to say “I know that I did this to you, and that it was shitty.”

But it’s absolutely crucial that you do at least that, if you can. That you do so openly, and publicly, taking your lumps where everyone who might want to update on this knowledge gets to see. That you don’t hide the outputs of your black box, so that others’ future predictions of you will skew more positively than they ought. Another way to say this is that in order to renege prosocially, you must stay in contact with, and subject to, the social web. The more you insulate yourself from its vibrations, the less you can credibly claim that its maintenance was important to you.

Inverting the prescription

My first impulse was to do a recap, since that was a lot of words. But I think it’s actually more productive to go through the opposite list—to think through what happens when each prescription is inverted. Sadly, you probably have a story that will resonate with each of these, either as the reneger or the renegee. It’s worth it to pause for a moment at each one, and sort of sink into what it feels like, because the first step in building up an immune response against this sort of thing is being able to recognize it as it’s happening, and that only comes from a sort of mindful attentiveness to what it tastes like.

• Sometimes, people don’t bother to keep careful track of their agreements, their commitments, or whether they said things that fall shy of an actual agreement but which would nevertheless tend to cause others to substantially change their plans. Sometimes, people don’t pay enough attention to notice that they’re taking actions counter to their promises.
• Sometimes, people know that they’re going to bail, and they still say nothing. They postpone and delay, whether out of awkwardness or fear or simple self-centeredness, and you only find out later that they knew you were headed for a brick wall, and they still didn’t warn you.
• Sometimes, people end it all at once, even if that makes it ten times harder for you and only a little bit easier for them. They focus on how the clean break will be good for them, and ignore the question of whether or not it will impact you. They make the decision without your input, and give you no chance for appeal, and leave you with no power and no way to shield yourself.
• Sometimes, they do it over and over, stringing you along, never fixing the bug on their end, never taking responsibility.
• Sometimes, they declare bankruptcy, and they never pay you back. They save themselves, walk away with what’s left, and leave you to swallow the loss (if you can).
• Sometimes, they won’t even take responsibility. They’ll put the blame on you, or on luck, or on circumstance, or anywhere except their own account. They’ll try to have their cake and eat it, too, and sometimes they will, to your detriment.

That’s just—the way of the world. It’s not even always wrong, strictly speaking, that this happens—the title of this essay is reneging prosocially, and let’s be real: sometimes, you just can’t prioritize the social fabric. Sometimes people just have to do what they have to do, and that means imposing costs, or being imposed on. Sometimes, it’s all you can do to just renege, and never mind the consequences.

But it’s clear (I hope) how each of these situations could be better—how, if the spoons are there, it’s good to spend them in these ways, to guard against these outcomes. We all make mistakes. We all overpromise, from time to time. We’ve all been there, when things go just a little bit wronger than we thought, when we have just a little less gas in the tank than we expected. That’s life.

It’s just a lot easier to sympathize—and to empathize, and to forgive and forget and start all over—when there’s a credible signal that the other person counted for more than zero. When their counting influences your actions, the outputs of your black box, and not just your internal feelings.

After all, those outputs are all any of us ever really gets to see.

Discuss