# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 3 минуты 7 секунд назад

### Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

1 июля, 2020 - 20:30
Published on July 1, 2020 5:30 PM GMT

It’s well-established in the AI alignment literature what happens when an AI system learns or is given an objective that doesn’t fully capture what we want.  Human preferences and values are inevitably left out and the AI, likely being a powerful optimizer, will take advantage of the dimensions of freedom afforded by the misspecified objective and set them to extreme values. This may allow for better optimization on the goals in the objective function, but can have catastrophic consequences for human preferences and values the system fails to consider. Is it possible for misalignment to also occur between the model being trained and the objective function used for training? The answer looks like yes. Evan Hubinger from the Machine Intelligence Research Institute joins us on this episode of the AI Alignment Podcast to discuss how to ensure alignment between a model being trained and the objective function used to train it, as well as to evaluate three proposals for building safe advanced AI.

Topics discussed in this episode include:

• Inner and outer alignment
• How and why inner alignment can fail
• Training competitiveness and performance competitiveness
• Evaluating imitative amplification, AI safety via debate, and microscope AI

You can find the page for this podcast here

Transcript:

Lucas Perry: Welcome to the AI Alignment Podcast. I’m Lucas Perry. Today we have a conversation with Evan Hubinger about ideas in two works of his: An overview of 11 proposals for building safe advanced AI and Risks from Learned Optimization in Advanced Machine Learning Systems. Some of the ideas covered in this podcast include inner alignment, outer alignment, training competitiveness, performance competitiveness, and how we can evaluate some highlighted proposals for safe advanced AI with these criteria. We especially focus in on the problem of inner alignment and go into quite a bit of detail on that. This podcast is a bit jargony, but if you don’t have a background in computer science, don’t worry. I don’t have a background in it either and Evan did an excellent job making this episode accessible. Whether you’re an AI alignment researcher or not, I think you’ll find this episode quite informative and digestible. I learned a lot about a whole other dimension of alignment that I previously wasn’t aware of, and feel this helped to give me a deeper and more holistic understanding of the problem.

Evan Hubigner was an AI safety research intern at OpenAI before joining MIRI. His current work is aimed at solving inner alignment for iterated amplification. Evan was an author on “Risks from Learned Optimization in Advanced Machine Learning Systems,” was previously a MIRI intern, designed the functional programming language Coconut, and has done software engineering work at Google, Yelp, and Ripple. Evan studied math and computer science at Harvey Mudd College.

And with that, let’s get into our conversation with Evan Hubinger.

In general, I’m curious to know a little bit about your intellectual journey, and the evolution of your passions, and how that’s brought you to AI alignment. So what got you interested in computer science, and tell me a little bit about your journey to MIRI.

Evan Hubinger: I started computer science when I was pretty young. I started programming in middle school, playing around with Python, programming a bunch of stuff in my spare time. The first really big thing that I did, I wrote a functional programming language on top of Python. It was called Rabbit. It was really bad. It was interpreted in Python. And then I decided I would improve on that. I wrote another functional programming language on top of Python, called Coconut. Got a bunch of traction.

This was while I was in high school, starting to get into college. And this was also around the time I was reading a bunch of the sequences on LessWrong. I got sort of into that, and the rationality space, and I was following it a bunch. I also did a bunch of internships at various tech companies, doing software engineering and, especially, programming languages stuff.

Around halfway through my undergrad, I started running the Effective Altruism Club at Harvey Mudd College. And as part of running the Effective Altruism Club, I was trying to learn about all of these different cause areas, and how to use my career to do the most good. And I went to EA Global, and I met some MIRI people there. They invited me to do a programming internship at MIRI, where I did some engineering stuff, functional programming, dependent type theory stuff.

And then, while I was there, I went to the MIRI Summer Fellows program, which is this place where a bunch of people can come together and try to work on doing research, and stuff, for a period of time over the summer. I think it’s not happening now because of the pandemic, but it hopefully will happen again soon.

While I was there, I encountered some various different information, and people talking about AI safety stuff. And, in particular, I was really interested in this, at that time people were calling it, “optimization demons.” This idea that there could be problems when you train a model for some objective function, but you don’t actually get a model that’s really trying to do what you trained it for. And so with some other people who were at the MIRI Summer Fellows program, we tried to dig into this problem, and we wrote this paper, Risks from Learned Optimization in Advanced Machine Learning Systems.

Some of the stuff I’ll probably be talking about in this podcast came from that paper. And then as a result of that paper, I also got a chance to work with and talk with Paul Christiano, at OpenAI. And he invited me to apply for an internship at OpenAI, so after I finished my undergrad, I went to OpenAI, and I did some theoretical research with Paul, there.

And then, when that was finished, I went to MIRI, where I currently am. And I’m doing sort of similar theoretical research to the research I was doing at OpenAI, but now I’m doing it at MIRI.

Lucas Perry: So that gives us a better sense of how you ended up in AI alignment. Now, you’ve been studying it for quite a while from a technical perspective. Could you explain what your take is on AI alignment, and just explain what you see as AI alignment?

Evan Hubinger: Sure. So I guess, broadly, I like to take a general approach to AI alignment. I sort of see the problem that we’re trying to solve as the problem of AI existential risk. It’s the problem of: it could be the case that, in the future, we have very advanced AIs that are not aligned with humanity, and do really bad things. I see AI alignment as the problem of trying to prevent that.

But there are, obviously, a lot of sub-components to that problem. And so, I like to make some particular divisions. Specifically, one of the divisions that I’m very fond of, is to split it between these concepts called inner alignment and outer alignment, which I’ll talk more about later. I also think that there’s a lot of different ways to think about what the problems are that these sorts of approaches are trying to solve. Inner alignment, outer alignment, what is the thing that we’re trying to approach, in terms of building an aligned AI?

And I also tend to fall into the Paul Christiano camp of thinking mostly about intent alignment, where the goal of trying to build AI systems, right now, as a thing that we should be doing to prevent AIs from being catastrophic, is focusing on how do we produce AI systems which are trying to do what we want. And I think that inner and outer alignment are the two big components of producing intent aligned AI systems. The goal is to, hopefully, reduce AI existential risk and make the future a better place.

Lucas Perry: Do the social, and governance, and ethical and moral philosophy considerations come much into this picture, for you, when you’re thinking about it?

Evan Hubinger: That’s a good question. There’s certainly a lot of philosophical components to trying to understand various different aspects of AI. What is intelligence? How do objective functions work? What is it that we actually want our AIs to do at the end of the day?

In my opinion, I think that a lot of those problems are not at the top of my list in terms of what I expect to be quite dangerous if we don’t solve them. I think a large part of the reason for that is because I’m optimistic about some of the AI safety proposals, such as amplification and debate, which aim to produce a sort of agent, in the case of amplification, which is trying to do what a huge tree of humans would do. And then the problem reduces to, rather than having to figure out, in the abstract, what is the objective that we should be trying to train an AI for, that, philosophically, we think would be utility maximizing, or good, or whatever, we can just be like, well, we trust that a huge tree of humans would do the right thing, and then sort of defer the problem to this huge tree of humans to figure out what, philosophically, is the right thing to do.

And there are similar arguments you can make with other situations, like debate, where we don’t necessarily have to solve all of these hard philosophical problems, if we can make use of some of these alignment techniques that can solve some of these problems for us.

Lucas Perry: So let’s get into, here, your specific approach to AI alignment. How is it that you approach AI alignment, and how does it differ from what MIRI does?

Evan Hubinger: So I think it’s important to note, I certainly am not here speaking on behalf of MIRI, I’m just presenting my view, and my view is pretty distinct from the view of a lot of other people at MIRI. So I mentioned at the beginning that I used to work at OpenAI, and I did some work with Paul Christiano. And I think that my perspective is pretty influenced by that, as well, and so I come more from the perspective of what Paul calls prosaic AI alignment. Which is the idea of, we don’t know exactly what is going to happen, as we develop AI into the future, but a good operating assumption is that we should start by trying to solve AI for AI alignment, if there aren’t major surprises on the road to AGI. What if we really just scale things up, we sort of go via the standard path, and we get really intelligent systems? Would we be able to align AI in that situation?

And that’s the question that I focus on the most, not because I don’t expect there to be surprises, but because I think that it’s a good research strategy. We don’t know what those surprises will be. Probably, our best guess is it’s going to look something like what we have now. So if we start by focusing on that, then hopefully we’ll be able to generate approaches which can successfully scale into the future. And so, because I have this sort of general research approach, I tend to focus more on: What are current machine learning systems doing? How do we think about them? And how would we make them inner aligned and outer aligned, if they were sort of scaled up into the future?

This is in contrast with the way I think a lot of other people at MIRI view this. I think a lot of people at MIRI think that if you go this route of prosaic AI, current machine learning scaled up, it’s very unlikely to be aligned. And so, instead, you have to search for some other understanding, some other way to potentially do artificial intelligence that isn’t just this standard, prosaic path that would be more easy to align, that would be safer. I think that’s a reasonable research strategy as well, but it’s not the strategy that I generally pursue in my research.

Lucas Perry: Could you paint a little bit more detailed of a picture of, say, the world in which the prosaic AI alignment strategy sees as potentially manifesting where current machine learning algorithms, and the current paradigm of thinking in machine learning, is merely scaled up, and via that scaling up, we reach AGI, or superintelligence?

Evan Hubinger: I mean, there’s a lot of different ways to think about what does it mean for current AI, current machine learning, to be scaled up, because there’s a lot of different forms of current machine learning. You could imagine even bigger GPT-3, which is able to do highly intelligent reasoning. You could imagine we just do significantly more reinforcement learning in complex environments, and we end up with highly intelligent agents.

I think there’s a lot of different paths that you can go down that still fall into the category of prosaic AI. And a lot of the things that I do, as part of my research, is trying to understand those different paths, and compare them, and try to get to an understanding of… Even within the realm of prosaic AI, there’s so much happening right now in AI, and there’s so many different ways we could use current AI techniques to put them together in different ways to produce something potentially superintelligent, or highly capable and advanced. Which of those are most likely to be aligned? Which of those are the best paths to go down?

One of the pieces of research that I published, recently, was an overview and comparison of a bunch of the different possible paths to prosaic AGI. Different possible ways in which you could build advanced AI systems using current machine learning tools, and trying to understand which of those would be more or less aligned, and which would be more or less competitive.

Lucas Perry: So, you’re referring now, here, to this article, which is partly a motivation for this conversation, which is An Overview of 11 Proposals for Building Safe Advanced AI.

Evan Hubinger: That’s right.

Lucas Perry: All right. So, I think it’d be valuable if you could also help to paint a bit of a picture here of exactly the MIRI style approach to AI alignment. You said that they think that, if we work on AI alignment via this prosaic paradigm, that machine learning scaled up to superintelligence or beyond is unlikely to be aligned, so we probably need something else. Could you unpack this a bit more?

Evan Hubinger: Sure. I think that the biggest concern that a lot of people at MIRI have with trying to scale up prosaic AI is also the same concern that I have. There’s this really difficult, pernicious problem, which I call inner alignment, which is presented in the Risks from Learned Optimization paper that I was talking about previously, which I think many people at MIRI, as well as me, think that this inner alignment problem is the key stumbling block to really making prosaic AI work. I agree. I think that this is the biggest problem. But I’m more optimistic, in terms of, I think that there are possible approaches that we can take within the prosaic paradigm that could solve this inner alignment problem. And I think that is the biggest point of difference, is how difficult will inner alignment be?

Lucas Perry: So what that looks like is a lot more foundational work, and correct me if I’m wrong here, into mathematics, and principles in computer science, like optimization and what it means for something to be an optimizer, and what kind of properties that has. Is that right?

Evan Hubinger: Yeah. So in terms of some of the stuff that other people at MIRI work on, I think a good starting point would be the embedded agency sequence on the alignment forum, which gives a good overview of a lot of the things that the different Agent Foundations people, like Scott Garrabrant, Sam Eisenstat, Abram Demski, are working on.

Lucas Perry: All right. Now, you’ve brought up inner alignment as a crucial difference, here, in opinion. So could you unpack exactly what inner alignment is, and how it differs from outer alignment?

Evan Hubinger: This is a favorite topic of mine. A good starting point is trying to rewind, for a second, and really understand what it is that machine learning does. Fundamentally, when we do machine learning, there are a couple of components. We start with a parameter space of possible models, where a model, in this case, is some parameterization of a neural network, or some other type of parameterized function. And we have this large space of possible models, this large space of possible parameters, that we can put into our neural network. And then we have some loss function where, for a given parameterization for a particular model, we can check what is its behavior like on some environment. In supervised learning, we can ask how good are its predictions that it outputs. In an RL environment, we can ask how much reward does it get, when we sample some trajectory.

And then we have this gradient descent process, which samples some individual instances of behavior of the model, and then it tries to modify the model to do better in those instances. We search around this parameter space, trying to find models which have the best behavior on the training environment. This has a lot of great properties. This has managed to propel machine learning into being able to solve all of these very difficult problems that we don’t know how to write algorithms for ourselves.

But I think, because of this, there’s a tendency to rely on something which I call the does-the-right-thing abstraction. Which is that, well, because the model’s parameters were selected to produce the best behavior, according to the loss function, on the training distribution, we tend to think of the model as really trying to minimize that loss, really trying to get rewarded.

But in fact, in general, that’s not the case. The only thing that you know is that, on the cases where I sample data on the training distribution, my models seem to be doing pretty well. But you don’t know what the model is actually trying to do. You don’t know that it’s truly trying to optimize the loss, or some other thing. You just know that, well, it looked like it was doing a good job on the training distribution.

What that means is that this abstraction is quite leaky. There’s many different situations in which this can go wrong. And this general problem is referred to as robustness, or distributional shift. This problem of, well, what happens when you have a model, which you wanted it to be trying to minimize some loss, but you move it to some other distribution, you take it off the training data, what does it do, then?

And I think this is the starting point for understanding what is inner alignment, is from this perspective of robustness, and distributional shift. Inner alignment, specifically, is a particular type of robustness problem. And it’s the particular type of robustness problem that occurs when you have a model which is, itself, an optimizer.

When you do machine learning, you’re searching over this huge space of different possible models, different possible parameterizations of a neural network, or some other function. And one type of function which could do well on many different environments, is a function which is running a search process, which is doing some sort of optimization. You could imagine I’m training a model to solve some maze environment. You could imagine a model which just learns some heuristics for when I should go left and right. Or you could imagine a model which looks at the whole maze, and does some planning algorithm, some search algorithm, which searches through the possible paths and finds the best one.

And this might do very well on the mazes. If you’re just running a training process, you might expect that you’ll get a model of this second form, that is running this search process, that is running some optimization process.

In the Risks from Learned Optimization paper, we call models which are, themselves, running search processes mesa-optimizers, where “mesa” is just Greek, and it’s the opposite of meta. There’s a standard terminology in machine learning, this meta-optimization, where you can have an optimizer which is optimizing another optimizer. In mesa-optimization, it’s the opposite. It’s when you’re doing gradient descent, you have an optimizer, and you’re searching over models, and it just so happens that the model that you’re searching over happens to also be an optimizer. It’s one level below, rather than one level above. And so, because it’s one level below, we call it a mesa-optimizer.

And inner alignment is the question of how do we align the objectives of mesa-optimizers. If you have a situation where you train a model, and that model is, itself, running an optimization process, and that optimization process is going to have some objective. It’s going to have some thing that it’s searching for. In a maze, maybe it’s searching for: how do I get to the end of the maze? And the question is, how do you ensure that that objective is doing what you want?

If we go back to the does-the-right-thing abstraction, that I mentioned previously, it’s tempting to say, well, we trained this model to get to the end of the maze, so it should be trying to get to the end of the maze. But in fact, that’s not, in general, the case. It could be doing anything that would be correlated with good performance, anything that would likely result in: in general, it gets to the end of the maze on the training distribution, but it could be an objective that will do anything else, sort of off-distribution.

That fundamental robustness problem of, when you train a model, and that model has an objective, how do you ensure that that objective is the one that you trained it for? That’s the inner alignment problem.

Lucas Perry: And how does that stand, in relation with the outer alignment problem?

Evan Hubinger: So the outer alignment problem is, how do you actually produce objectives which are good to optimize for?

So the inner alignment problem is about aligning the model with the loss function, the thing you’re training for, the reward function. Outer alignment is aligning that reward function, that loss function, with the programmer’s intentions. It’s about ensuring that, when you write down a loss, if your model were to actually optimize for that loss, it would actually do something good.

Outer alignment is the much more standard problem of AI alignment. If you’ve been introduced to AI alignment before, you’ll usually start by hearing about the outer alignment concerns. Things like paperclip maximizers, where there’s this problem of, you try to train it to do some objective, which is maximize paperclips, but in fact, maximizing paperclips results in it doing all of this other stuff that you don’t want it to do.

And so outer alignment is this value alignment problem of, how do you find objectives which are actually good to optimize? But then, even if you have found an objective which is actually good to optimize, if you’re using the standard paradigm of machine learning, you also have this inner alignment problem, which is, okay, now, how do I actually train a model which is, in fact, going to do that thing which I think is good?

Lucas Perry: That doesn’t bear relation with Stuart’s standard model, does it?

Evan Hubinger: It, sort of, is related to Stuart Russell’s standard model of AI. I’m not referring to precisely the same thing, but it’s very similar. I think a lot of the problems that Stuart Russell has with the standard paradigm of AI are based on this: start with an objective, and then train a model to optimize that objective. When I’ve talked to Stuart about this, in the past, he has said, “Why are we even doing this thing of training models, hoping that the models will do the right thing? We should be just doing something else, entirely.” But we’re both pointing at different features of the way in which current machine learning is done, and trying to understand what are the problems inherent in this sort of machine learning process? I’m not making the case that I think that this is an unsolvable problem. I mean, it’s the problem I work on. And I do think that there are promising solutions to it, but I do think it’s a very hard problem.

Lucas Perry: All right. I think you did a really excellent job, there, painting the picture of inner alignment and outer alignment. I think that in this podcast, historically, we have focused a lot on the outer alignment problem, without making that super explicit. Now, for my own understanding, and, as a warning to listeners, my basic machine learning knowledge is something like an Orc structure, hobbled together with sheet metal, and string, and glue. And gum, and rusty nails, and stuff. So, I’m going to try my best, here, to see if I understand everything here about inner and outer alignment, and the basic machine learning model. And you can correct me if I get any of this wrong.

So, in terms of inner alignment, there is this neural network space, which can be parameterized. And when you do the parameterization of that model, the model is the nodes, and how they’re connected, right?

Evan Hubinger: Yeah. So the model, in this case, is just a particular parameterization of your neural network, or whatever function, approximated, that you’re training. And it’s whatever the parameterization is, at the moment we’re talking about. So when you deploy the model, you’re deploying the parameterization you found by doing huge amounts of training, via gradient descent, or whatever, searching over all possible parameterizations, to find one that had good performance on the training environment.

Lucas Perry: So, that model being parameterized, that’s receiving inputs from the environment, and then it is trying to minimize the loss function, or maximize reward.

Evan Hubinger: Well, so that’s the tricky part. Right? It’s not trying to minimize the loss. It’s not trying to maximize the reward. That’s this thing which I call the does-the-right-thing abstraction. This leaky abstraction that people often rely on, when they think about machine learning, that isn’t actually correct.

Lucas Perry: Yeah, so it’s supposed to be doing those things, but it might not.

Evan Hubinger: Well, what does “supposed to” mean? It’s just a process. It’s just a system that we run, and we hope that it results in some particular outcome. What it is doing, mechanically, is we are using a gradient descent process to search over the different possible parameterizations, to find parameterizations which result in good behavior on the training environment.

Lucas Perry: That’s good behavior, as measured by the loss function, or the reward function. Right?

Evan Hubinger: That’s right. You’re using gradient descent to search over the parameterizations, to find a parameterization which results in a high reward on the training environment.

Lucas Perry: Right, but, achieving the high reward, what you’re saying, is not identical with actually trying to minimize the loss.

Evan Hubinger: Right. There’s a sense in which you can think of gradient descent as trying to minimize the loss, because it’s selecting for parameterizations which have the lowest possible loss that it can find, but we don’t know what the model is doing. All we know is that the model’s parameters were selected, by gradient descent, to have good training performance; to do well, according to the loss, on the training distribution. But what they do off-distribution, we don’t know.

Lucas Perry: We’re going to talk about this later, but there could be a proxy. There could be something else in the maze that it’s actually optimizing for, that correlates with minimizing the loss function, but it’s not actually trying to get to the end of the maze.

Evan Hubinger: That’s exactly right.

Lucas Perry: And then, in terms of gradient descent, is the TL;DR on that: the parameterized neural network space, you’re creating all of these perturbations to it, and the perturbations are sort of nudging it around in this n-dimensional space, how-many-ever parameters there are, or whatever. And, then, you’ll check to see how it minimizes the loss, after those perturbations have been done to the model. And, then, that will tell you whether or not you’re moving in a direction which is the local minima, or not, in that space. Is that right?

Evan Hubinger: Yeah. I think that that’s a good, intuitive understanding. What’s happening is, you’re looking at infinitesimal shifts, because you’re taking a gradient, and you’re looking at how those infinitesimal shifts would perform on some batch of training data. And then you repeat that, many times, to go in the direction of the infinitesimal shift which would cause the best increase in performance. But it’s, basically, the same thing. I think the right way to think about gradient descent is this local search process. It’s moving around the parameter space, trying to find parameterizations which have good training performance.

Lucas Perry: Is there anything interesting that you have to say about that process of gradient descent, and the tension between finding local minima and global minima?

Evan Hubinger: Yeah. It’s certainly an important aspect of what the gradient descent process does, that it doesn’t find global minima. It’s not the case that it works by looking at every possible parameterization, and picking the actual best one. It’s this local search process that starts from some initialization, and then looks around the space, trying to move in the direction of increasing improvement. Because of this, there are, potentially, multiple possible equilibria, parameterizations that you could find from different initializations, that could have different performance.

All the possible parameterizations of a neural network with billions of parameters, like GPT-2, or now, GPT-3, which has greater than a hundred billion, is absolutely massive. It’s over a combinatorial explosion of a huge degree, where you have all of these different possible parameterizations, running internally, correspond to totally different algorithms controlling these weights that determine exactly what algorithm the model ends up implementing.

And so, in this massive space of algorithms, you might imagine that some of them will look more like search processes, some of them will look more like optimizers that have objectives, some of them will look less like optimizers, some of them might just be grab bags of heuristics, or other different possible algorithms.

It’d depend on exactly what your setup is. If you’re training a very simple network that’s just a couple of feed-forward layers, it’s probably not possible for you to find really complex models influencing complex search processes. But if you’re training huge models, with many layers, with all of these different possible parameterizations, then it becomes more and more possible for you to find these complex algorithms that are running complex search processes.

Lucas Perry: I guess the only thing that’s coming to mind, here, that is, maybe, somewhat similar is how 4.5 billion years of evolution has searched over the space of possible minds. Here we stand as these ape creature things. Are there, for example, interesting intuitive relationships between evolution and gradient descent? They’re both processes searching over a space of mind, it seems.

Evan Hubinger: That’s absolutely right. I think that there are some really interesting parallels there. In particular, if you think about humans as models that were produced by evolution as a search process, it’s interesting to note that the thing which we optimize for is not the thing which evolution optimizes for. Evolution wants us to maximize the total spread of our DNA, but that’s not what humans do. We want all of these other things, like decreasing pain and happiness and food and mating, and all of these various proxies that we use. An interesting thing to note is that many of these proxies are actually a lot easier to optimize for, and a lot simpler than if we were actually truly maximizing spread of DNA. An example that I like to use is imagine some alternate world where evolution actually produced humans that really cared about their DNA, and you have a baby in this world, and this baby stubs their toe, and they’re like, “What do I do? Do I have to cry for help? Is this a bad thing that I’ve stubbed my toe?”

They have to do this really complex optimization process that’s like, “Okay, how is my toe being stubbed going to impact the probability of me being able to have offspring later on in life? What can I do to best mitigate that potential downside now?” This is a really difficult optimization process, and so I think it sort of makes sense that evolution instead opted for just pain, bad. If there’s pain, you should try to avoid it. But as a result of evolution opting for that much simpler proxy, there’s a misalignment there, because now we care about this pain rather than the thing that evolution wanted, which was the spread of DNA.

Lucas Perry: I think the way Stuart Russell puts this is the actual problem of rationality is how is my brain supposed to compute and send signals to my 100 odd muscles to maximize my reward function over the universe history until heat death or something. We do nothing like that. It would be computationally intractable. It would be insane. So, we have all of these proxy things that evolution has found that we care a lot about. Their function is instrumental in terms of optimizing for the thing that evolution is optimizing for, which is reproductive fitness. Then this is all probably motivated by thermodynamics, I believe. When we think about things like love or like beauty or joy, or like aesthetic pleasure in music or parts of philosophy or things, these things almost seem intuitively valuable from a first person perspective of the human experience. But via evolution, they’re these proxy objectives that we find valuable because they’re instrumentally useful in this evolutionary process on top of this thermodynamic process, and that makes me feel a little funny.

Evan Hubinger: Yeah, I think that’s right. But I also think it’s worth noting that you want to be careful not to take the evolution analogy too far, because it is just an analogy. When we actually look at the process of machine learning and how great it is that works, it’s not the same. It’s running a fundamentally different optimization procedure over a fundamentally different space, and so there are some interesting analogies that we can make to evolution, but at the end of the day, what we really want to analyze is how does this work in the context of machine learning? I think the Risks from Learned Optimization paper tries to do that second thing, of let’s really try to look carefully at the process of machine learning and understand what this looks like in that context. I think it’s useful to sort of have in the back of your mind this analogy to evolution, but I would also be careful not to take it too far. I imagine that everything is going to generalize to the case of machine learning because it is a different process.

Lucas Perry: So then pivoting here, wrapping up on our understanding of inner alignment and outer alignment, there’s this model, which is being parametrized by gradient descent, and it has some relationship with the loss function or the objective function. It might not actually be trying to minimize the actual loss or to actually maximize the reward. Could you add a little bit more clarification here about why that is? I think you mentioned this already, but it seems like when gradient descent is evolving this parametrized model space, isn’t that process connected to minimizing the loss in some objective way? The loss is being minimized, but it’s not clear that it’s actually trying to minimize the loss. There’s some kind of proxy thing that it’s doing that we don’t really care about.

Evan Hubinger: That’s right. Fundamentally, what’s happening is that you’re selecting for a model which has empirically on the training dispution, the load loss. But what that actually means in terms of the internals of the model, that it’s sort of trying to optimize for, and what its out of distribution behavior would be is unclear. A good example of this is this maze example. I was talking previously about the instance of maybe you train a model on a training distribution of relatively small mazes, and to mark the end, you put a little green arrow. Right? Then I want to ask the question, what happens when we move to a deployment environment where the green arrow is no longer at the end of the maze, and we have much larger mazes? Then what happens to the model in this new off distribution setting?

I think there’s three distinct things that can happen. It could simply fail to generalize at all. It just didn’t learn a general enough optimization procedure that it was able to solve these bigger, larger mazes, or it could successfully generalize and knows how to navigate. It learned a general purpose optimization procedure, which is able to solve mazes, and it uses it to get to the end of the maze. But there’s a third possibility, which is that it learned a general purpose optimization procedure, which is capable of solving mazes, but it learned the wrong objective. It learned to use that optimization procedure to get the green arrow rather than to get to the end of the maze. What I call this situation is capability generalization without objective generalization. It’s objective, but the thing it was using those capabilities for didn’t generalize successfully off distribution.

What’s so dangerous about this particular robustness failure is that it means off distribution you have models which are highly capable. They have these really powerful optimization procedures directed at incorrect tasks. You have this strong maze solving capability, but this strong maze solving capability is being directed at a proxy, getting to the green arrow rather than the actual thing which we wanted, which was get to the end of the maze. The reason this is happening is that on the training environment, both of those different possible models look the same in the training distribution. But when you move them off distribution, you can see that they’re trying to do very different things, one of which we want, and one of which we don’t want. But they’re both still highly capable.

You end up with a situation where you have intelligent models directed at the wrong objective, which is precisely the sort of misalignment of AIs that we’re trying to avoid, but it happened not because the objective was wrong. In this example, we actually want them to get to the end of the maze. It happened because our training process failed. It happened because our training process wasn’t able to distinguish between models trying to get to the end, and models trying to get to the green arrow. What’s particularly concerning in this situation is when the objective generalization lags behind the capability generalization, when the capabilities generalize better than the objective does, so that it’s able to do highly capable actions, highly intelligent actions, but it does them for the wrong reason.

I was talking previously about mesa optimizers where inner alignment is about this problem of models which have objectives which are incorrect. That’s the sort of situation where I could expect this problem to occur, because if you are training a model and that model has a search process and an objective, potentially the search process could generalize without the objective also successfully generalizing. That leads to this situation where your capabilities are generalizing better than your objective, which gives you this problem scenario where the model is highly intelligent, but directed at the wrong thing.

Lucas Perry: Just like in all of the outer alignment problems, the thing doesn’t know what we want, but it’s highly capable. Right?

Evan Hubinger: Right.

Lucas Perry: So, while there is a loss function or an objective function, that thing is used to perform gradient descent on the model in a way that moves it roughly in the right direction. But what that means, it seems, is that the model isn’t just something about capability. The model also implicitly somehow builds into it the objective. Is that correct?

Evan Hubinger: We have to be careful here because the unfortunate truth is that we really just don’t have a great understanding of what our models are doing, and what the inductive biases of gradient descent are right now. So, fundamentally, we don’t really know what the internal structures of our models are like. There’s a lot of really exciting research, stuff like the circuits analysis from Chris Olah and the clarity team at OpenAI. But fundamentally, we don’t understand what the models are doing. We can sort of theorize about the possibility of a model that’s running some search process, and that search process generalizes, but the objective doesn’t. But fundamentally, because our models are these black box systems that we don’t really fully understand, it’s hard to really concretely say, “Yes, this is what the model is doing. This is how it’s operating, and this is the problem.”

But in Risks from Learned Optimization, we try to at least attempt to understand that problem, and look at, if we really think carefully about what gradient descent is incentivizing and how it might work, what are the things which we might predict would happened? So, the objective that you’re training the model for it does not live in the model. It lives in the gradient descent process. It lives in the training procedure. We might hope that when we train a model on an objective, that it will produce its own model of that objective and try to figure out what it is and be aligned with it. But we don’t know exactly what happens. The model doesn’t get to see the objective you’re training for. All that happens is that the grade descent process looks at its behavior and tries to make it so that its behavior is more aligned with the loss function.

But that loss function never enters into the model, somehow, the model never sees that loss function, it might have some objective internally. Like I was saying, if it’s a mesa optimizer, and then we might hope that that objective is aligned with the loss function we’re training it for. But fundamentally, all we know is that it’s behavior on the training distribution was aligned with the loss function.

Lucas Perry: That makes sense. Because it’s so black boxy, we can’t really interpret the state of the alignment of the model. So, is the only way to do that to test it out of distribution and see what happens at this point?

Evan Hubinger: There are a bunch of different possible ways to address this problem. Certainly, one approach is to try to test it out of distribution, which is an adversarial training approach. This model is going to have some potential failure modes off distribution. We can try to find those failure modes and then train the model on those failure modes to prevent it from having this bad off distribution behavior. There are some concerns with adversarial training though. In particular, adversarial training doesn’t necessarily catch what I see as the most pernicious difficult inner alignment failure, which is something that we call deceptive alignment in the Risks from Learned Optimization paper. In the deceptive alignment case, if the model knows that it’s being adversarially trained, then you’re not going to be able to figure that out just via throwing it a bunch of examples.

You can also do something like transparency, as I mentioned previously that there’s a lot of really exciting transparency interpretability work. If you’re able to sort of look inside the model and understand what algorithm it’s fundamentally implementing, you can see, is it implementing an algorithm which is an optimization procedure that’s aligned? Has it learned a correct model of the loss function or an incorrect model? It’s quite difficult, I think, to hope to solve this problem without transparency and interpretability. I think that to be able to really address this problem, we have to have some way to peer inside of our models. I think that that’s possible though. There’s a lot of evidence that points to the neural networks that we’re training really making more sense, I think, than people assume.

People tend to treat their models as these sort of super black box things, but when we really inside of them, when we look at what is it actually doing, a lot of times, it just makes sense. I was mentioning some of the circuits analysis worked from the clarity team at OpenAI, and they find all sorts of behavior. Like, we can actually understand when a model classifies something as a car, the reason that it’s doing that is because it has a wheel detector and it has a window detector, and it’s looking for windows on top of wheels. So, we can be like, “Okay, we understand what algorithm the model is influencing, and based on that we can figure out, is it influencing the right algorithm or the wrong algorithm? That’s how we can hope to try and address this problem.” But obviously, like I was mentioning, all of these approaches get much more complicated in the deceptive alignment situation, which is the situation which I think is most concerning.

Lucas Perry: All right. So, I do want to get in here with you in terms of all the ways in which inner alignment fails. Briefly, before we start to move into this section, I do want to wrap up here then on outer alignment. Outer alignment is probably, again, what most people are familiar with. I think the way that you put this is it’s when the objective function or the loss function is not aligned with actual human values and preferences. Are there things other than loss functions or objective functions used to train the model via gradient descent?

Evan Hubinger: I’ve sort of been interchanging a little bit between loss function and reward function and objective function. Fundamentally, these are sort of from different paradigms in machine learning, so the reward function would be what you would use in a reinforcement learning context. The loss function is the more general term, which is in a supervised learning context, you would just have a loss function. You still have the loss function in a reinforcement learning context, but that loss function is crafted in such a way to incentivize the models, optimize the reward function via various different reinforcement learning schemes, so it’s a little bit more complicated than the sort of hand-wavy picture, but the basic idea is machine learning is we have some objective and we’re looking for parametrizations of our model, which do well according to that objective.

Lucas Perry: Okay. The outer alignment problem is that we have absolutely no idea, and it seems much harder than creating powerful optimizers, the process by which we would come to fully understand human preferences and preference hierarchies and values.

Evan Hubinger: Yeah. I don’t know if I would say “we have absolutely no idea.” We have made significant progress on outer alignment. In particular, you can look at something like amplification or debate. I think that these sorts of approaches have strong arguments for why they might be outer aligned. In a simplest form, amplification is about training a model to mimic this HDH process, which is a huge tree of humans consulting each other. Maybe we don’t know in the abstract what our AI would do if it were optimized in some definition of human values or whatever, but if we’re just training it to mimic this huge tree of humans, then maybe we can at least understand what this huge tree of humans is doing and figure out whether amplification is aligned.

So, there has been significant progress on outer alignment, which is sort of the reason that I’m less concerned about it right now, because I think that we have good approaches for it, and I think we’ve done a good job of coming up with potential solutions. There’s still a lot more work that needs to be done, a lot more testing, a lot more to really understand do these approaches work, are they competitive? But I do think that to say that we have absolutely no idea of how to do this is not true. But that being said, there’s still a whole bunch of different possible concerns.

Whenever you’re training a model on some objective, you run into all of these problems of instrumental convergence, where if the model isn’t really aligned with you, it might try to do these instrumentally convergent goals, like keep itself alive, potentially stop you from turning it off, or all of these other different possible things, which we might not want. All of these are what the outer alignment problem looks like. It’s about trying to address these standard value alignment concerns, like convergent instruments or goals, by finding objectives, potentially like amplification, which are ways of avoiding these sorts of problems.

Lucas Perry: Right. I guess there’s a few things here wrapping up on outer alignment. Nick Bostrom’s Superintelligence, that was basically about outer alignment then, right?

Evan Hubinger: Primarily, that’s right. Yeah.

Lucas Perry: Inner alignment hadn’t really been introduced to the alignment debate yet.

Evan Hubinger: Yeah. I think the history of how this concern got into the AI safety sphere is complicated. I mentioned previously that there are people going around and talking about stuff like optimization demons, and I think a lot of that discourse was very confused and not pointing at how machine learning actually works, and was sort of just going off of, “Well, it seems like there’s something weird that happens in evolution where evolution finds humans that aren’t aligned with what evolution wants.” That’s a very good point. It’s a good insight. But I think that a lot of people recoil from this because it was not grounded in machine learning, because I think a lot of it was very confused and it didn’t fully give the problem the contextualization that it needs in terms of how machine learning actually works.

So, the goal of Risk from Learned Optimization was to try and solve that problem and really dig into this problem from the perspective of machine learning, understand how it works and what the concerns are. Now with the paper having been out for awhile, I think the results have been pretty good. I think that we’ve gotten to a point now where lots of people are talking about inner alignment and taking it really seriously as a result of the Risks from Learned Monopolization paper.

Lucas Perry: All right, cool. You did mention sub goal, so I guess I just wanted to include that instrumental sub goals is the jargon there, right?

Evan Hubinger: Convergent instrumental goals, convergent instrumental sub goals. Those are synonymous.

Lucas Perry: Okay. Then related to that is Goodhart’s law, which says that when you optimize for one thing hard, you oftentimes don’t actually get the thing that you want. Right?

Evan Hubinger: That’s right. Goodhart’s law is a very general problem. The same problem occurs both in inner alignment and outer alignment. You can see Goodhart’s law showing itself in the case of convergent instrumental goals. You can also see Goodhart’s law showing itself in the case of finding proxies, like going to the green arrow rather than getting the end of the maze. It’s a similar situation where when you start pushing on some proxy, even if it looked like it was good on the training distribution, it’s no longer as good off distribution. Goodhart’s law is a really very general principle which applies in many different circumstances.

Lucas Perry: Are there any more of these outer alignment considerations we can kind of just list off here that listeners would be familiar with if they’ve been following AI alignment?

Evan Hubinger: Outer alignment has been discussed a lot. I think that there’s a lot of literature on outer alignment. You mentioned Superintelligence. Superintelligence is primarily about this alignment problem. Then all of these difficult problems of how do you actually produce good objectives, and you have problems like boxing and the stop button problem, and all of these sorts of things that come out of thinking about outer alignment. So, I don’t want to go into too much detail because I think it really has been talked about a lot.

Lucas Perry: So then pivoting here into focusing on the inner alignment section, why do you think inner alignment is the most important form of alignment?

Evan Hubinger: It’s not that I see outer alignment as not concerning, but that I think that we have made a lot of progress on outer alignment and not made a lot of progress on inner alignment. Things like amplification, like I was mentioning, I think are really strong candidates for how we might be able to solve something like outer alignment. But currently I don’t think we have any really good strong candidates for how to solve inner alignment. You know? Maybe as machine learning gets better, we’ll just solve some of these problems automatically. I’m somewhat skeptical of that. In particular, deceptive alignment is a problem which I think is unlikely to get solved as machine learning gets better, but fundamentally we don’t have good solutions to the inner alignment problem.

Our models are just these black boxes mostly right now, we’re sort of starting to be able to peer into them and understand what they’re doing. We have some techniques like adversarial training that are able to help us here, but I don’t think we really have good satisfying solutions in any sense to how we’d be able to solve inner alignment. Because of that, inner alignment is currently what I see as the biggest, most concerning issue in terms of prosaic AI alignment.

Lucas Perry: How exactly does inner alignment fail then? Where does it go wrong, and what are the top risks of inner alignment?

Evan Hubinger: I’ve mentioned some of this before. There’s this sort of basic maze example, which gives you the story of what an inner alignment failure might look like. You train the model on some objective, which you thought was good, but the model learns some proxy objective, some other objective, which when it moved off distribution, it was very capable of optimizing, but it was the wrong objective. However, there’s a bunch of specific cases, and so in Risk from Learned Optimization, we talk about many different ways in which you can break this general inner misalignment down into possible sub problems. The most basic sub problem is this sort of proxy pseudo alignment is what we call it, which is the case where your model learns some proxy, which is correlated with the correct objective, but potentially comes apart when you move off distribution.

But there are other causes as well. There are other possible ways in which this can happen. Another example would be something we call sub optimality pseudo alignment, which is a situation where the reason that the model looks like it has good training performance is because the model has some deficiency or limitation that’s causing it to be aligned, where maybe once the model thinks for longer, you’ll realize it should be doing some other strategy, which is misaligned, but it hasn’t thought about that yet, and so right now it just looks aligned. There’s a lot of different things like this where the model can be structured in such a way that it looks aligned on the training distribution, but if it encountered additional information, if it was in a different environment where the proxy no longer had the right correlations, the things would come apart and it would no longer act aligned.

The most concerning, in my eyes, is something which I’ll call deceptive alignment. Deceptive alignment is a sort of very particular problem where the model acts aligned because it knows that it’s in a training process, and it wants to get deployed with its objective in tact, and so it acts aligned so that its objective won’t be modified by the gradient descent process, and so that it can get deployed and do something else that it wants to do in deployment. This is sort of similar to the treacherous turn scenario, where you’re thinking about an AI that does something good, and then it turns on you, but it’s a much more specific instance of it where we’re thinking not about treacherous turn on humans, but just about the situation of the interaction between gradient descent and the model, where the model maybe knows it’s inside of a gradient descent process and is trying to trick that gradient descent process.

A lot of people on encountering this are like, “How could this possibly happen in a machine learning system?” I think this is a good reaction because it really is a very strange thing to train a model to do this. But I think there are strong arguments for why deceptive alignment would actually be the simplest type of model that you could find in this situation.

Lucas Perry: A way of explaining this I think to anyone on the street would be like, imagine if pigs were intelligent enough to create farmers, and you created farmers and they appeared to be aligned, they took care of you and they gave you these nice mud pits and they fed you every day, they gave you shelter and all of these other nice things, and then one day the farmer shows up and kills you, right? You thought the thing that you created was aligned, but it was deceptively aligned, and it takes a treacherous turn. Is this sort of like a more mundane example that you might agree with?

Evan Hubinger: I think that’s a good intuition pump for thinking about this situation. I generally am sort of averse to trying to really heavily apply these sorts of analogies, because it’s a good analogy to think about what’s happening, but it doesn’t answer the core question of how likely is this to actually happen in a machine learning system.

Lucas Perry: Yeah, that makes sense, because it’s much more specific than the other kinds of minds in this mind space. It seems pretty rare, a thing that could exist, but hard to find.

Evan Hubinger: Right. I think that’s a good intuition, but I’m going to try to disavow you of that notion. First, I think it’s interesting to look at maybe you do a bunch of adversarial training, you’re really pushing the model to figure out what the objective is. It needs to know the objective at some point if you are training it in all possible situations. You need to know what the loss function is for it to be able to do a good job. But there’s multiple possible channels through which information about the loss function can enter the model. And so i’ll fundamentally distinguish between two different channels, which is the information about the loss function can enter through the gradient descent process, or it can enter through the models input data.

I’ll call these two channels internalization and modeling. Internalization is the situation where you have this model that’s going along, and it has some proxy objective maybe. That proxy objective is not quite right, but then it gets moved to a new situation where the proxy objective no longer works, and gradient descent goes inside the model and tinkers with the proxy to make it slightly more accurate. Iterate this process many, many times, and eventually the proxy gets better and better and better and better and better, until it starts matching up with the actual loss function. But there’s another story that I can tell, which is modeling.

If you think about a training process like GPT-3 where the model is trained on a huge corpus of all of this really rich input data. Well, in that sort of a situation, there’s a whole bunch of really relevant information in that input to the agent’s objective. An agent, if it’s in this environment, where it has this huge text corpus, it’s going to have to learn a model of this environment and how it functions. We can imagine a situation where maybe you’re training the model on some huge text corpus to do something like maximize human values, or whatever, and it reads a Wikipedia page on ethics. And it’s like, “Wow, look at all of this really relevant, rich, useful information for figuring out this objective.”

But then there’s a second question, which is: suppose that the model has some model of the input data, which includes a bunch of rich information inside of the model already about that objective, how does gradient descent actually modify the model to make use of that information? And so there’s two different types of modeling, which are: deceptive alignment and corrigible alignment. So the corrigible story is you have this model that’s going along, it has some proxy objective, but it encounters this really rich input data, which includes a huge amount of information about the objective. To be able to predict successfully what the Wikipedia page on ethics is going to say, it has to know about ethics. So it learns this really detailed ethics model.

And then gradient descent is like: look, you have this really detailed ethics model, I’m going to just modify your objective to point to that ethics model. Now your new objective is just optimize that. And so this leads to something sort of like corrigibility, where the model that you’re training has its objectives determined by a pointer to some part of its world model. It has some model of this environment that includes some information about ethics. And now it’s trying to optimize for that thing that it’s pointed to in its world model.

Then there’s this other story, which is the deceptive alignment story. Similar, you have a model going along. It has some proxy objective and it learns this really detailed world model that includes a bunch of information about ethics, or whatever. And then gradient descent modifies the model to think longer about the fact that it’s inside of an optimization process and realize that the correct instrumental thing for it to be doing is to optimize for that ethics model. And then it realizes, “Oh, I see, I should optimize for this model that I have.” It goes and does that. And in both situations, you end up being able to make use of this really rich input data to improve the performance, by changing the objective to make reference to that. But in the corrigible case, you actually modify it to point to it, whereas in the deceptive case, you just modify the model to think about the fact that it should be instrumentally optimizing for it.

And then the question is: which one of these is more likely? Before I go into that though, I want to dwell for a second on the internalization versus modeling distinction again. I think a good analogy is thinking about animal imprinting. Think about a duck. You’re trying to train ducks to follow their mothers, or whatever. You can imagine a situation where the duck internalizes a model of its mother, it is born with some really detailed model of how to figure out what a mother is. But there’s also the modeling case where the duck is going to be in the world, it’s going to have to form this really detailed model of the world. And that really detailed model of the world is going to have the very first thing that it encounters should always be the mother. And so rather than having to learn some detailed model of figuring out what mothers are, if you just instead do some modeling process where it’s like, just have the duck’s objective point to the mother in its world model.

This modeling case, you can see it as something like animal imprinting, where rather than really internalizing the objective, because it already has this world model, it might as well just specify the objective in terms of that world model instead. And I haven’t yet talked about why you would get to deception versus corrigibility. Right now I’m just trying to convince you that you should expect modeling. You should expect that if you give your model this huge, really rich detailed input data that it’s going to learn an objective via reference to that input data rather than from scratch. When we think about models being trained, we think about them having some objective and that objective is just modified by gradient descent as it moves along to get closer and closer to reality.

But in the modeling case, that’s not what’s happening. The model is simultaneously learning a model of the world, as well as trying to improve its objective. And the objective is modified to make reference to that model of the world, to make it more effective. So internalization is: it just wants the same thing as you do, and so it does it because it wants the same thing. Whereas corrigibility is it’s trying to figure out what you want and then do that. And then deception is also trying to figure out what you want and then do that, but it’s doing so for instrumental rather than terminal reasons. The instrumental reason is the model can get more reward, it can do better in the long run if it’s able to not be modified by the gradient descent process. So if it’s doing a good job, then gradient descent process won’t change it’s objective, and so therefore, it can get deployed with the same objective and then go do a whole bunch of other stuff that you don’t want it to do.

Lucas Perry: But why does it care about the current objective?

Evan Hubinger: We’re imagining a situation where you have a proxy aligned model, but it cares about that proxy across episodes. It cares about that proxy over time, and it wants that proxy to continue to be optimized for it. If the gradient descent process modifies it so that it no longer cares about that proxy, because of doing a bad job, then that proxy won’t get optimized for it. So if it wants that proxy to get optimized for it in the future, it needs to play along, pretend to be doing the right thing so that it doesn’t get modified so that it can get deployed and then go optimize that proxy.

Lucas Perry: But it also needs to understand that it’s within the context of an optimization process where gradient descent will continue to perturb it?

Evan Hubinger: That’s exactly right.

Lucas Perry: And where does that kind of awareness come from?

Evan Hubinger: So now we have the second question: which is deception versus corrigibility. The awareness comes from gradient descent. Gradient descent is trying to produce models which are better at the loss function. And a deceptive model is really good and really effective because it’s trying very hard to do the thing that you want it to be doing. Both deception and corrigibility are situations where the model is trying to figure out what to do and then do that. And the question is, which is an easier model, which is a simpler model for gradient descent to find? And I think there’s a strong argument for deception being simpler than corrigibility. Corrigibility has some fundamental difficulties in terms of… If you’re imagining gradient descent process, which is looking at a proxy aligned model and is trying to modify it so that it makes use of this rich input data, it has to do some really weird things to make corrigibility work.

It has to first make a very robust pointer. With corrigibility, if it’s pointing at all incorrectly to the wrong thing in the input data, wrong thing in the world model, the corrigible optimizer won’t correct that pointer. It’ll just be like, “Well, I have this pointer. I’m just trying to optimize for what this thing is pointing for,” and if that pointer is pointing at a proxy instead, you’ll just optimize that proxy. And so you have this very difficult problem of building robust pointers. With deception, you don’t have this problem. A deceptive model, if it realizes the loss function is different than what it thought, it’ll just change to doing the new loss function. It’s actually much more robust to new information because it’s trying to do this instrumentally. And so in a new situation, if it realizes that the loss function is different, it’s just going to automatically change because it’ll realize that’s the better thing to do instrumentally.

And so fundamentally, deception is more robust. It gives you a more robust pointer. It’s easier in that sense. There’s another sense in which corrigibility is very difficult. To produce a corrigible model requires deception modifying the model’s optimization process to be capable of optimizing under uncertainty. For a corrigible model, because it has this pointer as its objective, it’s going to start not really knowing what that pointer is pointing to. It’s going to have some probability distribution over the different possibilities, and it’s going to have to know how to optimize under that probability distribution of objectives rather than under a single objective. And so this problem of optimization under uncertainty, it’s much more complicated than just optimizing under a fixed objective. What that means is that to produce a corrigible optimizer, gradient descent has to do all of these modifications, where it has to modify it to optimize under uncertainty, and to have the super robust pointer to the part of its world model that it’s supposed to be optimizing for.

But if you compare this to deception, gradient descent modifying a model to become deceptive is extremely simple. It needs to modify the model to have a longterm objective, to think about what its objective is across many different instances. And then it just needs to modify the model to think for a while about the fact that it’s in an optimization process and what the right thing is to do. And that modification is not very complicated because it’s just a simple modification of: think more about this particular thing, which is exactly the sort of modification that you should expect to happen all the time when you’re training a model. And so I think it’s a fundamentally much simpler modification. There’s also another argument that you can make here, which is: there’s just a lot more deceptive models. Any proxy objective, once a model, which is optimizing that proxy objective, starts optimizing that objective more in the longterm, across episodes, and then thinks about the fact that it’s an optimization process, will become deceptive.

But to produce corrigibility, you have to find exactly the right pointer. There’s many different possible pointers out there, only one of which is going to give you the exact correct pointer. And similar with the internalization, there’s many different proxies. Only one is the actual true loss function. Whereas with deceptive alignment, any of those different properties, they’re all compatible with deception. And so I think there’s a lot of strong arguments, both this argument for there being many more deceptive optimizers, as well as the simplicity argument for the modification necessary to produce a deceptive optimizer is just a lot simpler, I think, than the modifications necessary to produce these other types of optimizers. And so, because of this, I think that there’s a strong case to be made for deception really not being that uncommon, not being something crazy to think would happened in the training process, but is maybe even potentially the default outcome of a lot of these sorts of training procedures, which is quite, quite scary and quite concerning.

And obviously all of this is speculation. We’re trying to understand from a theoretical process what this gradient process might do, but I think we can make a lot of strong cases about thinking about things like simplicity and accounting arguments to at least put this problem on the radar. Until we have a really strong reason that this isn’t a problem, we should take it seriously. Buck, who’s another person who works at MIRI, often tries to explain some of the risks from learned optimization stuff and he has an analogy that might be useful here. You can imagine the Christian god and the Christian god is trying to produce humans which are aligned with the Bible. And you can imagine three different possible humans. You have Jesus who is just the same as god. Jesus has the same objective as god. Jesus is aligned with god because he just fundamentally wants to do the exact same things.

Lucas Perry: That’s internalization.

Evan Hubinger: That would be internalization. You could have Martin Luther. Martin Luther is aligned with God because he wants to really carefully study the Bible, figure out what the Bible says, and then do that. And that’s the corrigibility case. Or you can have Blaise Pascal and Blaise Pascal is aligned with God because he thinks that if he does what God wants, he’ll go to heaven in the future. And these are the three different possible models that God could find and you’re more likely to find a Jesus, a Martin Luther or a Blaise Pascal.

And the argument is there’s only one Jesus, so out of all the different possible human objectives, only one of them is going to be the exact same one that God wants. And Martin Luther, similarly, is very difficult because out of all the human objectives, there’s only one of them, which is: figure out precisely what the Bible wants and then do that. The Blaise Pascal, in this situation, anybody who realizes that God’s going to send them to heaven or hell, or whatever, based on their behavior, will realize that they should behave according to the Bible, or whatever. And so there’s many different possible Blaise Pascals but there’s significantly fewer possible Martin Luthers and Jesuses.

Lucas Perry: I think that’s an excellent way of simplifying this. Blaise Pascal can care about any kind of proxy. I guess the one thing that I’m still a little bit confused about here is in terms of the deceptive version, again, why is it that it cares about the current proxy reward?

Evan Hubinger: I think that has to do with the structure of the training procedure. You start out usually, I think, with a proxy aligned mesa optimizer. Maybe you start out with a bunch of heuristics, but then once you get a mesa optimizer, it’ll usually start by being proxy aligned. It’ll have some proxy that’s not quite the same as the loss function, at least if you’re in a complex enough environment. There’s a whole bunch of different possibilities. It’s going to start with some proxy, but then you might hope that if you do a bunch of adversarial training, you train it for a really long time in a bunch of different environments, you’ll be able to improve that proxy until you get to the point where it has the right thing.

And the deceptive alignment is about asking the question; there’s actually multiple different possible ways for a model to figure out the loss function. It could figure it out via internalization or it could figure it out via deception or it can figure out via corrigibility. And we don’t know which one of these is more likely. You might imagine that if you start with the proxy aligned mesca optimizer, you train it a bunch, it’ll internalize the correct objective. So the argument that I’m making here is that might not be what happens. And so the reason that I’m saying you start with the proxy aligned mesa optimizers is because I think that’s the way that you should expect that the training procedure will be structured, that you’ll start by learning some proxy and then that proxy will get better, but that having that proxy get better could result internalization or deception or corrigibility.

Lucas Perry: Sure. So the model developed some kind of proxy, but it seems like the other necessary ingredients are, one: it knows that it’s in an optimization process where gradient descent is changing its objective function, and then it requires, two: it actually cares about its current objective function over time and over iterations?

Evan Hubinger: That’s absolutely right. And so then the question to ask is: how likely are those things to occur. And I think it’s the really important point, which I want to make, a lot of times when people think about this, they’re like, “It seems unlikely that it would just happen to develop this understanding and care about it’s thing longterm.” It doesn’t just happen. It happens as a result of the training procedure, because if it does happen, it results in significantly better performance. And so the question is: would gradient descent modify the model to have those properties? And that’s the argument that I want to make is that it’s not that necessarily the model just happens to develop deception, but the deceptive models are just really effective at pursuing reward on the training environment. And so you should expect that if you have a really good training process, that it’s going to find deceptive models because they’re really good at doing the thing you’re training them for.

And so that’s the most fundamental argument that I want to make. But I do think there’s another level here where as you’re getting to the point where you’re training models and very complex data sets, where they have huge amounts of really rich information, we should expect the model to basically figure out most things. You should expect that if it’s going to have to learn all of this really complex data, it’s going to have to build a world model that’s going to include the training process, that’s going to include the loss function. And so I think that as we move into this direction of training more and more complex data sets, more and more complex models, you don’t want to rely on your model not figuring something out, especially when figuring that thing out results in it having better performance.

Now there’s also this question of: does it care about its return across episodes versus just across multiple steps? And for deception, it needs to care about the multi episode return. And this is also another critical point. I think there’s a strong argument that if you imagine a model being trained to care about it’s multi-step return, and maybe also I should pause here and explain what I mean by step versus episode. So in the standard reinforcement learning setup, you have many different episodes. Each episode is composed of many different steps and you train the model to only optimize its single episode return, its reward across all the different steps in an episode, but not across episodes. That’s not always the case. There are some reinforcement learning procedures that optimize across episodes. An example of this is something like population-based training. Population-based training selects models which have the greatest reward across episodes, not simply within the episode.

Though, standard reinforcement learning, you’re just selecting for the single episode return. But even if you’re just selecting for the single episode return, you have the same general problem that we’ve been talking about where it’s not the case that you actually end up with a model necessarily that only cares about single-step return. You’re just going to end up with a model that behaves like it cares about single-step return on the training distribution. And if you imagine a situation where you have a model and it’s been trained to optimize for this objective over many steps, and then it discovers that actually there’s these other episodes that are beyond this current step. I think that the most natural generalization is that you should expect that the model is going to care about those other steps too.

You can imagine a situation where every situation where the model has seen a blue door, it’s been like, “Oh, going through this blue is really good,” and it’s learning an objective that incentivizes to go through blue doors. If it then later realizes that there are more blue doors than it thought because there are other blue doors in other episodes, I think you should generally expect it’s going to care about those blue doors as well. But again, even if you don’t buy that, it’s not just about what’s most natural. It’s also about: what does the gradient descent process do? And because models which have this property, that plan across episodes, can become deceptive and therefore do really well in the training distribution, you should expect that a good training process, it’s going to modify them to become deceptive, even if they wouldn’t naturally.

Lucas Perry: Okay. So help me just understand a little bit more here. At a high level, it seems that the internalization and the corrigibility things require higher specification than the deception version. You’ve brought up all these different arguments about why you think that the deceptive version is easier to find because it has all of these computational benefits in terms of minimizing the loss function, right?

Evan Hubinger: That’s right.

Lucas Perry: And the structure of the deceptive model is composed of, as we mentioned earlier, one: the knowledge that it is within some kind of optimization process where it is being perturbed by gradient descent over episodes, and that it cares about the current proxy over episodes. And it seems that you’re arguing, correct me if I’m wrong, the manifestation of these components is that the strong desire to deceive leads to a very strong, direct attempt at actually minimizing the loss or accomplishing the objective function?

Evan Hubinger: That’s right.

Lucas Perry: Because if it fails at doing that, it won’t be able to accomplish the proxy that it actually cares about later?

Evan Hubinger: That’s right. If there’s deployment episodes than training episodes, and it just cares about how many times it goes through the blue door or the green arrow, or whatever, the green arrow is a proxy objective, and so if it gives up the fact that it’s optimizing for the green arrow, it’s going to get modified to not do that anymore. And so if it wants to go to the green arrow in all of these different deployment episodes, it needs to first, in the training procedure, pretend that it doesn’t care about the green arrow. And then once it’s deployed, switch to going after the green arrow. This also requires the models to know when it’s in training and when it’s in deployment.

This is another situation where I think you should expect that when you’re training really advanced models with these huge world models, that they figure out this information, but also even if it doesn’t figure out that information, it can gamble. It can be like, “Well, there’s some chance I’m in deployment. There’s some chance I’m in training. I will just opt to go to the green arrow some percentage of the time and do the correct thing some other percentage of the time.” And this gives you a decision problem which you can solve mathematically. You get some optimal defection probability.

Lucas Perry: Okay. And then going up, in terms of specification one step, the corrigible agent needs to have a very detailed world model where it has a pointer via gradient descent at some specific part of the world model that seems to correlate highly with minimizing the loss.

Evan Hubinger: That’s right.

Lucas Perry: Why is that less likely?

Evan Hubinger: Well, so the major two reasons that I want to argue for are, one: getting the pointer right is really hard and very difficult to do robustly, whereas getting the deception right is much easier to do robustly. Think about animal imprinting. If the animal imprints on a human caretaker instead, if you have a miss specified pointer, it doesn’t actually have very good performance. The deceptive alignment is more robust at figuring out what the thing is it’s trying to optimize for because the corrigible alignment relies on this pointer working successfully. And if the pointer fails, it’s just going to optimize for the wrong thing.

Lucas Perry: But why can’t the pointer change just as well as the attempt at satisfying the actual loss function in the deceptive model?

Evan Hubinger: So absolutely can. But the question I want to ask is: how difficult is it for gradient descent to make one sequence of modifications versus another sequence of modifications? The sequence of modification that leads to corrigibility versus the sequence that leads to deceptive alignment. The sequence of modifications necessary to produce corrigible alignment requires all of this very, very careful pointer tuning. And so, you can imagine if you’re considering an infinitesimal movement in the direction of corrigible alignment, imagine replacing your current proxy with a bad pointer. It’s not going to be that great. You’re going to have to put a lot of careful optimization effort into getting a really good pointer before it becomes better to be corrigible.

Whereas with deception, the model could build up this really good world model. And once the world model gets complicated enough that it includes a specification of the fact that it’s in a training process, as soon as it starts thinking about the training process, it’s going to be as robust as it can be because it’s going to instrumentally just want to try to figure out what is the best possible approximation I can get for this thing I’m being trained for, so I can optimize it as effectively as possible. And so this gives you a much more robust pointer compared to the corrigible alignment case.

Lucas Perry: And so the danger would be with the deceptive version: once it’s unleashed upon the world and it has a very excellent world model, it realizes this and finally now I can do the deceptive turn, or something like that, to actually optimize the proxy?

Evan Hubinger: That’s right, yeah.

Lucas Perry: So we’ve covered a lot of the ways in which inner alignment fails. Now, inner alignment and outer alignment are two of the things which you care about for evaluating proposals, for building safe and advanced AI. There are two other properties that you care about training procedures for building beneficial AI. One of these is training competitiveness and the second one is performance competitiveness. Could you explain what training competitiveness is and performance competitiveness and why they’re both important?

Evan Hubinger: Absolutely, yeah. So I mentioned at the beginning that I have a broad view of AI alignment where the goal is to try to mitigate AI existential risks. And I mentioned that what I’m working on is focused on this intent alignment problem, but a really important facet of that problem is this competitiveness question. We don’t want to produce AI systems which are going to lead to AI existential risks. And so we don’t want to consider proposals which are directly going to cause problems. As the safety community, what we’re trying to do is not just come up with ways to not cause existential risk. Not doing anything doesn’t cause existential risk. It’s to find ways to capture the positive benefits of artificial intelligence, to be able to produce AIs which are actually going to do good things. You know why we actually tried to build AIs in the first place?

We’re actually trying to build AIs because we think that there’s something that we can produce which is good, because we think that AIs are going to be produced on a default timeline and we want to make sure that we can provide some better way of doing it. And so the competitiveness question is about how do we produce AI proposals which actually reduce the probability of existential risk? Not that just don’t themselves cause existential risks, but that actually overall reduce the probability of it for the world. There’s a couple of different ways which that can happen. You can have a proposal which improves our ability to produce other safe AI. So we produce some aligned AI and that aligned AI helps us build other AIs which are even more aligned and more powerful. We can also maybe produce an aligned AI and then producing that aligned AI helps provide an example to other people of how you can do AI in a safe way, or maybe it provides some decisive strategic advantage, which enables you to successfully ensure that only good AI is produced in the future.

There’s a lot of different possible ways in which you could imagine building an AI leading to reduced existential risks, but competitiveness is going to be a critical component of any of those stories. You need your AI to actually do something. And so I like to split competitiveness down into two different sub components, which are training competitiveness performance competitiveness. And in the overview of 11 proposals document that I mentioned at the beginning, I compare 11 different proposals for prosaic AI alignment on the four qualities of outer alignment, inner alignment, training competitiveness, and performance competitiveness. So training competitiveness is this question of how hard is it to train a model to do this particular task? It’s a question fundamentally of, if you have some team with some lead over all different other possible AI teams, can they build this proposal that we’re thinking about without totally sacrificing that lead? How hard is it to actually spend a bunch of time and effort and energy and compute and data to build an AI, according to some particular proposal?

And then performance competitiveness is the question of once you’ve actually built the thing, how good is it? How effective is it? What is it able to do in the world that’s really helpful for reducing existential risk? Fundamentally, you need both of these things. And so you need all four of these components. You need outer alignment, inner alignment, training competitiveness, and performance competitiveness if you want to have a prosaic AI alignment proposal that is aimed at reducing existential risk.

Lucas Perry: This is where a bit more reflection on governance comes in to considering which training procedures and models are able to satisfy the criteria for building safe advanced AI in a world of competing actors and different incentives and preferences.

Evan Hubinger: The competitive stuff definitely starts to touch on all those sorts of questions. When you take a step back and you think about how do you have an actual full proposal for building prosaic AI in a way which is going to be aligned and do something good for the world, you have to really consider all of these questions. And so that’s why I tried to look at all of these different things in the document that I mentioned.

Lucas Perry: So in terms of training competitiveness and performance competitiveness, are these the kinds of things which are best evaluated from within leading AI companies and then explained to say people in governance or policy or strategy?

Evan Hubinger: It is still sort of a technical question. We need to have a good understanding of how AI works, how machine learning works, what the difficulty is of training different types of machine learning models, what the expected capabilities are of models trained under different regimes, as well as the outer alignment and inner alignment that we expect will happen.

Lucas Perry: I guess I imagine the coordination here is that information on relative training competitiveness and performance competitiveness in systems is evaluated within AI companies and then possibly fed to high power decision makers who exist in strategy and governance for coming up with the correct strategy, given the landscape of companies and AI systems which exist?

Evan Hubinger: Yeah, that’s right.

Lucas Perry: All right. So we have these intent alignment problems. We have inner alignment and we have outer alignment. We’ve learned about that distinction today, and reasons for caring about training and performance competitiveness. So, part of the purpose of this, I mean, is in the title for this paper that partially motivated this conversation, An Overview of 11 Proposals for Building Safe and Advanced AI. You evaluate these proposals based on these criteria, as we mentioned. So I guess, I want to take this time now then to talk about how optimistic you are about, say your top few favorite proposals for building safe and advanced AI and how you’ve roughly evaluated them on these four criteria of inner alignment, outer alignment, and then performance and training competitiveness.

Evan Hubinger: I’ll just touch on some of the ones that I think are most interesting to start by taking a look at. And it’s not necessarily just going to be the ones I think I’m most excited about, because some are more complicated than others. I want to start with some of the simpler ones and some of the ones that I am fairly excited about and I think are interesting, but that also are pedagogically useful to take a look at. I’ll start with amplification, which is the approach that I probably spend the most time thinking about.

The specific form of amplification that I want to start by talking about is imitative amplification, where imitative amplification specifically means the procedure of training a model to imitate a human with access to that model. The goal is to produce models imitating this HCH tree, which I mentioned earlier, this tree of humans consulting humans. The model is trained to mimic a human consulting the model, but then the models that that human is consulting with themselves trained to mimic humans consulting models, which were themselves trained to mimic humans consulting models. So the limit that you’re hoping to end up with this model trained to mimic this tree of humans consulting humans consulting humans, which is HCH for humans consulting HCH.

The specific way to try to talk about this though in the overview of 11 proposals, is that there’s an additional thing which you can do on top of this imitative objective, which is, we also want to add some form of inner alignment solution. So we might think that imitative amplification is going to be good in the limit. It’s going to be good if we can actually produce models, which are really trying to predict HCH, but we also have to actually ensure that we’re doing that, that we don’t get deceptive models or other sort of bad inner misalignment problems.

And so we can add on top of imitative amplification, something which I call relaxed adversarial training. I have another post that came out of some of the work I did at OpenAI, which really goes in detail into how you would do something like relaxed adversarial training of what it means, or how it works. But fundamentally, the idea behind relaxed adversarial training is to train the model, both to do the base objective of trying to imitate the human consulting the model, but also on an additional objective of trying to decrease the probability that the model is doing something bad, it’s deceptive as assessed by some overseer, which is looking at the model. Which in this case, in the case of amplification, the overseer would be the human with access to the model.

So you’re training overall on the objective of mimicking the human with access to the model and making it so that the human with access to the model, when that human with the help of the model uses transparency tools to look at the model and see what it’s doing, making sure that that human believes that the model is doing the right thing. And so you’re combining training on the imitative objective as well as on the relaxed adversarial training objective, trying to make it so that this oversight works.

Lucas Perry: How do you evaluate that in terms of your criteria?

Evan Hubinger: We can start with outer alignment. So the question of outer alignment is, is this HCH tree aligned? And the hope is that because this HCH is just composed of humans, that if we can understand the human behavior and ensure that those humans are doing the right thing, which we might hope to do. We’ve solved lots of human coordination problems in the past that we can ensure that this thing is outer aligned.

For inner alignment, inner alignment is this question of relaxed adversarial training. So in particular, the way in which I think about solving relaxed adversarial training in this situation is that you have this big model space and you have some set of deceptive models. And if you want to avoid that set of deceptive models, one thing which you can do is you can find some larger set, which contains all the deceptive models, such that it’s very easy to check whether there are models in that set or not. As long as outside of the set contains enough competitive models that they’re able to do the right thing, your relaxed adversarial training will work.

There’s a bunch of different ideas for how you could define a set like this, but the one which I’m most excited about, which I think about the most is to have the set be based on the distinction of myopia. Where you run this relaxed adversarial training check by looking at the model, determining whether the model cares about its multi-episode return or not. If the model does care about its multi-episode return, then it has the potential to become deceptive. And so we don’t allow that.

We have it so that the overseer, if it sees that it cares about the multi-episode return, oversight fails. And if it sees that it doesn’t care about the multi-episode return, then the oversight passes. And then we train the model to cause the oversight to pass so that it doesn’t develop this behavior where it cares about the multi-episode return. That’s the basic idea of how you could use relaxed adversarial training here to solve inner alignment.

And then you have training competitiveness, it’s sort of this imitation learning problem. We’re trying to train the model to imitate the human with access to the model. There’s all sorts of potential problems here. It could be very difficult to get a bunch of human data for this problem, as well as if you’re just doing imitation learning, there’s some concerns of, is the model really going to learn these sort of powerful long-term optimization, especially if you’re specifically selecting it to be myopic? Also, there’s a question of how rich is the reward signal, which is produced by imitation learning? Imitation learning uses some sort of fixed distance metric to determine the distance between what the model produced and what was predicted.

Generally, this is done by having the language model which reduces the probability distribution, and then determining the probability under that distribution of the sentences produced by the human with access to the model. And so you have this problem of how rich is the training signal, how difficult is the training? I think we really don’t know what the training competitiveness is going to be, but I think there’s reason to be optimistic. I mean, a lot of these sorts of simple language modeling tasks, like GPT-2 and GPT-3 have been very successful. And so this is very similar to that in that sort of vein, and so you can hope that it’ll also be hopefully successful similarly.

And then you have the performance competitiveness, and for performance competitiveness, the real question is how powerful is this HCH thing? Can HCH actually solve all the problems you’ve wanted to solve? And this is going to be a question of, how do you train your humans? Can the human take any question and decompose it into different sub questions such that the answer to those sub questions can be recombined into an answer to the whole question? Is this possible for all questions? So there are some questions which are undecomposable, in which case it wouldn’t be competitive, but if you can decompose all of these questions, it sort of should have performance competitiveness.

So the one that I’ve been talking about imitative amplification plus relaxed adversarial training is the one that I work on the most, and it’s probably the one I’m most optimistic about. There’s others that are also really interesting. I mean, I think all of the proposals, every proposal on the list has lots of strong arguments in favor of it and lots of real reasons to consider it.

Another proposal which I have talked about previously is AI safety via debate. So AI safety via debate has some similar properties to amplification, but it has a different objective. Rather than trying to mimic HCH, you’re trying to do this debate procedure where you have two different models trying to defend two different sides of some proposition, and convince the judge to agree with one of those sides. There’s some strong arguments for why this sort of procedure would have nice outer alignment properties because it converges to something like what the judge’s equilibrium beliefs are after having seen all of the arguments in favor. And so we might hope that this sort of equilibrium beliefs are going to be good.

There’s lots of work also that’s been done to try to improve the ability to access like the true equilibrium beliefs to really give the judge all possible information. There’s some work that was done by Paul Christiano and Beth Barnes about this cross-examination mechanism, which gives it significantly better performance competitiveness properties by allowing it to access a larger complexity class. But then the other question is then what can you do on top of AI safety via debate to help with inner alignment?

An interesting thing that you can do is you can have the models trained to use transparency tools on each other to win the debate. If one model is able to look inside the other model and demonstrate that it’s non-myopic or something, then it wins the debate. And so then you can incentivize the models to play this game of trying to look inside of each other and find the inner alignment problems, and then train away from those problems.

For outer alignment, you sort of really have this question of, are the debaters actually incentivized to have some honest, cooperative, helpful answers in the limit. And this is a question of what that equilibrium really looks like. Is it the case that humans are able to be tricked by really persuasive arguments and no counterargument will unpersuade them? Or is it the case that when humans are exposed to all of the best arguments and counterarguments, they come to the right conclusions? We don’t really know, but this is fundamentally what the outer alignment question that’s here. This is sort of a trickier equilibrium to analyze.

With amplification case, we just had to figure out with this concrete tree of humans, but with debate, we have to understand this more abstract, what is this theoretical equilibrium behavior of a human under exposure to all these different arguments? I think there’s a case that from an outer alignment perspective, debate is at least more questionable.

Then we have inner alignment and the situation here is, how successfully can we incentivize these models to inspect each other and be able to identify catastrophic behavior? There’s certainly some problems here. You have to get a good training signal to be able to produce models, which are good at inspecting each other, which probably means you have to produce deceptive models and then have the models find the deception, or at least produce non-myopic models and then have the models find the non-myopia. But then this could also incentivize the models to hide the non-myopia, so it’s a little bit unclear exactly how this works. It’s sort of going to dependent on precisely the properties of the training procedure.

I think there’s some good reasons to be hopeful here for training competitiveness, because the way that the debate game is structured, it lets you do this self-play structure where you have these agents competing on the zero-sum game in a self-play, which is in many ways, similar to a lot of big successes of things like AlphaZero, where AlphaZero is able to solve all these environments by doing this very successful self-play on these zero-sum games.

The biggest problem here is probably getting data, and then there’s the performance competitiveness question, which is how effective would an actual debate agent be? The hope is that in the limit, you should just be able to ask the first debater, what do you think? And then it’ll just choose the most convincing answer, and then you can just go with that. You don’t have to run the whole debate in deployment, but it’s only going to be able to solve these language problems.

It’s just going to give you is the equilibrium of what a human thinks after all of these different arguments, and is that good enough? Is it the case that humans are going to really be able to come to good enough equilibria after they see all these arguments that they’re going to be able to produce really good answers? And also, is it the case that question answering alone is sufficient to be able to be competitive in potentially a very competitive marketplace?

As a third proposal that I think is interesting to go into is something called microscope AI. Microscope AI I think is really interesting to look at because it’s very different from the other proposals that I was just talking about. It has a very different approach to thinking about how do we solve these sorts of problems. For all of these approaches, we need to have some amount of abilities to look inside of our models and learn something about what the model knows. But when you use transparency tools to look inside of the model, it teaches you multiple things. It teaches you about the model. You learn about what the model has learned. But it also teaches you about the world, because the model learned a bunch of useful facts, and if you look inside the model and you can learn those facts yourself, then you become more informed. And so this process itself can be quite powerful.

That’s fundamentally the idea of microscope AI. The idea of microscope AI is to train a predictive model on the data you want to understand, and then use transparency tools to understand what that model learned about that data, and then use that understanding to guide human decision making. And so if you’re thinking about outer alignment, in some sense, this procedure is not really outer aligned because we’re just trying to predict some data. And so that’s not really an aligned objective. If you had a model that was just trying to do a whole bunch of prediction, it wouldn’t be doing good things for the world.

But the hope is that if you’re just training a predictive model, it’s not going to end up being deceptive or otherwise dangerous. And you can also use transparency tools to ensure that it doesn’t become that. We still have to solve inner alignment, like I was saying. It still has to be the case that you don’t produce deceptive models. And in fact, the goal here really is not to produce mesa optimizers at all. The goal is just to produce these predictive systems, which learn a bunch of useful facts and information, but that aren’t running optimization procedures. And hopefully we can do that by having this very simple, predictive objective, and then also by using transparency tools.

And then training competitiveness, we know how to train powerful predictive models now, you know, something like GPT-2, and now GPT-3, these are predictive models on task prediction. And so we know this process, we know that we’re very good at it. And so hopefully we’ll be able to continue to be good at it into the future. The real sticky point with microscope AI is the performance competitiveness question. So is enhanced human understanding actually going to be sufficient to solve the use cases we might want for like advanced AI? I don’t know. It’s really hard to know the answer to this question, but you can imagine some situations where it is and some situations where it isn’t.

So, for situations where you need to do long-term, careful decision making, it probably would be, right? If you want to replace CEOs or whatever, that’s a sort of very general decision making process that can be significantly improved just by having much better human understanding of what’s happening. You don’t necessarily need the AI to making the decision. On the other hand, if you need fine-grained manipulation tasks or very, very quick response times, AIs managing a factory or something, then maybe this wouldn’t be sufficient because you would need the AIs to be doing all of this quick decision making and you couldn’t have it just giving information to a few.

One specific situation, which I think is important to think about also is the situation of using your first AI system to help build a second AI system, and making sure that second AI system is aligned and competitive. I think that it also performs pretty well there. You could use a microscope AI to get a bunch of information about the process of AIs and how they work and how training works, and then get a whole bunch of information about that. Have the humans learn that information, then use that information to improve our building of the next AIs and other AIs that we build.

There are certain situations where microscope AI is performance competitive, situations where it wouldn’t be performance competitive, but it’s a very interesting proposal because it’s sort of tackling it from a very different angle. It’s like, well, maybe we don’t really need to be building agents. Maybe we don’t really need to be doing this stuff. Maybe we can just be building this microscope AI. I should mention the microscope AI idea comes from Chris Olah, who works at OpenAI. The debate idea comes from Geoffrey Irving, who’s now at DeepMind, and the amplification comes from Paul Christiano, who’s at OpenAI.

Lucas Perry: Yeah, so for sure, the best place to review these is by reading your post. And again, the post is “An overview of 11 proposals for building safe advanced AI” by Evan Hubinger and that’s on the AI Alignment Forum.

Evan Hubinger: That’s right. I should also mention that a lot of the stuff that I talked about in this podcast is coming from the Risks from Learned Optimization in Advanced Machine Learning Systems paper.

Lucas Perry: All right. Wrapping up here, I’m interested in ending on a broader note. I’m just curious to know if you have concluding thoughts about AI alignment, how optimistic are you that humanity will succeed in building aligned AI systems? Do you have a public timeline that you’re willing to share about AGI? How are you feeling about the existential prospects of earth-originating life?

Evan Hubinger: That’s a big question. So I tend to be on the pessimistic side. My current view looking out on the field of AI and the field of AI safety, I think there’s a lot of really challenging, difficult problems that we are at least not currently equipped to solve. And it seems quite likely that we won’t be equipped to solve by the time we need to solve them. I tend to think that the prospects for humanity aren’t looking great right now, but I nevertheless have a very sort of optimistic disposition, we’re going to do the best that we can. We’re going to try to solve these problems as effectively as we possibly can and we’re going to work on it and hopefully we’ll be able to make it happen.

In terms of timelines, it’s such a complex question. I don’t know if I’m willing to commit to some timeline publicly. I think that it’s just one of those things that is so uncertain. It’s just so important for us to think about what we can do across different possible timelines and be focusing on things which are generally effective regardless of how it turns out, because I think we’re really just quite uncertain. It could be as soon as five years or as long away as 50 years or 70 years, we really don’t know.

I don’t know if we have great track records of prediction in this setting. Regardless of when AI comes, we need to be working to solve these problems and to get more information on these problems, to get to the point we understand them and can address them because when it does get to the point where we’re able to build these really powerful systems, we need to be ready.

Lucas Perry: So you do take very short timelines, like say 5 to 10 to 15 years very seriously.

Evan Hubinger: I do take very short timelines very seriously. I think that if you look at the field of AI right now, there are these massive organizations, OpenAI and DeepMind that are dedicated to the goal of producing AGI. They’re putting huge amounts of research effort into it. And I think it’s incorrect to just assume that they’re going to fail. I think that we have to consider the possibility that they succeed and that they do so quite soon. A lot of the top people at these organizations have very short timelines, and so I think that it’s important to take that claim seriously and to think about what happens if it’s true.

I wouldn’t bet on it. There’s a lot of analysis that seems to indicate that at the very least, we’re going to need more compute than we have in that sort of a timeframe, but timeline prediction tasks are so difficult that it’s important to consider all of these different possibilities. I think that, yes, I take the short timelines very seriously, but it’s not the primary scenario. I think that I also take long timeline scenarios quite seriously.

Lucas Perry: Would you consider DeepMind and OpenAI to be explicitly trying to create AGI? OpenAI, yes, right?

Evan Hubinger: Yeah. OpenAI, it’s just part of the mission statement. DeepMind, some of the top people at DeepMind have talked about this, but it’s not something that you would find on the website the way you would with OpenAI. If you look at historically some of the things that Shane Legg and Demis Hassabis have said, a lot of it is about AGI.

Lucas Perry: Yeah. So in terms of these being the leaders with just massive budgets and person power, how do you see the quality and degree of alignment and beneficial AI thinking and mindset within these organizations? Because there seems to be a big distinction between the AI alignment crowd and the mainstream machine learning crowd. A lot of the mainstream ML community hasn’t been exposed to many of the arguments or thinking within the safety and alignment crowd. Stuart Russell has been trying hard to shift away from the standard model and incorporate a lot of these new alignment considerations. So yeah. What do you think?

Evan Hubinger: I think this is a problem that is getting a lot better. Like you were mentioning, Stuart Russell has been really great on this. CHAI has been very effective at trying to really get this message of, we’re building AI, we should put some effort into making sure we’re building safe AI. I think this is working. If you look at a lot of the major ML conferences recently, I think basically all of them had workshops on beneficial AI. DeepMind has a safety team with lots of really good people. OpenAI has a safety team with lots of really good people.

I think that the standard story of, oh, AI safety is just this thing that these people who aren’t involved in machine learning think about it’s something which really in the current world has become much more integrated with machine learning and is becoming more mainstream. But it’s definitely still a process, and it’s the process of like Stuart Russell says that the field of AI has been very focused on the sort of standard model and trying to move people away from that and think about some of the consequences of it takes time and it takes some sort of evolution of a field, but it is happening. I think we’re moving in a good direction.

Lucas Perry: All right, well, Evan, I’ve really enjoyed this. I appreciate you explaining all of this and taking the time to unpack a lot of this machine learning language and concepts to make it digestible. Is there anything else here that you’d like to wrap up on or any concluding thoughts?

Evan Hubinger: If you want more detailed information on all of the things that I’ve talked about, the full analysis of inner alignment and outer alignment is in Risks from Learned Optimization in Advanced Machine Learning Systems by me, as well as many of my co-authors, as well as “an overview of 11 proposals” post, which you can find on the AI Alignment Forum. I think both of those are resources, which I would recommend checking out for understanding more about what I talked about in this podcast.

Lucas Perry: Do you have any social media or a website or anywhere else for us to point towards?

Evan Hubinger: Yeah, so you can find me on all the different sorts of social media platforms. I’m fairly active on GitHub. I do a bunch of open source development. You can find me on LinkedIn, Twitter, Facebook, all those various different platforms. I’m fairly Google-able. It’s nice to have a fairly unique last name. So if you Google me, you should find all of this information.

One other thing, which I should mention specifically, everything that I do is all public. All of my writing is public. I try to publish all of my work and I do so on the AI Alignment Forum. So the AI Alignment Forum is a really, really great resource because it’s a collection of writing by all of these different AI safety authors. It’s open to anybody who’s a current AI safety researcher, and you can find me on the AI Alignment Forum as evhub, I’m E-V-H-U-B on the AI Alignment Forum.

Lucas Perry: All right, Evan, thanks so much for coming on today, and it’s been quite enjoyable. This has probably been one of the more fun AI alignment podcasts that I’ve had in a while. So thanks a bunch and I appreciate it.

Evan Hubinger: Absolutely. That’s super great to hear. I’m glad that you enjoyed it. Hopefully everybody else does as well.

End of recorded material

Discuss

### [AN #106]: Evaluating generalization ability of learned reward models

1 июля, 2020 - 20:20
Published on July 1, 2020 5:20 PM GMT

[AN #106]: Evaluating generalization ability of learned reward models Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world View this email in your browser Newsletter #106
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet). SECTIONS ﻿HIGHLIGHTS
﻿TECHNICAL AI ALIGNMENT
﻿TECHNICAL AGENDAS AND PRIORITIZATION
﻿INTERPRETABILITY
﻿OTHER PROGRESS IN AI
﻿EXPLORATION
﻿REINFORCEMENT LEARNING
﻿META LEARNING
﻿UNSUPERVISED LEARNING ﻿ ﻿ ﻿ HIGHLIGHTS

Quantifying Differences in Reward Functions (Adam Gleave et al) (summarized by Rohin): Current work on reward learning typically evaluates the learned reward models by training a policy to optimize the learned reward, and seeing how well that policy performs according to the true reward. However, this only tests how well the reward works in the particular environment you test in, and doesn’t tell you how well the reward will generalize. For example, suppose the user loves apricots, likes plums, but hates durians. A reward that has apricots > durians > plums works perfectly -- until the store runs out of apricots, in which case it buys the hated durian.

So, it seems like we should evaluate reward functions directly, rather than looking at their optimal policies. This paper proposes Equivalent-Policy Invariant Comparison (EPIC), which can compare two reward functions while ignoring any potential shaping that doesn’t affect the optimal policy.

EPIC is parameterized by a distribution of states and actions DS and DA, as well as a distribution DT over transitions (s, a, s’). The first step is to find canonical versions of the two rewards to be compared, such that they have expected zero reward over DS and DA, and any potential shaping is removed. Then, we look at the reward each of these would assign to transitions in DT, and compute the Pearson correlation. This is transformed to be in the range [0, 1], giving the EPIC distance.

The authors prove that EPIC is a pseudometric, that is, it behaves like a distance function, except that it is possible for EPIC(R1, R2) to be zero even if R1 and R2 are different. This is desirable, since if R1 and R2 differ by a potential shaping function, then their optimal policies are guaranteed to be the same regardless of transition dynamics, and so we should report the “distance” between them to be zero.

The authors show how to approximately compute the EPIC distance in high dimensional environments, and run experiments to showcase EPIC’s properties. Their first experiment demonstrates that EPIC is able to correctly detect that a densely shaped reward for various MuJoCo environments is equivalent to a sparse reward, whereas other baseline methods are not able to do so. The second experiment compares reward models learned from preferences, demonstrations, and direct regression, and finds that the EPIC distance for the rewards learned from demonstrations are much higher than those for preferences and regression. Indeed, when the rewards are reoptimized in a new test environment, the new policies work when using the preference or regression reward models, but not when using the demonstration reward model. The final experiment shows that EPIC is robust to variations in the visitation distribution DT, while baseline methods are not.

﻿

Rohin's opinion: It’s certainly true that we don’t have good methods for understanding how well our learned reward models generalize, and I’m glad that this work is pushing in that direction. I hope that future papers on reward models report EPIC distances to the ground truth reward as one of their metrics (code is available here).

One nice thing is that, roughly speaking, rewards are judged to be equivalent if they would generalize to any possible transition function that is consistent with DT. This means that by designing DT appropriately, we can capture how much generalization we want to evaluate. This is a useful knob to have: if we used the maximally large DT, the task would be far too difficult, as it would be expected to generalize far more than even humans can.

﻿ ﻿ ﻿ TECHNICAL AI ALIGNMENT
﻿ TECHNICAL AGENDAS AND PRIORITIZATION

Plausible cases for HRAD work, and locating the crux in the "realism about rationality" debate (Issa Rice) (summarized by Rohin): This post tries to identify the possible cases for highly reliable agent design (HRAD) work to be the main priority of AI alignment. HRAD is a category of work at MIRI that aims to build a theory of intelligence and agency that can explain things like logical uncertainty and counterfactual reasoning.

The first case for HRAD work is that by becoming less confused about these phenomena, we will be able to help AGI builders predict, explain, avoid, detect, and fix safety issues and help to conceptually clarify the AI alignment problem. For this purpose, we just need conceptual deconfusion -- it isn’t necessary that there must be precise equations defining what an AI system does.

The second case is that if we get a precise, mathematical theory, we can use it to build an agent that we understand “from the ground up”, rather than throwing the black box of deep learning at the problem.

The last case is that understanding how intelligence works will give us a theory that allows us to predict how arbitrary agents will behave, which will be useful for AI alignment in all the ways described in the first case and more (AN #66).

Looking through past discussions on the topic, the author believes that people at MIRI primarily believe in the first two cases. Meanwhile, critics (particularly me) say that it seems pretty unlikely that we can build a precise, mathematical theory, and a more conceptual but imprecise theory may help us understand reasoning better but is less likely to generalize sufficiently well to say important and non-trivial things about AI alignment for the systems we are actually building.

﻿

Rohin's opinion: I like this post -- it seems like an accessible summary of the state of the debate so far. My opinions are already in the post, so I don’t have much to add.

﻿

The flaws that make today’s AI architecture unsafe and a new approach that could fix it (Rob Wiblin and Stuart Russell) (summarized by Rohin): This podcast delves into many of the ideas in Stuart’s book Human Compatible (AN #69). Rob especially pushes on some aspects that are less talked about in the AI safety community, like the enfeeblement problem and whether we’d be locking in suboptimal values. They also discuss Stuart’s response to some counterarguments.

﻿

Rohin's opinion: One of the counterarguments the podcast talks about is my position (AN #80) that we’ll probably learn from smaller catastrophes in order to avoid actual extinction. I just want to note that while it might sound like I disagree with Stuart on this point, I don’t think we actually do. I was arguing against the position that extinction is the default outcome (> 50% probability) while Stuart is arguing against the position that extinction is near-impossible (~0% probability). I ended up around 10%; I’d guess that if Stuart were forced to, he’d give a number similar to mine, for similar reasons as me.

﻿ ﻿ INTERPRETABILITY

Towards A Rigorous Science of Interpretable Machine Learning (Finale Doshi-Velez et al) (summarized by Robert): This paper from 2017 discusses the field of interpretability research, and how it can be made more rigorous and well-defined. The authors first highlight the problem of defining interpretability in the first place - they don't have a resolution to this problem, but suggest that we can think of interpretability in terms of what it's used for. They claim that interpretability is used for confirming other important desiderata in ML systems, which stem from an incompleteness in the problem formalization. For example, if we want a system to be unbiased but aren't able to formally specify this in the reward function, or the reward we're optimising for is only a proxy of the true reward, then we could use interpretability to inspect our model and see whether it's reasoning how we want it to.

The authors next move on to discussing how we can evaluate interpretability methods, providing a taxonomy of different evaluation methods: Application-grounded is when the method is evaluated in the context it will actually be used in, by real humans (i.e. doctors getting explanations for AI diagnoses); Human-grounded is about conducting simpler human-subject experiments (who are perhaps not domain experts) using possibly simpler tasks than what the intended purpose of the method is; Functionally-grounded is where no humans are involved in the experiments, and instead some formal notion of interpretability is measured for the method to evaluate its quality. Each of these evaluation methods can be used in different circumstances, depending on the method and the context it will be used in.

Finally, the authors propose a data-driven approach to understanding the factors which are important in interpretability. They propose to try and create a dataset of applications of machine learning models to tasks, and then analyse this dataset to find important factors. They list some possible task- and method- related factors, and then conclude with recommendations to researchers doing interpretability.

﻿

Robert's opinion: I like the idea of interpretability being aimed at trying to fill in mis- or under-specified optimisation objectives. I think this proposes that interpretability is more useful for outer alignment, which is interesting as I think that most people in the safety community think interpretability could help with inner alignment (for example, see An overview of 11 proposals for building safe advanced AI (AN #102), in which transparency (which could be seen as interpretability) is used to solve inner alignment in 4 of the proposals).

﻿ ﻿ ﻿ OTHER PROGRESS IN AI
﻿ EXPLORATION

Planning to Explore via Self-Supervised World Models (Ramanan Sekar, Oleh Rybkin et al) (summarized by Flo): PlaNet (AN #33) learns a latent world model which can be used for planning, and Dreamer (AN #83) extends the idea by performing RL within the learned latent world model instead of requiring interaction with the environment. However, we still need to efficiently explore the real environment to obtain training data for the world model.

The authors propose to augment Dreamer with a novel exploration strategy. In addition to the learned latent world model, an ensemble of simpler one-step world models is trained and the magnitude of disagreement within the ensemble for a state is used as a proxy for the information gain for reaching that state. This is used as a (dynamically changing) intrinsic reward that can guide planning. By training Dreamer on this intrinsic reward, we can identify informative states in the real environment without having to first visit similar states as would be the case with e.g. curiosity, where the intrinsic reward is computed in retrospect.

The resulting system achieves state of the art zero-shot learning on a variety of continuous control tasks, and often comes close to the performance of agents that were trained for the specific task.

﻿

Flo's opinion: Planning to reach states where a lot of information is gained seems like a very promising strategy for exploration. I am not sure whether building sufficiently precise world models is always as feasible as model-free RL. If it was, misspecified rewards and similar problems would probably become easier to catch, as rollouts of a policy using a precise world model can help us predict what kind of worlds this policy produces without deployment. On the other hand, the improved capabilities for transfer learning could lead to more ubiquitous deployment of RL systems and amplify remaining failure modes, especially those stemming from multiagent interactions (AN #70).

﻿ ﻿ REINFORCEMENT LEARNING

Learning to Play No-Press Diplomacy with Best Response Policy Iteration (Thomas Anthony, Tom Eccles et al) (summarized by Asya): Diplomacy is a game with simple rules where 7 players simultaneously move units every turn to capture territory. Units are evenly matched by default, so winning relies on getting support from some players against others. 'No-Press' Diplomacy limits communication between players to only orders submitted to units, removing the complex verbal negotiations that characterize traditional gameplay.

Previous state-of-the-art No-Press Diplomacy methods were trained to imitate human actions after collecting a dataset of 150,000 human Diplomacy games. This paper presents a new algorithmic method for playing No-Press Diplomacy using a policy iteration approach initialized with human imitation. To find better policies, their methods use "best response" calculations, where the best response policy for some player is the policy that maximizes the expected return for that player against opponent policies. Diplomacy is far too large for exact best response calculation, so the paper introduces an approximation, "Sampled Best Response", which

- Uses Monte-Carlo sampling to estimate opponents' actions each turn

- Only considers a small set of actions sampled from each candidate best response policy

- Only tries to make a single-turn improvement to its policy (rather than trying to optimize for the whole rest of the game)

Similar to other policy iteration methods, the paper creates a dataset of games every iteration using its Sampled Best Response method, then trains neural networks to create policy and value functions that predict the actions chosen by Sampled Best Response. To remedy issues where Sampled Best Response continually cycles through the best strategy for the last iteration, the paper tries several variants of a technique called "Fictitious Play". In the best-performing variant, the policy network is trained to predict the latest Sampled Best Response given explicitly averaged historical opponent and player policies, rather than just the latest policies.

The paper's methods outperform existing algorithmic methods for No-Press Diplomacy on a variety of metrics, but are still fairly few-shot exploitable-- at the end of training, the strongest (non-human) exploiter of the final policy wins 48% of the time. They also find that the strongest exploit doesn't change much through training, though few-shot exploitability does decrease from the beginning of training to the end.

﻿

Asya's opinion: This paper represented real progress in automated Diplomacy, but is still far from human-level. I’ll be pretty interested to see whether we can reach human-level by creating improved self-play algorithms, like the one presented in this paper, and the ones used for Poker and Go, or if we will have to wait for novel, more general reasoning algorithms applied to Diplomacy. Unlike Poker, Diplomacy against multiple human players involves collusion and implicit signalling, even with No Press. It seems possible to me that it is very difficult to become good at modeling those dynamics through self-play alone. If we did get to human-level through self-play, it would make me more optimistic about the extent to which training is likely to be a bottleneck in other domains which require sophisticated models of human behavior.

﻿ ﻿ META LEARNING

Learning to Continually Learn (Shawn Beaulieu et al) (summarized by Robert): This paper presents the ANML (A Neuromodulated Meta-Learning algorithm) method for countering catastrophic forgetting in continual learning. Continual learning is a problem setting where the system is presented with several tasks in sequence, and must maintain good performance on all of them. When training on new tasks, neural networks often “forget” how to perform the previous tasks, which is called catastrophic forgetting. This makes the naive approach of just training on each task in sequence ineffective.

The paper has two main ideas. First, rather than avoiding catastrophic forgetting by using hand-crafted solutions (e.g. previous methods have encouraged sparsity), the authors use meta-learning to directly optimise for this goal. This is done by learning a network parameterization which, after training sequentially on many tasks, will get good performance on all tasks. This outer loop objective can be optimised for directly by taking higher order gradients (gradients of gradients). The second idea is a novel form of neuromodulation. This takes the form of a neuromodulatory (NM) network, which takes the same input as the prediction network, and gates the prediction network’s forward pass. This provides direct control of the output of the prediction network, but also indirect control of the learning of the prediction network, as gradients will only flow through the paths which haven’t been zeroed out by the gating mechanism.

Their method achieves state-of-the-art results on continual learning in Omniglot, a few-shot dataset consisting of 1623 characters, each with only 20 hand-drawn examples. The network has to learn a sequence of tasks (e.g. classifying a character) with only 15 examples, and is then tested on overall performance over all the classes it’s learned. Their network gets 60% accuracy when presented with 600 classes in a row. A classifier trained with the same data but shuffled independently at random only gets 68% accuracy, implying that the catastrophic forgetting of their network only cost 8 percentage points. Their method also learns a form of sparsity in the activations of the network in a much better way than the hand-crafted methods - while per-class activations are very sparse, no neurons are wasted, as they all still activate over the entire dataset.

﻿

Robert's opinion: This paper is interesting because it's a demonstration of the power of meta-learning to formulate the true optimisation objective. Often in machine learning much research is devoted to the manual path of trying to find the correct inductive biases to solve hard problems (such as catastrophic forgetting). Instead, this paper shows we can use methods like meta-learning to learn these inductive biases (such as sparsity) automatically, by optimising directly for what we want. This relates to (and is motivated by) AI-Generating Algorithms (AN #63). Obviously, this method still uses the neuromodulatory network as an architectural inductive bias - it'd be interesting to see whether we could somehow learn this method (or something more specific) as well, perhaps through neural architecture search or just using a larger network which has the representational capacity to perform something like the gating operation.

﻿ ﻿ UNSUPERVISED LEARNING

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments (Mathilde Caron et al) (summarized by Rohin): There has been a lot of work in self-supervised representation learning for image classification (previously summarized in AN #92 and AN #99). This paper sets a new SOTA of 75.3% top-1 ImageNet accuracy, when allowed to first do self-supervised representation learning on ImageNet, and then to train a linear classifier on top of the learned features using all of ImageNet.

Previous methods use a contrastive loss across the learned representations (possibly after being processed by a few MLP layers), which can be thought of as using the learned representation to predict the representation of augmented versions of the same input. In contrast, this paper uses the representation to predict “codes” of augmented versions, where the codes are computed using clustering.

﻿

Rohin's opinion: I’m not sure why we should expect this method to work, but empirically it does. Presumably I’d understand the motivation better if I read through all the related work it’s building on.

﻿

Big Self-Supervised Models are Strong Semi-Supervised Learners (Ting Chen et al) (summarized by Rohin): Previously, SimCLR (AN #99) showed that you can get good results on semi-supervised learning on ImageNet, by first using self-supervised learning with a contrastive loss to learn good representations for images, and then finetuning a classifier on top of the representations with very few labels. This paper reports a significantly improved score, using three main improvements:

1. Making all of the models larger (in particular, deeper).

2. Incorporating momentum contrast, as done previously (AN #99).

3. Using model distillation to train a student network to mimic the original finetuned classifier.

On linear classification on top of learned features with a ResNet-50 architecture, they get a top-1 accuracy of 71.7%, so lower than the previous paper. Their main contribution is to show what can be done with larger models. According to top-1 accuracy on ImageNet, the resulting system gets 74.9% with 1% of labels, and 80.1% with 10% of labels. In comparison, standard supervised learning with a ResNet-50 (which is about 33x smaller) achieves 76.6% with all labels, and just 57.9% with 1% of labels and 68.4% with 10% of labels. When they distill down their biggest model into a ResNet-50, it gets 73.9% with 1% of labels and 77.5% with 10% of labels.

﻿

Rohin's opinion: It continues to baffle me why model distillation is so helpful -- you’d think that if you train a student model to mimic a teacher model, it would do at most as well as the teacher, but in fact it seems to do better. It's remarkable that just "training a bigger model and then distilling it down" leads to an increase of 16 percentage points (when we just have 1% of the labels). Another thing to add to the list of weird empirical facts about deep learning that we don’t understand.

FEEDBACK I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email. PODCAST An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
Subscribe here:

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Discuss

### PSA: Cars don't have 'blindspots'

1 июля, 2020 - 20:04
Published on July 1, 2020 5:04 PM GMT

Not once in my life have I gone to drive someone's car and not seen incorrect mirror setup. Infographic here.

With correct adjustment, as cars leave your rear view mirror they enter your side view, and as they exit your side view they enter your peripheral vision. You still need to move your head to check for motorcycles and for maintaining situational awareness, but properly adjusted mirrors drastically reduces being surprised.

Discuss

1 июля, 2020 - 12:46
Published on July 1, 2020 9:46 AM GMT

Highlights
1. Facebook launches Forecast, a community for crowdsourced predictions.
2. Foretell, a forecasting tournament by the Center for Security and Emerging Technology, is now open.
3. A Preliminary Look at Metaculus and Expert Forecasts: Metaculus forecasters do better.

Index
• Highlights.
• In the News.
• Prediction Markets & Forecasting Platforms.
• Negative Examples.
• Hard to Categorize.
• Long Content.
In the News.
• Facebook releases a forecasting app (link to the app, press release, TechCrunch take, hot-takes). The release comes before Augur v2 launches, and it is easy to speculate that it might end up being combined with Facebook's stablecoin, Libra.

• The Economist has a new electoral model out (article, model) which gives Trump an 11% chance of winning reelection. Given that Andrew Gelman was involved, I'm hesitant to criticize it, but it seems a tad overconfident. See here for Gelman addressing objections similar to my own.

• COVID-19 vaccine before US election. Analysts see White House pushing through vaccine approval to bolster Trump's chances of reelection before voters head to polls. "All the datapoints we've collected make me think we're going to get a vaccine prior to the election," Jared Holz, a health-care strategist with Jefferies, said in a phone interview. The current administration is "incredibly incentivized to approve at least one of these vaccines before Nov. 3."

• "Israeli Central Bank Forecasting Gets Real During Pandemic". Israeli Central Bank is using data to which it has real-time access, like credit-card spending, instead of lagging indicators.

• Google produces wind schedules for wind farms. "The result has been a 20 percent increase in revenue for wind farms". See here for essentially the same thing on solar forecasting.

• Survey of macroeconomic researchers predicts economic recovery will take years, reports 538.

Prediction Markets & Forecasting platforms.

Ordered in subjective order of importance:

• Foretell, a forecasting tournament by the Center for Security and Emerging Technology, is now open. I find the thought heartening that this might end up influencing bona-fide politicians.

• Metaculus

• Replication Markets might add a new round with social and behavioral science claims related to COVID-19, and a preprint market, which would ask participants to forecast items like publication or citation. Replication Markets is also asking for more participants, with the catchline "If they are knowledgeable and opinionated, Replication Markets is the place to be to make your opinions really count."

• Good Judgement family

• Good Judgement Open: Superforecasters were able to detect that Russia and the USA would in fact undertake some (albeit limited) form of negotiation, and do so much earlier than the general public, even while posting their reasons in full view.
• Good Judgement Analytics continues to provide its COVID-19 dashboard.
• PredictIt & Election Betting Odds. I stumbled upon an old 538 piece on fake polls: Fake Polls are a Real Problem. Some polls may have been conducted by PredictIt traders in order to mislead or troll other PredictIt traders; all in all, an amusing example of how prediction markets could encourage worse information.

• An online prediction market with reputation points, implementing an idea by Paul Christiano. As of yet slow to load.

• Augur:

• Coronavirus Information Markets is down to ca. $12000 in trading volume; it seems like they didn't take off. Negative examples. Hard to categorize. • A Personal COVID-19 Postmortem, by FHI researcher David Manheim. I think it's important to clearly and publicly admit when we were wrong. It's even better to diagnose why, and take steps to prevent doing so again. COVID-19 is far from over, but given my early stance on a number of questions regarding COVID-19, this is my attempt at a public personal review to see where I was wrong. • FantasyScotus beat GoodJudgementOpen on legal decisions. I'm still waiting to see whether Hollywood Stock Exchange will also beat GJOpen on film predictions. • How does pandemic forecasting resemble the early days of weather forecasting; what lessons can the USA learn from the later about the former? An example would be to create an organization akin to the National Weather Center, but for forecasting. • Linch Zhang, a COVID-19 forecaster with an excellent track-record, is doing an Ask Me Anything, starting on Sunday the 7th; questions are welcome! • The Rules To Being A Sellside Economist. A fun read. 1. How to get attention: If you want to get famous for making big non-consensus calls, without the danger of looking like a muppet, you should adopt ‘the 40% rule’. Basically you can forecast whatever you want with a probability of 40%. Greece to quit the euro? Maybe! Trump to fire Powell and hire his daughter as the new Fed chair? Never say never! 40% means the odds will be greater than anyone else is saying, which is why your clients need to listen to your warning, but also that they shouldn’t be too surprised if, you know, the extreme event doesn’t actually happen. • How to improve space weather forecasting (see here for the original paper): For instance, the National Oceanic and Atmospheric Administration’s Deep Space Climate Observatory (DSCOVR) satellite sits at the location in space called L1, where the gravitational pulls of Earth and the Sun cancel out. At this point, which is roughly 1.5 million kilometers from Earth, or barely 1% of the way to the Sun, detectors can provide warnings with only short lead times: about 30 minutes before a storm hits Earth in most cases or as little as 17 minutes in advance of extremely fast solar storms. • Coup cast: A site that estimates the yearly probability of a coup. The color coding is misleading; click on the countries instead. • Prediction = Compression. "Whenever you have a prediction algorithm, you can also get a correspondingly good compression algorithm for data you already have, and vice versa." • Box Office Pro looks at some factors around box-office forecasting. Long Content. • When the crowds aren't wise; a sober overview, with judicious use of Cordocet's jury theorem Suppose that each individual in a group is more likely to be wrong than right because relatively few people in the group have access to accurate information. In that case, the likelihood that the group’s majority will decide correctly falls toward zero as the size of the group increases. Some prediction markets fail for just this reason. They have done really badly in predicting President Bush’s appointments to the Supreme Court, for example. Until roughly two hours before the official announcement, the markets were essentially ignorant of the existence of John Roberts, now the chief justice of the United States. At the close of a prominent market just one day before his nomination, “shares” in Judge Roberts were trading at$0.19—representing an estimate that Roberts had a 1.9% chance of being nominated.

Why was the crowd so unwise? Because it had little accurate information to go on; these investors, even en masse, knew almost nothing about the internal deliberations in the Bush administration. For similar reasons, prediction markets were quite wrong in forecasting that weapons of mass destruction would be found in Iraq and that special prosecutor Patrick Fitzgerald would indict Deputy Chief of Staff Karl Rove in late 2005.

• A review of Tetlock’s ‘Superforecasting’ (2015), by Dominic Cummings. Cummings then went on to hire one such superforecaster, which then resigned over a culture war scandal, characterized by adversarial selection of quotes which indeed are outside the British Overton Window. Notably, Dominic Cummings then told reporters to "Read Philip Tetlock's Superforecasters, instead of political pundits who don't know what they're talking about."

• Assessing the Performance of Real-Time Epidemic Forecasts: A Case Study of Ebola in the Western Area Region of Sierra Leone, 2014-15. The one caveat is that their data is much better than coronavirus data, because Ebola symptoms are more evident; otherwise, pretty interesting:

Real-time forecasts based on mathematical models can inform critical decision-making during infectious disease outbreaks. Yet, epidemic forecasts are rarely evaluated during or after the event, and there is little guidance on the best metrics for assessment.

...good probabilistic calibration was achievable at short time horizons of one or two weeks ahead but model predictions were increasingly unreliable at longer forecasting horizons.

This suggests that forecasts may have been of good enough quality to inform decision making based on predictions a few weeks ahead of time but not longer, reflecting the high level of uncertainty in the processes driving the trajectory of the epidemic.

Comparing different versions of our model to simpler models, we further found that it would have been possible to determine the model that was most reliable at making forecasts from early on in the epidemic. This suggests that there is value in assessing forecasts, and that it should be possible to improve forecasts by checking how good they are during an ongoing epidemic.

One forecast that gained particular attention during the epidemic was published in the summer of 2014, projecting that by early 2015 there might be 1.4 million cases. This number was based on unmitigated growth in the absence of further intervention and proved a gross overestimate, yet it was later highlighted as a “call to arms” that served to trigger the international response that helped avoid the worst-case scenario.

Methods to assess probabilistic forecasts are now being used in other fields, but are not commonly applied in infectious disease epidemiology

The deterministic SEIR model we used as a null model performed poorly on all forecasting scores, and failed to capture the downturn of the epidemic in Western Area.

On the other hand, a well-calibrated mechanistic model that accounts for all relevant dynamic factors and external influences could, in principle, have been used to predict the behaviour of the epidemic reliably and precisely. Yet, lack of detailed data on transmission routes and risk factors precluded the parameterisation of such a model and are likely to do so again in future epidemics in resource-poor settings.

• In the selection of quotes above, we gave an example of a forecast which ended up overestimating the incidence, yet might have "served as a call to arms". It's maybe a real-life example of a forecast changing the true result, leading to a fixed point problem, like the ones hypothesized in the parable of the Predict-O-Matic.

• It would be a fixed point problem if [forecast above the alarm threshold] → epidemic being contained, but [forecast below the alarm thresold] → epidemic not being contained.
• Maybe the fix-point solution, i.e., the most self-fulfilling (and thus, accurate) forecast, would have been a forecast on the edge of the alarm threshold, which would have ended up leading to mediocre containment.
• The troll polls created by PredictIt traders are perhaps a more clear cut example of Predict-O-Matic problems.
• Calibration Scoring Rules for Practical Prediction Training. I found it most interesting when considering how Brier and log rules didn't have all the pedagogic desiderata.

• I also found the following derivation of the logarithmic scoring rule interesting. Consider: If you assign a probability to n events, then the combined probability of these events is p1 x p2 x p3 x ... pn. Taking logarithms, this is log(p1 x p2 x p3 x ... x pn) = Σ log(pn), i.e., the logarithmic scoring rule.
• Binary Scoring Rules that Incentivize Precision. The results (the closed-form of scoring rules which minimize a given forecasting error) are interesting, but the journey to get there is kind of a drag, and ultimately the logarithmic scoring rule ends up being pretty decent according to their measure of error.

• Opinion: I'm not sure whether their results are going to be useful for things I'm interested in (like human forecasting tournaments, rather than Kaggle data analysis competitions). In practice, what I might do if I wanted to incentivize precision is to ask myself if this is a question where the answer is going to be closer to 50%, or closer to either of 0% or 100%, and then use either the Brier or the logarithmic scoring rules. That is, I don't want to minimize an l-norm of the error over [0,1], I want to minimize an l-norm over the region I think the answer is going to be in, and the paper falls short of addressing that.
• How Innovation Works—A Review. The following quote stood out for me:

Ridley points out that there have always been opponents of innovation. Such people often have an interest in maintaining the status quo but justify their objections with reference to the precautionary principle.

• A list of prediction markets, and their fates, maintained by Jacob Laguerros. Like most startups, most prediction markets fail.

Note to the future: All links are added automatically to the Internet Archive. In case of link rot, go here

"I beseech you, in the bowels of Christ, think it possible that you may be mistaken." Oliver Cromwell

Discuss

### Inviting Curated Authors to Give 5-Min Online Talks

1 июля, 2020 - 04:05
Published on July 1, 2020 1:05 AM GMT

If you've written a post that's been curated by the LessWrong team, then you're eligible to give a talk at one of our upcoming weekend events. If you'd like to give one, here's the form to fill out your availability.

The events looks something like this:

• 20-40 LessWrong users show up.
• 3-5 curated authors give 5 minute talks over zoom
• After each talk, we have 5-10 mins of Q&A from the attendees.
• After the talks (which last ~1 hour), we split into breakout rooms for a 30-60 min hangout and discuss the ideas.
• After the event, Ben and Jacob make transcripts of the talks and publish them.

Speakers so far have been Vaniver, TurnTrout, Abram Demski, John Wentworth, Eukaryote, Othonormal, Charlie Steiner, Alkjash, Daniel Kokotajlo. Examples of transcripts: Alkjash, Abram.

Your talk can discuss a post, a comment, or a new idea you've not been able to write down yet.

Here's the form.

Discuss

### Situating LessWrong in contemporary philosophy: An interview with Jon Livengood

1 июля, 2020 - 03:37
Published on July 1, 2020 12:37 AM GMT

Jonathan Livengood is a current associate professor of philosophy at Urbana-Champaign, who hung around LessWrong in the late 2000s and early 2010s as a graduate student at the University of Pittsburgh, where he was writing a dissertation on causal inference under John Norton, Peter Spirtes, and Edouard Machery. He also blogs at the excellent Unshielded Colliders.

One of the central criticisms of mainstream philosophy at LessWrong has always been aimed at its tendency (sometimes called "conceptual analysis") to reify cognitive concepts ir linguistic terms—to perceive them, in other words, as having a simple, one-to-one correspondence with regularities or features of the world (see "Taboo Your Words," "Concepts Don't Work That Way," "LessWrong Rationality and Mainstream Philosophy"). Livengood and I discuss the state of conceptual analysis in philosophy departments, and its recent replacement by "conceptual engineering." We also discuss some of the problems of academic philosophy, continuities between LessWrong and analytic thought, and the status of insights like Bayesianism, verificationism, the pragmatist motto "making beliefs pay rent," and Korzybski's "map and territory."

Some context for this interview can be found in an earlier post, "Conceptual engineering: The revolution in philosophy you've never heard of," as well as in short pieces on my personal blog about LessWrong vs. contemporary philosophy (1, 2). But I'll add some definitions up front, to give context to our conversation for those who haven't read the backlog:

conceptual analysis: a method of philosophy in which a concept is assumed to have necessary and sufficient criteria which can be described simply and robustly; for instance, there might be a set of criteria which elegantly compress and describe all native-speaker utterances of a concept like "truth." Typically, a philosophical opponent will rebut a proposed set of criteria by offering counterexamples: cases in which a use-case of a concept does not meet the proposed criteria (or in which a non-member of the conceptual does meet them). Michael Bishop's "The Possibility of Conceptual Clarity in Philosophy" is an excellent, if skeptical, introduction.

conceptual engineering: a recently proposed shift in philosophical method, which abandons the idea of concepts as having "necessary and sufficient" criteria, and instead of analyzing concepts, attempts to rigorize or redefine them so they can be made more useful for a philosophical problem at hand.

This interview runs long, so I've supplied headers and bolded key lines which will hopefully enable selecting browsing.

On conceptual analysis & the history of philosophy

Livengood: [Before we start, since we're discussing LessWrong versus more traditional philosophy...] It's not clear to me that there's any unique thing we could think of as philosophy, full-stop, or "philosophical discourse today." I think a better picture is there are a bunch of overlapping activities and pursuits; sometimes they have goals that are nearby, and a lot of behavioral practice can live happily in any of those circumstances, but the ends people have in mind are a little different. We can have a lot of shared discourse in philosophical spaces; we all go to the same conferences and there isn't much disconnect, but when you try to get into what exactly people are trying to do with these projects, it can come pretty far apart.

Reason: Well, perhaps one angle here—I've heard it argued that conceptual analysis is the foundational, inseparable, aprioristic mode of philosophizing that goes back to antiquity and forms a throughline from philosophy's past to present. And though it's not always stated, the implication is that by turning a leaf from conceptual analysis to conceptual engineering, you've fundamentally changed the nature of the field: what it thinks it's up to in terms of lexicography, how it understands definitions, its place in offering linguistic prescriptions versus descriptions, the factorings of concepts and how people use them, and a larger transition from armchair philosophizing to the kind of experimental, empirical work you're doing with causation. Does that sound like a resonant narrative, or how off am I?

Livengood: I think that's a popular narrative. There's a fair amount of nuance that gets trampled, but it's not a naive **or amateurish view, there are philosophers I really like, such as Stephen Stich, who would give more or less this account of the development of Western philosophy. And you can definitely see elements of it in the Platonic dialogues: Socrates shows up in the marketplace, and someone runs into him, and they say something off the cuff like, "So-and-so was really courageous yesterday," or Euthyphro says, "I'm doing the pious thing by prosecuting my father for murder." And Socrates will go, "Oh. So you must know what courage, or piety is. Tell me about that." The structure usually looks like the other person giving a cluster-type definition, "Piety is when you do these sorts of thing—going to sacrifices, doing what the gods require, visiting the temple on a regular basis." And then Socrates says, "no, I don't want a list."

Reason: He wants the essence.

Livengood: Right, give me the account. And the other person realizes what Socrates wants is a definition, so they give an attempt at a definition. Socrates gives a counter-example, so they patch the definition; Socrates gives another counter-example and they patch the definition; and eventually everyone gets tired and leaves. That's the structure of a dialogue, especially the early ones.

There's something really nice about that format, and something that looks very similar to even contemporary work. One of the corners of the literature I know fairly well, the causation literature, a lot of it looks like that. Take David Lewis in the 1970s offering a counterfactual account of causation with a simple core idea: causation is like counterfactual dependence of a certain sort, or some pair of counterfactual dependence claims. And then people point out problems with that account, so he offers patches—in 1986, in 2000, another posthumously. There's a series of counterexamples and revisions to try to capture the counterexamples, and this process repeats and repeats. You wonder if the dialogue's gonna end in the same was as the Platonic dialogues: effectively people get bored with it, and move on, or if there's something like a satisfactory theoretical resolution.

There's an interesting, difficult, subtle kind of question about what the aims of that procedure really are; you'd asked when you wrote me, you used the word "lexicography" in your setup. I don't think for the most part philosophers have been trying to do, or thought of themselves as doing, lexicography. It seems to me that philosophers up until the 20th century, really, were doing one of two things. The boring older thing is doing metaphysics, where the target is supposed to be a thing "out there" in the world, and it's not so much that the project is figuring out how we use language, but about getting at whatever the thing is "out there." Think about this the same way you think about scientific things, Newton and the apocryphal apple. You say: "That thing we just saw, let's call that gravity; there are objects, and when they're unsupported, they fall." What's the right account of that? We know what we're talking about, we fixed our reference, but now we want to give an account.

It seems to me like historically, philosophers were aiming at the same type of things. You should think of Socrates as saying something like, "We've seen examples of what we might call courage, or piety—there's a thing, out there in the world" and here I think he's making a mistake, there's this abstract object "justice" or "piety" or "courage," and that thing I want to give an account of in the same way I give an account of gravity, or matter, or space.

Reason: The mistake being that he reifies a cognitive cluster space of "the good" or "the pious" as matching onto a discernible structure in the world, as opposed to being a garbage heap humans have found useful to call "pious" historically. Do you think philosophy that falls into that style of thought identifies and corrects its mistakes before Wittgenstein, or is Wittgenstein rightfully treated as a big deal in part for noticing it?

Livengood: Wittgenstein is tricky in a few different ways, and the 20th century on this is... contentious. There are two related things that happened where, the history is not so obvious yet, and so there are still live debates about how to think about it. There's this movement of analytic philosophy, you'll see Frege get included, Russell and Moore typically, Wittgenstein and maybe Carnap; sometimes the Ordinary Language group will get picked up like Austin; but there's this core British group that's tough to distinguish from realists.

Reason: They're rebelling from British idealism.

Livengood: And there's this focus on figuring out the meaning of terms; this is a big part of Russell's writing, for example; and there's a lot of concern with the logical structure of speech. Then there's a related phenomenon—sometimes it's smooshed together, sometimes they're separated—this idea of philosophical analysis, and this related idea of the linguistic turn. A number of people think that sometime in the 20th century there's a shift; often they're thinking of Carnap, who is very explicit about the difference between a material kind of discourse, which is how I've described Socrates—giving this account of a thing in the world, like piety—and another mode, Carnap's formal mode, which is, treating this term that shows up in our language, "piety," now with quotation marks. I'm talking about a linguistic object. And of course there's a possible further shift to paying attention to our concepts, which are supposed to be attached in some way to a linguistic term.

Reason: I guess one contention I'd advance is, to me, a classical account of concepts as having necessary and sufficient criteria in the analytic mode is in some way indistinguishable from the belief in forms or essences insofar as, even if you separate the human concept from the thing in the world, if you advance that the human concept has a low-entropy structure which can be described elegantly and robustly, you're essentially also saying there's a real structure in the world which goes with it. If you can define X, Y, & Z criteria, you have a pattern, and those analyses assume, if you can describe a concept in a non-messy way, as having regularity, then you're granting a certain Platonic reality to the concept; the pattern of regularity is a feature of the world. I don't know, what do you think of that?

Livengood: There's a lot right about what you said, and the kinds of challenges you see in the middle of the 20th century are serious problems for this whole collection of approaches, but I think it's important to see that this kind of move, especially from Carnap, which was prefigured a bit by what Russell was doing, was an important advance because it didn't necessary reify the target of the inquiry. In some cases you might want to say, "Gravity, that's something we can responsibly talk about as existing in the world," but for other things, we might just want to talk about what our language is doing. It might just be transactional—what kind of inferences we're going to make, what linguistic acts we're gonna trade back and forth; it might not be tracking anything out in the world. So there's been a pretty serious advance from the picture you're getting from Socrates up through the 20th century, to when people start focusing on the language, and thinking of linguistic acts or the structure of the language as themselves the targets of the investigation.

Reason: It's hard to understand the history backwards; much of what past philosophers got right now seems obvious, while everything non-obvious is wrong.

Livengood: I think that's right; one of the things that's fun about doing history of philosophy is seeing how very smart people can be deeply confused about things. They have an idea but it's vague and mashed-up, and today you'd say, "You're running together six different things, you have to pull apart and distinguish them." It's a thing that happens a lot, reading the history.

Reason: If I want to learn about the history of philosophy, or what Kant thought, or about philosophy through Kant—in which of these situations should I read the original, and when should I read a secondary source?

Livengood: Secondary sources have huge virtues, and you've identified some of them: they're often clearer than primary sources, they often supply intellectual context and help situate the primary source, while drawing out what the field thinks is important. But there are also vices: the secondary literature may not be right about what the most important things in the primary source are; often these sources are idiosyncratic in their readings.

Reason: What's your gut on how good these secondary sources are? Let's say major university press, respected in the field. Have we pretty much mined everything in the original, or are there gems still hiding out?

Livengood: The danger is more on the side of over-interpreting, or being overly charitable to the target. I just wrapped up a grad seminar on the problem of induction, and we were looking at the historical development of the problem of induction from Hume to 1970. As I pointed out, when you look at Hume, Hume's great, he's fun to read, but he's also deeply confused, and you don't want to do the following, which is a mistake: If you start with the assumption that Hume was just right, and assume that, if you're seeing an error it must be an error in your interpretation—if that's your historiographical approach, you're not going to understand Hume, you're going to understand this distorted SuperHume, who knows all these things Hume didn't know, and can respond to subtle distinctions and complaints that someone living now is able to formulate. That's not Hume! Hume didn't have an atomic theory, he didn't know anything about DNA or evolution; there are tons of things that were not on his radar. He's not making distinctions we'd want him to make, that a competent philosopher today would make. There's a real danger writing secondary literature, or generating new interpretations. If you want to publish a book on Hume, you need to say something new, a new angle—what's new and also responsible to what Hume wrote? It ends up doing new philosophy under the guise of history. There I'm suspicious that there's anything new to say that's also responsible to the writer.

In the 70s, the target for me is Quine; he wrote a paper called "Epistemology Naturalized," and there's a straightforward reading of this paper where he's resuscitating Hume, and giving a contemporary update. He has this throw-away line; the slogan part is, "The Humean predicament is the human predicament," but he also says, there hasn't been any progress in epistemology on the doctrinal side, the side that's dealing with normative questions, questions of justification, and the problem of inductive reasoning, since Hume. So the seminar [I ran] was asking: Is Quine right? I was upfront with the students, that there's been a lot of work on inferential problems between 1970 and today; almost all the interesting work on causal inference is after 1970. You have the emergence of information criteria, lots of statistical techniques like the bootstrap and jackknife, Bayesian and computational resources, machine learning and big data—those all change the landscape.

Livengood's experience with LessWrong

Reason: I want to ask how you think of the historic state of philosophy, or what it would be like to project a historical view on the present, but I want to ask about LessWrong, so let's jump back and forth. How'd you get exposed to the community? What was your experience?

Livengood: I started reading in the 2000s, I don't remember exactly which pieces. Much of it was just self-reinforcing; for the most part, stuff that happens on LessWrong seems indistinguishable to me from high-level amateur, low-level professional discourse in philosophy? Smart graduate students, people who had really decent ideas but lacked the professional language to express it. That's the way the LessWrong community struck me at the time; I was a graduate student just starting, and it felt like, "Yeah! I'm having a conversation with other people doing the same kind of thing I'm doing." There's sometimes an impression that the people on LessWrong are doing something wildly out of step from what philosophers would ordinarily think of themselves as doing, and that was not my impression.

Reason: Both naysayers and advocates for LessWrong do often emphasize the gap like you say, and I think unless you're very knowledgeable about the field, you hear a lot of bad arguments coming out of philosophy, both historically and still today. (Sturgeon's Law.) And most philosophers worth their chops in these fields are aware of these historical arguments being flawed; they're maybe more generous, and probably see these (today obvious) ideas as highly non-obvious in their times.

Livengood: Again, the thing I said earlier, that there isn't "such a thing, fullstop" as philosophy—LessWrong seems fruitfully engaged in similar kinds of questions, concerns, and problems to at least some parts of contemporary academic philosophy, and parts of contemporary philosophy I like and think are non-trivial. It's not a ghettoized, small corner of philosophy; there are robust projects that are shared by a number of departments across the world that do things this way.

I would agree LessWrong does things differently, there's a house style, but it's not like the collection of theses they defend or are pursuing or developing are so far out of the mainstream that academics wouldn't recognize it as philosophy, or as being reasonable approaches to philosophy.

Romantic vs. professionalized philosophy

Reason: Well, that's why I reached out in the first place; you'd left a comment on Luke Muehlhauser's "Train Philosophers With Pearl and Kahneman, not Plato and Kant" gesturing to this effect—that at least in your graduate program, at Pittsburgh, cognitive science was very paid-attention-to.

Livengood: The Pittsburgh scene is a little peculiar; just background-wise, at the University of Pittsburgh there are two departments which at the time were on the same floor. There's an enormous, 42-story cathedral of learning at Pittsburgh, lovely neo-Gothic, built in the 30s, and these two departments were right across the hall: there was the philosophy department, and there was the History and Philosophy of Science (HPS) department. My PhD is from the latter.

Those departments are very different in the way they think about what philosophy is doing, the way they train their graduate students, the way their courses are conducted, their faculty. Maybe the best way to describe that difference is there are two divergent attitudes of how philosophy should go, what I'd describe as the professionalized view and the romantic view. The HPS side tended to be more professionalized; you find an interesting problem, chip away at it, advance the field a bit, and at the end of a long career, you and the people you're working in conversation with will have learned something, you'll have advanced human knowledge. This is the way things have to go: most of us are not geniuses, we're just ordinary people chipping away at a problem.

And then there's the romantic view that says look, the people we read and engage with—Aristotle, Descartes, Kant, Wittgenstein—are these super-geniuses who thought thoughts nobody else had ever thought before, who shook the foundations of human knowledge and turned things upside down. This is the aim: to become one of those people.

And the difference in graduate training in the two programs is, HPS you come in, write some papers, get out in 6-8 years, get a job, everybody does that. The Pitt Philosophy program you come, think some things, try to think the deep thoughts; the very best people go on to an awesome career, the rest of you, well, we're happy to burn through a hundred grad students to find a diamond.

My sympathies are, as you might expect, entirely with the professionalized view.

Reason: It does seem if you're a Wittgenstein-level genius, you don't need your romanticism stoked, you might not even a graduate program. Certainly they didn't.

Livengood: That's probably right, but to give the devil his due, there are things to like, there are reasons people are attracted to that romantic view. They're just not reasons I endorse at the end of the day.

Analytic communities on LessWrong's wavelength

Reason: Have you read Clark Glymour's manifesto?

Livengood: Yes.

Reason: What did you think?

Livengood: So that's the other element in the mix. There are these two Pitt departments, both quite good, the Philosophy program at the time was top five in the world, and HPS program has been for a long time the place to do philosophy of science. And then across the street is Carnegie Mellon, which, their philosophy department is basically Glymour's construction. Whoever the president or provost was recruited Clark out of Pitt to establish a philosophy department, and Glymour's like, great, I can build a philosophy department from scratch, the way I'd want to run a philosophy department. It's a peculiar place. The way I've heard it described is that CMU's philosophy department is what you get when you treat philosophy as a kind of engineering. I think that's not inaccurate. I happen to think that's beautiful, a really good look for philosophy.

Reason: What would you call the CMU, HPS, maybe LSE, you can throw LessWrong in there it sounds like—

Livengood: I would include also Irvine, University of Minnesota, Indiana University sometimes has had this vibe. It's not quite positivist, but it's in that neighborhood—science-friendly, professionalized, trying to make progress, caring about mathematics and empiricism.

Reason: It's the kind of people who would've been positivists in the 50s.

Livengood: If Carnap were alive today he'd be in this camp. Whether he'd have the views he had back then, well, he probably wouldn't; we learn things, we hope that these things change minds.

Reason: I've heard this vibe is also popular in Europe.

Livengood: Yeah, the LMU at Munich has the same kind of character. European programs are trickier because much of it is tied to local funding regimes, but there do seem to be more of these mathematically, empirically informed projects.

Reason: A popular metaphor at LessWrong is Korzybski's "map and the territory," though it may have gotten there via Hayakawa. Is it a good metaphor, or do its reductions actually set you back, as some detractors claim?

Livengood: I think I'm mostly a fan of the Korzybski metaphor. It's serviceable. I think it has some limitations where the map is the territory, which can happen when the map-making makes the thing. Here I'm thinking of pretty mundane cases, like how something being money depends on how we treat it, and also more controversial cases, like the construction of gender and race or the status of mathematical objects. Or do you think that misses the point of the metaphor?

Reason: Bayes, underrated, overrated?

Livengood: Hm... a bit of both. Bayesian approaches in philosophy of science and epistemology today are pretty standard. Bayesian analysis of scientific reasoning is a project that's probably overrated, at least in philosophy. Bayes in undergraduate education generally is probably underrated; I teach a 100-level intro to logic course, and I tell the students, if you take a Stats 100 class, you'll see frequentist approaches to probability, and frequentist statistical inference techniques, so I'm going to give you something different, give you a Bayesian take on it. So far I haven't yet have a student saying, well, this is obviously the way people think about probability, this is boring and I've seen it in my other classes.

Reason: We're obviously familiar with the idea of scientific progress. Ethics get described surprisingly similarly, where there's a kind of drift; whether that drift happens "on its own," in an inevitable ratchet, or whether people have to work to make it happen, is unclear; but this is the way changing norms around race, sexuality, animal rights get talked about typically. Do you feel like the shift that departments like HPS or CMU are leading, the transition from conceptual analysis, will win out or become dominant? How do you see the field a hundred years out?

Livengood: Predictions that far out are tricky. It's not obvious to me we'll have anything that look like contemporary universities in a hundred years. You asked over email about technological developments and philosophical progress, and there are lots of positive impacts there. Increases in massive online instruction, I'm not sure how that will shake out.

Philosophy's role in public discourse

Reason: Last year you wrote, "I don't think philosophers are especially well-equipped in virtue of their training to help out in the current crisis. We're more like high-trained sports fencers when a general melee is breaking out. We've trained to participate in a game that has specific restricted rules, that are implicit and often hard to fathom; if we go out into the world and try to fix it playing by our usual rules, the result will be predictably bad." This seems right to me, but the question becomes, who is filling this role? We don't have literal swordfights, so it's not a big deal if human capital is channeled into play-fencing. We do have these figurative swordfights though, so the question becomes, who is filling this role in public discourse?

Livengood: I thought your list was pretty good. [I'd emailed along Tyler Cowen's comments that amateurs in philosophy are running the public-facing discipline: Silicon Valley stoicism, Nicholas Nassim Taleb, LessWrong-style rationalism and post-rationalism, ex-New Atheists like Sam Harris, psychologists like Jordan Peterson.] It gets filled in a variety of a way, some by professional or near-professional philosophers by way of podcasts, but much of it in larger circuits are indeed filled by people like Sam Harris, Jordan Peterson, and then even less interesting people like Ben Shapiro.

Reason: Zizek seems like one of the few entries from a more traditional philosophy tradition.

Livengood: Yeah, there are a few outliers. Peter Singer has had a fair amount of popular public impact. There are other with marginal public influence, but who are clearly important, such as Martha Nussbaum or Dan Dennett. They matter, even if they're not nearly as visible as people like Zizek, or Chomsky, or Singer. I don't know how many public-facing philosophers we need in a society of this size; it does seem like, given that I'm not especially impressed by people like Harris and Peterson and Shapiro, we could use more public-facing philosophy—but there's also a question of why it is the market has taken up those individuals, whether there are just market-type demands that are satisfied by the ideas they're producing that wouldn't take up public bandwidth the way more mainline philosophical production would.

Reason: Looking to one historical precedent, what do you think of say the post-war French gang, Sartre through Foucault? That's a case of borderline public hysteria around a set of more-or-less traditional academic philosophers. Is that fair? What can we take away, what do we learn?

Livengood: I'm not sure we learn anything. I'm not a radical contingency historian, I don't think there's nothing to learn from history, but there are often events where there isn't much to take away, you have a couple interesting public intellectual figures who happen to be in philosophy, who happen to have a public who is interested in their ideas; if they'd been in a different field, would things have been different? I don't know. The counterfactuals make me think it's too hard to judge. At minimum, we'd need a whole lot more detailed information about their writing, what was going on in society, and I'm unqualified for that.

Selection and referee problems in philosophy

Reason: I've really appreciated how much personality philosophy has. You have Chalmers and his Zombie Blues band, the Kripkensteins, it's a fun wonky field, old men with big personalities and big beards, I'm a big fan. But now that I've said something nice about philosophy I have to say something mean. Sturgeon's Law says 90% of any field is bad, 10% is good; you have plenty of dressed-up, garbage literary fiction and plenty of brilliant pulpy sci-fi books. Do you think there's a mechanism that makes it more difficult for the field to sort out and identify the good among the bad? Maybe it takes a certain level of criticality to identify the good thought to begin with, and the implicit consensuses built off support and textual elaboration aren't guiding us to the correct answers.

Livengood: Part of what you're saying sounds right, but I'm a little nervous about other bits. I'd put it in terms of "rules for settling opinions": in the sciences, there are clear standards for settling disputes, where you work out an experiment and run it. I'm not naive about how the sciences work in reality, but in principle at least, if you have a disagreement, you can come to an agreement about what you will do or believe in light of the experiment you're going to run. This is an idealistic Feynman picture, that at the end of the day, if you run the experiment, and the experiment doesn't agree with your idea, even if your idea is super pretty, it's wrong. In real scientific practice it's a lot messier, but in philosophy it's much harder to agree on a constraint or rule for settling disputes. We have practices we engage in, and we do tend to move closer together in the process of extended discourse and argument, but it's hard to say why that happens; I find it very unsettling that I don't a good sense of what might resolve a disagreement. It's a problem I'm always puzzling about.

There's something I want to fuss about though. It seems to me that philosophy has a bad cultural fixation on the genius, but that a lot of progress is possible in philosophy without these super-genius-level contributors. This is part of my bias toward the professionalized way of looking at the field. I think the best work in philosophy is identifying a narrow topic you can actually make progress on, and chipping away at it through formal precision, distinctions, experiments, and collectively we make progress on these problems. It's not always obvious that there's progress, or what progress looks like, when you're too close to it, or it's really new, but if you give yourself an extended period—how people have thought about induction from Hume to today—you'll see lots of progress made.

Making beliefs pay rent

Reason: I can't let you go before asking about Peirce, who you've written quite a bit about. One of the views of his that surfaces on LessWrong is a demand that beliefs pay rent. Now, I know people make a lot of the differences between pragmatism and positivism, and certainly Russell hated the pragmatists, but there seems to be a kernel or core, maybe you could call it weak verificationism, where if one person believes one thing, and another believes another thing, then there should be some observable difference that matters, something that ought to tell us who is right our wrong. That if there's nothing in the world that can distinguish between our arguments, maybe we're not in disagreement at all. Verificationism proper comes under a lot of flack these days; maybe you can suggest a better handle for the rough, generic version I'm describing; but I'm curious, is verificationism a good idea that's needed a lot of qualification over the 20th C, or is it a bad idea that got us off on the wrong foot?

Livengood: I think it's a great idea that's mostly right. It's similar to what we were talking about with primary and secondary sources: the bulk of its value lies in pretty simple statements, even though those statements aren't quite right. They have counterexamples, or haven't had enough detail built into them, but you get the gist. It's still an open question as to whether an adequate account of the verification criterion can be made to work, but I'm not sure it really matters with respect to the practical service the idea performs. Something like Peirce's pragmatic maxim, or various Positivist views, or the verificationism Quine goes in for—all of those are quite salutary attitudes to have. Broadly good, broadly healthy, and they inspire broadly good practices in our intellectual lives.

Now, when you start trying to narrow it down to a dogmatic thesis, then I'm not so sure a verificationist account of meaning is going to quite work. There are some obvious failures; A.J. Ayers' account doesn't work, it's pretty easy to kill it, and Church gives devastating counterexamples.

Reason: If we cast Ayers as a conceptual engineer, isn't he just telling us what a meaningful sentence is?

Livengood: Yes! This is more or less the Carnapian route. Carnap's accounts have not been knocked over in the way Ayers has been.

Reason: Well, I'll just ask a couple minutes more of your time: One paper I've gotten a lot out of is Michael Bishop's 1992, "The Possibility of Conceptual Clarity in Philosophy." He talks about a "counterexample" style of philosophizing that's broader than conceptual analysis, where the philosopher sits in the figurative armchair, proposes a definition, and another armchair-occupant posits a counterexample which pokes a hole in the original proposal. Much like a Socratic dialogue. Given this has been the standard method for both proposing and rejecting proposals, it seems that, if we grant prototype theory and reject classical accounts of concept—if we believe concepts are fuzzy and polysemous; that there will always be edge-cases to a conceptual carving, and there's no way to losslessly compress into a few simple criteria the high entropy use-in-the-world by millions of decentralized speakers over time—if we grant this about concepts, should we let the classically analytic rulings from the 20th C about what is "meaningful" or "true" or "knowledge" stand? Ought we revisit those debates to see if they might be useful factorings, even if they aren't necessary and sufficient?

Livengood: Yes. The best example I can give is work by Joseph Halpern, a computer scientist at Cornell. He's got a couple really interesting books, one on knowledge one on causation, and big parts of what he's doing are informed by the long history of conceptual analysis. He'll go through the puzzles, show a formalization, but then does a further thing, which philosophers need to take very seriously and should do more often. He says, look, I have this core idea, but to deploy it I need to know the problem domain. The shape of the problem domain may put additional constraints on the mathematical, precise version of the concept. I might need to tweak the core idea in a way that makes it look unusual, relative to ordinary language, so that it can excel in the problem domain. And you can see how he's making use of this long history of case-based, conceptual analysis-friendly approach, and also the pragmatist twist: that you need to be thinking relative to a problem, you need to have a constraint which you can optimize for, and this tells you what it means to have a right or wrong answer to a question. It's not so much free-form fitting of intuitions, built from ordinary language, but the solving of a specific problem.

Discuss

### AvE: Assistance via Empowerment

1 июля, 2020 - 01:07
Published on June 30, 2020 10:07 PM GMT

This might be relevant to the AI safety crowd. Key quote:

"Our key insight is that agents can assist humans without inferring their goals or limiting their autonomy by instead increasing the human’s controllability of their environment – in other words, their ability to affect the environment through actions. We capture this via empowerment, an information-theoretic quantity that is a measure of the controllability of a state through calculating the logarithm of the number of possible distinguishable future states that are reachable from the initial state [41]. In our method, Assistance via Empowerment (AvE), we formalize the learning of assistive agents as an augmentation of reinforcement learning with a measure of human empowerment. The intuition behind our method is that by prioritizing agent actions that increase the human’s empowerment, we are enabling the human to more easily reach whichever goal they want. Thus, we are assisting the human without information about their goal[...]Without any information or prior assumptions about the human’s goals or intentions, our agents can still learn to assist humans."[Emphasis and omissions are mine]

From the abstract: One difficulty in using artificial agents for human-assistive applications lies in the challenge of accurately assisting with a person's goal(s). Existing methods tend to rely on inferring the human's goal, which is challenging when there are many potential goals or when the set of candidate goals is difficult to identify. We propose a new paradigm for assistance by instead increasing the human's ability to control their environment, and formalize this approach by augmenting reinforcement learning with human empowerment. This task-agnostic objective preserves the person's autonomy and ability to achieve any eventual state. We test our approach against assistance based on goal inference, highlighting scenarios where our method overcomes failure modes stemming from goal ambiguity or misspecification. As existing methods for estimating empowerment in continuous domains are computationally hard, precluding its use in real time learned assistance, we also propose an efficient empowerment-inspired proxy metric. Using this, we are able to successfully demonstrate our method in a shared autonomy user study for a challenging simulated teleoperation task with human-in-the-loop training.

How does this fit in with other control problem approaches? What is the relationship between this and Turner's power formalism?

They also carried out a survey that didn't look like it made it into the paper, but shows up on the project web page: https://sites.google.com/berkeley.edu/ave/home

Discuss

### I am Bad at Flirting; Realizing that by Noticing Confusion

30 июня, 2020 - 23:48
Published on June 30, 2020 8:05 PM GMT

This post is about applying rationality to my dating life. It is gooey and rich in self-disclosure. But it was a great triumph over motivated reasoning.

I notice my confusion

When I was 22 I that my romances began mainly when I was busy. They began most often, paradoxically, when I was unable to pay attention to my interest in early courtship. The observation was interesting but did not then replace my preferred explanation for romantic successes and failures.

At the time I subscribed to, what I call, the “mental health” hypothesis[1]. When I was healthy and confident women could sense it and chose to date me. Sometimes I became “unhealthy” and women became disinterested. Partners sensed “unhealth” by uncontrollable micro-cues. Thinking about dating will only make you more self-conscious and worsen your chances. This explanation agreed with the results of my favorite epistemology: asking my female friends what they think happened (critique of this method). At the time I had not learned Bayes theorem, Occamian reasoning or any social influence literature, so the “mental health” hypothesis seemed plausible. Besides, I liked my earlier beliefs. I just need to be “better” then women will like me. It was a simple, appealing narrative. If Iost that explanation, the alternatives might be “you must be a jerk” or “you are unattractive”, which scared me. Besides, Me_2016 did not know the benefits of saying “oops”.

By 2019, the “mental health” explanation was under increasing strain. The conventional advice was to make your life full by exercising, working on your mental health, improving your career, and the partners will come”. The problem was that it did not work, despite soundind wise. In the Summer of 2019 I was healthy,4; I had a good job, strong friendship networks, passionate hobbies, and plenty of exercise. It was the ideal time to seek a partner and I put great energy into the search. Despite my apparent health, I had the worst results in years. When I pointed this out, my poor friends could only shake their heads and think “there he goes, trying to solve the unsolvable.”

The nail in the coffin of the “mental health” hypothesis came that winter. I lost my job, was briefly jailed in a foreign country, and emerged into a revolution and a currency crisis to look for work. That same month I broke my back and worked myself into an emotional collapse. And women loved it. I got more positive response that one month than 5 months in nest-building mode. Something else was going on. I finally noticed my confusion.

This summer I reinvestigated with Bayesian reasoning. First, I had to choose between believing two unlikely statements. Either this pattern of relationships was a strange coincidence, or my female friends had no idea what made them choose who to date. The first conclusion seemed unlikely, as from the 15 or so courtships I remembered the three successes occurred during strenuous efforts to hide my interest. That pattern is unlikely if relationships occur randomly. The second conclusion had seemed wildly implausible at first. But after reading about a mountain of cognitive bias and the difficulty of rationality, the unknowability of our preferences made sense. I began talking to my friends about how they selected partners. I mostly found their responses to be nonsensical, either drawing on overcomplicated psychological constructs or qualities that did not seem special at all. One friend described herself as constantly dating jerks because of her childhood issues, but it was cheaper to believer she subconsciously preferred aloof and unavailable men.

Furthermore, the “mental health” hypothesis relies on partners “seeing through me”, through just a few social cues, postures, and inflections, to a deep, hidden but somehow well-defined part of my psyche where “health” information is stored. Such a complicated theory should have a low prior. A low prior with no evidential support…

Availability is the problem

And a challenger appeared. In the summer of 2019, I accidentally invited three young women on the same hike[2]. I was so worried about offending them that I called my mother for advice. She suggested hitting on none of them during the hike, so as to be fair. My mom was an unintentional genius. I started seeing one of the women from the hike the next week. That partner later stated that she was honestly unsure if I was interested until we first made out, with particular reference to the hike. I suspect that forcing myself to now lavish attention on my interest made me a much more appealing partner[3].

The simple solution I call the availability hypothesis. Potential partners respond to how available I appear, when deciding about a first date. If I signal that they can have me easily, partners do not want me (as much/usually). I have two aligned explanations for the phenomenon. Firstly, partners enjoy the uncertainty of not knowing if I am interested or not. I become a challenge to be achieved. If I just tell someone they are great, then they have achieved that status and only get the pleasure once. I can also give subtle but incomplete signals of interest, each of which provides a separate rush of pleasure. So the person who knows I like them is less likely to come to my party, respond to my messages, notice my cool hobbies and passions, and generally fall in love with me.

Secondly, there is status competition. If I signal that dating me is easy, they perceive my cost as low. Amateur jewelry shoppers assume expensive jewels are valuable, because assessing the value of each piece independently would be tiring and difficult[4]. Likewise, how available I am is a simple value heuristic for prospective. This explanation may be an unflattering, but people do seem to care a lot about social status.

Note, they are not responding to neediness. Healthy, confident availability is right out as well. Partners are responding to the “price tag” I present. If I say in the first meeting that I like a person, the chance of a relationship drops, whether I say it confidently or needily. Words like “desperate” misleadingly imply that there is a healthy, flirtatious way to express unambiguous interest. The correct strategy is to be ambiguous[5]. In the words of one ex, you should have “General Aloofness”.

I suspect the optimal amount of interest is just enough. Slightly less than your prospective partner is ideal, but slightly more interest may be necessary as women rarely initiate. I must learn to imply interest without ever making it explicit. In other words, I was coming on to strong the whole time.

An agenda for further experimentation

There are many more tips to learn. Good posture improves attraction. Eye contact. A good fashion sense goes a long way. A few gestures at traditional masculinity. Teasing seems highly rewarded, so I can learn that. After all, 12-year-olds master the art. Also, an ex suggested "Wait for her to kiss you, then you always keep your power".

My next step is to keep experimenting and reading. Less wrongs posts have been useful, especially this one. Send along any reading recommendations in addition to HughRistick amd Lukeprog and Minda Myers. This work by Scott Alexander is illuminating, and Eric Raymond wrote the simplest guide. Wish me luck!

[1] I put mental health in quotes to emphasize its vagueness in this context specifically.

[2] The hike was not an attempt to make anyone jealous. I had simply observed that most people I invited hiking flaked, and so began inviting as many participants as possible.

[3] There is also evidence that women are influenced by peer attention. See Sprecher, Wenzel and Harvey, 2008. The Handbook of Relationship Initiation, pp. 103

[4] Cialdini, R (200). Influence: Science and Practice [4 .ed]. See chapter 1.

[5] Also, desperateness is a more complex/poorly defined category than availability. Cialdini’s work suggests people prefer simple heuristics.

Discuss

### Comparing AI Alignment Approaches to Minimize False Positive Risk

30 июня, 2020 - 22:34
Published on June 30, 2020 7:34 PM GMT

Introduction

Based on the method I used in "Robustness to Fundamental Uncertainty in AGI Alignment", we can analyze various proposals for building aligned AI and determine which appear to best trade off false positive risk for false negative risk and recommend those which are conservatively safest. We can do this at various levels of granularity and for various levels of specificity of proposed alignment methods. What I mean by that is we can consider AI alignment as a whole or various sub-problems within it, like value learning and inner alignment, and we can consider high-level approaches to alignment or more specific proposals with more of the details worked out. For this initial post in what may become a series, I'll compare high-level approaches on addressing alignment as a whole.

Some ground rules:

• By "alignment" I mean "caring about the same things".
• If you like I attempted to be more formal about this in "Formally Stating the AI Alignment Problem", but here I'm going to prefer the less formal statement to avoid getting tripped up by being overly specific in just one part of the analysis since we're not being very specific about the alignment methods.
• I say "cares about the same things" rather than "aligned with human interests" to both taboo "align" in the definition of "alignment" and reflect my philosophical leaning that "caring" is the fundamental human activity we are interested in when we express a desire to build aligned AI.
• I definitely have some unpublished idea about what "caring" means in terms of predictive coding in terms of valence. This is a good recent approximation of them written by someone more understandable than me, and accords with some things I've previously written. Hopefully not relevant beyond disclosing my background assumptions.
• Care synonyms for the scrupulous: interest, purpose, telos, concern, value, and preference as long as we use it in the everyday sense of the word to point at a broad category of human activity and not as jargon about choice (even though this broad category does influence how we choose).
• By "high-level approaches" I don't mean specific mechanisms or methods for building aligned AI, but general directions or ways in which specific methods are intended to work.
• If you don't want to dive into the details of the method I'm using here, the tl;dr on it is that we have more to lose from failure than gain from success when developing technologies that create existential risks, and so we should prefer risk mitigation interventions with lower risks of false positives, all else equal, thus I look for arguments that let us at least give an ordinal ranking of interventions in terms of false positive risk.
• I talk of "false positive risk" here because I mean the risk that we think an intervention will work, we try it, and it fails in a way that results in an outcome as bad as or worse than the outcome if we had tried no intervention or not developed the technology because it was deemed too risky.
• By comparison a false negative here is failing to try an intervention that would have worked but we incorrectly ruled it out because we thought it wouldn't work, seemed too risky, or made some other error in judging it.
• To find that ordinal ranking I primarily look for arguments that allow one intervention to dominate another in terms of false positive risks (see the section on meta-ethical uncertainty in the original paper for an example of this kind of reasoning) or show that all else is not equal and the safest choice is not necessarily the one with the lowest false positive risk (see the section on mental phenomena in the original paper for an example).
• To get there I'll consider the false positive risks associated with each approach, then look for arguments and evidence that will allow us to compare these risks.

I'll be comparing three high-level approaches to AI alignment that I term Iterated Distillation and Amplification (IDA), Ambitious Value Learning (AVL), and Normative Embedded Agency (NEA). By each of these, for the purposes of this post, I'll mean the following, which I believe captures the essence of these approaches but obviously leaves out lots of specifics about various ways they may be implemented.

• Iterated Distillation and Amplification (IDA)
• Build an AI, have it interact with humans to form a more aligned (and generally more capable) AI-human system, then build a new AI with the same capabilities as the AI-human system. Repeat.
• This is a family of methods based on Paul Chritiano's ideas about IDA and includes debate, HCH, and others.
• Ambitious Value Learning (AVL)
• Normative Embedded Agency (NEA)
• Build AI that follows norms about how it makes decisions that makes those decisions aligned with humanity's interests.
• The closest we have to specific proposals within this approach are MIRI's ideas about Highly Reliable Agent Designs (HRAD) and Error Tolerant Agent Designs (HTAD), perhaps combined with something like Coherent Extrapolated Volition (CEV).
• Otherwise NEA seems like a natural extension of the kind of work MIRI is doing, viz. it might be possible to bake a decision algorithm into an AI such that it achieves alignment by both computing long enough over enough detail and being programmed, via embodying a particular decision theory, to effectively care about the same things humans do.
• The reality is that NEA as described here is a bit of a straw category that no one is likely to try, as in isolation it's like the GOFAI approach to alignment, and more realistic approaches would combine insights from this cluster with other methods. Nonetheless, I'll let it stand for illustrative purposes, since I care more in the post about demonstrating the method than providing immediately actionable advice.

I do not think these are an exhaustive categorization of all possible alignment schemes; rather they are three that I have enough familiarity with to reason about and consider to be the most promising approaches people are investigating. There is at least a fourth approach I'm not considering here because I haven't thought about it enough—building AI that is aligned because it emulates how human brains function—and probably others different enough that they warrant their own category.

Sources of False Positive Risk

For each of the three approaches we must consider their false positive risks. Once we have done that, we can consider the risks of each approach relative to the others. Remember, here I'll be basing my analysis on my summary of these approaches given above, not on any specific alignment proposal.

I'll give some high level thoughts on why each may fail and then make a specific statement summing each one up. I won't go into a ton of detail both because in some cases others already have (and I'll link when that's the case) or because these seem like fairly obvious observations that most readers of the Alignment Forum will readily agree with. If that's not the case please bring it up in the comments and we can go into more detail.

• IDA
• Humans may fail to apply sufficient pressure on the combined system to produce a more aligned system after distillation.
• Humans may fail to be able to exert enough control over the combined system to produce a more aligned system during amplification.
• Competing pressures and incentives during amplification may swamp alignment such that humans themselves choose to prefer less aligned AI.
• More specifically, IDA may fail because humans are unable to express themselves through their actions in ways that constrain the behavior of an AI during amplification such that the distilled AI for the next iteration is unable to ever adequately incorporate what humans care about, resulting in unaligned AI that may fail in various standard ways (treacherous turn, etc.).
• AVL
• There may be no value learning norm that allows an AI to learn and align itself to arbitrary humans.
• There may be insufficient information available to an AI to discover what humanity actually cares about, i.e. not enough detail in observed behavior, brain scans, reports from humans, etc..
• AVL approaches may be prone to overfitting the observed data such that they can't generalize to care about the same things humans would care about in novel situations.
• More specifically, AVL may fail because the AI is unable, for various reasons, to adequately learn/discover what humans care about to become aligned.
• NEA
• We may fail to discover a decision theory that makes an arbitrary agent share human concerns, or no such decision theory exists.
• We may think we discover a decision theory that can produce aligned AI and are wrong about it.
• We may not be able to incrementally verify progress towards alignment via NEA because it's trying to hit a small target many steps down the line by setting the right initial conditions.
• More specifically, NEA may fail because the decision theory used does not sufficiently constrain or incentivize the agent to become or stay aligned and we may not adequately be able to predict if it will fail or succeed sufficiently far in advance to act.
Comparisons

Given the risks of false positives identified above, we can now look to see if we can rank the approaches in terms of false positive risk by assessing if any of those risks dominate the others, i.e. the false positive risks associated with one approach necessarily pose greater risks and thus higher chance of failure than those associated with another. I believe we can, and I make the following arguments.

• risk(AVL) < risk(IDA)
• IDA suffers from the same false positive risks as AVL, in that for IDA to work an AI must infer what humans care about from observing them, but adds the additional risk of not only optimizing for what humans care about but also optimizing for other things while the AI increases in capabilities via iteration. Thus IDA is strictly riskier than AVL in terms of false positives.
• risk(IDA) < risk(NEA)
• IDA depends on humans and AIs iteratively via small steps moving towards alignment with regularly opportunities to stop and check before the AI becomes too powerful to control, whereas NEA does not necessarily afford such opportunities. Thus NEA has higher false positive risk because it must get things right by predicting a longer chain of outcomes in advance rather than incrementally making smaller predictions with opportunities to stop if the AI becomes less aligned.
• risk(AVL) < risk(NEA)
• This is to double check that we are correct that our risk assessments are transitive, since if this fails we end up with a circular "ordering" and have an error in our reasoning somewhere.
• AVL requires that we determine a value learning norm, but otherwise expects to achieve alignment via observing humans, whereas NEA requires determining norms not just for value learning (which I believe would be implied by having a decision theory that could produce an aligned AI) but for all decision processes, thus it requires "writing a larger program" or "defining a more complex algorithm" which is more likely to fail all else equal since it has more "surface area" where failure may arise.
Conclusions

Based on the above analysis, I'd argue that Ambitious Value Learning is safer than Iterated Distillation and Amplification is safer than Normative Embedded Agency as approaches to building aligned AI in terms of false positive risk, all else equal. In short, risk(AVL) < risk(IDA) < risk(NEA), or if you like AVL is safer than IDA is safer than NEA, based on false positive risk.

I think there's a lot that could be better about the above analysis. In particular, it's not very specific, and you might argue that I stood up straw versions of each approach that I then knocked down in ways that are not indicative of how specific proposals would work. I didn't get more specific because I'm more confident I can reason about high level approaches than details about specific proposals, and it's unclear which specific proposals are worth learning in enough detail to perform this evaluation, so as a start this seemed like the best option.

Also we have the problem that NEA is not as real an approach as IDA or AVL, with the research I cited as the basis for the NEA approach more likely to augment the IDA or AVL approaches rather than offer an alternative to them. Still, I find including the NEA "approach" interesting if for no other reason that it points to a class of solutions researchers of the past would have proposed if they were trying to build aligned GOFAI, for example.

Finally, as I said above, my main goal here is to demonstrate the method, not to strongly make the case that AVL is safer than IDA (even though on reflection I personally believe this). My hope is that this inspires others to do more detailed analyses of this type on specific methods to recommend the safest seeming alignment mechanisms, or that it generates enough interest that I'm encourage to do that work myself. That said, feel free to fight out AVL vs. IDA at the object level in the comments if you like, but if you do at least try to do so within the framework presented here.

Discuss

### How ought I spend time?

30 июня, 2020 - 19:53
Published on June 30, 2020 4:53 PM GMT

1. grinding through books
2. projects, taking a shot at the possibility of a contribution

Personally, I never maintain 1 and 2 at the same time, and I tend to have >95% of my resources on either one or the other, leaving 5% or less for 3. The vast majority of the last 4 years I've been in 1-mode, my reasoning being that I can't make a contribution if I don't know anything, and I don't want to sink my time into bad projects.

Discuss

30 июня, 2020 - 17:50
Published on June 30, 2020 2:50 PM GMT

Two months ago, Somerville made facecoverings mandatory, both indoors and out. A few weeks later the state of Massachusetts required them " in public places where social distancing is not possible". A week ago Somerville reduced its requirements in light of the heat: "during the summer months, when you are outside and able to social distance at least six feet from others, you may temporarily remove your face covering but must put it back on when others are nearby."

I was curious what people were actually doing, so while Anna was playing in the "woods" along the edge of the bike path, I gathered some statistics. As each person passed along the path, I tracked mode of transportation (walk, run, bike, scooter/skateboard), apparent gender (female, male, child, unclear), and face covering status (covered, mouth only, removed, absent). Raw data is here.

In forty minutes on June 28th, from 9:41am to 10:21am I saw 179 people pass. Of these, 73% (131) were masked, 5% (9) had their nose exposed, 15% (27) had masks on their chin or otherwise removed, and 7% (12) had no mask at all:

With a person passing every 13 seconds, the path was a pretty crowded place. I only very rarely saw people putting on their masks, however, when coming close to others. For the analysis below I want to talk about people as being masked or not, and people with masks removed wouldn't qualify. Masks worn mouth-only are less clear, but since a large fraction of transmission seems to be via talking and coughing, I decided to count someone as masked if their mouth was covered, even if their nose was sticking out.

When looking at people by transportation, walkers (74%, 62/84) were a bit less likely to be masked than others (82%, 78/95), but not by much:

Looking by gender/age, women were less likely to be masked (72%, 56/78) than men (85%, 74/87). [1] Children were in between (77%, 10/13), though I didn't count children in bicycle trailers or strollers:

I'm curious what numbers look like in other areas. Talking impressions with friends, it sounds like this is a higher fraction of people wearing masks than in most places?

[1] I categorized people by apparent gender, and one person didn't read immediately as male or female.

Discuss

### Web AI discussion Groups

30 июня, 2020 - 14:22
Published on June 30, 2020 11:22 AM GMT

After the success of Web Taisu, I have decided to organize a similar event.

Here are the rules, this document is an attempt to create common knowledge among the participants.

If you have an Idea for a topic to discuss

Post it to reddit on https://www.reddit.com/r/AI_Safety_Meetups/

Comment on these posts to add points that you think are interesting, vote up to indicate that you want to discuss a topic.

Scheduling

On Friday 3 July, about 12:00 am BST, I will look at this subreddit, and select the most popular topics. (If there is no activity on the subreddit, I will announce so here) I will create a google doc where you can fill in what topics you are interested in, and when you are busy.

Timeslots will be between 2:00 to 9:30 pm BST (including breaks) (9:00 am to 4:30 pm EST) (6:00 am to 1:30pm PDT)

Timeslot 1:

2:00 pm to 3:30 pm BST, 9:00 am to 10:30 am EST, 6:00 am to 7:30 am PDT.

Timeslot 2:

4:00 pm to 5:30 pm BST, 11:00 am to 12:30 am EST, 8:00 am to 9:30 am PDT.

Timeslot 3:

6:00 pm to 7:30 pm BST, 1:00 pm to 2:30 pm EST, 10:00 am to 11:30 am PDT.

Timeslot 4:

8:00 pm to 9:30 pm BST, 3:00 pm to 4:30 pm EST, 12:00 am to 1:30 pm PDT.

Each timeslot will be available on Tuesday 7, Wednesday 8, and Friday 10 of July.

I will place the exact timeslots, and a link to the google doc here on Friday 3 July.

The doc will close, and a timetable will be placed here on Monday 6 of July, at around 12:00 am BST.

So You need to

now - 12:00 am BST Friday 3 July:

Post ideas on reddit.

12:00 am BST Friday 3 July - 12:00 am BST Monday 6 July:

Carry on discussing the topics on reddit

12:00 am BST Monday 6 July - onwards

Look here for timetable and links to videocalls.

Details

Any participants should be able to join zoom videocall meetings.

Links to the videocalls will be placed at the bottom of this page, and in the relevant reddit post.

Slots will consist of two 40 minute video meetings, separated by a 10 minute break. (The maximum length of call that zoom will allow for free is 40 minutes. "This is not a coincidence because nothing is ever a coincidence." Unsong)

Price: Free.

If you have any miscellaneous questions, or think I have missed something, say so in the comments. If it isn't well organized, or goes pear shaped for some reason, sorry, and I'll try to fix it, but I haven't had much practice organizing stuff yet.

If you want to do a similar unconference later, feel free to copy this text. If this event is successful, I might do another.

Discuss

### Slow Takeoff: Effect on Outcomes

30 июня, 2020 - 04:13
Published on June 29, 2020 10:06 PM GMT

Introduction

In general, people seem to treat slow takeoff as the safer option as compared to classic FOOMish takeoff (see e.g. these interviews, this report, etc). Below, I outline some features of slow takeoff and what they might mean for future outcomes. They do not seem to point to an unambiguously safer scenario, though slow takeoff does seem on the whole likelier to lead to good outcomes.

Social and institutional effect of precursor AI

If there’s a slow takeoff, AI is a significant feature of the world far before we get to superhuman AI.[1] One way to frame this is that everything is already really weird before there’s any real danger of x-risks. Unless AI is somehow not used in any practical applications, the pre-superhuman but still very capable AI will lead to massive economical, technological, and probably social changes.

If we expect significant changes to the state of the world during takeoff, it makes it harder to predict what kinds of landscape the AI researchers of that time will be facing. If the world changes a lot between now and superhuman AI, any work on institutional change or public policy might be irrelevant by the time it matters. Also, the biggest effects may be in the AI community, which would be closest to the rapidly changing technological landscape.

The kinds of work needed if everything is changing rapidly also seem different. Specific organizations or direct changes might not survive in their original, useful form. The people who have thought about how to deal with the sort of problems we might be facing then might be well positioned to suggest solutions, though. This implies that more foundational work might be more valuable in this situation.

While I expect this to be very difficult to predict from our vantage point, one possible change is mass technological unemployment well before superhuman AI. Of course, historically people have predicted technological unemployment from many new inventions, but the ability to replace large fractions of intellectual work may be qualitatively different. If AI approaches human-level at most tasks and is price-competitive, the need for humans reduces down to areas where being biological is a bonus and the few tasks it hasn’t mastered.[2]

The effects of such unemployment could be very different depending on the country and political situation, but historically mass unemployment has often led to unrest. (The Arab Spring, for instance, is sometimes linked to youth unemployment rates.) This makes any attempts at long-term influence that do not seem capable of adapting to this a much worse bet. Some sort of UBI-like redistribution scheme might make the transition easier, though even without a significant increase in income inequality some forms of political or social instability seem likely to me.

From a safety perspective, normalized AI seems like it could go in several directions. On one hand, I can imagine it turning out something like nuclear power plants, where it is common knowledge that they require extensive safety measures. This could happen either after some large-scale but not global disaster (something like Chernobyl), or as a side-effect of giving the AI more control over essential resources (the electrical grid has, I should hope, better safety features than a text generator).

The other, and to me more plausible scenario, is that the gradual adoption of AI makes everyone dismiss concerns as alarmist. This does not seem entirely unreasonable: the more evidence people have that AI becoming more capable doesn’t cause catastrophe, the less likely it is that the tipping point hasn’t been passed yet.

Historical reaction to dangerous technologies

A society increasingly dependent on AI is unlikely to be willing to halt or scale back AI use or research. Historically, I can think of some cases where we’ve voluntarily stopped the use of a technology, but they mostly seem connected to visible ongoing issues or did not result in giving up any significant advantage or opportunity:

• Pesticides such as DDT caused the near-extinction of several bird species (rather dramatically including the bald eagle).
• Chemical warfare is largely ineffective as a weapon against a prepared army.
• Serious nuclear powers have never reduced their stock of nuclear weapons to the point of significantly reducing their ability to maintain a credible nuclear deterrent. Several countries (South Africa, Belarus, Kazakhstan, Ukraine) have gotten rid of their entire nuclear arsenals.
• Airships are not competitive with advanced planes and were already declining in use before the Hidenberg disaster and other high-profile accidents.
• Drug recalls are quite common and seem to respond easily to newly available evidence. It isn’t clear to me how many of them represent a significant change in the medical care available to consumers.

I can think of two cases in which there was a nontrivial fear of global catastrophic risk from a new invention (nuclear weapons igniting the atmosphere, CERN). Arguably, concerns about recombinant DNA also count. In both cases, the fears were taken seriously, found “no self-propagating chain of nuclear reactions is likely to be started” and “no basis for any conceivable threat” respectively, and the invention moved on.

This is a somewhat encouraging track record of not just dismissing such concerns as impossible, but it is not obvious to me whether the projects would have halted had the conclusions been less definitive. There’s also the rather unpleasant ambiguity of “likely” and some evidence of uncertainty in the nuclear project, expanded on here. Of course, the atmosphere remained unignited, but since we unfortunately don’t have any reports from the universe where it did this doesn’t serve as particularly convincing evidence.

Unlike the technologies listed two paragraphs up, CERN and the nuclear project seem like closer analogies to fast takeoff. There is a sudden danger with a clear threshold to step over (starting the particle collider, setting off the bomb), unlike the risks from climate change or other technological dangers which are often cumulative or hit-based. My guess, based on these very limited examples, is that if it is clear which project poses a fast-takeoff style risk it will be halted if the risk can be shown to have legible arguments behind it and is not easily shown to be highly unlikely. A slow-takeoff style risk, in which capabilities slowly mount, seems more likely to have researchers take each small step without carefully evaluating the risks every time.

Relevance of advanced precursor AIs to safety of superhuman AI

An argument in favor of slow takeoff scenarios being generally safer is that we will get to see and experiment with the precursor AIs before they become capable of causing x-risks.[3] My confidence in this depends on how likely it is that the dangers of a superhuman AI are analogous to the dangers of, say, an AI with 2X human capabilities. Traditional x-risk arguments around fast takeoff are in part predicated on the assumption that we cannot extrapolate all of the behavior and risks of a precursor AI to its superhuman descendant.

Intuitively, the smaller the change in capabilities from an AI we know is safe to an untested variant, the less likely it is to suddenly be catastrophically dangerous. “Less likely”, however, does not mean it could not happen, and a series of small steps each with a small risk are not necessarily inherently less dangerous than traversing the same space in one giant leap. Tight feedback loops mean rapid material changes to the AI, and significant change to the precursor AI runs the risk of itself being dangerous, so there is a need for caution at every step, including possibly after it seems obvious to everyone that they’ve “won”.

Despite this, I think that engineers who can move in small steps seem more likely to catch anything dangerous before it can turn into a catastrophe. At the very least, if something is not fundamentally different than what they’ve seen before, it would be easier to reason about it.

Reactions to precursor AIs

Even if the behavior of this precursor AI is predictive of the superhuman AI’s, our ability to use this testing ground depends on the reaction to the potential dangers of the precursor AI. Personally, I would expect a shift in mindset as AI becomes obviously more capable than humans in many domains. However, whether this shift in mindset is being more careful or instead abdicating decisions to the AI entirely seems unclear to me.

The way I play chess with a much stronger opponent is very different from how I play with a weaker or equally matched one. With the stronger opponent I am far more likely to expect obvious-looking blunders to actually be a set-up, for instance, and spend more time trying to figure out what advantage they might gain from it. On the other hand, I never bother to check my calculator’s math by hand, because the odds that it’s wrong is far lower than the chance that I will mess up somewhere in my arithmetic. If someone came up with an AI-calculator that gave occasional subtly wrong answers, I certainly wouldn’t notice.

Taking advantage of the benefits of a slow takeoff also requires the ability to have institutions capable of noticing and preventing problems. In a fast takeoff scenario, it is much easier for a single, relatively small project to unilaterally take off. This is, essentially, a gamble on that particular team’s ability to prevent disaster.

In a slow takeoff, I think it is more likely to be obvious that some project(s) seem to be trending in that direction, which increases the chance that if the project seems unsafe there will be time to impose external control on it. How much of an advantage this is depends on how much you trust whichever institutions will be needed to impose those controls.

Some historical precedents for cooperation (or lack thereof) in controlling dangerous technologies and their side-effects include:

• Nuclear proliferation treaties reduce the cost of a zero-sum arms race, but it isn’t clear to me if they significantly reduced the risk of nuclear war.
• Pollution regulations have had very mixed results, with some major successes (eg acid rain) but on the whole failing to avert massive global change.
• Somewhat closer to home, the response to Covid-19 hasn’t been particularly encouraging.
• The Asilomar Conference, which seems to me the most successful of these, involved a relatively small scientific field voluntarily adhering to some limits on potentially dangerous research until more information could be gathered.

Humanity’s track record in this respect seems to me to be decidedly mixed. It is unclear which way the response to AI will go, and it seems likely that it will be dependent on highly local factors.

What is the win condition?

A common assumption I’ve seen is that once there is aligned superhuman AI, the superhuman AI will prevent any unaligned AIs. This argument seems to hinge on the definition of “aligned”, which I’m not interested in arguing here. The relevant assumption is that an AI aligned in the sense of not causing catastrophe and contributing significantly to economic growth is not necessarily aligned in the sense that it will prevent unaligned AIs from occurring, whether its own “descendants” or out of some other project.[4]

I can perfectly well imagine an AI built to (for instance) respect human values like independence and scientific curiosity that, while benevolent in a very real sense, would not prevent the creation of unaligned AIs. A slow takeoff scenario seems to me more likely to contain multiple (many?) such AIs. In this scenario, any new project runs the risk of being the one that will mess something up and end up unaligned.

An additional source of risk is modification of existing AIs rather than the creation of new ones. I would be surprised if we could resist the temptation to tinker with the existing benevolent AI’s goals, motives, and so on. If the AI were programmed to allow such a thing, it would be possible (though I suspect unlikely without gross incompetence, if we knew enough to create the original AI safely in the first place) to change a benevolent AI into an unaligned one.

However, despite the existence of a benevolent AI not necessarily solving alignment forever, I expect us to be better off than in the case of unaligned AI emerging first. At the very least, the first AIs may be able to bargain with or defend us against the unaligned AI.

Conclusion

My current impression is that, while slow takeoff seems on-the-whole safer (and likely implies a less thorny technical alignment problem), it should not be mostly neglected in favor of work on fast takeoff scenarios as implied e.g. here. Significant institutional and cultural competence (and/or luck) seems to be required to reap some of the benefits involved in slow-takeoff. However, there are many considerations that I haven’t addressed and more that I haven’t thought of. Most of the use I expect this to be is as a list of considerations, not as the lead-up to any kind of bottom line.

Thanks to Buck Shlegeris, Daniel Filan, Richard Ngo, and Jack Ryan for thoughts on an earlier draft of this post.

1. I use this everywhere to mean AI far surpassing humans on all significant axes ↩︎

2. See e.g. Robin Hanson’s Economic Growth Given Machine Intelligence ↩︎

3. An additional point is that the technical landscape at the start of takeoff is likely to be very different from the technical landscape near the end. It isn’t entirely clear how far the insights gained from the very first AIs will transfer to the superhuman ones. Pre- and post-machine learning AI, for instance, seem to have very different technical challenges. ↩︎

4. A similar distinction: "MIRI thinks success is guaranteeing that unaligned intelligences are never created, whereas Christiano just wants to leave the next generation of intelligences in at least as good of a place as humans were when building them." Source ↩︎

Discuss

### AI Benefits Post 2: How AI Benefits Differs from AI Alignment & AI for Good

29 июня, 2020 - 20:47
Published on June 29, 2020 5:00 PM GMT

This is a post in a series on "AI Benefits." It is cross-posted from my personal blog. For other entries in this series, navigate to the AI Benefits Blog Series Index page.

This post is also discussed on the Effective Altruism Forum.

For comments on this series, I am thankful to Katya Klinova, Max Ghenis, Avital Balwit, Joel Becker, Anton Korinek, and others. Errors are my own.

If you are an expert in a relevant area and would like to help me further explore this topic, please contact me.

How AI Benefits Differs from AI Alignment & AI for Good The Values Served by AI Benefits Work

Benefits plans need to optimize for a number of objectives.[1] The foremost is simply maximizing wellbeing. But AI Benefits work has some secondary goals, too. Some of these include:

1. Equality: Benefits are distributed fairly and broadly.[2]
2. Autonomy: AI Benefits respect and enhance end-beneficiaries’ autonomy.[3]
3. Democratization: Where possible, AI Benefits decisionmakers should create, consult with, or defer to democratic governance mechanisms.
4. Modesty: AI benefactors should be epistemically modest, meaning that they should be very careful when predicting how plans will change or interact with complex systems (e.g., the world economy).

These secondary goals are largely inherited from the stated goals of many individuals and organizations working to produce AI Benefits.

Additionally, since the rate of improvements to wellbeing probably decreases with income, the focus on maximizing wellbeing implies a focus on the distributional aspects of Benefits.

How AI Benefits differs from AI Alignment

Another important clarification is that AI Benefits differ from AI Alignment.

Both alignment and beneficiality are ethically relevant concepts. Alignment can refer to several different things. Iason Gabriel of DeepMind provides a useful taxonomy of existing conceptions of alignment. According to Gabriel, “AI alignment” can refer to alignment with:

1. Instructions: the agent does what I instruct it to do.”
2. Expressed intentions: the agent does what I intend it to do.”
3. Revealed preferences: the agent does what my behaviour reveals I prefer.”
4. Informed preferences or desires: the agent does what I would want it to do if I were rational and informed.”
5. Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking.”
6. Values: the agent does what it morally ought to do . . . .”

A system can be aligned in most of these senses without being beneficial. Being beneficial is distinct from being aligned in senses 1–4 because those deal only with the desires of a particular human principal, which may or may not be beneficial. Being beneficial is distinct from conception 5 because beneficial AI aims to benefit many or all moral patients. Only AI that is aligned in the sixth sense would be beneficial by definition. Conversely, AI need not be well-aligned to be beneficial (though it might help).

How AI Benefits differs from AI for Good

A huge number of projects exist under the banner of “AI for Good.” These projects are generally beneficial. However, AI Benefits work is different from simply finding and pursuing an AI for Good project.

AI Benefits work aims at helping AI labs craft a long-term Benefits strategy. Unlike AI for Good, which is tied to specific techniques/capabilities (e.g., NLP) in certain domains (e.g., AI in education), AI Benefits is capability- and domain-agnostic. Accordingly, the pace of AI capabilities development should not dramatically alter AI Benefits plans at the highest level (though it may of course change how they are implemented). Most of my work therefore focuses not on concrete beneficial AI applications themselves, but rather on the process of choosing between and improving possible beneficial applications. This meta-level focus is particularly useful at OpenAI, where the primary mission is to benefit the world by building AGI—a technology with difficult-to-foresee capabilities.

1. Multi-objective optimization is a very hard problem. Managing this optimization problem both formally and procedurally is a key desideratum for Benefits plans. I do not think I have come close to solving this problem, and would love input on this point. ↩︎

2. OpenAI’s Charter commits “to us[ing] any influence we obtain over AGI’s deployment to ensure it is used for the benefit of all . . . .” ↩︎

3. OpenAI’s Charter commits to avoiding “unduly concentrat[ing] power.” ↩︎

Discuss

### Thoughts as open tabs

29 июня, 2020 - 15:13
Published on June 29, 2020 12:13 PM GMT

I work better at night. At around midnight, I'm more prone to entering a state of relaxed focus, my mind making more lateral connections when reading or coming up with ideas that will likely have great positive impact on my personal life and work.

This happens after a hard workout too, or after having a stimulating discussion with a trusted, unbiased, and driven friend.

The conflict comes every day I wake up in the morning, and it's as if the open tabs of useful thoughts in my mind have closed, refreshed, but with no Command+Shift+Tab to reopen them exactly the way they were. I feel different in the morning; without the vision and subjective "charged" state I felt the night before, my mind doesn't give me a clear path to execute on the useful thoughts from yesterday. I don't have full access to that action-taking state.

It's logical that the subjective impact of thought and motivation changes from day-to-day. I wouldn't want to remember and feel the embarrassment from a social faux pas I made last week, every day. I just want to be able to capture and feel more of the good stuff from the day or week before. To feel the same emotional impact as when I first came across particular insights, so that I'll follow through on that with fervor and persistence.

Is there a way of hacking the brain to get the upside, without incurring the downside?

It's almost as if I existed in two states: one which thinks that it has done enough and deserves surface pleasures and rest - akrasia, and its opposite, non-akrasia - buzzing with quiet energy, working towards fulfillment, constantly alert to ideas that I can implement.

I know that being in a state of non-akrasia for sustained periods of time is possible. There are the Elon Musks of the world, those who work 80 hour work weeks towards their very specific cause of choice. I choose to believe that they were not so much genetically predisposed to do so, as they had something to protect as well as systems that ruthlessly reduced friction and distraction and ritualized the state of Flow.

The importance of colonizing mars and reversing climate change are likely kept as open tabs at the forefront of Musk's mind, thoughts with deep significance to him, that has guided every business or engineering decision he has made.

Given the limitations of working memory, how do I keep my useful thoughts as open tabs after a night's sleep, or at least be able to access the full "History" and restore them? I'd imagine that if I was able to hack my brain to access the long-term-thinking, fulfilment-driven mode of non-akrasian thoughts, there will be exponential gains in the areas of my life that matter - no more instinctive checking, mindless scrolling of social media as distraction to fulfilling work, no more superstimuli foods to damage physical and mental health.

The cost of giving into temptation will be greater than the short-term benefits precisely because the open tabs show me that I'm forgoing a much greater physique or career impact if I take the easy path.

More importantly, it will give me the confidence to follow through on all of those thoughts in the exact way I intended to. Being able to remember the full rationale and potential impact of these thoughts, I will be able to execute them without second-guessing myself. Finding a system or trigger to access these thoughts from my subconscious, in order to act with conviction and consistency with my long-term goals - that is my priority at this point of time.

Discuss

### Optimized Propaganda with Bayesian Networks: Comment on "Articulating Lay Theories Through Graphical Models"

29 июня, 2020 - 05:45
Published on June 29, 2020 2:45 AM GMT

Derek Powell, Kara Weisman, and Ellen M. Markman's "Articulating Lay Theories Through Graphical Models: A Study of Beliefs Surrounding Vaccination Decisions" (a conference paper from CogSci 2018) represents an exciting advance in marketing research, showing how to use causal graphical models to study why ordinary people have the beliefs they do, and how to intervene to make them be less wrong.

The specific case our authors examine is that of childhood vaccination decisions: some parents don't give their babies the recommended vaccines, because they're afraid that vaccines cause autism. (Not true.) This is pretty bad—not only are those unvaccinated kids more likely to get sick themselves, but declining vaccination rates undermine the population's herd immunity, leading to new outbreaks of highly-contagious diseases like the measles in regions where they were once eradicated.

What's wrong with these parents, huh?! But that doesn't have to just be a rhetorical question—Powell et al. show how we can use statistics to make the rhetorical hypophorical and model specifically what's wrong with these people! Realistically, people aren't going to just have a raw, "atomic" dislike of vaccination for no reason: parents who refuse to vaccinate their children do so because they're (irrationally) afraid of giving their kids autism, and not afraid enough of letting their kids get infectious diseases. Nor are beliefs about vaccine effectiveness or side-effects uncaused, but instead depend on other beliefs.

To unravel the structure of the web of beliefs, our authors got Amazon Mechanical Turk participants to take surveys about vaccination-related beliefs, rating statements like "Natural things are always better than synthetic alternatives" or "Parents should trust a doctor's advice even if it goes against their intuitions" on a 7-point Likert-like scale from "Strongly Agree" to "Strongly Disagree".

Throwing some off-the-shelf Bayes-net structure-learning software at a training set from the survey data, plus some ancillary assumptions (more-general "theory" beliefs like "skepticism of medical authorities" can cause more-specific "claim" beliefs like "vaccines have harmful additives", but not vice versa) produces a range of probabilistic models that can be depicted with graphs where nodes representing the different beliefs are connected by arrows that show which beliefs "cause" others: an arrow from a naturalism node (in this context, denoting a worldview that prefers natural over synthetic things) to a parental expertise node means that people think parents know best because they think that nature is good, not the other way around.

Learning these kinds of models is feasible because not all possible causal relationships are consistent with the data: if A and B are statistically independent of each other, but each dependent with C (and are conditionally independent given the value of C), it's kind of hard to make sense of this except to posit that A and B are causes with the common effect C.

Simpler models with fewer arrows might sacrifice a little bit of predictive accuracy for the benefit of being more intelligible to humans. Powell et al. ended up choosing a model that can predict responses from the test set at r = .825, explaining 68.1% of the variance. Not bad?!—check out the full 14-node graph in Figure 2 on page 4 of the PDF.

Causal graphs are useful as a guide for planning interventions: the graph encodes predictions about what would happen if you changed some of the variables. Our authors point out that since previous work showed that people's beliefs about vaccine dangers were difficult to influence, that suggests trying to intervene on the other parents of the intent-to-vaccinate node in the model: if the hoi polloi won't listen to you when you tell them the costs are minimal (vaccines are safe), instead tell them about the benefits (diseases are really bad and vaccines prevent disease).

To make sure I really understand this, I want to adapt it into a simpler example with made-up numbers where I can do the arithmetic myself. Let me consider a graph with just three nodes—

Suppose this represents a structural equation model where an anti-vaxxer-leaning parent-to-be's propensity-to-vaccinate-against-measles C is expressed in terms of belief-in-vaccine-safety A and belief-in-measles-danger B as—

C=0.7⋅A+0.3⋅B

And suppose that we're a public health authority trying to decide whether to spend our budget (or what's left of it after recent funding cuts) on a public education initiative that will increase A by 0.1, or one that will increase B by 0.3.

We should choose the program that intervenes on B, because (0.3)(0.3)=0.09 is bigger than (0.7)(0.1)=0.07. That's actionable advice that we couldn't have derived without a quantitative model of how the lay audience thinks. Exciting!

At this point, some readers may be wondering why I've described this work as "marketing research" about constructing "optimized propaganda." A couple of those words usually have negative connotations, but educating people about the importance of vaccines is a positive thing. What gives?

The thing is, "Learn the causal graph of why they think that and compute how to intervene on it to make them think something else" is a symmetric weapon—a fully general persuasive technique that doesn't depend on whether the thing you're trying to convince them of is true.

In my simplified example, the choice to intervene on B was based on numerical assumptions that amount to the claim that it's sufficiently easier to change B than it is to change A, such that intervening on B is more effective at changing C than intervening on A (even though C depends on A more than it does on B). But this methodology is completely indifferent to what A, B, and C mean. It would have worked just as well, and for the same reasons if the graph had been—

Suppose that we're advertising executives for the Coca-Cola Company trying to decide how to spend our budget (or what's left of it after recent funding cuts). If consumers won't listen to us when we tell them the costs of drinking Coke are minimal (lying that it isn't unhealthy), we should instead tell them about the benefits (Coke tastes good).

Or with different assumptions about the parameters—maybe C=0.8⋅A+0.2⋅B actually—then intervening to increase belief in "Coca-Cola isn't unhealthy" would be the right move (because 0.06 = (0.2)(0.3)">(0.8)(0.1)=0.08>0.06=(0.2)(0.3)). The marketing algorithm that just computes what belief changes will flip the decision node, doesn't have any way to notice or care whether those belief changes are in the direction of more or less accuracy.

To be clear—and I really shouldn't have to say this—this is not a criticism of Powell–Weisman–Markman's research! The "Learn the causal graph of why they think that" methodology is genuinely really cool! It doesn't have to be deployed as a marketing algorithm: the process of figuring out which belief change would flip some downstream node is the same thing as what we call locating a crux.[1] The difference is just a matter of forwards or backwards direction: whether you first figure out if the measles vaccine or Coca-Cola are safe and then use whatever answer you come up with to guide your decision, or whether you write the bottom line first.

Of course, most people on most issues don't have the time or expertise to do their own research. For the most part, we can only hope that the sources we trust as authorities are doing their best to use their limited bandwidth to keep us genuinely informed, rather than merely computing what signals to emit in order to control our decisions.

If that's not true, we might be in trouble—perhaps increasingly so, if technological developments grant new advantages to the propagation of disinformation over the discernment of truth. In a possible future world where most words are produced by AIs running a "Learn the causal graph of why they think that and intervene on it to make them think something else" algorithm hooked up to a next-generation GPT, even reading plain text from an untrusted source could be dangerous.

1. Thanks to Anna Salamon for this observation. ↩︎

Discuss

### Abstractions on Inconsistent Data

29 июня, 2020 - 03:30
Published on June 29, 2020 12:30 AM GMT

[I’m not sure this makes any sense – it is mostly babble, as an attempt to express something that doesn’t want to be expressed. The ideas here may themselves be an abstraction on inconsistent data. Posting anyway because maybe somebody else will prune it into something useful.]

i. Abterpretations

Abstractions are (or at least are very closely related to) patterns, compression, and Shannon entropy. We take something that isn’t entirely random, and we use that predictability (lack of randomness) to find a smaller representation which we can reason about, and predict. Abstractions frequently lose information – the map does not capture every detail of the territory – but are still generally useful. There is a sense in which some things cannot be abstracted without loss – purely random data cannot be compressed by definition. There is another sense in which everything can be abstracted without loss, since even purely random data can be represented as the bit-string of itself. Pure randomness is in this sense somehow analogous to primeness – there is only one satisfactory function, and it is the identity.

A separate idea, heading in the same direction: Data cannot, in itself, be inconsistent – it can only be inconsistent with (or within) a given interpretation. Data alone is a string of bits with no interpretation whatsoever. The bitstring 01000001
is commonly interpreted both as the number 65, and as the character ‘A’, but that interpretation is not inherent to the bits; I could just as easily interpret it as the number 190, or as anything else. Sense data that I interpret as “my total life so far, and then an apple falling upwards”, is inconsistent with the laws of gravity. But the apple falling up is not inconsistent with my total life so far – it’s only inconsistent with gravity, as my interpretation of that data.

There is a sense in which some data cannot be consistently interpreted – purely random data cannot be consistently mapped onto anything useful. There is another sense in which everything can be consistently interpreted, since even purely random data can be consistently mapped onto itself: the territory is the territory. Primeness as an analogue, again.

Abstraction and interpretation are both functions, mapping data onto other data. There is a sense in which they are the same function. There is another sense in which they are inverses. Both senses are true.

ii. Errplanations

Assuming no errors, then one piece of inconsistent data is enough to invalidate an entire interpretation. In practice, errors abound. We don’t throw out all of physics every time a grad student does too much LSD.

Sometimes locating the error is easy. The apple falling up is a hallucination, because you did LSD.

Sometimes locating the error is harder. I feel repulsion at the naive utilitarian idea of killing one healthy patient to save five. Is that an error in my feelings, and I should bite the bullet? Is that a true inconsistency, and I should throw out utilitarianism? Or is that an error in the framing of the question, and No True Utilitarian endorses that action?

Locating the error is meaningless without explaining the error. You hallucinated the apple because LSD does things to your brain. Your model of the world now includes the error. The error is predictable.

Locating the error without explaining it is attributing the error to phlogiston, or epicycles. There may be an error in my feelings about the transplant case, but it is not yet predictable. I cannot distinguish between a missing errplanation and a true inconsistency.

iii. Intuitions

If ethical frameworks are abterpretations of our moral intuitions, then there is a sense in which no ethical framework can be generally true – our moral intuitions do not always satisfy the axioms of preference, and cannot be consistently interpreted.

There is another sense in which there is a generally true ethical framework for any possible set of moral intuitions: there is always one satisfactory function, and it is the identity.

Primeness as an analogue.

Discuss

### Gödel's Legacy: A game without end

28 июня, 2020 - 21:50
Published on June 28, 2020 6:50 PM GMT

Cross-posted at Brick. A fun exploration, hopefully requiring only a medium level of technical familiarity.

"Not with a bang but a whimper"

It's 1930 and Kurt Gödel has just delivered a talk at Königsberg that carpet bombed Hilbert's Program.

...and no one noticed.

Godel's announcement, delivered during the summarizing session on the third and last day of the conference, was so understated and casual—so thoroughly undramatic—that it hardly qualified as an announcement, and no one present, with one exception, paid it any mind at all.
(source)

Well, almost no one. John von Neumann came up to Gödel after the talk, presumably to double check if he was going crazy or if Gödel had in fact just proved the existence of unprovable mathematical truths.

To understand what Gödel's Incompleteness Theorem says, it helps to see the meta-mathematical opinions of some of his peers. Wittgenstein opens his big famous book (not that I've actually read it) with:

One could put the whole sense of the book perhaps in these words: What can be said at all, can be said clearly; and whereof one cannot speak, thereof one must be silent.

This was a bold move in its own way. Contrast it with the following take, "Everything can be spoken of, you just need to get smarter" which is a deeply held ideal of Strawman STEM-Lord. Interestingly, both Wittgenstein and STEM-Lord agree on something, the exact something that Gödel smashed up in Königsberg. It's a very technical claim, best communicated with the following GIF:

Witt and STEM-Lord disagree on what bleeds, but they are in fervent agreement that what bleeds can be killed.

Unraveling Self-reference

Oh shit, it's true and false, or neither, paaaaaaaaaaaaaaaradox! The Wittgenstein-esque approach might be to claim the liar's sentence doesn't actually mean anything. There is no reality to it to speak of, so you can't speak of it. You are just getting yourself tied into a linguistic knot, tricking yourself into thinking that just because you can write this sentence it must mean something. Natural language is not a formal system and this sentence need not have a coherent "logical" "truth value". BOOM, checkmate hypothetical opponent!

Hidden in the explanation "Language isn't meant to avoid paradox and self-reference" is the claim "but formal math does avoid them!" The first part is true, the second isn't.

What Gödel did was to make a logical sentence in the syntax of Peano Arithmetic which had the meaning, "This statement is unprovable in Peano Arithmetic". The two tricky components of this were finding a way to mathematically encode self-reference ("This statement") and to talk about the system from within the system ("is unprovable in Peano Arithmetic"). Both are quite a challenge. If you mess around with the liar's paradox, you'll quickly get into infinite recursion when trying to unravel the self-reference.

(source)

People have known forever that self-reference can often lead to paradox. This is part of why it's easy to have the intuition that full natural language expressions need not have truth values. So you make your mathematical systems more precise, and more limited than full natural language, then you can avoid self-reference, right?

Part of the sneakiness of Gödel's theorem is that it shows how self-reference sneaks in whether you want it or not. Just like a computer system is not safe merely because you didn't explicitly create a "HaCk ThIs SoFtWaRe" button, a formal mathematical system is not free of self-reference merely because it's not explicitly part of the design. Self-reference isn't just a part of Peano Arithmetic, it's an inevitable feature of any system that has a basic amount of complexity.

If you're having trouble wrapping your head around how something as simple as arithmetic could embed self-reference and the concept of provability, it's helpful to look at programming languages. Brainfuck can express anything that python can express, because both of them reach a "sufficient level of complexity" called being Turing Complete. Gwern maintains a fun list of surprisingly Turing Complete systems (Pokemon Yellow and CSS are my favorite).

Speaking of programming, this segues nicely into the next topic :)

Turing and the Undecidable

Before there were computers, there were algorithms. Unambiguous, mechanistic, finite procedures which when carried out consistently lead to certain results. Algorithms go way back, but our understanding of how to clearly demarcate them is recent. The phrase "effective procedure" has been used to tag the intuitive and fuzzy grouping that set apart the accessible from the inaccessible. There's clearly a difference, so what's the difference?

A storm of research in the first half of the 1900's built a conclusive answer. They created the abstract mathematical backing of not only computers, but of the idea of algorithms in general. Turing made Turing Machines, Church made lambda calculus, and Kleene made recursion theory. Though very different in taste and texture, all of these systems turned out to capture the exact same thing; you could use any one to fully represent and simulate the other. These systems form the bar that I mentioned early; Turing Completeness is being able to do everything a Turing Machine can do.

Just like how when a formal system get's complex enough, self-reference and incompleteness sneak in, when a computational system reaches Turing Completeness, a whole class of "uncomputable/undecidable" problems emerge. These are well formed and clearly defined computational problems which no algorithm can always answer correctly.

The most famous on is the Halting Problem

Given the source code to a program and its input, determine if the program will ever terminate, or if it will get caught in a loop forever.

Turns out, there's no algorithm which answers this question correctly for all possible source code-input pairs. Each bold word is key to understanding what it means for a problem to be undecidable.

Sometimes people get confused by the halting problem. How can it be undecidable if I can clearly determine that:

def code(x): return x

halts for all inputs. Have I defeated computers? The answer is in the difference between a specific instance of a problem, and the problem class as a whole. "Solve the problem with this specific input" is incredibly different from "Solve the problem in full generality for all possible inputs."

Compare this with Gödel's theorems. It's not that all statements are improvable. It's that there exists a statement that's meaningful and can't be proven. This is not a surface level similarity. Secretly, the halting problem and Gödel's incompleteness theorems are the same thing. It's outside the already wildly expanding scope of this post, but the Curry-Howard Isomorphism is a crazy bridge that connects mathematical proofs to the execution of programs.

Suffice to say, incompleteness and undecidability are two sides of the same coin, and I will casually flip flop between them as much as I please. They are the formal refutation to "If it bleeds, we can kill it". They are pieces of mathematics that have been thoroughly and rigorously formalized, and yet provably defy being resolved.

But Wait, there's more!

Turing and Gödel show us that there are monsters out there that bleed and can't be killed, a shocking fact to certain philosophical frames. But mayhaps there's only a few of such creatures. Unfortunately, most things that bleed can't be killed.

Rice's theorem proves that almost all properties that you might want to prove about program behavior are undecidable. Not only are there undecidable problems, but there's an infinite hierarchy of increasingly difficult computational problems which can't even be solved if you had a magical oracle to solve problems one rung down.

These sorts of results are not contained just to decidability. With regards to Kolmogorov complexity most strings are in-compressible. Most boolean circuits have exponential complexity. The No-Free-Lunch theorem tells us there is no universal learning algorithm that outperforms all others on all learning tasks. Weierstrass summoned monsters that still roam.

Result after result in this vein can leave one feeling... small.

Gödel's Impact

It took some time before Gödel's proof really took hold in the mathematical world. Gödel himself was the sort of guy that the words reticent and reclusive were created to describe, and didn't push for recognition. Luckily, John von Neumann took it upon himself to spread the good word.

Within a few decades, after fully internalizing that their quest of submitting all of math to being provable was doomed, most mathematicians had moved onto other fields. Math departments began to thin. In 1963 UC Berkley was the last University in the world to offer a mathematics major, which itself was discontinued in 19—

Ha, just kidding, math go brrrrrrrrrr

Despite destroying the contemporary understanding of what math is, we still sent people to space, created the internet, and made things like GPT-3. Why does the fact that most problems in mathematics are unsolvable not seem to matter in the slightest? How is that not a huge roadblock to the development of human knowledge?

Hilbert's Program was part of a large scale drive to create a "secure and unassailable" foundation for mathematics. This intent was in the water-supply ever since Euclid's The Elements, the poster-boy for mathematical rigor, had been tarnished with the discovery of non-euclidean geometry. Mathematicians felt burned by this betrayal, and so began the purging of intuition from math in favor of formal rigor.

But... why the intense fuss if it turns out "a secure and unassailable foundation" is not needed to actually do math? It feels like Gödel pulled a magic trick, yanking the foundations of math out from under itself, yet failing to disturb anything. It makes one a bit suspicious about what a foundation is supposed to be doing if it's not supporting the house.

How to "solve" the halting problem

In tHe ReAl WoRlD few people care about decidability. People "solve" the halting problem all day everyday; wait a few seconds, if the program hasn't returned conclude it never will. Even your grandma "solves" the halting problem when she eventually decides to reboot the computer after staring at a frozen internet explorer for a few minutes.

The boundaries that engineers do care about are complexity bounds. "This problem is not undecidable!" is still useless for application if it turns out the most efficient algorithm won't finish before the heat death of the universe.

So maybe that's it. Fear not the specter of incompleteness, instead cower at the sight of the complex. Wail and gnash your teeth upon finding out the problem you want to solve is NP-Hard.

Once upon a time, I was sitting in office hours asking a professor about how code verification worked:

Me: How this work?

Prof: We've worked out how to represent the code as an SMT and just pass it all to an SMT solver

Me: How do SMT solvers work?

Prof: Often they reduce it to a boolean satisfaction problem and then just use a garden variety SAT solver.

Me: ......wut? SAT is supposed to be NP-hard?!

Prof: laughs oh yeah, it is. But in practice we can handle all sorts of NP-hard problems totally fine.

Me: mind melts out of ears

This completely blew my mind. Suddenly I lived in a new magical world.

How exactly does this work? Stackexchange has a nice answer. I want to home in on restricting the problem. Problem relaxation is a tried and true tactic for solving all kinds of challenges. If X is too hard, solve an easier problem. Sometimes this helps you solve the original problem, but sometimes you realize you didn't actually need to solve the harder problem, and your algorithm for the relaxed problem works just fine on all the problem instances you care about.

There's a sneaky difficulty to knowing what you actually want. Maybe you're a food delivery company and you think you need to solve the traveling salesman problem (NP-Hard), but upon further thought you realize this is Manhattan and everything is on a grid! Now you only care about a specific subset of traveling salesman problems, one's that have a very specific structure. Rejoice! This subset is actually tractable! It may even be efficient!

But for a lot of real world scenarios, there is no formalized sub problem. You get that you're not dealing with the fully general version, but you don't understand your use cases enough to formalize them as a subset. Clearly, there is rich structure to the irl instances of the problems that you're solving, otherwise you wouldn't be making any progress. But until you can figure out exactly what that structure is and re-formalize a relaxation, you're in the position of trying to solve a formally "intractable" by trying something that very smart people think is a decent idea, and having it work out most of the time.

That's not to say that things are easy. There are plenty of problems that we can't currently efficiently solve or prove, ones which would be a HUGE boon to human capabilities if resolved. And Harvey Friedman seems to have devoted his whole life to showing how incompleteness sneaks into otherwise natural and normal mathematical domains.

The Vast Mathematical Wilderness

I've been sharing a lot of technical details, but what I really hope is to communicate an aesthetic sense. From Ambram Demski in a post on something else:

To explain Gödel's first incompleteness theorem, Douglas Hofstadter likened truth and falsehood to white and black trees which never touch, but which interlace so finely that you can't possible draw a border between them. [...] A gaping fractal hole extending self-similar tendrils into any sufficiently complex area of mathematics

The secret to math's generality is that we aren't actually dealing with vasts swaths of the formal world. We live in a strange and structured bubble near the trunk of Gödel's tree. Remember earlier how I was describing that most formalizable problems are undecidable, and most decidable problems are mired in intractable complexity? And despite that our algorithms work on a freaky amount of the problems we actually interact with? It's because those pessimistic results aren't facts about our little bubble; they're facts about the jagged wastelands that lay beyond our bubble, out in the fractal folds of the wild.

This is both surprising and expected. Borges' Library of Babble is a sprawling haunt that contains every work of literature ever produced, and it's still endless enough to be mostly nonsensical gibberish. Of all the possible ways that letters can be combined, only a small fractions create words. As it is with words, so with math.

Gödel and Turing helped teach us that this wilderness exists. Our experience is what taught us that we live on the edge of this wilderness. We don't know the exact boundary of our little bubble. The bubble contains the problems that we happen to encounter and happen to care about. It's a boundary that's highly contextual and very human. One way to see the advance of knowledge is as the exploration of this boundary. It too will probably have unresolvable fractal nature.

I could leave it there. "Gödel illuminated a giant wilderness, but luckily we almost never have to deal with it."

Except I don't think that's quite right. There are contexts the force you out of the predictable bubble, and into wild.

Journey through the Fractal Folds

Self-reference is at the heart of all undecidability and incompleteness. If you're in a standard STEM environment you might think that self-reference is a weird corner case, when really it's what people spend most of their time doing. Predicting adversarial intelligent agents that run on similar hardware is the alt text on life. All agents are embedded slices of reality, but they aren't like other slices of reality; they're exactly the kind of strange-loops that engender self-reference and incompleteness.

The only thing harder than hitting a moving target is hitting a moving target that is watching you trying to hit it. The best move for you depends on what your opponent thinks their best move is, which depends on what you think your best move is; it's Sicilian reasoning all the way up. The classic game theory prescription is to play randomly, called a mixed strategy. This is closely related to minimax "Minimize the maximum damage that the other can do to you".

Both mixed strategies and minimax black box your opponent. They ignore any information you may have about your opponent and make no effort to anticipate the other's moves. It does so even when it could do better by peeking inside the blackbox.

You can beat most people who know the rules of chess with Scholar's mate, ever though such an opening would let a master destroy you. In rock paper scissor, a lot of people default to a predictable cyclical strategy that you can take advantage of, even though a sufficiently smart player could catch on to your predictable strategy and reliably take advantage of you. Traditional game theory sacrifices any plausible yet uncertain edge you might have for the guarantee that you will never be taken advantage of.

In many human affairs, minimaxing and mixed strategies are not enough. Playing randomly might mean you can never be outsmarted, but your expected winnings might still be well below what it takes to survive. Sometimes trying to outsmart the competition may be your only choice.

Though a lot of science is inductive, competitive landscapes are often anti-inductive. There's explicit pressure to keep things unpredictable and incomprehensible. Flirting is anti-inductive:

Now, flirting isn't that crazy. The goal is not to make it so that NO ONE can interest you; it's to move around enough that the only way they other person can succeed is by ditching their canned strategy and paying attention.

The anti-induction of markets is a lot more intense. Warfare even more so. All your predictions of the enemy are based off of their past behavior, behavior that may or may not have been done explicitly to fool you. There are no perfectly reliable signals, and you're forced to get down in the weeds an try your best to out compete other's locally, without fully generalized tactics.

None of this is news to people who study strategy. If you want to read more about OODA here and here are great resources. Alternatively, you can learn everything you need to know by watching these robots:

Look at 1:35 for a perfect fake out. Two bots are in a headlock, motors at full force. One stalls out. To not go off the edge, the other stalls, and right away the other full speed rams them faster than they can reorient. Sumo robots is a game of PURE tempo. Every movement has a perfect counter. The only path to victory is to get inside the others control loop.

The Ghost of Gödel

Every comment I've made so far about strategy has been legit, but it hasn't needed Gödel and Turing to explain. If anything, using the lens of competitive agents makes the halting problem more understandable. No, to see where U&I comes into play, we need to go deeper.

Imagine amending the sumo bot rules: before each bout, both sumo bots were given the other's source code. What changes? This level of apparent determinancy might tempt you into thinking that there could be some way to always win; after all it's just totally predictable code, right? Nope, still undecidable.

There are other games that don't have the multi-polar quality of sumo bots and rock paper scissors, with every move having a perfect counter. Take computer security. This field feels much more like a rising tide than balanced wheel. Eventually a new offense will be made that renders a lot of previous defense obsolete. Then a new defense will be made that renders a lot of old offense obsolete. Heuristic piles on top of heuristic and the tides rise.

Will this be a never ending game off spy vs spy as well? Some aspects of security, like cryptography, seem to be clear victory for defense. On the offense, polymorphic viruses exist that alter their own code while retaining logical identity to avoid detection, virus detection in full generality is impossible, and reliable detection of bounded length viruses is NP-complete.

Recall, with the bulk of engineering we don't fret when our problem is undecidable or NP-hard. We simply do our best and a silly amount of the time we are rewarded. We approximate, apply heuristics, and do what feels right. But that's what we see happen when we tackle stationary or moving targets. Computer security is engineering that's trying to hit a moving target that fights back.

Virus detection being undecidable is exactly what ensure that detection and evasion will be a never ending progressions of increasingly sophisticated heuristics. The fractal tendrils of incompleteness are what provide the inexhaustible landscape for intelligent systems battle.

When it comes down to it, all games are open games. The rules, boundaries, and scoring all shift with time and intent. Games with rigid boundaries are only possible through the mutual agreement of all players. Soccer only works because everyone has agreed to abide by the rules. The more adversarial the context, the more the boundaries dissolve. Gödel's legacy is to show us that there is no limit to how far the boundaries can dissolve. He guarantees the inexhaustibly of the dark wilderness, a wilderness that becomes the new playing field for life as adversarial actors butt heads.

In this world, there are no fully general answers. Good approximations don't stay good approximations for long. Lacking the certainty and guarantees we are used to, all you can do is apply the full force of your being. Every victory is necessarily circumstantial.

What is called the spirit of the void is where there is nothing. It is not included in man’s knowledge. Of course the void is nothingness. By knowing things that exist, you can know that which does not exist. That is the void.

The Book of Five Rings

Discuss

### Gary Marcus vs Cortical Uniformity

28 июня, 2020 - 21:18
Published on June 28, 2020 6:18 PM GMT

Background / context

I wrote about cortical uniformity last year in Human Instincts, Symbol Grounding, and the Blank Slate Neocortex. (Other lesswrong discussion includes Alex Zhu recently and Jacob Cannell in 2015.) Here was my description (lightly edited, and omitting several footnotes that were in the original):

Instead of saying that the human brain has a vision processing algorithm, motor control algorithm, language algorithm, planning algorithm, and so on, in "Common Cortical Algorithm" (CCA) theory we say that (to a first approximation) we have a massive amount of "general-purpose neocortical tissue", and if you dump visual information into that tissue, it does visual processing, and if you connect that tissue to motor control pathways, it does motor control, etc.

CCA theory, as I'm using the term, is a simplified model. There are almost definitely a couple caveats to it:

1. There are sorta "hyperparameters" on the generic learning algorithm which seem to be set differently in different parts of the neocortex. For example, some areas of the cortex have higher or lower density of particular neuron types. There are other examples too. I don't think this significantly undermines the usefulness or correctness of CCA theory, as long as these changes really are akin to hyperparameters, as opposed to specifying fundamentally different algorithms. So my reading of the evidence is that if you put, say, motor nerves coming out of visual cortex tissue, the tissue could do motor control, but it wouldn't do it quite as well as the motor cortex does.

2. There is almost definitely a gross wiring diagram hardcoded in the genome—i.e., set of connections between different neocortical regions and each other, and other parts of the brain. These connections later get refined and edited during learning. Again, we can ask how much the existence of this innate gross wiring diagram undermines CCA theory. How complicated is the wiring diagram? Is it millions of connections among thousands of tiny regions, or just tens of connections among a few regions? Would the brain work at all if you started with a random wiring diagram? I don't know for sure, but for various reasons, my current belief is that this initial gross wiring diagram is not carrying much of the weight of human intelligence, and thus that this point is not a significant problem for the usefulness of CCA theory. (This is a loose statement; of course it depends on what questions you're asking.) I think of it more like: if it's biologically important to learn a concept space that's built out of associations between information sources X, Y, and Z, well, you just dump those three information streams into the same part of the cortex, and then the CCA will take it from there, and it will reliably build this concept space. So once you have the CCA nailed down, it kinda feels to me like you're most of the way there....

Marcus et al.'s challenge

Now, when I was researching that post last year, I had read one book chapter opposed to cortical uniformity and another book chapter in favor of cortical uniformity, which were a good start, but I've been keeping my eye out for more on the topic. And I just found one! In 2014 Gary Marcus, Adam Marblestone, and Thomas Dean wrote a little commentary in Science Magazine called The Atoms of Neural Computation, with a case against cortical uniformity.

Out of the various things they wrote, one stands out as the most substantive and serious criticism: They throw down a gauntlet in their FAQ, with a table of 10 fundamentally different calculations that they think the neocortex does. Can one common cortical algorithm really subsume or replace all those different things?

Well, I accept the challenge!!

But first, I better say something about what there common cortical algorithm is and does, with the caveat that nobody knows all the details, and certainly not me. (The following paragraph is mostly influenced by reading a bunch of stuff by Dileep George & Jeff Hawkins, along with miscellaneous other books and papers that I've happened across in my totally random and incomplete neuroscience and AI self-education.)

The common cortical algorithm (according to me, and leaving out lots of aspects that aren't essential for this post) is an algorithm that builds a bunch of generative models, each of which consists of predictions that other generative models are on or off, and/or predictions that input channels (coming from outside the neocortex—vision, hunger, etc.) are on or off. ("It's symbols all the way down.") All the predictions are attached to confidence values, and both the predictions and confidence values are, in general, functions of time (or of other parameters ... again, I'm glossing over details here). The generative models are compositional, because if two of them make disjoint and/or consistent predictions, you can create a new model that simply predicts that both of those two component models are active simultaneously. For example, we can snap together a "purple" generative model and a "jar" generative model to get a "purple jar" generative model. Anyway, we explore the space of generative models, performing a search with a figure-of-merit that kinda mixes self-supervised learning, model predictive control, and Bayesian(ish) priors. Among other things, this search process involves something at least vaguely analogous to message-passing in a probabilistic graphical model.

OK, now let's dive into the Marcus et al. FAQ list:

• Marcus et al.'s computation 1: "Rapid perceptual classification", potentially involving "Receptive fields, pooling and local contrast normalization" in the "Visual system"

I think that "rapid perceptual classification" naturally comes out of the cortical algorithm, not only in the visual system but also everywhere else.

In terms of "rapid", it's worth noting that (1) many of the "rapid" responses that humans do are not done by the neocortex, (2) The cortical message-passing algorithm supposedly involves both faster, less-accurate neural pathways (which prime the most promising generative models), as well as slower, more-accurate pathways (which, for example, properly do the "explaining away" calculation).

• Marcus et al.'s computation 2: "Complex spatiotemporal pattern recognition", potentially involving "Bayesian belief propagation" in "Sensory hierarchies"

The message-passing algorithm I mentioned above is either Bayesian belief propagation or something approximating it. Contra Marcus et al., Bayesian belief propagation is not just for spatiotemporal pattern recognition in the traditional sense; for example, to figure out what we're looking at, the Bayesian analysis incorporates not only the spatiotemporal pattern of visual input data, but also semantic priors from our other senses and world-model. Thus if we see a word with a smudged letter in the middle, we "see" the smudge as the correct letter, even when the same smudge by itself would be ambiguous.

• Marcus et al.'s computation 3: "Learning efficient coding of inputs", potentially involving "Sparse coding" in "Sensory and other systems"

I think that not just sensory inputs but every feedforward connection in the neocortex (most of which are neocortex-to-neocortex) has to be re-encoded into the data format that the neocortex knows what to do with, i.e. different possible forward inputs correspond to stimulation of different sparse subsets out of a pool of receiving neurons, wherein the sparsity is relatively uniform, and where all the receiving neurons in the pool are stimulated a similar fraction of the time (for efficient use of computational resources). So, Jeff Hawkins has a nice algorithm for this re-encoding process and again, I would put this (or something like it) as an interfacing ingredient on every feedforward connection in the neocortex.

• Marcus et al.'s computation 4: "Working memory", potentially involving "Continuous or discrete attractor states in networks" in "Prefrontal cortex"

To me, the obvious explanation is that active generative models fade away gradually when they stop being used, rather than turning off abruptly. Maybe that's wrong, or there's more to it than that; I haven't really looked into it.

• Marcus et al.'s computation 5: "Decision making", potentially involving "Reinforcement learning of action-selection policies in PFC/BG system" and "winner-take-all networks" in "prefrontal cortex"

I didn't talk about neural implementations in my post on how generative models are selected, but I think reinforcement learning (process (e) in that post) is implemented in the basal ganglia. As far as I understand, the basal ganglia just kinda listens broadly across the whole frontal lobe of the neocortex (the home of planning and motor control), and memorizes associations between arbitrary neocortical patterns and associated rewards, and then it can give a confidence-boost to whatever active neocortical pattern is anticipated to give the highest reward.

Winner-take-all is a combination of that basal ganglia mechanism, and the fact that generative models suppress each other when they make contradictory predictions.

• Marcus et al.'s computation 6: "Routing of information flow", potentially involving "Context-dependent tuning of activity in recurrent network dynamics, shifter circuits, oscillatory coupling, modulating excitation / inhibition balance during signal propagation", "common across many cortical areas"

Routing of information flow is a core part of the algorithm: whatever generative models are active, they know where to send their predictions (their message-passing massages).

I think it's more complicated than that in practice thanks to a biological limitation: I think the parts of the brain that work together need to be time-synchronized for some of the algorithms to work properly, but time-synchronization is impossible across the whole brain at once because the signals are so slow. So there might be some complicated neural machinery to dynamically synchronize different subregions of the cortex when appropriate for the current information-routing needs. I'm not sure. But anyway, that's really an implementation detail, from a high-level-algorithm perspective.

As usual, it's possible that there's more to "routing of information flow" that I don't know about.

• Marcus et al.'s computation 7: "Gain control", potentially involving "Divisive normalization", "common across many cortical areas"

I assume that divisive normalization is part of the common cortical algorithm; I hear it's been observed all over the neocortex and even hippocampus, although I haven't really looked into it. Maybe it's even implicit in that Jeff Hawkins feedforward-connection-interface algorithm I mentioned above, but I haven't checked.

• Marcus et al.'s computation 8: "Sequencing of events over time", potentially involving "Feed-forward cascades" in "language and motor areas" and "serial working memory" in "prefrontal cortex"

I think that every part of the cortex can learn sequences; as I mentioned, that's part of the data structure for each of the countless generative models built by the cortical algorithm.

Despite what Marcus implies, I think the time dimension is very important even for vision, despite the impression we might get from ImageNet-solving CNNs. There are a couple reasons to think that, but maybe the simplest is the fact that humans can learn the "appearance" of an inherently dynamic thing (e.g. a splash) just as easily as we can learn the appearance of a static image. I don't think it's a separate mechanism.

(Incidentally, I started to do a deep dive into vision, to see whether it really needs any specific processing different than the common cortical algorithm as I understand it. In particular, the Dileep George neocortex-inspired vision model has a lot of vision-specific stuff, but (1) some of it is stuff that could have been learned from scratch, but they put it in manually for their convenience (this claim is in the paper, actually), and (2) some of it is stuff that fits into the category I'm calling "innate gross wiring diagram" in that block-quote at the top, and (3) some of it is just them doing a couple things a little bit different from how the brain does it, I think. So I wound up feeling like everything seems to fit together pretty well within the CCA framework, but I dunno, I'm still hazy on a number of details, and it's easy to go wrong speculating about complicated algorithms that I'm not actually coding up and testing.)

• Marcus et al.'s computation 9: "Representation and transformation of variables", potentially involving "population coding" or a variant in "motor cortex and higher cortical areas"

Population coding fits right in as a core part of the common cortical algorithm as I understand it, and as such, I think it is used throughout the cortex. The original FAQ table also mentions something about dot products here, which is totally consistent with some of the gory details of (my current conception of) the common cortical algorithm. That's beyond the scope of this article.

• Marcus et al.'s computation 10: "Variable binding", potentially involving "Indirection" in "PFC / BG loops" or "Dynamically partitionable autoassociative networks" or "Holographic reduced representations" in "higher cortical areas"

They clarify later that by "variable binding" they mean "the transitory or permanent tying together of two bits of information: a variable (such as an X or Y in algebra, or a placeholder like subject or verb in a sentence) and an arbitrary instantiation of that variable (say, a single number, symbol, vector, or word)."

I say, no problem! Let's go with a language example.

I'm not a linguist (as will be obvious), but let's take the sentence "You jump". There is a "you" generative model which (among other things) makes a strong prediction that the "noun" generative model is also active. There is a "jump" generative model which (among other things) makes a strong prediction that the "verb" generative model is also active. Yet another generative model predicts that there will be a sentence in which a noun will be followed by a verb, with the noun being the subject. So you can snap all of these ingredients together into a larger generative model, "You jump". There you have it!

Again, I haven't thought about it in any depth. At the very least, there are about a zillion other generative models involved in this process that I'm leaving out. But the question is, are there aspects of language that can't be learned by this kind of algorithm?

Well, some weak, indirect evidence that this kind of algorithm can learn language is the startup Gamalon, which tries to do natural language processing using probabilistic programming with some kind of compositional generative model, and it works great. (Or so they say!) Here's their CEO Ben Vigoda describing the technology on youtube, and don't miss their fun probabilistic-programming drawing demo starting at 29:00. It's weak evidence because I very much doubt that Gamelon uses exactly the same data structures and search algorithms as the neocortex, only vaguely similar, I think. (But I feel strongly that it's way more similar to the neocortex than a Transformer or RNN is, at least in the ways that matter.)

Conclusion

So, having read the Marcus et al. paper and a few of its references, it really didn't move me at all away from my previous opinion: I still think the Common Cortical Algorithm / Cortical Uniformity hypothesis is basically right, modulo the caveats I mentioned at the top. (That said, I wasn't 100% confident about that hypothesis before, and I'm still not.) If anyone finds the Marcus et al. paper more convincing than I did, I'd love to talk about it!

Discuss

### The Illusion of Ethical Progress

28 июня, 2020 - 12:33
Published on June 28, 2020 9:33 AM GMT

Here are two statements I used to believe.

1. The world's ethical systems have generally improved over time.
2. It follows that ethical systems probably will continue to improve into the future.

I think the first statement is an illusion. If the first statement is untrue then the second statement cannot follow from the first.

What does it mean for an ethical system to get "better"? Physics contains no such thing.

Take the universe and grind it down to the finest powder and sieve it through the finest sieve and then show me one atom of justice, one molecule of mercy. And yet… and yet you act as if there is some ideal order in the world, as if there is some… some rightness in the universe by which it may be judged.

― Terry Pratchett, Hogfather

To judge the quality of an ethical system you must do so through your own ethical system. Ethics are like Minkowski spacetime. You cannot judge ethics in absolute terms. You can only judge an ethical position relative to your own.

A universal standard of ethics must have practical utility in every society at every point in history. Today's fashions often judge ethics by its internal coherence (untenable in traditional Japan[1]) or universality (untenable in tribal pastoralist cultures[2]).

If you believe your society (or somewhere nearby you in ideatic space) is the pinnacle of ethical evolution then what is more likely?

1. Your society objectively is the pinnacle of ethical evolution.
2. You judge every ethical system by its distance to your own.

An ethical system similar to your own often seems like a "good ethical system". The illusion of ethical progress follows from this subjective metric. If ethical systems are one-dimensional then morals will appear to be getting better as often as they get worse. (Except for very recent history which will appear to have improved.) But ethical systems have many dimensions.

In the above picture you can see random walks through 3-dimensional space, representing 3 universes with 3 separate ethical evolutions. The higher the dimensionality of ethical space, the less likely an ethical system will walk back to a previous state and thus the more likely ethical evolution will appear to have a direction. Each 3-dimensional path appears to be going from one place to another even though they are all completely random. The more dimensions an ethical space has, the harder it is to distinguish a random walk from progress. Real ethical space has many more than 3 dimensions.

Does this mean ethics is fundamentally relative?

No

Ethics is fundamentally subjective, but not relative.

In the Western intellectual tradition, ethics is a branch of philosophy. Western philosophy has no place for empiricism. Without empirical results, there is no way to compare ethical systems objectively against each other. Progress is indistinguishable from a random walk.

But there is a way to observe ethics in absolute terms. It is called "mysticism".

Have you ever noticed how Abraham, Jesus, Mohammad, Siddhartha and Ryokan all had a habit of going alone into the wilderness for several days at a time? Then they came back and made ethical pronouncements and people listened to them? The great mystics cut through the Gordian Knot of moral relativism by approaching ethics empirically.

The Snowmass Contemplative Group

In the early 1980's Father Thomas Keating, a Catholic priest, sponsored a meeting of contemplatives from many different religions. The group represented a few Christian denominations as well as Zen, Tibetan, Islam, Judaism, Native American & Nonaligned. They found the meeting very productive and decided to have annual meetings. Each year they have a meeting at a monastery of a different tradition, and share the daily practice of that tradition as a part of the meetings. The purpose of the meetings was to establish what common understandings they-had achieved as a result of their diverse practices. The group has become known as the Snowmass Contemplative Group because the first of these meetings was held in the Trappist monastery in Snowmass, Colorado.

When scholars from different religious traditions meet, they argue endlessly about their different beliefs. When contemplatives from different religious traditions meet, they celebrate their common understandings. Because of their direct personal understanding, they were able to comprehend experiences which in words are described in many different ways. The Snowmass Contemplative Group has established seven Points of Agreement that they have been refining over the years:

1. The potential for enlightenment is in every person.
2. The human mind cannot comprehend ultimate reality, but ultimate reality can be experienced.
3. The ultimate reality is the source of all existence.
4. Faith is opening, accepting & responding to ultimate reality.
5. Confidence in oneself as rooted in the ultimate reality is the necessary corollary to faith in the ultimate reality.
6. As long as the human experience is experienced as separate from the ultimate realty it is subject to ignorance, illusion, weakness and suffering.
7. Disciplined practice is essential to the spiritual journey, yet spiritual attainment is not the result of one's effort but the experience of oneness with ultimate reality.

Saints and Psychopaths by Willian L Hamilton

You cannot "judge" an ethical system objectively. But you can observe it objectively and you can measure it objectively. Such empiricism once formed the foundation for the Age of Reason. Mystics are less like moral philosophers arguing doctrine than they are scientists reconciling separate experiments.