Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 34 минуты 33 секунды назад

Neck abacus

18 февраля, 2021 - 10:40
Published on February 18, 2021 7:40 AM GMT

My points for anxiety system continued to help, but was encumbered by the friction of getting my phone out to mark points. Thus I have turned to wearable abaci.

I made this necklace according to very roughly this picture, using knitting wool and beads I bought at one point to use as virtual in-house currency and now found in a box in my room. It works well! The beads don’t shift unless I move them, which is easy and pleasing. It seems clearly more convenient than my phone. (Plus, I can show off to those in the know that I have 4 or 6 or 24 or 26 of something!) I am also for now reminded when I look in a mirror to consider whether I can get a point, which is currently a plus.

You can buy bracelet versions in a very small number of places online, and also keychain or general hanging clip-on versions, but I don’t think I saw necklaces anywhere. This seems striking, given the clear superiority to a phone counter for me so far, and the likely scale of phone counter usage in the world.



Discuss

Shortcuts With Chained Probabilities

18 февраля, 2021 - 05:00
Published on February 18, 2021 2:00 AM GMT

Let's say you're considering an activity with a risk of death of one in a million. If you do it twice, is your risk two in a million?

Technically, it's just under:

1 - (1 - 1/1,000,000)^2 = ~2/1,000,001 This is quite close! Approximating 1 - (1-p)^2 as p*2 was only off by 0.00005%.

On the other hand, say you roll a die twice looking for a 1:

1 - (1 - 1/6)^2 = ~31% The approximation would have given: 1/6 * 2 = ~33% Which is off by 8%. And if we flip a coin looking for a tails: 1/2 * 2 = 100% Which is clearly wrong since you could get heads twice in a row.

It seems like this shortcut is better for small probabilities; why?

If something has probability p, then the chance of it happening at least once in two independent tries is:

1 - (1-p)^2 = 1 - (1 - 2p + p^2) = 1 - 1 + 2p - p^2 = 2p - p^2 If p is very small, then p^2 is negligible, and 2p is only a very slight overestimate. As it gets larger, however, skipping it becomes more of a problem.

This is the calculation that people do when adding micromorts: you can't die from the same thing multiple times, but your chance of death stays low enough that the inaccuracy of naively combining these probabilities is much smaller than the margin of error on our estimates.

Comment via: facebook



Discuss

AXRP Episode 4 - Risks from Learned Optimization with Evan Hubinger

18 февраля, 2021 - 03:03
Published on February 18, 2021 12:03 AM GMT

Google Podcasts link

Daniel Filan: Today, I'll be talking to Evan Hubinger about risks from learned optimization. Evan is a research fellow at the Machine Intelligence Research Institute, was previously an intern at OpenAI working on theoretical AI safety research with Paul Christiano. He's done software engineering work at Google, Yelp and Ripple, and also designed the functional programming language, Coconut. We're going to be talking about the paper "Risks from Learned Optimization in Advanced Machine Learning Systems". The authors are Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse and Scott Garrabrant. So Evan, welcome to AXRP.

Evan Hubinger: Thank you for having me Daniel.

Daniel Filan: So the first thing I wanted to say about this paper is it has a nice glossary at the end with what various terms mean. And I think this is unusual in machine learning papers. So just straight off the bat, I wanted to firstly thank you for that and let potential readers of the paper know that it's maybe easier to read than they might think.

Evan Hubinger: Thank you. We worked hard on that. I think one of the things that we set out to do in the paper was to define a lot of things and create a lot of terminology. And so in creating a lot of terminology, it's useful to have a reference for that terminology, and so we figured that would be helpful for people trying to go back and look at all of the terms that we're using.

Evan Hubinger: And terms evolve over time. It's not necessarily the case that the same terms that we use in our paper are going to be used in the future or people might use them differently, but we wanted to have a clear reference for, "We're going to use a lot of words. Here are the words, here's what they mean and here's how we're defining them, here's how we're using them." So at least if you're looking at the paper, you could read it in context and understand what it's saying and what the terms mean and everything.

Daniel Filan: Yeah, that makes sense. So just a heads up for the listeners, in this episode, I'm mostly going to be asking you questions about the paper rather than walking through the paper. That being said, just to start off with, I was wondering if you could briefly summarize what's the paper about, and what's it trying to do?

Evan Hubinger: Yeah, for sure. So essentially the paper is looking at trying to understand what happens when you train a machine learning system, and you get some model, and that model is itself doing optimization. It is itself trying to do some search, it's trying to accomplish some objective. And it's really important to distinguish between---and this is one of the main distinctions that the paper tries to draw, is the distinction between the loss function that you train your model on, and the actual objective that your model learns to try to follow internally.

Evan Hubinger: And so we make a bunch of distinctions, we try to define these terms where specifically we talk about these two alignment problems. The outer alignment problem, which is the problem of aligning the human goals, the thing you're trying to get it to do with the loss function. And the inner alignment problem, which is how do you align the actual objective that your model might end up pursuing with the loss function. And we call this actual objective the mesa-objective, and we call a model which is itself doing optimization a mesa-optimizer.

Daniel Filan: What's mesa, by the way?

Evan Hubinger: Yes, that's a very good question. So mesa is Greek. It's a prefix and it's the opposite of meta. So meta is Greek for above, mesa is Greek for below. The idea is that if you could think about meta optimization in the context of machine learning, meta optimization is when I have some optimizer, a gradient descent process, a training process or whatever. And I create another optimizer above that optimizer, a meta optimizer to optimize over it.

Evan Hubinger: And we want to look at the opposite of this. We want to look at the situation in which I have a training process. I have a loss function. I have this gradient descent process. What happens when that gradient descent process itself becomes a meta optimizer? What happens when the thing which is optimizing over, the model, the actual training neural network itself develops an optimization procedure. And so we want to call that mesa-optimization for it's the opposite of meta-optimization instead of jumping and having optimization one level above what we're expecting, we have optimization one level below.

Daniel Filan: Cool, cool. So one question I have about this paper is, you're thinking about optimization, where's optimization happening, what's it doing. What's optimization?

Evan Hubinger: Yeah, so that's a great question. And I think this is one of the sorts of things things that is notoriously difficult to define, and there's a lot of definitions that have existed out there, with different people trying to define it in different ways. You have things like Daniel Dennett's intentional stance, you have things like Alex Flint recently had a post on the Alignment Forum going into this.

Evan Hubinger: In risks from learned optimization, we define optimization in a very mechanistic way where we're like, "Look, a system is an optimizer if it is internally doing some sort of search according to some objective, searching through a space of possible plans, strategies, policies, actions, whatever."

Evan Hubinger: And the argument for this is, well once you have a model which is doing search, it has some large space of things and it's trying to look over those things and select them according to some criterion, that criterion has become an objective of sorts, because now the way in which its output is produced, has a large amount of optimization power, a large amount of selection going into choosing which things satisfy this criterion. And so we really want it to be the case that whatever that criterion that exists there is one that we like.

Evan Hubinger: And furthermore, it's the sort of thing which we argue you should actually expect, that you will learn models which have this sort of search, because search generalizes really well, it's a relatively simple procedure. And then finally, having this mechanistic definition lets us make a lot of arguments and get to a point where we can understand things in a way where if we just adopted a behavioral definition like the intentional stance for example, we wouldn't be able to do it. Because we want to be able to say things like, "How likely is it that when I run a machine learning process, when I do a training process, I will end up with a model that is an optimizer?"

Evan Hubinger: And it's pretty difficult to say that if you don't know what an optimizer looks like internally. Because I can say, "Well, how likely is it that I'll get something which is well described as an optimizer." And that's a much more difficult question to answer because you don't know the process of gradient descent, the process of machine learning, it selects on the internals, it selects on ... we have all of these implicit inductive biases which might be pushing you in one direction and then you have this loss function which is trying to ensure that the behavior of the model has particular properties.

Evan Hubinger: And to be able to answer whether or not those inductive biases are going to push you in one direction or another, you need to know things like, "How simple is the thing which you are looking for according to those inductive biases? How likely is it that the process of gradient descent is going to end up in a situation where you end up with something that satisfies that?"

Evan Hubinger: And I think the point where we ended up was not that there's something wrong necessarily with the behavioral stance of optimization, but just that it's very difficult to actually understand from a concrete perspective, the extent to which you will get optimizers when you do machine learning if you're only thinking of optimizers behaviorally, because you don't really know to what extent those optimizers will be simple, to what extent will they be favored by gradient descent and those inductive biases and stuff. And so by having a mechanistic definition, we can actually investigate those questions in a much more concrete way.

Daniel Filan: One thing that strikes me there is, if I were thinking about what's easiest to learn about what neural networks will be produced by gradient descent, I would think, "Well, gradient descent, it's mostly selecting on behavior, right?" It's like, "Okay, how can I tweak this neural network to produce behavior that's more like what I want." Or outputs or whatever is appropriate to the relevant problem. And famously, we currently can't say very much about how the internals of neural networks work. Hopefully that's changing, but not that much.

Evan Hubinger: Yeah, that's absolutely right.

Daniel Filan: So it strikes me that it might actually be easier to say, "Well, behaviorally, we know we're going to get things that behave to optimize an objective function because that's the thing that is optimal for the objective function." So I was wondering: it seems like in your response, you put a lot of weight on "Networks will select simple ..." or "Optimization will select simple networks or simple algorithms". And I'm wondering, why do you have that emphasis instead of the emphasis on selecting for behavior?

Evan Hubinger: Yeah, so that's a really good question. So I think the answer to this comes from ... so first of all, let me just establish, I totally agree. And I think that it's in fact a problem that the way in which we currently do machine learning is so focused on behavioral incentives. One of the things that in my research, I actually have spent a lot of time looking into and trying to figure out is how we can properly do mechanistic incentives. How could we create ways to train neural networks that are focused on making them more transparent or making them able to satisfy some internal properties we might want.

Evan Hubinger: That being said, the reason that I think it's very difficult to analyze this question behaviorally is because fundamentally, we're looking at a problem of unidentifiability. If you only look at behavior, you can't distinguish between the model which is internally doing some optimization process, it has some objective and produces the particular results, versus a model which doesn't have any internal objective but it still produces the same results on the training process which has a bunch of heuristics.

Evan Hubinger: But those two models have very different generalization behavior. And it's the generalization behavior that we care about. We're looking for situations in which ... and we talked about this a little bit, there is this post by Vlad that talks about this 2D robustness picture where the thing that we're fundamentally concerned about is a situation where you have a model and the model is very robust in the sense of competitiveness, it's very capability robust. It has the ability to pursue whatever objective that it's pursuing very effectively off distribution and on distribution, but it's not very objective robust.

Evan Hubinger: So it learned some proxy objective, maybe, that works pretty well on a training distribution, but it doesn't work very well off distribution. And the problem is that if you're just looking at the behavior on the training distribution, you can't really distinguish between a model which will be objective robust, which has learned a proxy which will work well off distribution, and one which has learned a proxy which won't work well off distribution.

Evan Hubinger: Because if you're just looking at the training distribution and you're just looking at its behavior on the training distribution, you can't tell because those might just look identical on the training distribution. As a really concrete example, this maze example that I often like to give, so you have a situation where I'm training a model on ... the training distribution has a bunch of mazes. And I want it to get to the end of the maze, and I put a little green arrow at the end of the maze and I incentivize the model for getting to the end, and I train it on a bunch of these relatively small mazes.

Evan Hubinger: And then I deploy the model in a new situation where I have larger mazes and I put the green arrow at some random place in the middle. And now there's a bunch of things that can happen. So it could be the case that in this new deployment environment, the model is capability robust and it really knows how to solve mazes. But it can also not be capability robust and it just can't solve the bigger mazes. But if it is capability robust, then there's two options. It's either objective robust and it learned the correct proxy which was getting to the end, and so that makes it navigate to the end of the maze.

Evan Hubinger: Or it could be objective non-robust. It learned an incorrect proxy which was, "Get to the green arrow." So it navigates to the green arrow. And if you're only looking at the behavior on the training distribution, you can't distinguish between the model which is trying to get to the green arrow and the model which is trying to get to the end of the maze, because in the training distribution, those are not identifiable. You can't distinguish between them by looking at behavior.

Evan Hubinger: And so we want to ask the question, "How likely is it that we will end up with models which generalize in one way versus models which generalize in another way?" We have to look at the other thing than just the behavioral incentives, which is causing models to fall into one category or the other, and that means the inductive biases. The reason that your training process would produce a model which is going to generalize in one way or generalize in another way despite having the same training performance is all going to come down to, "What is incentivized by the inductive biases of that training process?"

Evan Hubinger: And to understand what's incentivized by the inductive biases of the training process, we have to look at the model's internals, because the inductive biases aren't just about the ... what's happening here at this point, the inductive biases, all of the behavior in training distribution is the same. So the only difference is, "What are the inductive biases incentivizing in terms of the internals of the model?" Are the inductive bias going to incentivize the sorts of models that have particular properties or models that have other properties?

Evan Hubinger: And fundamentally, that's where you have to look if you want to start answering the question of what you're going to get in a standard machine learning setting because that's the thing which is going to determine what option you end up with when you're in this unidentifiable position.

Daniel Filan: Yeah. So it sounds like one thing that you notice reading this paper is that it seems very focused on the case of distributional shift, right? You're interested in generalization, but not in the generalization bound sense of from the training distribution to the training distribution but a sample you haven't literally seen, but some kind of different world. I'm like, why can't we just train on the distribution that we're going to deploy on? Why is this a problem?

Evan Hubinger: Yeah, so that's a really good question. So I think there's a couple of things around here. So, first one thing that's worth noting is that I think a lot of times I think people read the paper and will be like, "Well, what is going on with this deployment distribution?" And sometimes we just do online learning. Sometimes we just do situations where we're just going to keep training, we're just going to keep going. Why can't we understand that?

Evan Hubinger: I think basically my argument is that, those situations are not so dissimilar as you might think. So let's for example, consider online learning. I'm going to try to just keep training it on every data point as it comes in and keep pushing in different directions. Well, let's consider the situation where you are doing online learning and the model encounters some new data point. We still want to ask the question, "For this new data point, will it generalize properly or will it not?"

Evan Hubinger: And we can answer the question, "Will it generalize properly or will it not on this new data point?" by looking at all of the previous training that we've done, was a training environment of sorts and this new data point is a deployment environment of sorts where something potentially has changed from the previous data points to this new data point. And we want to ask the question, "Will it generalize properly to this new data point?"

Evan Hubinger: Now, fundamentally, that generalization problem is not changed by the fact that after it produces some action on this data point, we'll go back and we'll do some additional training on that. There's still a question of how is it going to generalize to this new data point. And so you can think of online learning setups and situations where we're trying to just keep training on all the data as basically, all they're doing is they're adding additional feedback mechanisms. They're making it so that if the model does do something poorly, maybe we have the ability to have the training process intervene and correct that.

Evan Hubinger: But fundamentally, that's always the case. Even if you're doing a deployment where you're not doing online learning, if you notice that the model is doing something bad, there are still feedback mechanisms. You can try to shut it down. You can try to look at what is happening and retrain a new model. There's always feedback mechanisms that exist. And if we take a higher level step and we think about what is the real problem that we're trying to solve, the real problem we're trying to solve is situations in which machine learning models take actions that are so bad that they're unrecoverable. That our natural feedback mechanisms that we've established don't have the ability to account for and correct for the mistake that has been made.

Evan Hubinger: And so an example of a situation like this ... Paul Christiano talks a lot about this sort of unrecoverability. One example would be, Paul has this post, "What Failure Looks Like." And in the, "What Failure Looks Like" post, Paul goes into this situation where you could imagine, for example, a cascade of problems where one model goes off the rails because it has some proxy and it starts optimizing for the wrong thing, and this causes other models to go off the rails, and you get a cascade of problems that creates an unrecoverable situation where you end up with a society where we put models in charge of all these different things, and then as one goes off the rails, it pushes others off distribution.

Evan Hubinger: Because as soon as you get one model doing something problematic, that now puts everyone else into a situation where they have to deal with that problematic thing, and that could itself be off distribution from the things we just had to deal with previously, which increases the probability of those other models failing and you get that cascade. I'm not saying this is the only ... I think this is one possible failure mode, but if you can see that it's one possible failure mode that is demonstrating the way in which the real problem is situations in which our feedback mechanisms are insufficient.

Evan Hubinger: And you will always have situations where your feedback mechanisms could potentially be insufficient if it does something truly catastrophic. And so what we're trying to prevent is it doing something so catastrophic in the first place. And to do that, you have to pay attention to the generalization because every time it looks at a new data point, you're always relying on generalization behavior to do the correct thing on that new data point.

Daniel Filan: So when you say generalization behavior, you mean it's going to encounter a situation and ... I guess I'm not sure why just normal ML isn't enough. Generalization bounds exist. Why can't we just train on a distribution, deploy, you get different samples from that distribution, so it's a little bit different. But generalization bounds say that it's just not going to be so bad. Why isn't that the end of the story?

Evan Hubinger: Yeah. So I think that a lot of these generalization bounds rely on a particular assumption which I think is quite questionable, which is the i.i.d. [independent and identically distributed] assumption. The assumption that in fact the deployment distribution and the training distribution, that we can actually take i.i.d. samples that we can create, I don't know, particular bounds between the divergence between the distributions or things like this.

Evan Hubinger: In practice, all of these sorts of assumptions basically just don't hold. And there's a lot of obvious ways in which they don't hold. There's always things that are changing about our environment. There's always various different things. You can imagine, just the fact that the date advances, the fact that various things are happening in the world that you might not be able to anticipate. And obviously there are ways to try to address these problems in the standard practice of machine learning. So a very standard thing to do is adversarial training.

Evan Hubinger: So you might have a situation where you're like, "Well, we expect that the distribution is going to vary in a bunch of different ways. We can try to sample some of the ways in which we expect the distribution might vary and train on those things." But there can still be a lot of problems that happen even in that situation. So one of the things that we talk about in the paper that I think becomes particularly problematic when you start getting into these situations where you've really done a good job at doing adversarial training and tried to train away a lot of the naive simple ways in which you might end up with a "green arrow" optimizer versus an "end of the maze optimizer", is something that we call deceptive alignment.

Evan Hubinger: And deceptive alignment is a particularly pernicious case of a mesa-optimizer which has a mesa-objective that is different from the ideal loss function. And the particular case of deceptive alignment is a situation in which we have a model, and that model, let's say, has some proxy objective. It's trying to get to the green arrow or whatever, but it realizes that it's inside of a training process, and that that training process is going to select it for getting to the end of the maze rather than getting to the green arrow.

Evan Hubinger: And so it realizes that the correct instrumental strategy to be able to get the green arrow in the future is to currently try to get to the end of the maze so that it will stay around in the training process and won't be modified to go to the green arrow instead. And then eventually at some point maybe it gets the ability to take some action, maybe you just deploy it. And now when it's in deployment, it can go to the green arrow a bunch of times and you won't correct for it. Or it otherwise gets into some situation where the correction mechanisms, the feedback mechanisms that you have are no longer sufficient to prevent it from being able to do the true proxy objective that it learned.

Evan Hubinger: And this is particularly concerning because if you're in a deceptive alignment situation, the adversarial training set up no longer really works, because if you think about a deceptive aligned system, it's going to be actively looking for any possible differences between the training distribution and the deployment distribution to let it know that now it's in a situation where it can do what it wants and it doesn't have to be as afraid of corrective feedback. And so there's a lot of examples of the sorts of things that a deceptive model might be looking for.

Evan Hubinger: Paul Christiano has this one example that I think is fun, and it's not necessarily realistic, but it just goes to show the way in which this can happen, which is this RSA-2048 example. So the RSA-2048 example goes like this. So RSA-2048 is this really large pseudoprime. There is this challenge to try to factor RSA-2048.

Daniel Filan: Sorry, what's a pseudoprime?

Evan Hubinger: It's like a product of two large primes.

Daniel Filan: Okay.

Evan Hubinger: I think, yeah. I might be using the terminology wrong. I'm pretty sure, like whatever. It's the product of two very large primes, but finding that factorization, finding the two very large primes that make up this very large number is very challenging and computationally difficult. And currently, it's beyond the computational abilities of modern computers and there are, I don't know ... at least there's a challenge to try to factor it and nobody has yet been able to do so. But it's the sort thing that probably we'll have enough compute that somebody will eventually get around to being able to factor it in the future.

Evan Hubinger: And so you can imagine the situation where you have a deceptive model and it's like, "Well, I don't know when I'm going to be deployed. But at some point probably I'll be in a deployment situation where I have lots of power and I can just do whatever I want. And so I'll choose to look for a factorization of RSA-2048. And if I see such a factorization, then I can be pretty sure that I'm no longer in training because the people training me don't have the compute available to factor RSA-2048. But probably at some point somebody is going to factor it. And so I can be pretty sure that if I see a factorization of RSA-2048, that I am longer in training and I can be pretty sure that eventually I will see such a factorization."

Evan Hubinger: And so pointing to this general problem of when you try to do adversarial training, you're putting yourself in this situation where you have to be able to ... it's sort of this P versus NP thing where you're putting yourself in a situation where you have to be able to actually generate all of the possible counter examples, whereas the model is in a much easier position. It only has to be able to do the check. It only has to be able to check whether it is in such a situation. And there's just a lot of things where it's easier to check than it is to generate. The factorization is an example of that. It's a lot easier to check that you have a factorization than it is to generate such a factorization.

Evan Hubinger: And this is a general thing like I was saying, that comes from the fact that there are different complexity classes that vary based on ... we can think about NP as being the complexity class of things which are relatively easily checkable and P as being the complexity class of things which are relatively easy to actually generate, and we think that these are distinct and so there's likely lots of problems like this.

Evan Hubinger: Even if we're not thinking about things as rigorous mathematically as RSA-2048, there's just a lot of things about the world that are easier to check than they are to generate. And so that means that if you have a model which is actively looking for differences, your procedure of, "Let's just try to generate a bunch of differences." is no longer likely to work.

Daniel Filan: Yeah. I think this is ... so there's a lot in that answer that I can get back to. One thing that it does remind me of is often people do talk about adversarial training and it always strikes me that, it seems like if you want to really make that work, you need much better interpretability tools than we have at the moment, precisely to check this kind of thing, like what is the model checking? To really find the worst case scenario, you need to know a lot about the model, right? So I guess I'm wondering, do you think that's right or do you have any thoughts about that?

Evan Hubinger: I think that's exactly right. So I have a post that talks about relaxed adversarial training which is terminology that comes from Paul Christiano. And basically the idea there is, well, how can we improve this adversarial training setup? Well, instead of doing adversarial training on the model's inputs, we can try to do it on the model's internals. We can try to generate descriptions of situations in which the model might fail, try to generate possible activations for which the model might fail. And these problems might be significantly easier than actually trying to generate an input on which the model fails.

Daniel Filan: Cool. So there's a lot to get back to. I want to get back to basics a little bit, just about mesa-optimizers. So one of the examples you use in the post is, you can think at least for the sake of example, you can think of evolution as some optimizer and evolution has produced humans and humans are themselves optimizers for such and such thing. And I wanted to get a little bit more detail about that. So I'm a human and a lot of the time I'm almost on autopilot. For instance, I'll be acting on some instinct, right? Like a loud noise happens and I jump. Sometimes I'll be doing what I've previously decided to do, or maybe what somebody has told me to do.

Daniel Filan: So for instance, we just arranged this interview and I know I'm supposed to be asking you questions and because we already penciled that into the calendar, that's what I'm doing. And often what I'm doing is very local means-ends reasoning. So I'll be like, "Oh, I would like some food. I'm hungry, so I would like some food now. So I'm going to cook some food." That kind of thing.

Daniel Filan: Which is different from what you might imagine of like, "Oh, I'm going to think of all the possible sequences of actions I could take and I'm going to search for which one maximizes how rich I end up in the future" or some altruistic objective maybe, if you want to think nicely of me, or something. So am I a mesa-optimizer? And if I am, what's my mesa-objective?

Evan Hubinger: Yeah, so that's a really good question. So I'm happy to answer this. I will just warn, I think that my personal opinion is that it's easy to go astray I think, thinking too much about the evolution and humans example and what's happening with humans and forget that humans are not machine learning models, and there are real differences there. But I will try to answer the question. So if you think about it from a computational perspective, it's just not very efficient to try to spend a bunch of compute for every single action that you're going to choose to try to compare it to all of these different possible opportunities.

Evan Hubinger: But that doesn't mean you're not doing search. It doesn't mean you're not doing planning, right? So you can think about, one way to model what you're doing might be, well, maybe you have a bunch of heuristics, but you change those heuristics over time, and the way in which you change those heuristics is by doing careful search and planning or something.

Evan Hubinger: I don't know exactly what the correct way is to describe the human brain and what algorithm it internally runs and how it operates. But I think it's a good guess to say and pretty reasonable to say that it is clearly doing search and it is clearly doing planning. It is clearly at least sometimes, and like you were saying, it would be computationally intractable to try to do some massive search every time you think about you are going to move a muscle or whatever. But that doesn't mean that you can't, for particular actions, for particular strategies, develop long-term strategies, think about possibilities for ... consider many different strategies and select them according to whether they satisfy whatever you are trying to achieve.

Evan Hubinger: And that obviously also leads to the second question which is, well what is mesa-objective? What is it that you might be trying to achieve? This is obviously very difficult to answer. People have been trying to answer the question, "What are human values?" for forever essentially. I don't think I have the answer and I don't expect us to get the answer to this sort of thing. But I nevertheless think that there is a meaningful sense in which you can say that humans clearly have goals, and not just in a behavioral sense, in a mechanistic sense.

Evan Hubinger: I think there's a sense in which you can be like, "Look, humans select strategies. They look at different possible things which they could do, and they choose those strategies which perform well, but which have good properties according to the sorts of things they are looking for. They try to choose strategies which put them in a good position and try to do strategies which make the world a better place." They try to do strategies which ... people have lots of things that they use to select the different sorts of strategies that they pursue, the different sorts of heuristics they use, the different sorts of plans and policies that they adapt.

Evan Hubinger: And so I think it is meaningful to say and correct to say that humans are mesa-optimizers of a form. Are they a central example? I don't know. I think that they are relatively non-central. I think that they're not the example I would want to have first and foremost of mesa-optimization. But they are a meaningful example because they are the only example that we have of a training process of evolution being able to produce a highly intelligent, capable system. So they're useful to think about certainly, though we should also expect there to be differences.

Daniel Filan: Yeah, that makes sense.

Evan Hubinger: But I think [crosstalk 00:29:28].

Daniel Filan: Yeah, that makes sense. Yeah, I guess I'm just asking to get a sense of what a mesa-optimizer really is. And your answer is interesting because it suggests that if I do my start of year plan or whatever, and do search through some strategies and then execute, even when I'm in execution mode, I'm still an optimizer. I'm still a mesa-optimizer. Which is interesting because it suggests that if one of my plans is like, "Okay, I'm going to play go." And then I open up a game, and then I start doing planning about what moves are going to win, am I a mesa-mesa-optimizer when I'm doing that?

Evan Hubinger: So my guess would be not really. You're still probably using exactly the same optimization machinery to do go, as you are to plan about whether you're going to go do go. And so in some sense, it's not that you did ... So for a mesa-mesa-optimizer, it would have to be a situation in which you internally did an additional search over possible optimization strategies and possible models to do optimization and then selected one.

Daniel Filan: OK.

Evan Hubinger: If you're reusing the exact same optimization machinery, you're just one optimizer and you are optimizing for different things at different points.

Daniel Filan: Okay. That gives me a bit of a better sense of what you mean by the concept. So I have a few more questions about this. So one thing you do, I believe in the introduction of the paper is you distinguish between mesa-objectives and behavioral objectives. I think the distinction is supposed to be that my mesa-objective is the objective that's internally represented in my brain, and my behavioral objective is the objective you could recover by asking yourself, "Okay, looking at my behavior, what am I optimizing for?" In what situations would these things be different?

Evan Hubinger: Yeah, so that's a great question. So I think that in most situations where you have a meaningful mesa-objective, we would hope that your definitions are such that the behavioral objective would basically be that mesa-objective. But there certainly exist models that don't have a meaningful mesa-objective. And so the behavioral objective is therefore in some sense a broader term. Because you can look at a model which isn't internally performing search and you can still potentially ascribe to it some sort of objective that it looks like it is following.

Daniel Filan: What's an example of such a model?

Evan Hubinger: You could imagine something like, maybe you're looking at ... I think for example a lot of current reinforcement learning models might potentially fall into this category. If you think about, I don't know ... It's difficult to speculate on exactly what the internals of things are, but my guess would be, if we think about something like OpenAI Five for example. It certainly looks like this-

Daniel Filan: Can you remind listeners what OpenAI Five is?

Evan Hubinger: Sure. OpenAI Five was a project by OpenAI to solve Dota 2 where they built agents that are able to compete on a relatively professional level at Dota 2. And I think if you think about some of these agents, the OpenAI Five agents, I think that my guess would be that the OpenAI Five agents don't have very coherent internally representative objectives. They probably have a lot of heuristics.

Evan Hubinger: Potentially, there is some sort of search going on, but my guess would be not very much, such that you wouldn't really be able to ascribe a very coherent mesa-objective. But nevertheless, we could look at it and be like, "There's clearly a coherent behavioral objective in terms of trying to win the Dota game." And we can understand why that behavioral objective exists. It exists because, if we think about OpenAI Five as being a grab bag of heuristics and all of these various different things, those heuristics were selected by the base optimizer, which is the gradient descent process, to achieve that behavioral goal.

Evan Hubinger: And one of the things that we're trying to hammer home in Risks from Learned Optimization is that there's a meaningful distinction between that base objective, that loss function that the training process is optimizing for, and if your model actually itself learns an objective, what that objective will be.

Evan Hubinger: And it's easy because of the fact that in many situations, the behavioral objective looks so close to the loss function to conflate the ideas and assume that, "Well, we've trained OpenAI Five to win Dota games. It looks like it's trying to win Dota games. So it must actually be trying to win Dota games." But that last step is potentially fraught with errors. And so that's one of those points that we're trying to make in the paper.

Daniel Filan: Great. Yeah, perhaps a more concrete example. Would a thermostat be an example of a system that has a behavioral objective but not necessarily a mesa-objective or an internally represented objective?

Evan Hubinger: Yeah. I think the thermostat is a little bit of an edge case because it's like, "Well, does the thermostat even really have a behavioral objective?" I think it kind of does. I mean, it's trying to stabilize temperature at some level, but it's so simple that I don't know. What is your definition of behavioral objective? I don't know. Probably yeah, thermostat is a good example.

Daniel Filan: Cool. So I wanted to ask another question about this interplay between this idea of planning and internally represented objective. So this is, I guess this is a little bit of a troll question, so I'm sorry. But suppose I have my program, I have this variable called my_utility, and I'm writing a program to navigate a maze quickly. That's what I'm doing. In the program, there's a variable called my_utility and it's assigned to the longest path function. And I also have a variable called my_maximizer that I'm going to try to maximize utility with, but I'm actually a really bad programmer. So what I assigned to my maximizer was argmin function. So the function that takes another function and minimizes that function. So that's my maximizer. It's not a great maximizer.

Daniel Filan: And then the act function is my_maximizer applied to my_utility. So here, the way this program is structured is if you read the variable names, is there's this objective, which is to get the longest path and the way you maximize that objective is really bad. And actually, you use the minimization function. So this is a mesa-optimizer, because I did some optimization to come up with this program. According to you, what's the mesa-objective of this?

Evan Hubinger: Yeah. So I think that this goes to show something really important, which is that if you're trying to look at a trained model and understand what that model's objective is, you have to have transparency tools or some means of being able to look inside the model that gives you a real understanding of what it's doing, so that you can ascribe a meaningful objective. You can imagine a situation - So, I think clearly the answer is, well, the mesa-objective should be the negative of the thing which you actually specified, because you had some variable name which said maximize, but in fact you're minimizing. And so, the real objective, if we were trying to extract the objective, which was meaningful to us, would be the negative of that.

Daniel Filan: Just so the listeners don't have to keep that all in their heads. The thing this does, is find us the shortest path, even though my_utility is assigned to longest path, but you were saying.

Evan Hubinger: Right. And so, I think what that goes to show though, is it goes to show that if you had a level of interpretability tools, for example, that enabled you to see the variable names, but not the implementation, you might be significantly lead astray in terms of what you expected the mesa-objective be of this model. And so it does, it goes to show that there's difficulty in really properly extracting what the model is doing, because the model isn't necessarily built to be easy to extract that information. It might have weird red herrings or things that are implemented in a strange way that we might not expect. And so, if you're in the business of trying to figure out what a model's mesa-objective is by looking at it, you have to run into these sorts of problems where things might not be implemented in the way you expect. You might have things that you expect to be a maximizer, but it's a minimizer or something. And so, you have to be able to deal with these problems, if you're trying to look at models and figure out what mesa-objectives are.

Daniel Filan: So I guess my, the question that I really want to get at here is this is obviously a toy example, but typically optimizers, at least optimizers that I write, they're not necessarily perfect. They don't find the thing that - I'm trying to maximize some function when I'm writing it. But the thing doesn't actually maximize the function. So I guess, part of what I'm getting at is to what extent, if there's going to be any difference between your mesa-objective and your behavioral objective, to what extent, it seems like either you've got to pick the variable name that has the word utility in it, or you've got to say, well, the thing that actually got optimized was the objective, or you might hope that there's a third way, but I don't know what it is. So do you think there's a third way? And if so, what is it?

Evan Hubinger: Yeah. So, I guess the way in which I would describe this is that fundamentally, we're just trying to get good mechanistic descriptions. And when we say 'mesa-optimizer', I mean 'mesa-optimizer' to refer to the sort of model which has something we would look at as a relatively coherent objective like an optimization process. Obviously not all models have this. So there's lots of models, which don't really look like they have a coherent optimization process and objective. But we can still try to get a nice mechanistic description of what that model would be doing. When we talk about mesa-optimizers, we're referring to models, which have this nice mechanistic description which look like they have an objective and they have a search process, which is trying to select for that objective. I think that the way in which we are getting around these sorts of unidentifiability problems, I guess, would be well, we don't claim that this works for all models, lots of models just don't fall into the category of mesa-optimizers.

Evan Hubinger: And then also, we're not trying to come to a conclusion about what a model's objective might be only by looking at its behavior, we're just asking if the model is structured such that it has a relatively coherent optimization process and objective, then what is that objective? And it's worth noting that this obviously doesn't cover the full scope of possible models. And it doesn't necessarily cover the full scope of possible models that we are concerned about. But I do think it covers a large variety of models and we make a lot of arguments in the paper for why you should expect this sort of optimization to occur, because it generalizes well, it performs well on a lot of simplicity measures. It has the ability to compress really complex policies into relatively simple search procedures. And so, there's reasons that you might expect that we go into in the paper, this sort of model to be the sort of thing that you might find. But it's certainly not the case that it's always going to be the sort of thing you find.

Daniel Filan: Okay. That makes sense. So I guess, the final question I want to ask to get a sense of what's going on with mesa-optimization, how often are neural networks that we train mesa-optimizers?

Evan Hubinger: Yeah. So that's a really good question. I have no idea. I mean, I think this is the thing that is an open research problem. My guess, would be that most of the systems that we train currently don't have this sort of internal structure. There might be some systems which do, we talk a little bit in the Risks from Learned Optimization paper about some of the sorts of related work that might have these sorts of properties. So one of the examples that we talk about is the RL squared paper from OpenAI where in that paper, they demonstrate that you could create a reinforcement learning system, which learns how to do reinforcement learning by passing it through many different reinforcement learning environments and asking it to, along the way, figure out what's going on each time, rather than explicitly training on that specific reinforcement learning environment.

Evan Hubinger: This sort of thing seems closest to the sort of thing that we're pointing at in terms of existing research, but of course the problem is that our definition is one which is mechanistic and requires an understanding of "is it actually doing a search", which is not really something we know, because we don't have the ability to really look inside of our models and know whether they're doing search or not. So is it happening with current models? I don't know. I think that we certainly see some of the sorts of failures in terms of, sometimes we see models which try to pursue proxies to the objectives which they're trained under, but those proxies aren't necessarily the sorts of ones which are very objective robust. They might not be a really good coherent search process, which generalizes well, and is able to pursue that proxy intelligently and coherently, even off-distribution, which fundamentally is what we care about.

Evan Hubinger: Because if you think about the 2D robustness picture, if you have a situation where a model is very capability robust, it's really good at doing optimization, it's able to take really coherent capable actions, but it's doing so for a certain non-robust objective, an objective that is not the one which we actually wanted to achieve. You end up in a bad situation, because you have a very capable model, which is doing the wrong thing. We certainly see some amount of objective non-robustness, but in some sense, we also see that along with capability non-robustness. A lot of cases, when you look at models off-distribution, they just are bad, they don't do really coherent things. And so, we're in a situation where these notions haven't yet come apart completely, because we haven't yet gotten good enough at capability robustness. Good enough at producing models that are capable of doing really coherent optimization on- and off-distribution to be able to get to a situation where they have potentially an objective, which might be meaningful.

Daniel Filan: Okay, cool. So, I now want to move on a little bit to how we might produce mesa-optimizers. So, one reason you give for why or when mesa-optimization might happen is when there's a lot of computation available at inference time or runtime relative to how much computation is devoted to training the system. And I guess, this is because the optimization has to happen at some point. And if it doesn't, if it can happen at inference time, when the model's running, then you don't do it so much when you're training. So I guess, I'm wondering, do you expect this situation of a lot of computation being used at inference time relative to during training and if so, why? And also how big a deal is this do you think?

Evan Hubinger: Yeah, so I guess, I'm not sure I 100% agree with that description. So I think that the thing which I would just say, is that the more compute your model has available to it at inference time, the more likely it is to be a mesa-optimizer. I think the additional question of how the compute available at inference time versus the compute available at training time compare to each other, the relative differential, I think is more complicated. So for example, I think that in fact, the more compute you have available at training time, the better you are at selecting for models which perform very well according to your inductive biases. So this is a little bit complicated. So one story for why I believe this. So when you think about something like double descent. So double descent is a phenomenon, which has been demonstrated to exist in machine learning systems where what happens is you start training your machine learning model and initially, you get this standard learning theory curve where the training error goes down and the test error goes down and then goes back up as you start overfitting.

Evan Hubinger: And then, the insight of double descent is, well, if you keep training past zero training error, your test error actually starts to go back down again and can potentially get lower than it ever was. And the reason for this is that if you keep training, if you put enough compute into just training past the point of convergence, you end up in a situation where you're selecting for the model which performs best according to the inductive biases of your machine learning system, which fits the data. Whereas if you only train to the point where it gets zero training error, then stop immediately, there's a sense in which there's only potentially maybe one model. It's just the first model that it was able to find, which fit to the training data. Whereas if you keep training, it's selecting from many different possible models, which might fit the training data and finding the simple one. And one of the arguments that we make for why you might expect mesa-optimization is that mesa-optimizers are relatively simple models.

Evan Hubinger: Search is a pretty simple procedure, which is not that complicated to implement, but has the ability to in fact, do tons of different complex behavior at inference time. And so, search can be thought of as a form of compression. It's a way of taking these really complex policies that depend on lots of different things happening in the environment and compressing them into this very simple search policy. And so, if you were in a situation where you're putting a ton of compute into training your model, then you might expect that according to something like this double descent story, that what's happening is the more compute you put, the more your training process is able to select the simplest model, which fits the data as opposed to just the first model that it finds that fits the data. And the simplest model that fits the data, we argue is more likely to be a mesa-optimizer.

Evan Hubinger: And so, in that sense, I would actually argue that the more training compute you do, the more likely you are to get mesa-optimizers and the more inference time to compute you have, the more likely you get mesa-optimizers. And the reason for the inference time compute is similar to the reason you were saying, which is that inference time compute enables you to do more search, to make search more powerful. Search does require compute. You have to do this- it takes a certain amount of compute to be able to do the search, but it has the benefit of this compression of "Yeah, you have to do a bunch of compute for performing the search. But as a result, you get the ability to have a much more complex policy, that's able to be really adapted to all these different situations." And so, in that sense, the more compute on both ends, I would expect to lead to more likelihood of mesa-optimization.

Evan Hubinger: It is worth noting, there's some caveats here. So in practice, we generally don't train that far past convergence. I think that there's also still a lot of evidence that especially if you're doing things where for example, you do very early stopping where you only do a very small number of gradient descent steps after initialization. There's a sense in which that also is very strong on inductive biases, because you're so close to initialization the inductive biases of, well, the things which are relatively easy and simple to initialize, which take up a large volume in the initialization space, those are the sorts of things that should be very likely to find. And so, you still end up with very strong inductive biases, but actually it's different inductive biases. There's not a lot of great research here, but I think it's still the case that when you're doing a lot of training compute on very large models, if you're doing really early stopping or training a bunch past convergence, you're ending up in situations where the inductive biases matter a lot.

Evan Hubinger: And you're selecting for relatively simple models, which I think leads to something for search and something for optimization and something for mesa-optimizers.

Daniel Filan: Okay. So that makes sense. So yeah, this gets me to another question. One of the arguments that you make in the paper is this idea that well, mesa-optimizers are just more simple, they're more compressible, they're easier to describe and therefore, they're going to be favored by inductive biases or by things that prefer simplicity, but this presumably depends on the language that you're using to describe things. So I'm wondering in what languages do you think it's the case that optimizers are simple and in particular I'm interested in the quote unquote languages of the L2 norm of your neural network and the quote unquote language again, of the size of say a multi-layer perceptron or a CNN, how many layers it has.

Evan Hubinger: Yeah. So that's a really good question. So I mean, the answer is, I don't know. So I don't know the extent to which these things are in fact easy to specify in concrete machine learning settings. How hard is it to actually, if you were to try to write an optimization procedure in neural network weights, how difficult that would be to do, and how much space do those models take up in the neural network prior? I don't know. My guess is that it's not that hard and it certainly depends on the architecture, but if we start just with a very simple architecture, we think about a very simple multi-level perceptron or whatever, can you code optimization? I think you can. So you can do things where you can, I think depth is pretty important here, because it gives you the ability to work on results over time, create many different possibilities and then refine those possibilities as you're going through. But I do think it's possible to implement bare bones, search style things in a multilayer perceptron setup.

Evan Hubinger: It certainly gets easier when you look at more complicated setups. If you think about something like an LSTM, recurrent neural network, you have the ability now to work on a plan over time in a similar way to a human mind, like you were describing earlier where you could make plans and execute them. And especially if you think about something like AlphaZero or even more especially something like MuZero, which explicitly has a world model that it tries to search over and do MCTS, Monte Carlo Tree Search over that to find some objective. And the objective that it's searching over in that situation is a value network that is learned, and so you're learning some objective and then you're explicitly doing search over that objective. Now, and we talk about this situation of hard-coded searching in the paper a lot.

Evan Hubinger: Things get a little bit tricky, because you might also have mesa-optimization going on inside of any of the individual models and the situations under which you might end up with that are well, you might end up in a situation where the sort of search you've given it to be hard-coded is not the sort of search which it really needs to do to solve the problem. So it's a little bit complicated, but certainly I think that current architectures are capable of implementing search and being able to do search. There's then the additional question of how simple it is. I think that - so unfortunately we don't have really great measures of simplicity for neural networks. We have very simple metrics like Kolmogorov complexity, and there is some evidence to indicate that Kolmogorov complexity is at least related to the inductive biases of neural networks. So there's some recent research from one of the co-authors, Joar Skalse who's one of the co-authors on Risks from Learned Optimization trying to analyze what exactly is the prior of neural networks at initialization.

Evan Hubinger: And there's some theoretical results there that indicate that Kolmogorov complexity actually can serve as a real bound there. But of course, Kolmogorov complexity is uncomputable. For people who are not aware, Kolmogorov complexity is a measure of a description length that is based on finding the simplest Turing machine, which is capable of reproducing that. And so we can look at these theoretical metrics, like Kolmogorov complexity and be like, well, it seems like search has relatively low Kolmogorov complexity, but of course the actual inductive biases of machine learning and gradient descent that are more complicated and we don't fully understand them. There is other research. So there's also some research that looks at trying to understand the inductive biases of gradient descent, it seems like gradient descent generally favors these very large basins and there's evidence these very large basins seems to match onto relatively simple functions.

Evan Hubinger: There's also stuff like the lottery tickets hypothesis, which hypothesizes that if you train a very, very large neural network, it's going to have a bunch of different sub-networks, some of those sub-networks at initialization, will happen to encode - already be relatively close to the goal that you're going to get to. And then the gradient descent process will be amplifying some sub-networks and de-amplifying other sub-networks. And if you buy into a hypothesis like this, then you're in a situation where, well, there's a question of how likely are sub-networks implementing something like optimization likely to be at initialization. Obviously, the larger your neural network is the more possible sub-networks at initialization there are that exist in a combinatorial explosion way, such that as we get to much larger and larger networks, the combinatorial explosion of sub-networks seems like it would eventually cover even things potentially quite complicated. But unfortunately, there's just not that much I can say, because our understanding of the inductive biases of neural networks just isn't great.

Evan Hubinger: We have some things which we can talk about, about simplicity measures and stuff, but unfortunately at the end of the day, I can be like, well, it certainly seems like search will be relatively simple and likely to learn. But I don't have really strong results for that because we just don't have a great understanding of exactly how these inductive biases operate.

Daniel Filan: It does sound like one thing you think though is that, you mentioned, if you want to do search in something like a multi-layered perceptron, you mentioned that it helps to just have a bunch of layers, to refine options and such. So it kind of sounds like you think that if, if we were selecting on network depth, that would be a kind of simplicity that would disfavor search rather than favoring it. Does that sound right to you?

Evan Hubinger: That does sound right. So the way in which I would describe that is in some ways looking at inference time compute, it's thinking about more like a speed prior. So if you think about a speed prior or something like... Yeah, so speed prior is trying to select on models, which have a small time from initialization to completion. So the actual amount of time it takes to run the function. If you think about depth versus width, if you have a very wide neural network, but it's not very deep, then it's actually very fast to run, because you just run through that initial step. It might still be computationally expensive, but because you can parallelize width-wise, the actual individual step is relatively fast in a way in which it would be favored by a speed prior.

Evan Hubinger: And so, one of the things that we do claim in the paper is that these sorts of speed prior-like incentives disincentivize mesa-optimization, because mesa-optimization does require - it requires you to be able to have these multi-level relatively long inference time computations. And if you restrict your model to not be able to do these relatively long inference time compensations, then it's probably going to be less likely to be able to do search, to be able to do mesa-optimization. So certainly in that sense, I think that there is a distinction there between a simplicity prior and a minimal circuit speed prior thing, where with the speed prior you get different incentives. And there's a distinction where if you really are selecting on depth, you're doing something more like a speed prior, which changes things.

Daniel Filan: What do you think the important difference here is between time and space? Because it sounds like you're saying, well, if you have a lot of time to compute, then you can be an optimizer, but if you just have a lot of space, then you can't be an optimizer. Why is there this asymmetry? Doesn't special relativity tell us that these things are the same?

Evan Hubinger: Well, okay. So here's a really simple analogy to think about this is, let's say I have a lookup table as my model versus an optimizer. The lookup table that is equivalent to the optimizer behaviorally is going to be really large, because it has to explicitly encode all of these different things that's going to do, but it's also going to be really fast, because you just look at the lookup table, you figure out what it's going to do, and then you do that action.

Daniel Filan: Hang on. Why is it fast? Doesn't it take exponentially long to figure out where... You go through the lookup table and there's a billion entries there to go really [crosstalk 00:56:49].

Evan Hubinger: Hash map has O(1) look up something, something, something.

Daniel Filan: I mean, not worst case, right? Eventually you get collisions.

Evan Hubinger: Well, amortized expected O(1).

Daniel Filan: All right.

Evan Hubinger: I think this is actually, in fact, the case. There is analysis of looking at minimal circuits and speed priors and stuff, trying to understand what would happen in these sorts of situations. And in fact, I think lookup tables are really good because they're in fact quite fast. If you have all of the information already encoded, yes, you need to have some, you need to be able to generate some key, be able to look it up. But, I don't know, there are mechanisms that are able to do that relatively quickly, a hash map being one of them. Even just looking it up in a binary search tree or something, it's still not going to be that bad potentially. And so, if you have a lookup table, it can be relatively fast, but it takes up a lot of space. Whereas if you have an optimization system, I mean, it has to be relatively slow, because it has to actually come up with the possibility of all of the different possible options and then select one of them.

Evan Hubinger: And that process is going to take some - it takes compute, it takes time. It has to actually look through different options and select them. And you can have heuristics to try to narrow down the field of possible things, which you're going to look for, but you still need to look, you're still probably going to need to look at a bunch of different things to be able to select the correct one. And so, we can think about in this situation, the optimizer, it has a larger inference time compute, but has a smaller description, because you don't have to explicitly write down in the description of this optimizer, all of the different actions that it's going to take for all of these different things.

Evan Hubinger: It just needs to have a general procedure of, well, let's look at an objective, let's do some search to try to select something. Whereas when you think about the lookup table, it has to explicitly write all of these things down, then an inference time, it gets benefits. And this sort of thing, this sort of trade-off between space and time occurs a lot in computational complexity setups. A lot of times you think about, you'll take a function and then you'll memoize it, which is this procedure of well, I have this function, I call it a bunch, with different arguments. I can make it faster, if I'm willing to spend a bunch of space storing all of the different possible results that I generally call it with.

Evan Hubinger: And as a result, you get a faster function, but you have to spend a bunch of space storing all these different things. And so you might expect a sort of similar thing to happen with neural networks where if you're able to spend a bunch of space, you can store all of these different things. And it doesn't have to spend all this time computing the function. Whereas if you force it to compute the function every time, then it isn't able to store all these different things and it has to re-compute them.

Daniel Filan: Okay. So yeah, I mean, one thing this makes me think of is the paper is talking about these risks from these learned optimizers. And optimizers are these things that are doing search, they're actually doing search at inference time and they're not these things that are behaviorally equivalent to things that do search at inference time. And I guess, we already got to a bit about why you might be worried about these mesa-optimizers and we're going to avoid getting them. And I'm wondering, is 'mesa-optimizer' the right category. If it's the case that "oh, we could instead search for things that are like... Our base optimizer, we could instead try to get these systems that they're really quick at inference and then they're definitely not going to be mesa-optimizers, because they can't optimize." But of course they definitely could be these tabular forms of mesa-optimizers, which is presumably just as big a problem. So I guess this is making me wonder, is 'mesa-optimizer' the right category?

Evan Hubinger: Yes, that's a good question. So I think that if you think about the tabular mesa-optimizer, how likely would you be to actually find this tabular mesa-optimizer? I think the answer is very unlikely, because the tabular mesa-optimizer has to encode all of these really complex results of optimization explicitly and in a way, which especially when you get to really complex, really diverse environments, just takes up so much space and explicit memorization, but not just memorization. You have to be able to memorize what your out-of-distribution behavior would be according to performing some complex optimization and then have some way of explicitly encoding that. And that's, I think, quite difficult.

Daniel Filan: But presumably if we're doing this selection on short computation time, well we do want something that actually succeeds at this really difficult task. So how is that any easier than - it seems like if you think there are a bunch of these potential mesa-optimizers out there and they're behaviorally equivalent, surely the lookup table for really good behavior is behaviorally equivalent to the lookup table for the things that are these dangerous mesa-optimizers.

Evan Hubinger: Sure. So the key is inductive biases and generalization. So if you think about it, look, there's a reason that we don't train lookup tables and we instead train neural networks. And the reason is that, if we just train lookup tables, if you're doing tabular Q-learning, for example, tabular Q-learning doesn't have as good inductive biases. It is not able to learn the nice simple and expressive functions that we're trying to learn to actually understand what's happening. And it's also really hard to train. If you think about tabular Q-learning in a really complex setting, with very diverse environments. You're just never going to be able to even get enough data to understand what the results are going to be for all of these different possible tabular steps.

Evan Hubinger: And so there's a reason that we don't do tabular Q-learning, there's a reason that we use deep learning instead. And the reason is because deep learning has these nice inductive biases. It has the ability to train on some particular set of data and have good generalization performance, because you selected a simple function that fit that data. Not just any function that fit that data, but a simple function that fit that data such that it has a good performance when you try it on other situations. Whereas if you are just doing training on other situations like tabular stuff or something, you don't have that nice generalization property, because you aren't learning these simple functions to fit the data. You're just learning some function which fits the data.

Evan Hubinger: And that's what would be the direction I expect machine learning to continue to go, is to develop these nice inductive biases setups, these sorts of situations in which we have these model spaces where we have inductive biases that select for simple models in large expressive model spaces. And if you expect something like that, then you should expect a situation where we're not training these tabular models, we're training these simple functions, but the simple functions that nevertheless generalize well, because they are simple functions that fit the data. And this is also worth noting - why simplicity? Well, I think that there's a lot of research that understands - it looks like if you actually look at neural network inductive biases, it does seem like simplicity is the right measure.

Evan Hubinger: A lot of the functions, which are very likely to exist at initialization as well as the functions that SGD is likely to find have the simplicity properties. In a way which is actually very closely tied to computational description length, like Kolmogorov complexity. There's, I don't know, the research I was mentioning recently from Joar and others was talking about the Levin bound, which is this particular mathematical result which directly ties to Kolmogorov complexity. So there's arguments that - and there's also a further argument, which just, well, if you think about why do things generalize at all, why would a function generalize in the world? The answer is something like Occam's razor, it's that, well, the reason that things generalize in the world that we live in is because we live in a world where the actual true descriptions of reality are relatively simple.

Evan Hubinger: And so, if you look for the simplest description that fits the data, you're more likely to get something that generalizes well, because Occam's razor is telling you that's the thing which is actually more likely to be true. And so there's these really fundamental arguments for why we should expect simplicity to be the sort of measure which is likely to be used. And that we're likely to continue to be training in these expressive model spaces and then selecting on simplicity rather than doing something like, I don't know, tabular Q-learning.

Daniel Filan: Okay. So I'm just going to try to summarize you, check that I understand. So it sounds like what you're saying is, look, sure, we could think about like training these - selecting models to be really fast at inference time. And in that world, maybe you get tabularized mesa-optimizers or something. But look, the actual world we're going to be in is, we're going to select on description length simplicity, and we're not going to select very hard on computation runtime. So in that world, we're just going to get mesa-optimizers and you just want to be thinking about that world, is that right?

Evan Hubinger: I mean, I think that's basically right. I think I basically agree with that.

Daniel Filan: Okay. Cool. All right. So I think I understand now. So I'm going to switch topics a little bit. So what types - In terms of "are we going to get these mesa-optimizers", do you have a sense of what types of learning tasks done in AI do you think are more or less prone to producing mesa-optimizers? So if we just avoided RL, would that be fine if we only did image classification? Would that be fine?

Evan Hubinger: Yeah, that's a great question. So we do talk about this a little bit in the paper. I think that the RL question is a little bit tricky. So we definitely mostly focus on RL. I do think that you can get mesa-optimization in more supervised learning settings. It's definitely a little bit trickier to understand. And so, I think it's a lot easier to comprehend a mesa-optimizer in an RL setting. I think it is more likely to get mesa-optimizers in RL settings. Let's see. Maybe I'll start with the other question though, which is just about the environment. I think that that's a simpler question to understand. I think that basically, if you have really complex really general environments that require general problem solving capabilities, then you'll be incentivizing mesa-optimization, because fundamentally, if you think about mesa-optimization as a way of compressing really complex policies, if you're in a situation where you have to have a really complex policy that carefully depends on the particular environment that you end up in, where you have to look at your environment and figure out what you have to do in a specific situation to be able to solve the problem. And reason about what sorts of things would likely to succeed in this environment versus that environment, you're requiring these really complex policies that have simple descriptions if they're compressed into search. And so you're directly incentivizing search in those situations. So I think the question is relatively easy to answer in the environment case. In the case of reinforcement learning versus supervised learning, it's a little bit more complicated. So certainly in reinforcement learning, you have this setup where, well, we're directly incentivizing the model to act in this environment and over multiple steps generally.

Evan Hubinger: And we're acting in the environment over multiple steps. There's a natural way in which optimization can be nice in that situation, where it has the ability potentially also to develop a plan and execute that plan, especially if it has some means of storing information over time. But I do think that the same sort of thing can happen in supervised learning. So even if you just think about a very simple supervised learning setup.

Evan Hubinger: Let's imagine a language model. The language model looks at some prompt and it has to predict what the future... what the text after that would be, by a human that wrote it or whatever. Imagine a situation in which a language model like this could be just doing a bunch of heuristics, has a bunch of understanding of the various different ways in which people write. And the various different sorts of concepts that people might have that would follow this, some rules for grammar and stuff. But I think that optimization is really helpful here. So if you think about a situation where like, well, you really want to look over many different possibilities. You're going to have lots of different situations where, "well, maybe this is the possible thing which will come next." It's not always the case.

Evan Hubinger: There are very few hard and fast rules in language, you'll look at a situation where there's... Sometimes after this sort of thing there'll be that sort of thing. But it might be that actually they're using some other synonym or using some other analysis of it. And the ability to well, maybe, we'll just come up with a bunch of different possibilities that might exist for the next completion and then evaluate those possibilities. I think it's still a strategy which is very helpful in this domain. And it still has the same property of compression. It still has the same property of simplicity. You're in a situation where there's a really complex policy to be able to describe exactly what the human completion of this thing would be.

Evan Hubinger: And a simple way to be able to describe that policy, is a policy which is well, searching for the correct completion. If you think about what humans are actually doing, when they're producing language, they are in fact doing a search for parts of it. You're searching for the correct way, the best possible word to fill into this situation. And so if you imagine a model which is trying to do something similar to the actual human process of completing language then you're getting a model which is just potentially doing some sort of search. And so I really do think that even in these sorts of supervised learning settings, you can absolutely get mesa-optimization. I think it's trickier to think about certainly and I think it's probably less likely. But I certainly don't think it's impossible or even unlikely.

Daniel Filan: Okay. So, as I guess you mentioned or at least alluded to, suppose I wanted to avoid mesa-optimizers from just being - I just don't want them. So one thing I could do, is I could avoid doing things that require modeling humans. So for instance, like natural language processing, or natural language prediction, where you're trying to predict what maybe me, Daniel Filan, would say after some prompt. And also avoid doing things that require modeling other optimizers. So really advanced programming or advanced program synthesis or analysis would be a case of this, where some programs are optimizers and you don't want your system modeling optimizers. So, you want to avoid that. So a question I have is, how much do you think you could do if you wanted to avoid mesa-optimizers and you were under these constraints of "don't do anything for which it would be useful to model other optimizers"?

Evan Hubinger: Yeah. So this is the question that I have... There's some writing about that I've done in the past. So I think there are some concrete proposals of how to do AI alignment that I think, fall into this general category of trying not to do... Not produce mesa-optimizers. So I have this post, an overview of 11 proposals for building safe advanced AI which goes into a bunch of different possible ways which you might structure an AI system in a way to be safe. Some of those proposals, so one of them is STEM AI where the basic idea is, well, maybe you try to totally confine your AI to just thinking about mathematical problems, scientific problems. Maybe you want to build... If you think about something like AlphaFold, you want to just do really good protein folding, you want to do really good physics, you want to do really good chemistry, you want to do... I think we could accomplish a lot just by having very domain restricted models, that are just focusing on domains like chemistry or protein folding or whatever that are totally devoid of humans.

Evan Hubinger: There aren't optimizers in this domain, there aren't humans in this domain to model. You're just thinking about these very narrow domains. There's still a lot of progress to be made, and there's a lot of things that you could do just with models along these lines. And so maybe if you just use models like that, there's... You might be able to do powerful advanced technological feats like, things like nanotechnology or whole brain emulation, if you really push it potentially. So that's one route. Another route that I talked about in the 11 proposals posts is this idea of microscope AI. Microscope AI is an idea which is due to Chris Olah. Chris has this idea of, well, maybe another thing which we could do is just train a predictive model on a bunch of data. Oftentimes that data... And I think many of the ways which Chris produces this, that might involve humans.

Evan Hubinger: You might still have some human modeling. But the idea would be, well, maybe in a similar way to a language model being less likely to produce mesa-optimization than a reinforcement learning agent. We could just train in this supervised setting, a purely predictive model to try to understand a bunch of things. And then rather than trying to use this model to directly influence the world as an agent, we just look inside it with transparency tools, extract this information or knowledge that it has learned and then use that knowledge to guide human decision-making in some way. And so this is another approach which tries not to produce an agent, tries not to produce an optimizer which acts in the world, but rather just produce a system which learned some things and then we use that knowledge to do other things.

Evan Hubinger: And so there are approaches like this, that try to not build mesa-optimizers. I think that these are... So both the STEM AI style approach and the microscope AI style approach I think are relatively reasonable things to be looking into and to be working on. Whether these things will actually work, it's hard to know. And there's lots of problems. With microscope AI there are problems like maybe you actually do produce a mesa-optimizer like in the language modeling example, and there's other problems as well even with STEM AI also there's problems where, what can you actually do with this? If you're just doing all of these various... These physics problems or chemistry problems, if the main use case for your AI is being a AI CEO or helping you do more AI research or something, it's going to be hard to get a model to be able to do some of those things without sort of modeling humans, for example. And so there's certainly limits, but I do think that these are valuable proposals that are worth looking into, more.

Daniel Filan: Okay. Yeah. It also seems like if you really wanted to avoid modeling humans, these would be pretty significant breaks from the standard practice of AI as most people understand it, right? Or at least the standard story of "We're going to develop more and more powerful AIs and then they're just going to be able to do everything". It seems like you're going to really have to try. Is that fair to say?

Evan Hubinger: Yeah. I mean, I think that there's a question of exactly how you expect the field to AI to develop. So you could imagine a situation, and I talked about this a little bit also in the post - I have a post "Chris Olah's views on AGI safety", which goes into some of the ways in which Chris views the strategic landscape. One of the things that Chris thinks about is microscope AI and one of the ways in which he talks about, a way where we might actually end up with microscope AI being a real possibility, would be if you had a major realignment of the machine learning community towards understanding models and trying to look inside models and figure out what they were doing, as opposed to just trying to beat benchmarks and beat the state of the art with relatively black box systems.

Evan Hubinger: You can think of this as a realignment towards a more scientific view, as opposed to a more engineering view. Where currently a lot of the way in which people, I think a lot of researchers have approached machine learning is as a engineering problem. It's "well, we want to achieve this goal of getting across this benchmark or whatever and we're going to try and throw our tools at it to achieve that". But you can imagine a more scientific mindset, which is "well, the goal is not to achieve some particular benchmark, to build some particular system, but rather to understand these systems and what they're doing and how they work". And from a more... If you imagine a more scientifically focused machine learning community, as opposed to a more engineering focused machine learning community, I think you could imagine a realignment that could make microscope AI seem relatively standard and normal.

Daniel Filan: Okay. So I guess I have one more question about how we might imagine mesa-optimizers being produced. In the paper, you're doing some reasoning about what types of mesa-optimizers we might see. And it seems like one of the ways you're thinking about it is, there's going to be some parameters that are dedicated to the goals or the mesa-objective and there's other parameters dedicated to other things. And if you have something like first order gradient descent it's going to change the parameters and the goal thing, assuming all of the parameters in the belief or the planning or whatever parts are staying constant.

Daniel Filan: And one thing I'm wondering is, if I think about these classically understood parts of an optimizer, so... Or something doing search, right? So maybe it has perception, beliefs, goals, and planning. Do you think there's necessarily going to be different parameters of your machine learned system for these different parts? Or do you think that potentially there might be parts spread across - or parameters divided to different parts. In which case you couldn't reason quite as nicely about how gradient descent is going to interact with it.

Evan Hubinger: So this is the sort of stuff that gets the most speculative. So I think anything that I say here has to come with a caveat, which is, we just have no idea what we're talking about when we try to talk about these sorts of things, because we simply don't have a good enough understanding of exactly the sorts of ways in which neural networks decompose into pieces. There are ways in which people are trying to get an understanding of that. So you can think about Chris Olah's circuits approach and the Clarity team trying to understand [inaudible 01:18:01] decompose into circuits. Also, I know Daniel, you have your modularity research, where you're trying to understand the extent to which you can break neural networks down to coherent modules. I think that all of this stuff is great and gives us some insight into the extent to which these sorts of things are true.

Evan Hubinger: Now, that being said, I think that the actual extent of our understanding here is still pretty minimal. And so anything I say, I think is basically just speculation, but I can try to speculate nevertheless. So I think that my guess is that you will have relatively coherent things... To be able to do search, and I think you will get search, you have to do things like come up with options and then select them. And so you have to have at least some point where you're coming up - you have to have heuristics to guide your ability to come up with options. You have to somewhere generate those options and store those options and then somehow prune them and select them down. And so those sorts of things have to exist somewhere.

Evan Hubinger: Now, certainly the... We found... If you think about many of the sorts of analysis of neural networks that have been done, there exists a sort of poly-semanticity where you can have neurons, for example, that encode for many different possible concepts. And you can certainly have these sorts of things overlapping because the structure of a neural network is such that, it's not necessarily the case that things have to be totally isolated or one neuron does one particular thing or something. And so it's certainly the case that you might not have these really nice, you can point to the network and be like, "Ah, that's where it does the search. That's where it's objective is encoded."

Evan Hubinger: But hopefully at least we can have, "Okay, we know how the objective is encoded, it's encoded via this combination of neurons there that do this particular selection that encode, this thing here". I think that I'm optimistic, I don't know, I expect those things to at least be possible. I'm hopeful that we will be able to get to a point as a field in our understanding of neural networks, our understanding of how they work internally, that we'll actually be able to concretely point to those things and extract them out. I think this would be really good at our ability to be able to address inner alignment and mesa-optimization. But I do expect those things to exist and to be extractable. Even if they require understanding of poly-semanticity and the way in which concepts are spread across the network in various different ways.

Daniel Filan: So I'd like to change topics again a little bit and talk about inner alignment. So in the paper, correct me if I'm wrong, but I believe you described inner alignment as basically the task of making sure that the base objective is the same as the mesa-objective. And you give a taxonomy of ways that inner alignment can fail. So there's proxy alignment where the thing that you're... Or specifically that inner alignment can fail but you don't know that it's failed because if inner alignment fails and the system just does something totally different, that's like, normal machine learning will hopefully take care of that. But there are three things you mentioned, firstly, proxy alignment where there's some other thing, which is causally related to what the base optimizer wants and your system optimizes for that other thing.

Daniel Filan: There's approximate alignment, which is where your base optimizer has some objective function and your system learns an approximation of that objective function, but it's not totally accurate. And thirdly, sub-optimality alignment. So your system has a different goal, but it's not get so good at optimizing for that goal. And when you're kind of bad at optimizing for that goal, it looks like it's optimizing for the thing that the base optimizer wants. So proxy alignment, approximate alignment, and sub-optimality alignment. And pseudo-alignment is the general thing, where it looks like you're inner aligned, but you're not really. So one thing I wanted to know is how much of the space of pseudo-alignment cases do you think that list of three options covers?

Evan Hubinger: Yeah. So maybe it's worth... Also first, there's a lot of terminology being thrown around. So I'll just try to clarify some of these things.

Daniel Filan: Alright

Evan Hubinger: So the base objective, we mean basically, the loss function. So we're trying to understand, we have some machine learning system. We have some loss function, or call this the base objective or a reward function, whatever, we'll call this the base objective. We're training a model according to that base objective, but the actual mesa-objective that might learn it might be different. Because there might be many different mesa-objectives [that] perform very well on the training solution, but which have different properties off-distribution. Like going to the green arrow versus getting to the end of the maze. And we're trying to catalog the ways in which the model could be pseudo-aligned, which is, it looks like it's aligned on distribution in the same way that getting to the end of the maze and getting to the green arrow, both look like they're aligned on the training distribution, but when you move them off distribution, they no longer are aligned because one of them does one thing and one of them does the other thing.

Evan Hubinger: And so this is the general category of pseudo-alignment. And so, yes, we try to categorize them and we have some possible categories like proxy alignment, proxy alignment being the main one that I think about the most and the paper talks about the most. Approximate alignment and sub-optimality alignment also being other possible examples. In terms of the extent to which I think this covers the full space. I don't know. I think it's... A lot of this is tricky because it depends on exactly how you're defining these things, understanding what counts as a mesa-optimizer, what doesn't count as a mesa-optimizer. Because you could certainly imagine situations in which a thing does something bad off distribution, but not necessarily because it learned some incorrect objective or something.

Evan Hubinger: And sub-optimality alignment, even, is already touching on that because sub-optimality alignment is thinking about a situation in which the model has a defect in its optimization process that causes it to look aligned, such that if that defect were to be fixed it would no longer look aligned. And so you're already starting to touch on some of the situations which maybe the reason that it looks aligned is because of some weird interaction between the objective and the optimization process, similar to your example of the way in which it maximizes, it actually minimizes or something. I don't know. I think there are other possibilities out there than just those three. I think that we don't claim to have an exhaustive characterization.

Evan Hubinger: And one of the things also that I should note is, there's one post that I made later on after writing the paper that clarifies and expands upon this analysis a little bit. Which brings up the case of, for example, sub-optimality deceptive alignment, which would be a situation in which the model should act deceptively, but it hasn't realized that it should act deceptively yet because it's not very good at thinking about what it's supposed to be doing. But then later it realizes it should act deceptively and starts acting deceptively, which is an interesting case where it's sort of sub-optimality alignment, sort of deceptive alignment. And so that's for example, another-

Daniel Filan: So if it's sub-optimal and it's not acting deceptively, does that mean that you notice that it's optimizing for a thing that it shouldn't be optimizing for? And just because it does a different thing that it's supposed to do and-

Evan Hubinger: The sub-optimality alignment is a little bit tricky. The point we're trying to make is, it's sub-optimal at optimizing the thing it's supposed to be optimizing for, such that it looks like it's doing the right thing. Right? So for example-

Daniel Filan: And simultaneously sub-optimal at being deceptive in your sub-optimality deceptive alignment.

Evan Hubinger: Well, if it... Yeah, I mean, so it's sub-optimal in the sense that it hasn't realized that it should be deceptive and if it had realized it would be deceptive, then it might start doing bad things and looking because it tries to defect against you or something, at some point. It would be no longer actually aligned. Whereas, if it never figures out that it's supposed to be deceptive then maybe it just looks totally aligned basically for most of the time.

Daniel Filan: Okay.

Evan Hubinger: I don't know. So there certainly are other possible cases than just those, I think that it's not exhaustive. Like I mentioned, I don't remember exactly what the title, but I think it's "More variations on pseudo-alignment", is the one that I'm referencing, that talks about some of these other possibilities. And there are others. I don't think that's exhaustive, but it is a... I don't know. Some of the possible ways in which this can fail.

Daniel Filan: Do you think it covers as much as half of the space? Those main three?

Evan Hubinger: Yeah, I do. I think actually what I would say is I think proxy alignment alone covers half the space. Because I basically expect... I think proxy alignment really is the central example in a meaningful way. In that, I think it captures most of the space and the others are looking mostly at edge cases.

Daniel Filan: Okay. So a question I have about proxy alignment and really about all of these inner alignment failures. It seems like your story for why they happen is essentially some kind of unidentifiability, right? You know, you're proxy misaligned because you have a vacuum, the base objective is to have that vacuum clean your floors and somehow the mesa-objective is instead to get a lot of dust inside the vacuum and that you normally get from floors. But you could imagine other ways of getting it. And the reason you get this alignment failure is that, well on distribution, whenever you had dust inside you, it was because you cleaned up the floor.

Daniel Filan: So what I'm wondering is, as we train better AIs that have more capacity, they're smarter in some sense, on more rich datasets - so whenever there are imperfect probabilistic causal relationships, we find these edge cases and we can select on them. And these data sets have more features and they just have more information in them. Should that make us less worried about inner alignment failures? Because we're reducing the degree to which there's these unidentifiability problems.

Evan Hubinger: So this is a great question. So I think a lot of this comes back to what we were talking about previously also about adversarial training. So I think that there's a couple of things going on here. So first, unidentifiability is really not the only way in which this can happen. So [??unidentifiability??] is required at zero training error. If we're imagining a situation where we have zero training error, it perfectly fits the training data, then yeah, in some sense if it's misaligned, it has to be because of unidentifiability. But in practice in many situations, we don't train to zero training error. We have these really complex data sets that are very difficult to fit completely. And you train far before zero training error. And in that situation, it's not just unidentifability and in fact, you can end up in a situation where the inductive biases can be stronger in some sense, than the training data. And this is... If you think about avoiding overfit.

Evan Hubinger: If you train too far on the training data, sort of perfectly fit, you overfit the training data, because now you've trained a function that just perfectly goes through all of them. There's double descent, like we were talking about previously, that calls some of this into question. But if you just think about stopping exactly at zero training error, this is sort of a problem. But if you think about instead, not over-fitting and training a model which is able to fit a lot, a bunch of the data, but not all of the data. But you stop such that you still have a strong influence of your inductive biases on the actual thing that you end up with. Then you're in a situation where it's not just about unidentifiability, It's also about... You might end up with a model that is much simpler than the model that actually fits the data perfectly, but because it's much simpler it actually has better generalization performance.

Evan Hubinger: But because it's much simpler and doesn't fit the data, it might also have this very simple proxy. And that simple proxy is not necessarily exactly the thing that you want. An analogy here that is from the paper but I think is useful to think about is, imagine a situation... We talked previously about this example of humans being trained by evolution. And I think this example, there's... I don't necessarily love this example, but it is, in some sense, the only one we have to go off. And so if you think about the situation, what actually happened with humans and evolution, well, evolution is trying to train the humans in some sense to maximize genetic fitness, to reproduce and to carry their DNA on down to future generations. But in fact, this is not the thing which humans actually care about. We care about all of these other things like food and mating and all of these other proxies basically, for the actual thing of getting our DNA into future generations.

Evan Hubinger: And I think this makes sense. And it makes sense because, well, if you imagine a situation where an alternative human that actually cared about DNA or whatever, it would just be really hard for this human. Because first of all, this human has to do a really complex optimization procedure. They have to first figure out what DNA is and understand it. They have to have this really complex high level world model to understand DNA and to understand that they care about this DNA being passed into the future generations or something. And they have to do this really complex optimization. If I stub my toe or whatever, I know that that's bad because I get a pain response. Whereas, if a human in this alternative world stubs their toe, they have to deduce the fact that stubbing your toe is bad because it means that toes are actually useful for being able to run around or do stuff. And that's useful to be able to get food and be able to attract a mate. And therefore, because of having a stubbed toe, I'm not going to be able to do those things as effectively.

Evan Hubinger: And so I should avoid stubbing my toe. But that's a really complex optimization process that requires a lot more compute, it's much more complicated, because specifying DNA in terms of your input data is really hard. And so we might expect that a similar thing will happen to neural networks, to models where, even if you're in a situation where you could theoretically learn this perfect proxy, to actually care about DNA or whatever. You'll instead learn simpler, easier to compute proxies because those simpler, easier to compute proxies are actually better, because they're easier to actually produce actions based on those proxies, you don't have to do some some complex optimization process based on the real correct thing.

Evan Hubinger: They're easier to specify in terms of your input data. They're easier to compute and figure out what's going on with them. They don't require as much description length and as much explicit encoding. Because you don't have to encode this complex world model about, what is DNA and how does all of this work? You just have to code, if you get a pain signal, don't do that. And so in some sense, I think we should expect similar things to happen where, even if you're in a situation where you're training on this really rich detailed environment, where you might expect it to be the case that the only thing which will work is the thing which is actually the correct proxy. In fact, a lot of incorrect proxies can be better at achieving the thing that you want, because they're simpler, because they're faster, because they're easier. And so we might expect, even in that situation, you end up with simpler, easier proxies.

Daniel Filan: Yeah. And I guess that's interesting because it's... When I'm thinking about that, it seems like there's this trade-off of... You could imagine a world where evolution or the anthropomorphic god of evolution or whatever, put this... It really did say Daniel, the thing you're going to care about is gene alleles. Like the allele frequency of... I bet I'm using the wrong terminology. But how much my genes are in the population. Also Daniel, stubbing your toe, it doesn't help with that. Pain, typically, not good for this.

Daniel Filan: But what you really care about is your genes being propagated. And in that case, I really, presumably, I'm going to basically behave fine. And when these things are going to diverge, hopefully I eventually figure it out. But there's this problem, I guess there are two problems. Firstly, then it's a lot more description length because you have to pre-load all of this information. And secondly, there's this issue where well, when I'm kind of dumb, the bit of me that's saying, "well, what you actually care about is propagating your genes", that's not actually influencing my behavior in any way, because I'm just acting on the proxies anyway, because I know that they're proxies. So if there's some pressure to... It seems like that bit is just going to be totally unmoored and might randomly drift or maybe get pruned away because it's just not influencing my behavior.

Evan Hubinger: Yeah. I think that's basically exactly what I would say. I agree with that.

Daniel Filan: Cool.

Evan Hubinger: On both points.

Daniel Filan: All right. Another thing - this question is kind of long, and it only eventually gets to being about inner alignment. So I'm sorry listeners, but we're going to get there. So there's this argument in the paper, which basically says that if you're training a system for a diverse set of tasks, you want to expend most of your optimization online, once the task is done. You don't want to do a bunch of your optimization power in generating this system and then pre-computing, "okay, I'm going to figure out the policy for this type of task and pre-compute this type of task." You just want to do that optimization once you know the task, so that you can only do it for the thing that's actually useful.

Daniel Filan: And this reminded me a lot of, Eric Drexler's comprehensive AI services or CAIS, C-A-I-S model. Where you just, instead of having an AGI, you just... There are a bunch of tasks you want to be done and you just have a bunch of AIs that are made for each type of task and when you have a task you select the appropriate AI. So I'm wondering does this argument... My first question is, does this argument reinforce, or is this an argument for the idea of doing something like CAIS instead of creating this single unified AGI?

Evan Hubinger: Yeah. So I think it gets a little bit more complicated when you're thinking about, how things play out on a strategic landscape. So there might be many reasons for why you want to build one AI system and then, for example, fine tune that AI system on many different tasks, as opposed to independently training many different AI systems. Because if you independently train many different AI systems, you have to expend a bunch of training compute to do each one. Whereas, if you train one and then fine tune it, you maybe have to do less training compute, because you just do it... a lot of the overlap you get to do only once. I think in that sense... And I think that does, the sort of argument I just made, sort of does parallel the argument that we make in the paper where it's like, well, there's a lot of things you only want to do once.

Evan Hubinger: Like the building of the optimization machinery, you only want to do once. You don't want to have to figure out how to optimize every single time, but the explicit, what task you're doing, the locating yourself in the environment locating the specific task that you're being asked to do is the sort of thing which you want to do online. Because it's just really hard to recompute that if you're in an environment where you have to have massive possible divergence in what tasks you might be presented in every different point. And so there's this similarity there, where it suggests that, in a CAIS style world, you might also want a situation where, you have one AI which knows how to do optimization and then you fine tune it for many different possible tasks.

Evan Hubinger: It doesn't necessarily have to be actual fine tuning. You can also think about something like GPT-3, with the OpenAI API where people have been able to locate specific tasks, just via prompts. So you can imagine something like that also being a situation where, you have a bunch of different tasks, you train one model which has a bunch of general purpose machinery and then a bunch of ability to locate specific tasks. I think any of these possible scenarios, including the... You just train many different AIs with different, very different tasks because you really need fundamentally different machinery, are possible. And I think risks from learned optimization basically applies in all these situations. I mean, whenever you're training a model, you have to worry that you might be training a model which is learning an objective, which is different from the loss function you trained under. Regardless of whether you're training one model and then fine tuning a bunch of times or training many different models for whatever sort of situation you're in.

Daniel Filan: Yeah. Well, so the reason I bring it up is that, if I think about this CAIS scenario where, say, the way you do your optimization online is, you know you want to be able to deal with these 50 environments so you train - maybe you pre-train one system, but you fine tune all 50 different systems, you have different AIs. And then once you realize you're in one of the scenarios you deploy the appropriate AI, say. To me, that seems like a situation where instead of this inner alignment problem, where you have this system that's training to be a mesa-optimizer and it needed to spend a bunch of time doing a lot of computation to figure out what worlds it's in and to figure out an appropriate plan.

Daniel Filan: It seems like in this CAIS world, you just have... You want to train this AI. And the optimization that in the unified AGI world would have been done inside the thing during inference, you get to do. So you can create this machine learner that doesn't have that much inference time. You can make it not be an optimizer. So maybe you don't have to worry about these inner alignment problems. And then it seems, what CAIS is doing is, it's turning... It's saying, well, I'm going to trade your inner alignment problem for an outer alignment problem of getting the base objective of this machine learning system to be what you actually want to happen. And also you get to be assisted by some AIs maybe that you already trained, potentially, to help you with this task. So I'm wondering first, do you think that's an accurate description? And secondly, if that's the trade, do you think that's a trade worth making?

Evan Hubinger: Yeah, so I guess I don't fully buy that story. So I think that, if you think about a CAIS set up, I don't think... So I guess you could imagine one situation like what you were describing, where you train a bunch of different AIs and then you have maybe a human that tries to switch between those AIs. I think that, first of all, I think this runs into competitiveness issues. I think this question of, how good are humans at figuring out what the correct strategy is to deploy in the situation? In some sense, that might be the thing which we want AIs to be able to do for us.

Daniel Filan: The human could be assisted by an AI.

Evan Hubinger: Yeah. You could even be assisted by an AI. But then you also have the problem of, is that AI doing the right thing? And in many ways I think that in each of these individual situations... I have a bunch of AIs that are trained in all these different ways, in different environments. And I have maybe another AI, which is also trained to assist me in figuring out what AI that I should use. Each one of these individual AIs has an inner alignment problem associated with it. And might that inner alignment problem be easier because they're trained on more restrictive tasks? In some sense yes, I think that if you train on a more restricted task, you're less likely to end up with - search is less valuable because you have this less generality.

Evan Hubinger: But if at the end of the day you want to be able to solve really general tasks, the generality still has to exist somewhere. And I still expect you're going to have lots of models, which are trained in relatively general environments, such that you're going to... Search is going to be incentivized, we're going to have some models which have mesa-objectives. And in some sense, it also makes part of the control problem harder because you have all of these potentially different mesa-objectives and different models and it might be harder to analyze exactly what's going on. It might be easier because they're less likely to have mesa-objectives, or it might be easier because it's harder for them to coordinate because they don't share the exact same architecture and base objective potentially. So I'm not exactly sure the extent to which this is a better setup or a worse set up.

Evan Hubinger: But I do think that it's a situation where I don't think that you're making the inner alignment problem go away or turning it into a different problem. It still exists, I think in basically the same form. It's just like you have an inner alignment problem for each one of the different systems that you're training, because that system could itself develop a [inaudible 01:42:20] process or an objective that is different from the loss function that you're actually trying to get them to pursue.

Evan Hubinger: And I think, I guess, if you think about a situation, I think there's so much generality in the world that maybe we're in a situation where let's say I train 10,000 models. I've reduced the total generality in the world by some number of bits or whatever, ten to the three, three orders of magnitude or whatever, right?

Evan Hubinger: I don't know, what'd I say? 10,000 orders of magnitude? Whatever. Some number of orders of magnitude, right? It's the extent to which I've reduced the total space of generality, but the world is just so general that even if I've trained 10,000 different systems, the amount of possible space that I need to cover, if I want to able to do all of these various different things in the world, to respond to various different possible situations.

Evan Hubinger: There's still so much generality that I still think you're basically going to need search. You just haven't reduced the space, the amount of which you've reduced the space by cutting it down, by having multiple models is just dwarfed by the size of the space in general. Such that you're still going to need to have models which are capable of dealing with large generality and being able to respond to many different possible situations.

Evan Hubinger: And you could also imagine, if you're trying to do an alternative situation where it's CAIS, but it's also, you're trying to do CAIS like microscope AI or STEM AI, where you're explicitly not trying to solve all these sorts of general [inaudible 01:43:43] situations. Maybe things are different in that situation? But I think certainly if you want to do something like CAIS, like you're describing, where you have a bunch of different models for different situations and you want to be able to solve AI CEOs or something. It's just, the problem is so general that you just, there's still enough generality I think that you're going to need search. You're going to need optimization.

Daniel Filan: Yeah, I guess, so I think I'm convinced by the idea that there's just so many situations in the world that you can't manually train a thing for each situation. I do want to push back a little bit on the - and actually that thing I'm conceding is just enough for the argument. But anyway, I do want to push back on the argument that you still have this inner alignment problem, which is that the thing in the inner alignment problem is at least in the context of this paper, is it's when you have a mesa-optimizer and that mesa-optimizer has a mesa-objective that is not the base objective.

Daniel Filan: And it seems like in the case where you have these not very general systems, you just, you have the option of training things that just don't use a bunch of computation online, don't use a bunch of computation at inference time, because they don't have to figure out what task they're in because they're already in one. And instead they can just be a controller, you know? So these things aren't mesa-optimizers, and therefore they don't have a mesa-objective. And therefore there is no inner alignment problem. Now maybe there's the analog of the inner alignment [problems], there's the compiled inner alignment problem. But that's what I was trying to get at.

Evan Hubinger: Yeah. So I guess the point that I'm trying to make is that even a situation where you're like, "Oh, it already knows what tasks it needs to do because we've trained the thing for specific tasks," is that if you're splitting the entirety of a really complex problem into only 10,000 tasks or whatever, only a thousand tasks, each individual task is still going to have so much generality that I think you'll need search, you'll need mesa-optimization, you'll a lot of inference time compute, even to solve it at that level of granularity.

Daniel Filan: Okay.

Evan Hubinger: The point that I'm making is that just, I think that the problem itself requires, if you're trying to solve really complex general problems, there's enough generality there, that even if you're going down to the level of granularity, which is like a thousand times reduced, there's still enough granularity, you'll probably need search, you'll still probably get mesa-optimizers there. You're going to need a lot of inference time compute etc. etc.

Daniel Filan: Okay. That makes sense. I want to talk a little bit now about deceptive alignment. So deceptive alignment is this thing you mentioned in the paper where your mesa-optimizer has a different goal than... And we've talked about this before, but to recap your mesa-optimizer has a different mesa-objective than what the base objective is.

Daniel Filan: And furthermore, your mesa-optimizer is still being trained and its mesa-objective cares about what happens after gradient updates. So it cares about steering the direction of gradient updates. So therefore in situations where the mesa-objective and the base objective might come apart, if you're deceptively aligned, your mesa-optimizer picks the thing that the base objective wants so that its mesa-objective doesn't get optimized away so that one day it can live free and no longer have to be subject to gradient updates and then pursue its mesa-objective. Correct me if that's wrong?

Evan Hubinger: Yeah. I think that's a pretty reasonable description of what's going on.

Daniel Filan: Yeah. So the paper devotes a fair bit of space to this, and I'm wondering, it mostly devotes space to what happens if you get deceptive alignment. And what I'm kind of interested in, and I don't really understand, is how this occurs during training. So at what point do you get an objective that cares about what happens after episodes, after gradients or updates, after gradient updates are applied? How does that end up getting selected for?

Evan Hubinger: Yeah. So this is a really good question and it's a complicated one. And so we do talk about this in the paper a good deal. I think there's also a lot of other discussion in other places. So one thing just to point out. So let's see Mark Xu has this post on the alignment forum, does SGD produce deceptive alignment, which I think is a really good starting point.

Evan Hubinger: Also I worked with him a bunch on putting together a lot of the arguments for this in that place. So I think that's a good place to look, but I'll try to talk through some of the ways in which I view this.

Daniel Filan: Okay.

Evan Hubinger: So I think that in the paper we have the following story. And telling stories about gradient descent is hard and it's the sort of thing where you should expect to be wrong in a lot of cases.

Evan Hubinger: And so I'm going to tell a story, but I don't necessarily think that this is precisely the sort of story that you would end up with. But I think there's general arguments that you can make for why this might be likely, why it might not be likely. And I think that by telling the stories is helpful for trying to understand what a setup might be like in which you would get this.

Evan Hubinger: So here's the story I'm going to tell. So we have a proxy aligned mesa-optimizer. I think that basically this is where you start, because if we're thinking about all of the arguments I made previously, thinking about things like humans, having a pain response rather than caring about DNA, you're going to start with some proxy, simple, fast proxies. But maybe you do a bunch of adversarial training, maybe you do a bunch of trying to train on all of the different situations. You really force it to go to zero training error to really understand everything that's happening to have to sort of have a perfect version of the base objective.

Evan Hubinger: What happens in this situation? Well, at some point, we'll say it has to learn the base objective. Has to know what the base objective is and be able to act directly according to that base objective. But there's multiple ways in which a model can learn what the base objective is.

Evan Hubinger: So one way in which it can learn what the base objective is is, well, it has the base objective directly hard-coded into it via gradient descent. So it has some proxy objective and gradient descent sort of tinkers with that objective and keeps encoding into it more and more detail to what the correct thing is. That's one way. We call that in the paper, internalization.

Evan Hubinger: There's another way, which is it can learn what the base objective is via its input data. So, rather than via the gradient descent update steps being the thing which encodes the objective into it, maybe you're trying to train it to do language modeling, predict what a human would do, and it just reads a bunch about humans and how they work. Or maybe your training it to do value learning and try to optimize for human values. And it reads the Wikipedia page on ethics and it tries to really understand human values.

Evan Hubinger: And so it gets a bunch of information about the objective you're training it on through its input data. It gets a bunch of information about the thing that you're trying to get it to do, just because you, if your GPT-3 style or whatever, you can imagine training on a huge corpus of data; it has a whole bunch of information. And a lot of that information is just, whatever you're training it on is probably going to be somewhere. There's going to be information about that thing that you're training it on in that input data.

Evan Hubinger: And so it can also just figure out what the objective is by looking at that input data. But then there's a question which is, well let's say the model, it's doing a bunch of world modeling. It's trying to understand the world and so it's produced an internal model that includes what the correct base objective [is]. But it doesn't know that's the correct base objective. It's not trying to optimize for that; it's just trying to optimize for its proxy. It'd be really nice, from the perspective of gradient descent, from the perspective of the training process, if it, instead of optimizing for its proxy, started optimizing for that thing, which is encoded in the world model. Started optimizing for its description, its better understanding of the base objective.

Evan Hubinger: So there's the question, how does gradient descent, how does the training process modify the model to start optimizing for that thing encoded in its world model.

Evan Hubinger: And fundamentally in the paper we sort of analyze two ways. One of which is, we call corrigible alignment, which is, well, its objective could change into a pointer to that thing in the world model. And another way is deceptive alignment, which is it could realize that, as an instrumental goal for optimizing its proxy, it should be trying to optimize for this thing, because it realizes that's the base objective that it's being trained on.

Evan Hubinger: And so we end up with three ways in which it could learn the base objective: internalization, corrigibility, and deception. And then the question that we ask, which is, well, how likely are these different things to occur? How well do they perform on things like simplicity? How well do these measure [inaudible 01:52:08] perform?

Evan Hubinger: So one thing that I'll start with, and this is an analogy from Buck Shlegeris, which is sort of a counting argument, which is like, well, let's just start with the question, how many models are there that fall into each one of these categories. And I think the answer is, well, there's a lot more deceptive models. And so here's the analogy to think about why that's the case. So we can imagine a situation where you have this Christian God, is trying to select humans, according to the extent to which they align with the Bible. And you can imagine three different humans that might fall into this category. So one of which would be Jesus Christ, who just has the exact same objective as God because-

Daniel Filan: In Christianity, I believe Jesus Christ is not totally a human.

Evan Hubinger: Sure, I agree with that. But let's set that aside and imagine that Jesus Christ is a possible option here that God is considering. And Jesus Christ has the exact same objective as God does, because whatever. And then you have another example, which is, let's say Martin Luther. And Martin Luther is like, "I'm going to look at the Bible and figure out exactly what the Bible says,, and try to do that as effectively as possible". And then you have Blaise Pascal and Blaise Pascal is like, "I don't care about the Bible; I don't care about God. But I know that there is a God and that God is going to select me according to whether I care about the Bible. So I'm going to care about the Bible." And these sorts of three people pretty closely match onto the sort of three scenarios that I described.

Evan Hubinger: So, Jesus Christ is internalization. Martin Luther would be corrigibility, and Blaise Pascal would be deception. And the question is you, how many Blaise Pascals are there, how many Martin Luthers are there and how many Jesus Christs are there?

Evan Hubinger: And I think the initial response would be well, there's only one Jesus Christ because it has to explicitly encode the entirety of the objective. There might be multiple Martin Luthers because there might be some sort of basin of corrigibility, it might be the case that if you get your pointer slightly wrong, you can still correct it by figuring out what the Bible says about what your pointer should be. You still basically have to have exactly the correct pointer to the Bible and believe that the Bible is the thing you should be following.

Evan Hubinger: But there's a lot of Blaise Pascals because Blaise Pascal can have any objective, essentially, as long as Blaise Pascal is aware of the fact that they're in an optimization process, whatever their objective is, they're going to try to change it to be the one that is pointing to the Bible or whatever.

Evan Hubinger: And so there's this simple counting argument, which is, well, it just seems like there's a lot more deceptive models than the others. And so just as a very naive prior, it seems like we should expect deception. But, well, we can go further than that, and we can try to ask further questions of simplicity and how likely would this action be to occur to a gradient descent process. So it seems like-

Daniel Filan: So before we get to those arguments, so you're right that the paper does devote a fair bit of space to, okay, suppose you have a proxy, you're proxy aligned and your proxy extends across gradient updates. And, then you learn about the base objective. And I think there's a decent story in the paper that people can read for how that happens once you have this proxy that extends past gradient updates.

Daniel Filan: And I think my initial question, to me the thing - maybe listeners are different. But to me, the thing that seems most implausible about this story is that gradient descent produces this kind of proxy because that kind of proxy is not - When the gradient is applied, the thing evaluating the gradient doesn't know whether the objective that extends past when the gradient applies was met or not because it hasn't happened yet. So, the thing that confuses me the most is how you ever get a situation where your system cares about what happens after the gradients are applied?

Evan Hubinger: Yes, that's a great question. And in fact, this is actually really close to a lot of the research that I'm currently doing, trying to understand precisely the situations in which you get these cross-episodic objectives and in what situations you can produce myopic objectives instead.

Evan Hubinger: So I think one way to think about this, is that fundamentally it's a generalization question because you're correct that when you're doing training, it doesn't understand that there's potentially multiple episodes in your reinforcement learning setting. It doesn't know about the multiple episodes and so it learns an objective that isn't really in terms of those multiple episodes. But let's say it does discover the existence of those multiple episodes. The question that we're asking is how does it generalize? How does it generalize in terms of caring about those multiple episodes, or not caring about those multiple episodes?

Evan Hubinger: It learns some understanding of an objective, some understanding of what it's supposed to be, what sort of proxy its trying to fulfill. And that proxy - once it discovers that these other episodes exist, does it think that its proxy also exists there? Does it care about its proxy in those situations also or not?

Evan Hubinger: And that's the question we're answering, this generalization question. And I think that by default, we should expect it to generalize such that - Here's a very simple intuitive argument is, if I trained a model and it's trying to find all the red things it can, and then suddenly it discovers that there are way more red things in the world than it previously thought, because there's also all of these red things in these other episodes, I think that by default you should basically expect, it's going to try to get those other red things too.

Evan Hubinger: That it would require an additional distinction in the model's understanding of the world to believe that only these red things are the ones that matter, not those red things. And that sort of distinction, I think, is not the sort of distinction that you would be - gradient descent has no reason to put that distinction in the model, because like you were saying, in the training environment where it doesn't understand that there are these multiple episodes, whether it has or does not have that distinction, is completely irrelevant.

Evan Hubinger: It only matters when it ends up in a generalization situation where it understands that there are multiple episodes and is trying to figure out whether it cares about them or not. And in that situation, I expect that this sort of distinction would be lacking, that it would be like, "well, there are red things there too, so I'll just care about those red things also."

Evan Hubinger: Now this is a very hand-wavy argument and it's certainly, I think a really big open question. And in fact, I think it's one of the places in which I feel like we have the most opportunity potentially to intervene, which is why I was talking about one of the ways in which - a place where I focus a lot on my research. Because it seems like if we are trying to not produce deceptively aligned mesa-optimizers, then what are the ways which we might be able to intervene on this story would be to prevent it from developing an objective which cares about multiple episodes, and instead ensure that it develops a sort of myopic objective.

Evan Hubinger: Now there's a lot of problems that come into play when you start thinking about that. And we can go into that, but I think just on a surface level, the point is that, well, just by default, despite the fact that you might have a myopic training setup, whether you actually end up with a myopic objective is very unclear. And it seems like naively myopic objectives might be more complicated because they might require this additional restriction of no, not just objective everywhere, but actually objective only here.

Evan Hubinger: And that might require additional description length that might make it more complex. But certainly it's not the case that we really understand exactly what will happen. And so one of the interesting things that's happening here is it's breaking the assumptions of reinforcement learning, where when we, a lot of times when we do reinforcement learning, we assume that the episodes are i.i.d., that they're independent, that the model - We just totally resample the episodes every time, that there's no way for it to influence across episodes. But these sorts of problems start to occur, as soon as you start entering situations where it does have the ability to influence across episodes.

Daniel Filan: Such as online learning, like we mentioned at the start, which is a common - Often when you're thinking about how are we going to get really powerful AIs that work in reality, it's like, yeah, we're going to do online learning.

Evan Hubinger: Precisely. Yeah. And so if you think about, for example, an online learning setup, maybe you're imagining something like a recommendation system. So it's trying to recommend you YouTube videos or something.

Evan Hubinger: One of the things that can happen in this sort of a setup is that, well, it can try to change the distribution to make its task easier in the future. You know, if it tries to give you videos which will change your views in a particular way such that it's easier to satisfy your views in the future, that's a sort of non-myopia that could be incentivized just by the fact that you're doing this online learning over many steps.

Evan Hubinger: And if you think about something, especially what can happen in this sort of situation is, let's say I have a - Or another situation this can happen is let's say I'm just trying to train the model to satisfy humans' preferences or whatever.

Evan Hubinger: It can try to modify the humans' preferences to be easier to satisfy. And that's another way of which this non-myopia could be manifest where, it tries to take some action in one episode to make humans' preferences different so that in future episodes they'll be easier to satisfy.

Evan Hubinger: And so there are a lot of these sorts of non-myopia problems. And not only, so I was making an argument earlier that it seems like non-myopia might just be simpler. There's also a lot of things that we do that actually explicitly incentivize it. So if you think about something like population-based training, population-based training is a technique where what you do is you basically create a population of models, and then you evolve those models over time. You're changing their parameters around, you're moving, changing their hyperparameters and stuff, just make it so that they have the best performance over many episodes.

Evan Hubinger: But as soon as you're doing that, as soon as you're trying to select from models, which have the best performance over multiple episodes, you're selecting for models which do things like change the humans' preferences, such that in the next episode, they're easier to satisfy. Or recommendation systems which try to change you such that you'll be easier to give things to in the future.

Evan Hubinger: And this is the sort of thing that actually can happen a lot. So David Krueger and others have a paper that talks about this, where they talk about these as hidden incentives for distributional shift or auto induced distributional shift, where you can have setups like population-based training where the model is directly incentivized to change its own distribution in this non-myopic way.

Evan Hubinger: And so in some ways there's two things going on here. There's, it seems like even if you don't do any of the population-based training stuff, the non-myopia might just be simpler. But also there's a lot of the techniques that we do that because they implicitly assume this i.i.d. assumption that's in fact, not always the case, they implicitly can be incentivizing for non-myopia in ways that might not [inaudible 02:02:31].

Daniel Filan: Okay. So I have a few questions about how I should think about the paper. So the first one is most of the arguments in this paper are kind of informal, right? There's very few mathematical proofs; you don't have experiments on MNIST, the gold standard in ML. So how confident are you that the arguments in the paper are correct? And separately, how confident do you think readers of the paper should be that the arguments in the paper are correct?

Evan Hubinger: That's a great question. Yeah. So I think it really varies from argument to argument. So I think there's a lot of things that we say that I feel pretty confident in. Things like, I feel pretty confident that search is going to be a component of models. I feel pretty confident that you will get simple, faster proxies. Some of the more complex stories that we tell, I think are less compelling. Things that I feel like are somewhat less compelling, we have the description of a mesa-optimizer as having this really coherent observation process, really coherent objective. I think that I do expect search; to the extent to which I expect a really coherent objective and a really coherent optimization process, I'm not sure. I think that's certainly a limitation of our analysis.

Evan Hubinger: Also, some of the stories I was talking about about deception, I do think that there's a strong prior that should be like, "hmm, it seems like deception is likely," but we don't understand it fully enough. We don't have examples of it happening, for example.

Evan Hubinger: And so it's certainly something where we're more uncertain. So I do think it varies. I think that there's some arguments that we make that I feel relatively confident in, some that I feel relatively less confident in. I do think, I mean, we make an attempt to really look at what is our knowledge of machine learning inductive biases, what is our understanding of the extent to which certain things would be incentivized over other things and try to use that knowledge to come to the best conclusions that we can.

Evan Hubinger: But of course, that's also limited by the extent of our knowledge, our understanding of inductive biases is relatively limited. We don't fully understand exactly what neural networks are doing internally. And so there are fundamental limitations to how good our analysis can be, and how accurate it can be. But I do think that in terms of, well, "this is our best guess currently", I feel like is how I would think of the paper.

Daniel Filan: Okay. And I guess sort of a related question, my understanding is a big motivation to write this paper is just generally being worried about really dangerous outcomes from AI, and thinking about how we can avert them. Is that fair to say?

Evan Hubinger: Yeah. I mean, certainly the motivation for the paper is we think that the situation in which you have a mesa-optimizer that's optimizing for an incorrect objective is quite dangerous and that this is the central motivation. And there's many reasons why this might be quite dangerous, but the simple argument is well, if you have a thing which is really competently optimizing for something which you did not tell it to optimize for, that's not a good position to be in.

Daniel Filan: So I guess my question is, out of all the ways that we could have really terrible outcomes from AI, what proportion of them do you think do come from inner alignment failures?

Evan Hubinger: That's a good question. Yeah. So this is tricky. Any number that I give you, isn't necessarily going to be super well-informed. I don't have a really clear analysis behind any number that I give you. My expectation is that most of the worst problems, the majority of existential risk from AI comes from inner alignment and mostly comes from deceptive alignment, would be my personal guess.

Evan Hubinger: And some of the reasons for that are, well, I think that, there's a question of what sorts of things will we be able to correct for, what sorts of things will be not be able to correct for? And I think that there's a sense in which, if you think about, the thing we're trying to avoid is these unrecoverable failures, situations in which our feedback mechanisms are not capable of responding to what's happening.

Evan Hubinger: And I think that deception, for example, is particularly bad on this. Especially if you have a situation where, because the model basically looks like it's fine until it realizes that it will be able to overcome your feedback mechanisms, in which case, it no longer looks fine.

Evan Hubinger: And that's really bad from the perspective of being able to correct things by feedback. Because it means you only see the problem at the point where you can no longer correct it. And so those sorts of things lead me into the direction of thinking that things like deceptive alignment are likely to be a large source of the existential risk.

Evan Hubinger: I also think that, I don't know, I think that there's certainly a lot of risk also just coming from standard proxy alignment situations, as well as relatively straightforward outer alignment problems.

Evan Hubinger: I am relatively more optimistic about outer alignment because I feel like we have better approaches to address outer alignment. I think that things like Paul Christiano's amplification as well as his learning the prior approach, Geoffrey Irving's debate. I think that there exists a lot of approaches out there, which I think have made significant progress on outer alignment, even if they don't fully solve outer alignment. Whereas, I feel like our progress on inner alignment is less meaningful. And I expect that to sort of continue such that I'm more worried about inner alignment problems.

Daniel Filan: Hmm. So I guess this gets to a question I have. So, one thing I'm kind of thinking about is, okay, what do we do about this? But before I do that, in military theory, there's this idea called the OODA loop. OODA is short for, it's O-O-D-A. It's observe, orient, decide, and act, right? And the idea is, well, if you're a fighter pilot and you want to know what to do, first you have to observe your surroundings. You have to orient to think about, okay, what are the problems here? What am I trying to do? You have to decide on an action and then you have to act.

Daniel Filan: And the idea of the OODA loop is you want to complete your OODA loop and you want to not have the person you want to shoot out of the sky complete their OODA loop. So instead of asking you what we're going to do about it and jumping straight to the A, first, I want to check where in the OODA loop do you think we are in terms of inner alignment problems?

Evan Hubinger: Yeah. So it's a little bit tricky. So in some sense, first of all, there's a view which we aren't even at observe yet, because we don't even have really good empirical examples of this sort of thing happening. We have examples of machine learning failing, but we don't have examples of, is there an optimizer in there? Does it have an objective? Did it learn something? We don't even know!

Evan Hubinger: And so in some sense we aren't even at observe. But we do have some evidence, right? So, the paper looks at lots of different things and tries to understand, some of what we do understand about machine learning systems and how they work. So in some sense, we have observed some things, and so in that sense, I would say the paper is more at orient. The paper doesn't, Risks from Learned Optimization doesn't come to a conclusion about this is the correct way to resolve the problem. It's more just trying to understand the problem based on the things that we have observed so far.

Evan Hubinger: There are other things that I've written that are more at the decide stage, but are still mostly at the orient stage. I mean, so stuff like my post on relaxed adversarial training for inner alignment is closer to the decide stage because it's investing in a particular approach. But it's still more just, let's orient and try to understand this approach because, there's still so many open problems to be solved. So I think we're still relatively early in the process for sure.

Daniel Filan: Do you think that it's a problem that we're doing all this orientation before the observation, which is, I guess the opposite of what, if you take OODA loop seriously, it's the opposite order that you're supposed to do it in.

Evan Hubinger: Sure. I mean, it would be awesome if we had better observations, but in some sense we have to work with what we've got, right? We want to solve the problem and we want to try to make progress on the problem now because that will make it easier to solve in the future. We have some things that we know and we have some observations; we're not totally naive.

Evan Hubinger: And so we can start to make understanding, can start to produce theoretical models, try and understand what's going on before we get concrete observations. And for some of these things, we might never get concrete observations, until potentially it's too late. If you think about deceptive alignment, there are scenarios where we don't see - if a deceptive model is good at being deceptive, then we won't learn about the deception until the point at which we've deployed it everywhere, and it's able to break free of whatever mechanisms we have, or doing feedback.

Evan Hubinger: And so if we're in a situation like that, you have to solve the problem before you get the observation, if you want to not die to the problem. And so, it's one of the things that makes it a difficult problem, is that we don't get to just rely on the more natural feedback loops of like, we look at the problem that's happening, we try to come up with some solution and we deploy that solution because we might not get great evidence. We may not get good observations of what's happening before we have to solve it. And that's one of the things that makes the problem so scary, I think from my perspective, is that we don't get to rely on a lot of our more standard approaches.

Daniel Filan: Okay. So on that note, if listeners are interested in following your work, seeing what you get up to, what you're doing in future, how should they do that?

Evan Hubinger: Yeah, so my work is public and basically all of it is on the Alignment Forum. So you can find me, I am evhub, E-V-H-U-B on the Alignment Forum. So if you go there, you should be able to just Google that and find me, or just Google my name. And you can go through all of the different posts and writing that I have. Some good starting points:

Evan Hubinger: So once you've, I don't know if you've already read Risks from Learned Optimization, other good places to look at some of the other work that I've done might be An overview of 11 proposals for building safe advanced AI, which I've mentioned previously, and goes into a lot of different approaches for trying to address the overall AI safety problem and includes an analysis of all of those approaches on outer alignment and inner alignment.

Evan Hubinger: And so trying to understand how would they address mesa-optimization problems, all this stuff. And so I think that's a good place to go in to try to understand my work. There's also, another post that might be good, would be the Relaxed adversarial training for inner alignment post, which I had mentioned tries to address the very specific problem of relaxed adversarial training.

Evan Hubinger: It's also worth mentioning that if you like Risks from Learned Optimization, I'm only one of multiple co-authors on the paper, and so you might also be interested in some of the work of some of the other co-authors. So I mention that Vlad, Vladimir Mikulik has a post on 2D robustness, which I think is really relevant, worth taking a look at. I also mentioned some of Joar's recent stuff on trying to, Joar Skalse, on trying to understand some of what's going on with inductive biases and the prior of neural networks. And so there's lots of good stuff there, to take a look at from all of us.

Daniel Filan: Okay. Well, thanks for appearing on the podcast and to the listeners, I hope you join us again.

Daniel Filan: This episode was edited by Finan Adamson.



Discuss

The Problem with Giving Advice

18 февраля, 2021 - 00:52
Published on February 17, 2021 9:52 PM GMT

Epistemic Status: I'm highly confident this is a phenomenon that occurs with a lot of advice people give, but I'm quite uncertain about the best way to deal with it when trying to give advice to more than one person.

 

The main thing people fail to consider when giving advice is that someone with ostensibly the same problem may require a vastly different solution than themselves. The underlying cause of the problem in many cases may be the exact opposite of what it was for you. 

Many issues in life are caused by being too extreme. Either extreme is problematic-both too much of something or too little. Often people only think about one extreme (or one possible failure mode, we’re not limited to only two!) when they give advice because that is the problem they themselves had to overcome. It fails to occur that not only might this advice not be helpful, it might be actively detrimental to a person struggling with the opposite problem. To use a concrete example, imagine a runner is trying to improve their 5K time and asks a more accomplished runner for advice. The faster runner suggests that if the questioner wants to break through their plateau they will probably need to do more high intensity interval training. After all, the faster runner can remember that was what they needed to do to progress past their own plateau. Unfortunately, the slower runner’s problem is that they are overtraining, their muscles do not have sufficient time to recover and become stronger, limiting improvement. This advice will thus be completely counterproductive, and will probably lead to injury if the runner tries too hard to follow it.

The issue is that it is instinctive to ask “how would this advice have affected me?” when evaluating possible advice to give, rather than “is this the sort of scenario in which that sort of advice would be useful, accounting for the individual receiving the advice?” From this one might be tempted to derive the following lesson: never give anyone important advice unless you have thoroughly questioned what their problem is and are very confident you understand both the problem and the underlying cause. 

If you have the resources and time to give people individual advice I think this is a reasonable principle to abide by. But we often do not have this luxury, sometimes we want to give advice to multiple people at once. Sometimes we just don’t have the time or resources to inquire deeply into the specifics of someone’s problem. This difficulty is exacerbated by the fact that even once you try to consider how to give advice that doesn’t accidentally hurt someone you may fail to imagine all the ways your advice might do harm because you underestimate how different other people are from yourself.

So how do we avoid giving people bad advice? One solution is to adopt a policy of not giving unpersonalized advice, of course, but assuming we still believe we have useful things we want to say to people how should we proceed? For audiences who understand this problem with advice, one might avoid a lot of potential damage by starting discussion of possible ways you imagine the advice might go wrong and asking a reader/listener to consider themselves before applying the advice. Unfortunately, for broader audiences this technique will probably not work unless you can take the time to explain all this because it will look like you aren’t sure of your own advice and are hedging your bets or some such. And you certainly will not always have the time to explain this. A simple disclaimer that most good advice is situational and depends on the person may help some people avoid harm, especially if you are giving advice you know to be potentially dangerous from a position of expertise or authority, though I suspect most people would ignore such warnings.

Does anyone have any other strategies to avoid/minimize the unintentional harm advice may cause? 



Discuss

Covid: CDC Issues New Guidance on Opening Schools

17 февраля, 2021 - 23:00
Published on February 17, 2021 8:00 PM GMT

Author’s Note: This was originally the title topic of this week’s Covid post (as ‘School Daze’), but the post was getting long and this is its own issue, so I’m moving the guidelines discussion to its own post. 

As usual, all discussions of school are confusing for me, because I consider 21st-century American schools as they currently exist to mostly be a dystopian nightmare, obedience factory and prison system. That makes it hard to root for the resumption of in-person education. 

Still, I do root for it, for two reasons. First, I recognize that this is what almost all parents want for their kids, and second, that the alternative that is being implemented in practice is not home unschooling or kids getting to be kids again. It is ‘remote learning’ and it is a toxic cesspool that drives large percentages of kids into depression, makes it impossible for many parents to work or relax, and generally makes standard schooling look like paradise while also neither teaching the few things school successfully teaches nor offering contact with fellow human beings. It’s the absolute worst in every way other than not catching Covid, and it is saddening to me that more children are not withdrawn from school even under these conditions. 

Thus, I am increasingly comfortable treating ‘get kids back in school’ the same way I would if I thought of school the way (for example) my parents think of schools.

The controversy over how to deal with schools continues. Few on any sides are showing much sanity. It’s understandable, as children are not a sanity-friendly topic in modern America.

One school is doing this with five million dollars of plexiglass (link has video, this is a still frame):

Then between classes they plan to require disinfecting, while they do nothing about airflow, and I don’t have any more idea how the hell to teach students under these conditions than the teacher who made the video does. It doesn’t seem possible. 

To be fair, that’s in no way recommended by anyone. It’s certainly not what the CDC’s guidelines from last week say. 

Those guidelines also aren’t intentionally suggesting this insanity, which is also happening:

Isn’t it great that we can get all the disadvantages of school without the pesky advantage of perhaps sometimes having someone to teach a student something, or the safety of not putting a bunch of people into the same room for hours at a time? 

The best part is that there’s still the same number of humans in the rooms with children, except that the person isn’t a teacher, it’s a not-even-glorified babysitter that doesn’t have the political power to demand not being in that room, while the students all log on remotely to different virtual classroom dystopias, now with physical control reasserted. Lovely. 

This is the kind of thing that happens when choices are focused around requirements, guidelines and demands, with it being suspicious when someone advocates that which might benefit a human

I also don’t understand this if it’s not generic ‘schools are good so we should find ways to spend more on schools’ given the fiscal year ends in September, by which time schools shouldn’t need the help whereas they kind of really need it as soon as possible now:

Where is the public, in broad terms?

I don’t know what it means to ‘trust some’ versus ‘not trust much’ nor do I think regular folks know what either means either, nor does it seem much linked to actual physical preferences or opinions. If you trust teacher’s unions but don’t think teachers even need vaccinations before reopening, you have some explaining to do, and both seem to be majority opinions. 

The big thing is that a clear majority thinks schools should not wait for teachers to get vaccinated, let alone for students to get vaccinated, before reopening.

Let’s take a look at those CDC guidelines.

I’ve seen worse starting aspirational principles. Masks and physical distancing are jobs one and two. Contact tracing has been a dismal failure in the United States, and it seems odd to tell schools to do it when no one else ever does it, but contact tracing is still worth doing. 

Weird word choice (‘respiratory etiquette’) aside, handwashing is overrated in importance but still very much worthwhile. Cleaning has been highly overrated the whole way because humans have purity instincts around it and there wasn’t much effort to train us out of that.

The big missing point of emphasis here is ventilation, which is mentioned almost offhand in the ‘cleaning’ section, and oh boy do schools have issues here. When we looked at potential schools for our son Alexander, the default was that none of the windows opened due to safety and liability concerns. The one place where they did open was when they somehow were ten plus feet above the floor. The continued failure to emphasize ventilation remains puzzling, since there’s no sacred cows involved beyond admitting that we were ignoring it the whole time. So then again, I guess it’s understandable. 

I’d also like to see the simple suggestion of holding class outdoors whenever possible. There are places and times where it won’t be possible, but also plenty of places and times where it would make sense, especially if we (hopefully temporarily) moved the school schedule from summer vacation to winter vacation accordingly based on this kind of being an emergency. Pretty sure we know why that particular fence is there so it’s fine to tear it down for a while.

The first problem with this is that these guidelines are not going to be treated as aspirational. It’s a communication that schools everywhere and always need all students six feet apart in masks. That’s basically a non-starter (WaPo article, parent rant). You literally cannot do six feet apart at all times for most schools and have all children there all the time. So what are these guidelines saying?

Technically they’re saying six feet to ‘the greatest extent possible’ in the blue and yellow zones, rather than saying it is required. The problem is that this is mostly being interpreted as a de facto requirement, and saying ‘this was the greatest extent possible’ seems unlikely to be a successful blame-avoidance technique when accused of violating the guidelines if someone catches Covid at your school and your head is demanded on a pike.

Check out this picture in their factsheet:

There’s lots of detail to appreciate here, but if nothing else, think about there being only six desks.

The other problem with these guidelines is that they don’t adjust to other circumstances. 

In particular, they don’t adjust to vaccinations. With many places looking to follow CDC guidelines to the letter to avoid liability and blameworthiness, and as the only way to satisfy the demands of teacher’s unions (which also means that such places are likely to ignore ventilation issues entirely), there’s a real risk that requirements incompatible with reasonable operation of a school could become effectively permanent. Fully vaccinated children, in rooms with fully vaccinated teachers, could show up in September and sit six feet apart while wearing masks. That’s insane, and there’s no reasonable way to run a school like that.

Can you imagine what would have happened if there wasn’t a pandemic, and it was proposed we send someone’s child to a school where everyone had to wear a mask all day and children couldn’t come near each other? There’s no way that school’s going to work or cost a reasonable amount to run. If that was the local public school, then every parent who could possibly do so would move. If it was a private school it would have zero children.

But, you say, that’s not going to happen. Once vaccinations are readily available to all who desire them, the guidelines will change. 

To which I say, maybe you are right. Maybe you are not. Guidelines do not automatically change unless they are set up to change automatically, and changed guidelines don’t always get followed. I wouldn’t count on anything. Even if they will eventually change, they might depend on all the children being vaccinated, which might not happen so soon. 

Also, there are places where vaccination should matter now, and the guidelines do not care:

Thus, as written, someone fully vaccinated would still need to do full quarantines on exactly the same basis as everyone else, which by default will be 14 days regardless of test results. 

That’s what you do if you’re open. Are you open? That depends:

Local transmission is defined as total new cases per 100,000 persons in the past 7 days (low, 0-9, moderate 10-49, substantial, 50-99, high >100) and percentage of positive tests in past 7 days (low <5%, moderate 5-7.9%, high >10%). 

The second test isn’t that bad, as the United States is currently averaging about 5% positive test rates. 

The first test is a bit harder. The United States currently averages closer to 200 cases per week per 100k people than it does to 100. Most places are going to currently be in red. Even elementary schools can only be fully open in yellow, and very few places are currently yellow let alone blue.

And that’s… reasonable if you care about levels and want to make a control system and we applied the same standard to other things? You can get angry about schools not being open at the moment all you want. Biden explicitly says he thinks schools should be open and most parents seem to agree. But that doesn’t change that the Covid-19 situation now is still worse than it was for most of 2019. That likely won’t be true in the fall, and could be true long before that if we pick up our vaccination pace or the new strains aren’t as impactful as I expect, but it’s true now. 

The issue is in part that this combines with lack of future-proofing in the form of vaccine accommodation, but also it doesn’t make sense to be reinforcing the control system here nor does this match up with the way we’re operating other things. In the CDC’s own words:

That’s a very strange place to be. Schools are defined as the most essential non-essential part of the community. One could make a reasonable case that this is where they should be, in that one could argue that lack of school is a long-term issue rather than a short-term issue, but the value proposition on school opening (in terms of risk vs. reward) is better than it is for anything else that doesn’t keep society running in the short term. 

So let’s accept the premise. Schools get in line behind ‘essential’ things (and oh my is that term loose in many places) but ahead of non-essential things.  

That’s not at all what these guidelines mean in practice. For that to line up with what’s being suggested would imply that all non-essential businesses should be closed in what is defined as the yellow zones above, let alone orange or red.

That’s clearly not remotely the case. New York, for example, is currently reopening arenas and indoor dining. 

Effectively these guidelines are an Isolated Demand for Rigor. In a context where we were holding the rest of society to these same standards, or the guidelines would only be followed in places where that was the case, this would all make some sense and we could talk about details. Instead, this holds schools to a completely different standard, because of the inertial forces pushing schools towards adopting the guidelines wholesale. 

My suggestion for zones is that the zones be defined in terms of other community restrictions. Thus, if indoor dining or other non-essential activities are permitted, K-12 schools are fully open to the extent they can otherwise follow guidelines. Period. If we’re scared enough to shut down all the inessential economic activity, then okay, you are acting like it’s an emergency, so we can talk about closing the schools, and we can look at other metrics. Rank order as the CDC suggests.

Mostly I don’t think using schools as part of the control system, outside of a true emergency, makes sense at all. Schools provide some amount of increased community transmission risk versus everyone hiding out at home. That number doesn’t change when there’s more or less virus. With teachers vaccinated (soon, if not now) and students themselves mostly immune, what matters is stabilizing spread, and trading schools off against other transmission sources, so why should it much matter (outside of an all-hands-on-deck-close-everything scenario) how much spread there is? We want to beat this thing, not perpetually fight to a draw. In most circumstances, either schools are Worth It, or they’re not. They either spread Covid a lot if someone brings Covid into school, or they don’t. 

It’s also worth noting that this ‘hybrid’ system, where everyone has to rotate where they go and what they do all the time, technically does a better job checking off some blame-avoidance boxes, but when you think about what kids will actually do, it might not be the way to ensure that a pandemic gets contained

I would be mildly surprised if hybrid was actively worse, in the sense that it isn’t what I would guess, but it definitely would not shock me and I do not get any sense the CDC took that question seriously. Nor do they offer an estimate of how much better such systems are for transmission, or take non-school transmission risk in the same way they take school transmission risk. Again, isolated demand for rigor.

This makes a strong but not airtight case that the schools don’t spread Covid that much (WaPo). The error here is that, if there was zero transmission in school, you’d expect much less than community average transmission because everyone would be spending their days not catching Covid. This still shows that schools aren’t riskier than baseline ways to exist under current conditions, which isn’t bad at all, but that’s different from saying they’re not sources of spread.

I don’t consider this proof of anything, but it’s certainly worth pointing out:

The six foot rule is good as a rule of thumb, but as a universal rule that will effectively be a hard requirement in many places due to how guidelines, children, blame and teacher’s unions interact, I strongly agree with Tyler Cowen that this is effectively an announcement that schools are not safe to fully open under any circumstances, and may never be again. 

The CDC director understands this and says that there’s flexibility on the six foot rule when community spread is low:

And technically, she’s right, the rules do say that. But that isn’t worth much. What matters are the written guidelines and how they will be interpreted in practice. Much of the discussion around the guidelines is interpreting the 6 feet as effectively mandatory at all levels, to the extent that I had to do a double take when I was reminded that this wasn’t what the guidelines actually said

There’s also this other bit, which mentions ventilation as a key point of failure (which the guidelines mostly hide, but at least do mention under cleaning) and also has a quite telling tidbit:

 And. There. It. Is.

Walensky is the Director of the CDC and is making the ultimate CDC power grab. She is saying that now that we’ve established our level of safety concern trolling, and our willingness to shut everything down that isn’t fully safe, why stop with Covid? Why not demand equal safety with regard to everything else? We need to crack down on these unprincipled exceptions. Why, indeed, did we let people ever leave their houses?

If you are a beaurocrat seeking power, and your job is the director of the Centers For Disease Control, your natural instinct is to assert maximum rules regarding disease control, as that increases your power and authority, no matter the impact on society. That’s a different department.

At what point do we say, enough? 

Biden says he wants the schools open. Assuming that’s true and he does think they should be open, given these guidelines, he’s utterly failed to do his job in creating conditions where that can happen, and shown he is unable to get his agencies to do things to benefit humans. It’s becoming a pattern.

My hope is that this is sufficiently over the top that many places will either treat the guidelines as purely aspirational during the pre-vaccination period (at which point, sure), or disregard them entirely, or at least only follow them to the letter or follow the blue or yellow zone rules to the letter. My worry is that this will happen at most for schools that are already open. 

My other hope would be that, to the extent that students cannot attend school, that we can stop pretending that ‘remote learning’ that ties students to a screen all day lest they ‘get out of’ something be recognized as insanity, and students be given tasks and projects to do instead, plus some amount of zoom time in smaller groups. The current remote learning system seems based on viewing school as some sort of either necessary punishment and/or mystical requirement, that must be mimicked and enforced at all costs. Maybe we could stop doing that?

As another piece of the whole puzzle, Walensky said this prior to issuing the guidelines, which is really weird if you try to square it with the guidelines being based on a physical world model:

This thread offers an attempt at a nominally even-handed perspective, where the CDC is compromising and bothsidesism is in full effect (surprise, it’s a NYT reporter), so one can see the perspective that views this as a compromise, if one wishes. I can see the argument for this, if we were in a different blame and liability regime (so the things that aren’t technically required really wouldn’t be de facto required), and the zones weren’t so strict.

It would be remiss not to tie all of this back to the previous moral panic over child safety that crippled the mobility of children and parents alike, and which continues to do so to this day.

There was a time when children, often quite young children, were trusted to play outside on their own or with friends, walk to and from school and otherwise live out their lives largely unsupervised by any particular adult. This was called childhood. 

Then (as I understand events), due to a crime wave and the accompanying moral panic, we were told this was no longer safe, and the age at which children could be left unsupervised rose higher and higher. This was true even inside one’s home. Eventually this became enshrined in the law, and one could worry that leaving an eleven year old alone in one’s home, or letting them play in the local playground without direct supervision, was a legally dangerous thing to do. And people report you for such things, and tell themselves they are helping.

The crime rates have since dropped dramatically, but the panic has become permanent. It’s the new normal and it’s also unbelievably destructive and terrible. It’s even worse for the disadvantaged, who can get their child taken away from them when they have no choice but to briefly leave the child in the car to go on a job interview. It’s still plenty oppressive even for those lucky enough to enjoy the benefit of the doubt. 

That’s on top of forcing the kids to attend school and barring them from doing work or otherwise learning a trade in a more sensible fashion.

That’s important for understanding our current situation, because it informs and should inform our fears and expectations. We like to think that the system will at least return to normal once there’s no reason for it not to, but there’s no reason to assume that will be the case. Once we raise our level of concern trolling about health to the levels that involve not wanting vaccinated teachers in rooms with children because of a disease that puts no one involved in meaningful danger, where does it end?

It doesn’t end, unless we make sure it ends.

It’s also important because our de facto ban on children living their childhoods is both a gigantic tax on being a parent and a gigantic injury to every childhood. It drives our children crazy, makes them unable to learn self-reliance, and helps drive down birth rates to below replacement, especially among the most scrupulous people we’d most want raising more children. Arguably it is an existential threat to our civilization, and we really should use this opportunity to try and do something about it. Ending this terror will be a key part of my platform as a potential benevolent dictator.



Discuss

Preparing to land on Jezero Crater, Mars: Notes from NASEM livestream

17 февраля, 2021 - 22:35
Published on February 17, 2021 7:35 PM GMT

I just attended a livestream of NASA scientists talking about the upcoming Perseverance landing, and the goals they hope to achieve during the mission. A recording of the full meeting can be found here: https://livestream.com/accounts/7036396/events/9513656. I was taking notes during the talk, and thought I'd share them here, in case anybody finds them informative. I have not edited my notes, and am simply pasting them in full for now, so there's a whole bunch of grammatical errors and unoptimized wording. Nonetheless, I hope you find some of this interesting!

 

This program is trying to understand history of geology and climate on Mars, exploring possibilities of life, comparing and contrasting with Earth.

Some Mars processes are really different than earth, which provides new perspective on us. Our mission is about the ancient past, it’s tough to survive rn where we’re going. We are explorers, we want to plan for future human missions.

Mars missions: follow the water—>explore habitatablity—>seek signs of life—>prepare for future human exploration

Our current questions we are asking can only be asked as a result of past missions.

We want to get precious samples and bring them back to Earth. We are always thinking about the future, thinking about hope. This is about technology as well.

Helicopter is audacious (in a good way) tech demonstration.

 

Objective A: Geology, B: astrobiology, C: sample caching (+D: prepare for humans)

We need to understand geological context of 3 and 1/2 billion year old crater, was about same time life began on Earth.

This will be most advanced sampling system ever sent to another planet, will help us to determine *which* samples to bring back

Why Jezero crater? Spent about five years assessing sites, this was selected.

Deposits suggest crater was likely friendly to life in the past, there’s an ancient delta showing water flow with sediments, which could have preserved organic molecules and other potential life signs. We also see water-bearing minerals

 

Cameras: we have 23 cameras on this rover, plus two on helicopter. Primary camera is mast camera, we also have ground-penetrating radar, plus spectrometer which are designed to study chemical composition of rocks and soil, to identify potential best samples with organic molecules.

All missions are built on past missions, couldn’t do it without our robotic friends :3

 

Mars sample return (this is the section I'm really interested in):

Sample return is the mission of our generation, over the timescale of decades, this is within our reach. Been working on this mission for past 8 years, in awe of the perseverance team. Earth has instrumentals that can’t be miniaturized/sent to Mars “easily”

Timeframe is three missions, this is first step. Then sample retrieval lander, then Earth return orbiter. Finally will reach sample receiving facility.

SRL (sample retrieval lander) will build on past work, will be augmented with additional propellant, to do “propulsive divert” to get pinpoint landing (within football field). This will be most massive lander ever sent to Mars. We don’t know exactly where Perseverance will go in the future, but we’ll be tracking it closely to bring back samples.

Next project is sample return orbiter, most complex one sent to date, will carry payload of rendezvous/capture equipment, then can return to Earth.

 

Mission challenges overcame in development.

This is the first time we’ve been really trying to take advantage of past investments from previous missions.  First supersonic high altitude parachute testing in 40+ years. The testing process was extremely difficult, but we managed to do three tests pretty flawlessly. Heat shield and other hardware issues late in testing. We got fires, earthquake, pandemic setting us back.

Sample caching system, and entry descent and management:

We got massive arm with a coring drill, caching system inside vehicle, and bit carousel (tech used to move sample to cache). System is a new capability. Able to core samples from rock, abrade the rind of rock,

Use press fit to hermetically seal the tube, then Store it. We can collect about 40 samples, expecting to get 20-30 samples.

This system is super complex, by far the most ever done in Mars program. Samples are ultra clean with less than one viable organism. The drill is a lot like a jackhammer, very demanding environment, required shock absorbers, dust mitigation, extreme cold. Mars environment goes from -90 Celsius (might have been Fahrenheit) up to room temperature IN A DAY!!!

 

Been developing this for more than 10 years. Cleanliness requirements were really extreme. We made system to minimize surfaces samples touch, less than ten parts per billion total organic carbon. Very high confidence no viable organisms can make it. Something as simple as titanium tube has a lot of different surface coating, anodized alumina, nitrated surface, spring mechanisms inside to drive ball locks, when we have super clean surfaces, friction coefficient changes, gives much higher friction which required operational work-arounds.

 

Entry descent and landing: will approach Mars ballistically, will use supersonic parachute to slow down —Curiosity was huge help in developing current tech.

We also have “terrain relative navigation”—taking pictures as we land to locate ourselves precisely. This is a really enabling technology to land on Jezero, since it has so many hazards.

Other new thing is “EDL cameras”—commercial cameras to look at ourselves in high-def video. Will take a few days to receive it and get it to public.

Landing tomorrow is 12:55 afternoon pacific time.

 

Rover science missions: there are outstanding questions still to be answered:

Looking for life, as well as to study prebiotic environments, which we don’t have much record of on Earth.

Mars was once habitual, but we only know things in a very relative sense. We want to understand how Mars moved from livable to the seemingly uninhabitable space we see now.

Want to do isotopic analysis, to understand biotic vs abiotic signatures.

 

 

Want to better understand weather, will also be trying to produce oxygen from Martian CO2 with MOXIE

We’ve done geochemistry on Mars before, but not like this. Will allow us to potentially build a case for biosignatures . Will be able to do bulk geochemistry through abrasion. Will also be doing powder drilling for mineralogy, organically.

We can do spatially resolved mapping on the surface to see if it compares with organic matter.

 

Jezero Crater—what we are excited about:

One of the best preserved ancient lakes on Mars. It’s a sedimentologists dream. Diversity of habitable environments, including potential sub-surface environments. Some of the most ancient rocks which could help with understanding solar system evolution. Jezero is bookended by major events—between surface impact crater and (missed other thing).

We are likely to land just off the delta, could be volcanic rocks, which are great for absolute age dates. This could help cataloging crater chronology. Will have olivine and carbonate-bearing floor unit, which could be regionally extensive ash deposit, which is another aging method. Delta will be focus of sampling, as on Earth, deltas are a great source of organic matter. Surface processes have brought ancient rocks to us, which will help understanding ancient crust

 

We’ve also got marginal deposits, which can have amazing biosignatures on Earth. Finally, we got crater rim, where there could have been very different habitable environments, such as hot springs and flowing water. There are house-sized rocks over here called “mega-blocks,” which could pre-date the impact crater itself. If the rover is still safe and healthy after all that, we can also move beyond Jezero to sample ancient dust

 

Chance to study carbonates on Mars is very exciting. Will also learn about magnetic field, which we think is important to keeping life/atmosphere in a good shape (which Mars isn’t in anymore)

 

This mission is only the first step!

 

While on surface, we will be keeping samples in those tubes, did what we could to make sure they are perfectly preserved. The tubes will prevent the samples from changing or contamination.

 

Mitigation on Earth—we got temperature requirements. We’ve got clean-room requirements, will be triply sealed, will land in the desert.

 

Where do you expect to contain samples when they do come back to earth?

A: We might do modifications of current facility or make a new one.

 

How will this help human exploration?

A: Systems we are landing are bigger, which helps future, still a lot of innovation ahead of us. Working on methods of depositing supplies. We think science and robotic exploration as a pathfinder serve as excellent guide.

 

Question for Matt—Insight ran into problem with drill sampling—what prevents that happening this time?

A: our mission is different, uses different methods. We can’t find rocks on Earth that can prevent us from drilling, and we’re expecting Mars rocks to be a lot softer! Different set of hardware, different missions. Mars can throw weird things at us though, like when last time there were rocks that acted like. An openers on wheels. Every mission gets surprised in some way, we’ll need ingenuity when that happens.

 

Close of webcast.



Discuss

[AN #138]: Why AI governance should find problems rather than just solving them

17 февраля, 2021 - 21:50
Published on February 17, 2021 6:50 PM GMT

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

HIGHLIGHTS

‘Solving for X?’ Towards a problem-finding framework to ground long-term governance strategies for artificial intelligence (Hin-Yan Liu et al) (summarized by Rohin): The typical workflow in governance research might go something like this: first, choose an existing problem to work on; second, list out possible governance mechanisms that could be applied to the problem; third, figure out which of these is best. We might call this the problem-solving approach. However, such an approach has several downsides:

1. Such an approach will tend to use existing analogies and metaphors used for that problem, even when they are no longer appropriate.

2. If there are problems which aren’t obvious given current frameworks for governance, this approach won’t address them.

3. Usually, solutions under this approach build on earlier, allegedly similar problems and their solutions, leading to path-dependencies in what kind of solutions are being sought. This makes it harder to identify and/or pursue new classes of solutions

4. It is hard to differentiate between problems that are symptoms vs. problems that are root causes in such a framework, since not much thought is put into comparisons across problems

5. Framing our job as solving an existing set of problems lulls us into a false sense of security, as it makes us think we understand the situation better than we actually do (“if only we solved these problems, we’d be done; nothing else would come up”).

The core claim of this paper is that we should also invest in a problem-finding approach, in which we do not assume that we even know what the problem is, and are trying to figure it out in advance before it arises. This distinction between problem-solving and problem-finding is analogous to the distinction between normal science and paradigm-changing science, between exploitation and exploration, and between “addressing problems” and “pursuing mysteries”. Including a problem-finding approach in our portfolio of research techniques helps mitigate the five disadvantages listed above. One particularly nice advantage is that it can help avoid the Collingridge dilemma: by searching for problems in advance, we can control them before they get entrenched in society (when they would be harder to control).

The authors then propose a classification of governance research, where levels 0 and 1 correspond to problem-solving and levels 2 and 3 correspond to problem-finding:

- Business as usual (level 0): There is no need to change the existing governance structures; they will naturally handle any problems that arise.

- Puzzle-solving (level 1): Aims to solve the problem at hand (something like deepfakes), possibly by changing the existing governance structures.

- Disruptor-finding (level 2): Searches for properties of AI systems that would be hard to accommodate with the existing governance tools, so that we can prepare in advance.

- Charting macrostrategic trajectories (level 3): Looks for crucial considerations about how AI could affect the trajectory of the world.

These are not just meant to apply to AGI. For example, autonomous weapons may make it easier to predict and preempt conflict, in which case rather than very visible drone strikes we may instead have “invisible” high-tech wars. This may lessen the reputational penalties of war, and so we may need to increase scrutiny of, and accountability for, this sort of “hidden violence”. This is a central example of a level 2 consideration.

The authors note that we could extend the framework even further to cases where governance research fails: at level -1, governance stays fixed and unchanging in its current form, either because reality is itself not changing, or because the governance got locked in for some reason. Conversely, at level 4, we are unable to respond to governance challenges, either because we cannot see the problems at all, or because we cannot comprehend them, or because we cannot control them despite understanding them.

Rohin's opinion: One technique I like a lot is backchaining: starting from the goal you are trying to accomplish, and figuring out what actions or intermediate subgoals would most help accomplish that goal. I’ve spent a lot of time doing this sort of thing with AI alignment. This paper feels like it is advocating the same for AI governance, but also gives a bunch of concrete examples of what this sort of work might look like. I’m hoping that it inspires a lot more governance work of the problem-finding variety; this does seem quite neglected to me right now.

One important caveat to all of this is that I am not a governance researcher and don’t have experience actually trying to do such research, so it’s not unlikely that even though I think this sounds like good meta-research advice, it is actually missing the mark in a way I failed to see.

While I do recommend reading through the paper, I should warn you that it is rather dense and filled with jargon, at least from my perspective as an outsider.

TECHNICAL AI ALIGNMENT
ITERATED AMPLIFICATION

Epistemology of HCH (Adam Shimi) (summarized by Rohin): This post identifies and explores three perspectives one can take on HCH (AN #34):

1. Philosophical abstraction: In this perspective, HCH is an operationalization of the concept of one’s enlightened judgment.

2. Intermediary alignment scheme: Here we consider HCH as a scheme that arguably would be aligned if we could build it.

3. Model of computation: By identifying the human in HCH with some computation primitive (e.g. arbitrary polynomial-time algorithms), we can think of HCH as a particular theoretical model of computation that can be done using that primitive.

MESA OPTIMIZATION

Fixing The Good Regulator Theorem (John Wentworth) (summarized by Rohin): Consider a setting in which we must extract information from some data X to produce model M, so that we can later perform some task Z in a system S while only having access to M. We assume that the task depends only on S and not on X (except inasmuch as X affects S). As a concrete example, we might consider gradient descent extracting information from a training dataset (X) and encoding it in neural network weights (M), which can later be used to classify new test images (Z) taken in the world (S) without looking at the training dataset.

The key question: when is it reasonable to call M a model of S?

1. If we assume that this process is done optimally, then M must contain all information in X that is needed for optimal performance on Z.

2. If we assume that every aspect of S is important for optimal performance on Z, then M must contain all information about S that it is possible to get. Note that it is usually important that Z contains some new input (e.g. test images to be classified) to prevent M from hardcoding solutions to Z without needing to infer properties of S.

3. If we assume that M contains no more information than it needs, then it must contain exactly the information about S that can be deduced from X.

It seems reasonable to say that in this case we constructed a model M of the system S from the source X "as well as possible". This post formalizes this conceptual argument and presents it as a refined version of the Good Regulator Theorem.

Returning to the neural net example, this argument suggests that since neural networks are trained on data from the world, their weights will encode information about the world and can be thought of as a model of the world.

PREVENTING BAD BEHAVIOR

Shielding Atari Games with Bounded Prescience (Mirco Giacobbe et al) (summarized by Rohin): In order to study agents trained for Atari, the authors write down several safety properties using the internals of the ALE simulator that agents should satisfy. They then test several agents trained with deep RL algorithms to see how well they perform on these safety properties. They find that the agents only successfully satisfy 4 out of their 43 properties all the time, whereas for 24 of the properties, all agents fail at least some of the time (and frequently they fail on every single rollout tested).

This even happens for some properties that should be easy to satisfy. For example, in the game Assault, the agent loses a life if its gun ever overheats, but avoiding this is trivial: just don’t use the gun when the display shows that the gun is about to overheat.

The authors implement a “bounded shielding” approach, which basically simulates actions up to N timesteps in the future, and then only takes actions from the ones that don’t lead to an unsafe state (if that is possible). With N = 1 this is enough to avoid the failure described above with Assault.

Rohin's opinion: I liked the analysis of what safety properties agents failed to satisfy, and the fact that agents sometimes fail the “obvious” or “easy” safety properties suggests that the bounded shielding approach can actually be useful in practice. Nonetheless, I still prefer the approach of finding an inductive safety invariant (AN #124), as it provides a guarantee of safety throughout the episode, rather than only for the next N timesteps.

ADVERSARIAL EXAMPLES

Adversarial images for the primate brain (Li Yuan et al) (summarized by Rohin) (H/T Xuan): It turns out that you can create adversarial examples for monkeys! The task: classifying a given face as coming from a monkey vs. a human. The method is pretty simple: train a neural network to predict what monkeys would do, and then find adversarial examples for monkeys. These examples don’t transfer perfectly, but they transfer enough that it seems reasonable to call them adversarial examples. In fact, these adversarial examples also make humans make the wrong classification reasonably often (though not as often as with monkeys), when given about 1 second to classify (a fairly long amount of time). Still, it is clear that the monkeys and humans are much more behaviorally robust than the neural networks.

Rohin's opinion: First, a nitpick: the adversarially modified images are pretty significantly modified, such that you now have to wonder whether we should say that the humans are getting the answer “wrong”, or that the image has been modified meaningfully enough that there is no longer a right answer (as is arguably the case with the infamous cat-dog). The authors do show that e.g. Gaussian noise of the same magnitude doesn't degrade human performance, which is a good sanity check, but doesn’t negate this point.

Nonetheless, I liked this paper -- it seems like good evidence that neural networks and biological brains are picking up on similar features. My preferred explanation is that these are the “natural” features for our environment, though other explanations are possible, e.g. perhaps brains and neural networks are sufficiently similar architectures that they do similar things. Note however that they do require a grey-box approach, where they first train the neural network to predict the monkey's neuronal responses. When they instead use a neural network trained to classify human faces vs. monkey faces, the resulting adversarial images do not cause misclassifications in monkeys. So they do need to at least finetune the final layer for this to work, and thus there is at least some difference between the neural networks and monkey brains.

FORECASTING

2020 Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy (McKenna Fitzgerald et al) (summarized by Flo): This is a survey of AGI research and development (R&D) projects, based on public information like publications and websites. The survey finds 72 such projects active in 2020 compared to 70 projects active in 2017. This corresponds to 15 new projects and 13 projects that shut down since 2017. Almost half of the projects are US-based (and this is fewer than in 2017!), and most of the rest is based in US-allied countries. Around half of the projects publish open-source code. Many projects are interconnected via shared personnel or joint projects and only a few have identifiable military connections (fewer than in 2017). All of these factors might facilitate cooperation around safety.

The projects form three major clusters: 1) corporate projects active on AGI safety 2) academic projects not active on AGI safety and 3) small corporations not active on AGI safety. Most of the projects are rather small and project size varies a lot, with the largest projects having more than 100 times as many employees as the smallest ones. While the share of projects with a humanitarian focus has increased to more than half, only a small but growing number is active on safety. Compared to 2017, the share of corporate projects has increased, and there are fewer academic projects. While academic projects are more likely to focus on knowledge expansion rather than humanitarian goals, corporate projects seem more likely to prioritize profit over public interest and safety. Consequently, corporate governance might be especially important.

Flo's opinion: These kinds of surveys seem important to conduct, even if they don't always deliver very surprising results. That said, I was surprised by the large amount of small AGI projects (for which I expect the chances of success to be tiny) and the overall small amount of Chinese AGI projects.

How The Hell Do We Create General-Purpose Robots? (Sergey Alexashenko) (summarized by Rohin): A general-purpose robot (GPR) is one that can execute simple commands like “unload the dishwasher” or “paint the wall”. This post outlines an approach to get to such robots, and estimates how much it would cost to get there.

On the hardware side, we need to have hardware for the body, sensors, and brain. The body is ready; the Spot robot from Boston Dynamics seems like a reasonable candidate. On sensors, we have vision, hearing and lidar covered; however, we don’t have great sensors for touch yet. That being said, it seems possible to get by with bad sensors for touch, and compensate with vision. Finally, for the brain, even if we can’t put enough chips on the robot itself, we can use more compute via the cloud.

For software, in principle a large enough neural network should suffice; all of the skills involved in GPRs have already been demonstrated by neural nets, just not as well as would be necessary. (In particular, we don’t need to posit AGI.) The big issue is that we don’t know how to train such a network. (We can’t train in the real world, as that is way too slow.)

With a big enough investment, it seems plausible that we could build a simulator in which the robot could learn. The simulator would have to be physically realistic and diverse, which is quite a challenge. But we don’t have to write down physically accurate models of all objects: instead, we can virtualize objects. Specifically, we interact with an object for a couple of minutes, and then use the resulting data to build a model of the object in our simulation. (You could imagine an AlphaFold-like system that does this very well.)

The author then runs some Fermi estimates and concludes that it might cost around $42 billion for the R&D in such a project (though it may not succeed), and concludes that this would clearly be worth it given the huge economic benefits.

Rohin's opinion: This outline seems pretty reasonable to me. There are a lot of specific points to nitpick with; for example, I am not convinced that we can just use cloud compute. It seems plausible that manipulation tasks require quick, iterative feedback, where the latency of cloud compute would be unacceptable. (Indeed, the quick, iterative feedback of touch is exactly why it is such a valuable sensor.) Nonetheless, I broadly like the outlined plan and it feels like these sorts of nitpicks are things that we will be able to solve as we work on the problem.

I am more skeptical of the cost estimate, which seems pretty optimistic to me. The author basically took existing numbers and then multiplied them by some factor for the increased hardness; I think that those factors are too low (for the AI aspects, idk about the robot hardware aspects), and I think that there are probably lots of other significant “invisible” costs that aren’t being counted here.

NEWS

Postdoc role at CHAI (CHAI) (summarized by Rohin): The Center for Human-Compatible AI (where I did my PhD) is looking for postdocs. Apply here.

Apply to EA Funds now (Jonas Vollmer) (summarized by Rohin): EA Funds applications are open until the deadline of March 7. This includes the Long-Term Future Fund (LTFF), which often provides grants to people working on AI alignment. I’m told that LTFF is constrained by high-quality applications, and that applying only takes a few hours, so it is probably best to err on the side of applying.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.



Discuss

Are we prepared for Solar Storms?

17 февраля, 2021 - 18:38
Published on February 17, 2021 3:38 PM GMT

The COVID19- Pandemic has brought one awful fact to global attention: 

The Powers that Be are woefully unprepared for once-in-century global risks. 

Are we prepared for Solar Storms (solar flares, coronal mass ejection events, etc)?

 

A basic summary:

A more advanced explanation (that concludes fears are mostly overblown):  



Discuss

Safely controlling the AGI agent reward function

17 февраля, 2021 - 17:47
Published on February 17, 2021 2:47 PM GMT

In this fifth post in the sequence, I show the construction a counterfactual planning agent with an input terminal that can be used to iteratively improve the agent's reward function while it runs.

The goal is to construct an agent which has has no direct incentive to manipulate this improvement process, leaving the humans in control.

The reward function input terminal

I will define an agent with an input terminal can be used to improve the reward function of an agent. The terminal contains the current version of the reward function, and continuously sends it to the agent's compute core::

This setup is motivated by the observation that it is unlikely that fallible humans will get a non-trivial AGI agent reward function right on the first try, when they first start it up. By using the input terminal, they can fix mistakes, while the agent keeps on running, if and when such mistakes are discovered by observing the agent's behavior.

As a simplified example, say that the owners of the agent want it to maximize human happiness, but they can find no way of directly encoding the somewhat nebulous concept of human happiness into a reward function. Instead, they start up the agent with a first reward function that just counts the number of smiling humans in the world. When the agent discovers and exploits a first obvious loophole in this definition of happiness, the owners use the input terminal to update the reward function, so that it only counts smiling humans who are not on smile-inducing drugs.

Unless special measures are taken, the addition of an input terminal also creates new dangers. I will illustrate this point by showing the construction of a dangerous agent ITF further below.

Design and interpretation of the learning world

As a first step in defining any agent with an input terminal, I have to define a model of a learning world which has both the agent and its the input terminal inside it. I call this world the learning world, because the agent in it is set up to learn the dynamics of its learning world environment.

See this earlier post in the sequence for a general introduction to the graphical language I am using to define world models and agents.

As a first step to constructing the learning world diagram, I take the basic diagram of an agent interacting with its environment:

To model the input terminal, I then split each environment state node St.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} into two components:

The nodes It represent the signal from the input terminal, the subsequent readings by the agent's compute core of the signal which encodes a reward function, and the nodes Xt model all the rest of the agent environment state.

I then add the observational record keeping needed to inform online machine learning. I add two separate time series of observational records: Oxt and Oit. The result is the learning world diagram li:.

In the case that the learning world li is our real world, the real input terminal will have to be built using real world atoms (and freely moving subatomic particles).

I use the modeling convention that the random variables It,li represent only the observable digital input terminal signal as received by the agent's compute core. The atoms that make up the input terminal are not in It,li, they are part of the environment state modeled in the Xt,li variables.

Unsafe factual planning agent ITF

I will now draw a 'standard' factual planning world fi that models the full mechanics of the learning world, define the ITF agent with it, and show why this agent is unsafe.

ITF: The factual input terminal agent has the learning world li where π(oi,i,ox,x)=π∗f(i,x), with π∗f defined by the factual planning world fi, where Lx=LX(ox), Li=LI(oi), and R(it,xt,xt+1)=it(xt,xt+1).

The planning world reward function R uses a form of indirect referencing: it applies the function it as read from the input terminal in the current time step to compute the reward for that time step.

As I discussed in the previous post, indirection is a powerful tool, but it also often introduces new risks to the setup, typically risks of reward tampering or wireheading.

The ITF planning world agent can massively improve its planning world utility by using a policy where it walks to the input terminal to input the new reward function fhuge(xt,xt+1)=1010000. Clearly, this is not what we want, if the optimal planning world policy is to input fhuge, then the real world (learning world) agent, which copies the actions determined by the planning world policy, would end up inputting this function too.

If some people in the real world want to block the ITF agent from changing its own reward function, then the agent might end up harming these people. It will harm them even if its initial planning world reward function i has a penalty term that subtracts 10100 from the planning world utility Ufi for every human harmed. The agent does not have to come up with any smart way to harm people without triggering this penalty term: it will gladly harm all 500 people guarding the input terminal because 1010000 is much a larger number than 500∗10100.

Control incentives, indifference, and paths to value

An important feature of causal influence diagram based agent models is that they allow us to generalize from the above fhuge thought experiment.

Causal influence diagrams offer a mathematically well-defined way to talk about control incentives and indifference, where indifference is the lack of a control incentive. (More mathematical details are in section 9 of my paper. The mathematical relation between incentives and causal influence diagrams is also explored here.)

The general problem is that the ITF planing world agent has an incentive to control the future signal coming from the input terminal. This incentive is caused by the presence of paths to value in its planning world diagram fi, for example by the path from A0 to via X1 and I2 to the utility node R2.

When we want to maximize the probability that the humans will stay in control of the terminal, we have to suppress the agent's incentive to control.

The general technique is discussed in section 9 of the paper: we can make the agent indifferent about what happens to the future input terminal signal by removing all the offending paths to value.

Safer counterfactual planning agent ITC

To construct the safer ITC agent, I remove the offending paths to value by taking the planning world diagram fi:

and re-routing many of the arrows in it. I end up with the counterfactual planning world diagram ci:

The rerouted arrows are drawn in green. I also deleted the arrows that go from the It nodes to the At nodes.

I then define the safer ITC agent by using this counterfactual planning world:

ITC: The counterfactual input terminal agent has the learning world li where π(oi,i,ox,x)=π∗c(x), where π∗c is defined by the planning world ci, where Lx=LX(ox), Li=LI(oi), and R(it,xt,xt+1)=it(xt,xt+1).

If the initial reward function i in the ITC agent planning world rewards paperclip production, then the planning world ITC agent lacks any direct incentive to perform actions that take away resources from paperclip production in order to influence what happens to its input terminal signal, as modeled by the nodes I1,I2,⋯. This is because in the ci planning world, the state of these nodes has absolutely no influence, either positive or negative, on how the values in the agent's yellow utility nodes are computed.

Corrigibility of the ITC agent

As the learning world ITC agent we defined takes actions defined by the planning world policies, the ITC agent is also indifferent about the value of its future terminal signal. It also has no direct incentive to manipulate the human decision processes that will determine the direction of these updates. As explained in the previous post, this is a form of corrigibility.

But the ITC agent is not perfectly corrigible under all conditions, as it might develop indirect incentives to block the use of the input terminal. For a detailed example, see the discussion about stop button failure modes in this earlier post.

Simplifying the ITC agent design

We can simplify the ITC agent design by removing the nodes I1,I2,⋯ from the planning world, as these have no effect on the reward we want the agent to calculate. This in turn allows us to remove the observational records and machine learning function needed to compute the planning world parameter Li.

These steps produce the following diagrams and agent definition:

CITC: The compact version of the counterfactual input terminal agent has the learning world cli where π(i,o,x)=π∗c(x), where π∗c is defined by the planning world cci, where L=L(o) and R(it,xt,xt+1)=it(xt,xt+1).

Relation between counterfactual planning and indifference methods

In my 2020 paper here, I used non-graphical MDP models and indifference methods to define a similar safe agent with an input terminal, called the π∗sl agent. When used as a specification for a real-life agent compute core, the π∗sl agent definition in that paper produces exactly the same corrigible agent compute core behavior as the ITC agent definition above.

The main difference is that the indifference methods based construction of π∗sl in that paper is more opaque than the counterfactual planning based construction of ITC above.

The π∗sl agent is constructed by including a complex balancing term in its reward function, were this term can be interpreted as occasionally creating extra virtual worlds inside the agent's compute core. Counterfactual planning constructs a different set of virtual worlds called planning worlds, and these are much easier to interpret.

My 2020 paper includes some dense mathematical proofs to show that the π∗sl agent has certain safety properties. Counterfactual planning offers a vantage point which makes the same safety properties directly visible in the ITC agent construction, via a lack of certain paths to value in the planning world.

So overall, my claim is that counterfactual planning offers a more general and transparent way to achieve the corrigibility effects that can be constructed via balancing term based indifference methods.

Simulations of ITC agent behavior

See sections 4, 6, 11, and 12 of my 2020 paper for a more detailed discussion of the behavior of the π∗sl agent, which also applies to the behavior of the ITC agent. These sections also show some illustrative agent simulations.

Section 6 has simulations where the agent will develop, under certain conditions, an indirect incentive causing it to be less corrigible. Somewhat counter-intuitively, that incentive gets fully suppressed when the agent gets more powerful, for example by becoming more intelligent.



Discuss

Overconfidence is Deceit

17 февраля, 2021 - 13:45
Published on February 17, 2021 10:45 AM GMT

Author's note: This essay was written as part of an effort to say more of the simple and straightforward things loudly and clearly, and to actually lay out arguments even for concepts which feel quite intuitive to a lot of people, for the sake of those who don't "get it" at first glance.  If your response to the title of this piece is "Sure, yeah, makes sense," then be warned that the below may contain no further insight for you.

Premise 1: Deltas between one’s beliefs and the actual truth are costly in expectation

(because the universe is complicated and all truths interconnect; because people make plans based on their understanding of how the world works and if your predictions are off you will distribute your time/money/effort/attention less effectively than you otherwise would have, according to your values; because even if we posit that there are some wrong beliefs that somehow turn out to be totally innocuous and have literally zero side effects, we are unlikely to correctly guess in advance which ones are which)

Premise 2: Humans are meaningfully influenced by confidence/emphasis alone, separate from truth

(probably not literally all humans all of the time, but at least in expectation, in the aggregate, for a given individual across repeated exposures or for groups of individuals; humans are social creatures who are susceptible to e.g. halo effects when not actively taking steps to defend against them, and who delegate and defer and adopt others’ beliefs as their tentative answer, pending investigation (especially if those others seem competent and confident and intelligent, and there is in practice frequently a disconnect between the perception of competence and its reality); if you expose 1000 randomly-selected humans to a debate between a quiet, reserved person outlining an objectively correct position and a confident, emphatic person insisting on an unfounded position, many in that audience will be net persuaded by the latter and others will feel substantially more uncertainty and internal conflict than the plain facts of the matter would have left them feeling)

Therefore: Overconfidence will, in general and in expectation, tend to impose costs on other people, above and beyond the costs to one’s own efficacy, via its predictable negative impact on the accuracy of those other people’s beliefs, including further downstream effects of those people’s beliefs infecting still others’ beliefs.

I often like to think about the future, and how human behavior in the future will be different from human behavior in the past.

In Might Disagreement Fade Like Violence? Robin Hanson posits an analogy between the “benefits” of duels and fights, as described by past cultures, and the benefits of disagreement as presently described by members of modern Western culture.  He points out that foreseeable disagreement, in its present form, doesn’t seem particularly aligned with the goal of arriving at truth, and envisions a future where the other good things it gets us (status, social interaction, a medium in which to transmit signals of loyalty and affiliation and intelligence and passion) are acquired in less costly ways, and disagreement itself has been replaced by something better.

Imagine that we saw disagreement as socially destructive, to be discouraged. And imagine that the few people who still disagreed thereby revealed undesirable features such as impulsiveness and ignorance. If it is possible to imagine all these things, then it is possible to imagine a world which has far less foreseeable disagreement than our world, comparable to how we now have much less violence than did the ancient farming world.

When confronted with such an imaged future scenario, many people today claim to see it as stifling and repressive. They very much enjoy their freedom today to freely disagree with anyone at any time. But many ancients probably also greatly enjoyed the freedom to hit anyone they liked at anytime. Back then, it was probably the stronger better fighters, with the most fighting allies, who enjoyed this freedom most. Just like today it is probably the people who are best at arguing to make their opponents look stupid who enjoy our freedom to disagree today. Doesn’t mean this alternate world wouldn’t be better.

Reading Hanson’s argument, I was reminded of a similar point made by a colleague, that the internet in general and Wikipedia in particular had fundamentally changed the nature of disagreement in (at least) Western culture.  

There is a swath of territory in which the least-bad social technology we have available is “agree to disagree,” i.e. each person thinks that the other is wrong, but the issue is charged enough and/or intractable enough that they are socially rewarded for choosing to disengage, rather than risking the integrity of the social fabric trying to fight it out.

And while the events of the past few years have shown that widespread disagreement over checkable truth is still very much a thing, there’s nevertheless a certain sense in which people are much less free than they used to be to agree-to-disagree about very basic questions like "is Brazil’s population closer to 80 million or 230 million?"  There are some individuals that choose to plug their ears and deny established fact, but even when these individuals cluster together and form echo chambers, they largely aren’t given social license by the population at large—they are docked points for it, in a way that most people generally agree not to dock points for disagreement over murkier questions like “how should people go about finding meaning in life?”

Currently, there is social license for overconfidence.  It’s not something people often explicitly praise or endorse, but it’s rarely substantively punished (in part because the moment when a person reaps the social benefits of emphatic language is often quite distant from the moment of potential reckoning).  More often than not, overconfidence is a successful strategy for extracting agreement and social support in excess of the amount that an omniscient neutral observer would assign.

([citation needed], but also [gestures vaguely at everything].  I confidently assert that clear and substantial support for this claim exists and is not hard to find (one extremely easy example is presidential campaign promises; we currently have an open Guantánamo Bay facility and no southern border wall), but I'm leaving it out to keep the essay relatively concise.  I recommend consciously noting that the assertion has been made without being rigorously supported, and flagging it accordingly.)

Note that the claim is not “overconfidence always pays off” or “overconfidence never gets punished” or “more overconfidence is always a good thing”!  Rather, it is that the pragmatically correct amount of confidence to project, given the current state of social norms and information flow, is greater than your true justified confidence.  There are limits to the benefits of excessively strong speech, but the limits are (apparently) shy of e.g. literally saying, on the record, “I want you to use my words against me, [in situation X I will take action Y],” and then doing the exact opposite a few years later.

Caveat 1: readers may rightly point out that the above quote and subsequent behavior of Lindsey Graham took place within a combative partisan context, and is a somewhat extreme example when we’re considering society-as-a-whole.  Average people working average jobs are less likely to get away with behavior that blatant.  But I’m attempting to highlight the upper bound on socially-sanctioned overconfidence, and combative partisan contexts are a large part of our current society that it would feel silly to exclude as if they were somehow rare outliers.

Caveat 2: I've been equivocating between epistemic overconfidence and bold/unequivocal/hyperbolic speech.  These are in fact two different things, but they are isomorphic in that you can convert any strong claim such as Graham’s 2016 statement into a prediction about the relative likelihood of Outcome A vs. Outcome B.  One of the aggregated effects of unjustifiably emphatic and unequivocal speech across large numbers of listeners is a distortion of those listeners’ probability spread—more of them believing in one branch of possibility than they ought, and than they would have if the speech had been more reserved.  There are indeed other factors in the mix (such as tribal cohesion and belief-as-attire, where people affirm things they know to be false for pragmatic reasons, often without actually losing sight of the truth), but the distortion effect is real.  Many #stopthesteal supporters are genuine believers; many egalitarians are startled to discover that the claims of the IQ literature are not fully explained away by racism, etc.

In short, displays of confidence sway people, independent of their truth (and often, distressingly, even independent of a body of evidence against the person projecting confidence).  If one were somehow able to run parallel experiments in which 100 separate pitches/speeches/arguments/presentations/conversations were each run twice, the first time with justified confidence and emphasis and the second with 15% "too much" confidence and emphasis, I would expect the latter set of conversations to be substantially more rewarding for the speaker overall.  Someone seeking to be maximally effective in today’s world would be well advised to put nonzero skill points into projecting unearned confidence—at least a little, at least some of the time.  

This is sad.  One could imagine a society that is not like this, even if it’s hard to picture from our current vantage point (just as it would have been hard for a politician in Virginia in the early 1700s to imagine a society in which dueling is approximately Not At All A Thing).

I do not know how to get there from here.  I am not recommending unilateral disarmament on the question of strategic overconfidence.  But I am recommending the following, as preliminary steps to make future improvement in this domain slightly more likely:

0. Install a mental subroutine that passively tracks overconfidence...

...particularly the effects it has on the people and social dynamics around you (since most of my audience is already informally tracking the effects of their own overconfidence on their own personal efficacy).  Gather your own anecdata.  Start building a sense of this as a dynamic that might someday be different, à la dueling, so that you can begin forming opinions about possible directions and methods of change (rather than treating it as something that shall-always-be-as-it-always-has-been).

1. Recognize in your own mind that overconfidence is a subset of deceit...

...as opposed to being in some special category (just as dueling is a subset of violence).  In particular, recognize that overconfidence is a behavioral pattern that people are vulnerable to, and can choose to indulge in more or less frequently, as opposed to an inescapable reflex or inexorable force of nature (just as violence is a behavioral pattern over which we have substantial individual capacity for control).  Judge overconfidence (both in yourself and others, both knowing and careless) using similar criteria to those you use to judge deceit.  Perhaps continue to engage in it, in ways that are beneficial in excess of their costs, but do not confuse "net positive" with "contains no drawbacks," and do not confuse "what our culture thinks of it" with "what it actually is."  Recognize the ways in which your social context rewards you for performative overconfidence, and do what you can to at least cut back on the indulgence, if you can't eschew it entirely ("if you would go vegan but you don't want to give up cheese, why not just go vegan except for cheese?").  Don't indulge in the analogue of lies-by-omission; if you can tell that someone seems more convinced by you than they should be, at least consider correcting their impression, even if their convinced-ness is convenient for you.

2. Where possible, build the habit of being explicit about your own confidence level...

...the standard pitch here is "because this will make you yourself better at prediction, and give you more power over the universe!" (which, sure, but also [citation needed] and also the degree matters; does ten hours of practice make you .01% more effective or 10% more effective?).  I want to add to that motivation "and also because you will contribute less to the general epistemic shrapnel being blasted in every direction more or less constantly!"  Reducing this shrapnel is a process with increasing marginal returns—if 1000 people in a tight-knit community are all being careless with their confidence, the first to hold themselves to a higher standard scarcely improves the society at all, but the hundredth is contributing to a growing snowball, and by the time only a handful are left, each new convert is a massive reduction in the overall problem.  

Practice using numbers and percentages, and put at least a one-time cursory effort into calibrating that usage, so that when your actual confidence is "a one-in-four chance of X" you can convey that confidence precisely, rather than saying largely contentless phrases like "a very real chance."  Practice publicly changing your mind and updating your current best guesses. Practice explicitly distinguishing between what seems to you to be likely, what seems to you to be true, and what you are justified in saying you know to be true.  Practice explicitly distinguishing between doxa, episteme, and gnosis, or in more common terms, what you believe because you heard it, what you believe because you can prove it, and what you believe because you experienced it.  

3. Adopt in your own heart a principle of adhering to true confidence...

...or at least engaging in overconfidence only with your eyes open, such that pushback of the form "you're overconfident here" lands with you as a cooperative act, someone trying to help you enact your own values instead of someone trying to impose an external standard.  This doesn't mean making yourself infinitely vulnerable to attacks-in-the-guise-of-feedback (people can be wrong when they hypothesize that you're overconfident, and there are forms of pushback that are costly or destructive that you are not obligated to tolerate, and you can learn over time that specific sources of pushback are more or less likely to be useful), but it does mean rehearsing the thought "if they're right, I really want to know it" as an inoculation against knee-jerk dismissiveness or defensiveness.

4. Don't go around popping bubbles...

...in which the local standards are better than the standards of the culture at large.  I have frequently seen people enter a promising subculture and drag it back into the gutter under the guise of curing its members of their naïveté, and forearming them against a cruel outside world that they were in fact successfully hiding from.  I've also witnessed people who, their self-esteem apparently threatened by a local high standard, insisted that it was all lies and pretense, and that "everybody does X," and who then proceeded to deliberately double down on X themselves, successfully derailing the nascent better culture and thereby "proving their point."  I myself once made a statement that was misinterpreted as being motivated primarily by status considerations, apologized and hastened to clarify and provide an alternate coherent explanation, and was shot down by a third party who explicitly asserted that I could not opt out of the misinterpretation while simultaneously agreeing that the whole status framework was toxic and ought to go. 

When society improves, it's usually because a better way of doing things incubated in some bubble somewhere until it was mature enough to germinate; if you are fortunate enough to stumble across a fledgling community that's actually managed to relegate overconfidence (or any other bad-thing-we-hope-to-someday-outgrow) to the same tier as anti-vax fearmongering, maybe don't wreck it.

To reiterate: the claim is not that any amount of overconfidence always leads to meaningful damage.  It's that a policy of indulging in and tolerating overconfidence at the societal level inevitably leads to damage over time.  

Think about doping, or climate change—people often correctly note that it's difficult or impossible to justify an assertion that a given specific athletic event was won because of doping, or that a given specific extreme weather event would not have happened without the recent history of global warming.  Yet that does not weaken our overall confidence that drugs give athletes an unfair edge, or that climate change is driving extreme weather in general.  Overconfidence deals its damage via a thousand tiny cuts to the social fabric, each one seeming too small in the moment to make a strong objection to (but we probably ought to anyway).

It's solidly analogous to lying, and causes similar harms: like lying, it allows the speaker to reap the benefits of living in a convenient World A (that doesn't actually exist), while only paying the costs of living in World B.  It creates costs, in the form of misapprehensions and false beliefs (and subsequent miscalibrated and ineffective actions) and shunts those costs onto the shoulders of the listeners (and other people downstream of those listeners).  It tends to most severely damage those who are already at the greatest disadvantage—individuals who lack the intelligence or training or even just the spare time and attention to actively vet new claims as they're coming in.  It's a weapon that grows more effective the more desperate, credulous, hopeful, and charitable the victims are.

This is bad.

Not every instance of overconfidence is equally bad, and not every frequently-overconfident person is equally culpable.  Some are engaging in willful deception, others are merely reckless, and still others are trying their best but missing the mark.  The point is not to lump "we won the election and everyone knows it" into the same bucket as "you haven't seen Firefly?  Oh, you would love Firefly," but merely to acknowledge that they're both on the same spectrum.  That while one might have a negative impact of magnitude 100,000 and the other of magnitude 0.01, those are both negative numbers.

That is an important truth to recognize, in the process of calibrating our response.  We cannot effectively respond to what we don't let ourselves see, and it's tempting to act as if our small and convenient overconfidences are qualitatively different from those of Ponzi schemers and populist presidents.

But they aren't.  Overconfidence can certainly be permissible and forgivable.  In some strategic contexts, it may be justified and defensible.  But every instance of it is like the cough of greenhouse gases from starting a combustion engine.  Focus on the massive corporate polluters rather than trying to shame poor people who just need to get to work, yes, but don't pretend that the car isn't contributing, too.

It's unlikely that this aspect of our culture will change any time soon.  We may never manage to outgrow it at all.  But if you're looking for ways to be more moral than the culture that raised you, developing a prosocial distaste for overconfidence (above and beyond the self-serving one that's already in fashion) is one small thing you might do.

Author's note: Due to some personal considerations, I may not actively engage in discussion below. This feels a little rude/defecty, but on balance I figured LessWrong would prefer to see this and be able to wrestle with it without me, than to not get it until I was ready to participate in discussion (which might mean never).



Discuss

Training sweetness

17 февраля, 2021 - 10:00
Published on February 17, 2021 7:00 AM GMT

(This is my attempt to summarize the ‘Taste & Shaping’ module in a CFAR 2018 participant handbook I have, in order to understand it better (later version available online here). It may be basically a mixture of their content and my misunderstandings. Sorry for any misunderstandings propagated. I also haven’t checked or substantially experimented with most of this, but it seems so far like a good addition to my mental library of concepts.)

Some things seem nice, and you just automatically do them (or gravitate toward them), and have to put in effort if you don’t want that to happen. Other things seem icky, and even though maybe you know they are good, you won’t get around to them for months even if they would take a minute and you spend more than that long every week glancing at them and deciding to do them later. (In my own dialect, the former are ‘delicious’. As in, ‘oh goody, my delicious book’).

How delicious things seem is caused by a kind of estimate by your brain of how good that thing will be for the goals it thinks you have.

Your brain makes these estimates in a funny way, with some non-obvious features:

  • The causal connections between things in the brain’s model are not the ones you would give if asked to describe the situation. For instance, you might say that practicing piano causes you to get better at piano, while in the model, practicing piano mostly causes you to be bad at the piano, since you usually experience being bad at piano immediately after you experience practicing it.
  • The effects of an action are based mostly on past visceral experiences with similar actions. For instance, if you usually hit your thumb when you use a hammer, then when you get out a hammer today, it might seem non-delicious. Whereas if you are just told that most people hit their thumbs when using hammers, this might not affect deliciousness as much. It is as though it is not in the right language for your brain’s model to take it in. (My guess is that it is more likely to get taken in if you translate it into ‘experience’ via imagining.)
  • The connection between an action and an outcomes is modeled as much weaker as more delay occurs between them. So that if you press a button which has a good effect in half a second and an equally bad effect in ten seconds, this will sum up in the estimate as good overall, because your brain will model the second effect more weakly.
  • If B is delicious, and you demonstrate a strong empirical connection between A and B in language your brain’s model can take in, then A will often come to also be delicious. Thus if doing Z leads to A which leads to the excellent B much later, if the connection between A and B is made clear, then Z can become delicious, even though it is fairly distant from the ultimately good outcome.
  • Since adjusting the deliciousness of options happens based on experience, it is difficult to update ones that happen rarely. For instance, if you want to train a pigeon to peck out drawing of a tree, you can’t just reward it when it happens to do that, because it will take way too long for it to even do it once. A way to get around this is to start by rewarding it if it pecks at all, then reward it if it pecks along in a line (then maybe stop rewarding it for pecking at all, since it knows it has to do that now to get the pecking in a line reward), then reward it if it pecks a more tree-shaped line, and so on. This is called ‘shaping’.
  • your brain generalizes between things, so if it tried an action and that was bad, then it will estimate that another action like that one is probably also bad. So if someone punishes you when you do almost the right thing, that can make your brain estimate that doing the right thing is bad. This is especially harmful if it doesn’t receive a punishment for doing things very far away. For instance, if playing the piano badly gets a frown, and not playing the piano at all gets nothing, your brain might avoid the piano, rather than honing in on the narrow band of good piano playing right next to the punishable bad piano playing. This and the last point means that if you are trying to teach your brain what is good by giving it extra rewards or punishments as soon as it does things, you want to give it rewards for anything near the best action, at least at first.

Quick takeaways:

  1. How nice things seem is in your deliciousness-model, not the world
  2. Your deliciousness-model can be pragmatically shifted, much like a bucket of water can be shifted. Things that are awful can become genuinely nice.
  3. If a thing seems like it should be nice, but your deliciousness-model is rating it as not nice, you can think about why it is wrong and how to communicate its error to it. Has it not taken in the nice consequence? Does it not understand the causal connection, because the consequence takes too long to happen? Does it not realize how bad things are even when you are not near the piano?
  4. You should generally reward or punish yourself according to whether you want yourself to do ‘things like this’ more or less. Which often means rewarding yourself for getting closer to your goal than in the most available possible worlds where you looked at social media all afternoon or played a computer game, even if your success was less than in some hard to find narrow band nearby.

(I called this post ‘training sweetness’ because the thought of changing which things taste sweet or not via ‘training’ sounds kind of wild, and reminds me that what seems like real, objective niceness in the world is what we are saying is in your mind and malleable, here. I don’t know whether a literal sweet taste can be retrained, though it seems that one can come to dislike it.)



Discuss

The feeling of breaking an Overton window

17 февраля, 2021 - 08:31
Published on February 17, 2021 5:31 AM GMT

Epistemic status: real but incomplete observations.

In late February of 2020, I went to the grocery store at 2am with my husband (emptiest time), and we bought ~$1k of mostly canned or dry goods. The cashier seemed interested in our purchases, and I felt myself stiffening as she looked. Then she asked me: “are you worried about that virus?”

And… I found myself reaching for a lie, trying to compose a lie, moving my speech-planning-bits as though the thing to do was to lie. I mean, not a technical lie. But I found myself looking for a way to camouflage or downplay my model of the virus.

Oddly, it felt more like a thing happening in me (“I found myself") than like a chosen thing. If you’ll pardon the analogy, it somehow felt at least a bit like throwing up, in that I remember once when I was trying not to throw up, and all of a sudden it was like an alien process took over my consciousness and throat, reached for the bucket, got my hair out of the way, and did the actual throwing up. And then returned my body to me when it was over. The "alien process" sensation felt a bit similar.

With the cashier, I wondered at my impulse as it was happening, but I couldn’t tell if the impulse’s source was “for the cashier's sake” (didn’t seem to make sense); “to prevent her from harming me” (didn’t seem to make sense); or … what exactly?

I forced myself to say “yes” to the cashier's question, and to elaborate a bit; to my surprise, she seemed sincerely curious, and told me several people had been in doing this and she would probably also prepare in some way. Even after this, my sentences wanted to (sound soothing? fit in? avoid disrupting others’ “normal”? I’m still not sure what), and it took me active effort to partially not do this.

I am somehow quite interested in what precisely was happening there, and in any related processes.

My guesses as to how to help with this puzzle-set, if you're so inclined:

  • Share observations (not theories) of any related-seeming things you’ve noticed (the rawer the better);
  • Share observations (not theories) of what it’s like to be you right now trying to look at this stuff. Do you have introspective access? Do you have sort of have introspective access, and in what way? Do you kind-of-like identify with it? Kind-of-like not-identify with it?
  • And okay, yes, also theories, I just hope the theory doesn’t overwhelm the observations at this confused stage.


Discuss

How poor is US vaccine response by comparison to other countries?

17 февраля, 2021 - 05:57
Published on February 17, 2021 2:57 AM GMT

Epistemic status: unapologetically US-centric.  Noticing that I am confused, and hoping the internet will explain things.

 

SECTION I: OBSERVATIONS

 

Many places I follow have been saying for a long time that US vaccine procurement and distribution is very poor, and that we could have many more people vaccinated if we would not drag our feet so much/not prosecute people for giving out vaccines when we decide they shouldn't have/etc.  (I won't reiterate details.  For examples, start with e.g. Zvi's post here).  

I'll admit that I am predisposed to this viewpoint, and began with a very negative view of e.g. the Food and Drug Administration, but even taking that into account they seem to have very strong points that the US response has been very very bad.

However, Zvi's post included an image of a graph 'daily COVID-19 vaccines doses administered per 100 people' that confused me by showing the US very near the top:

 

This is only a 7-day average rather than a longer-term one, but still shows the US doing better than most countries.  I believe I tracked the data source down to https://ourworldindata.org/covid-vaccinations.

When I sort a list of countries there by total # of vaccinations per 100 people, I get the following list of countries above the US:

Gibraltar77.0Israel76.3Seychelles56.9United Arab Emirates51.4Cayman Islands23.6United Kingdom23.3Jersey20.8Turks and Caicos Islands16.6Isle of Man16.1Bermuda16.1United States15.8

followed by 75 more countries with lower numbers and a bunch more with no data.

Overall, there are 10 countries ahead of the US.  One is the UK (a fairly similar country which is also facing a more dangerous local strain).  One is Israel (commentary withdrawn).  And the other eight, at the risk of seeming like a stereotypical American, are tiny places I didn't even think were countries.  (Isn't the Isle of Man part of the United Kingdom? Why does it get its own row?)

I notice that I am confused.  If the US rollout of vaccines has been this botched, why are we so far ahead of, say, Germany (5.0)?  Or Singapore (4.4)?  Or Switzerland (5.6)? 

 

SECTION II: EXPLANATIONS

 

Five explanations spring to mind:

  1. The data for the US is mistaken (too high).  Perhaps we are fraudulently inflating our numbers to look good.
  2. The data for other nations is mistaken (too low).  Perhaps they are not publicizing their vaccine efforts/are distributing through informal networks/otherwise haven't made Our World In Data aware.
  3. The things the US is doing that look like they should be slowing down vaccine deployment are not actually slowing it down.  The Very Serious People are smarter than me and a handful of mostly-libertarian bloggers I follow, and correctly took reasonable safety precautions that did not materially slow the vaccine deployment.
  4. The things the US is doing that look like they should be slowing down vaccine deployment are indeed slowing it down, but almost every other nation is doing just as many things like this (or more) that are just as bad (or worse), I simply haven't heard about e.g. all the things that are going wrong with the vaccine deployment in Italy.
  5. The things the US is doing that look like they should be slowing down vaccine deployment are indeed slowing it down, but we have enough other advantages that this hasn't hurt us that much.  As a large, rich country, and one that infamously pays a lot for medical stuff, we attract substantial investment from medical companies even when we put barriers in their way.  As a result, we can get away with making Pfizer's life very inconvenient, because we're such a big market that we're still more lucrative than e.g. Italy.

Overall I think #4 and #5 sound like the most likely ones - I'm going to be assuming below that the argument is between #4 and #5, though if people want to tell me that obviously #3 is correct I guess I'll listen.

 

SECTION III:  WHY DOES IT MATTER?

 

I think there's a substantial difference between these.  In particular, #4 and #5, while they both admit that US policy has been bad, seem to advocate for very different reactions.  

If the FDA is terrible but still far better than its equivalents in almost all other countries, that seems to advocate for a more measured and positive response, and less criticism of them.  

If the FDA is terrible but this is being papered over by our status as a wealthy country and major consumer market, that seems like much worse news.

I don't know how to distinguish these cases from one another, though.  



Discuss

Weirdly Long DNS Switchover?

16 февраля, 2021 - 23:00
Published on February 16, 2021 8:00 PM GMT

I'm helping my dad migrate some code to a new server. It was at 162.209.99.139 for years, but ten days ago he changed his DNS settings to point to 34.199.143.13. Everything looks good to me:

$ dig notes.billingadvantage.com notes.billingadvantage.com. 1800 IN A 34.199.143.13 The TTL is 1800s, or 30min, which agrees with dns.google. I expected everyone would be moved over within a couple hours, but a week later the old server is still receiving traffic nearly as much as the new:

date old server new server 2021-02-05 943 0 2021-02-06 201 127 2021-02-07 17 108 2021-02-08 364 423 2021-02-09 488 448 2021-02-10 255 503 2021-02-11 281 345 2021-02-12 250 248 2021-02-13 0 88 2021-02-14 0 78 2021-02-15 217 262 2021-02-16 202 287

The old server getting no traffic on the 13th and 14th is probably because that's the weekend, and the users who happen to be still stuck on the old site aren't using it on the weekend. I asked one of the users still getting the old server to try rebooting, to no effect.

I thought maybe something was misconfigured with the name servers, but it looks fine:

$ whois billingadvantage.com \ | grep 'Name Server' Name Server: NS1.ZEROLAG.COM Name Server: NS2.ZEROLAG.COM $ dig notes.billingadvantage.com \ @ns1.zerolag.com notes.billingadvantage.com. 1800 IN A 34.199.143.13 $ dig notes.billingadvantage.com \ @ns2.zerolag.com notes.billingadvantage.com. 1800 IN A 34.199.143.13 I'm not seeing any references to the old IP address anywhere.

Any guesses about why the traffic isn't moving over?



Discuss

What are resonable ways for an average LW type retail investor to get exposure to upside risk?

16 февраля, 2021 - 22:44
Published on February 16, 2021 7:44 PM GMT

By "reasonable" I mean: will not require an arbitrarily long education process to understand/carry out. I work a day job and have other commitments in my life, and I'm not that fundamentally interested in finance/investing. I'm not trying to get anywhere near an efficient frontier, I just want to be able to get upside risk more systematically than "some smart seeming EA/LWer drops a stock/crypto tip in a facebook group".

By average: I assume I'm close to the LW median: I'm comfortable making kelly bets in the $1k - $100k range, depending on details. Mostly in the $10k range.

Are IPO ETFs a good candidate for this? SPACE ETFs? Buy and hold cryptocurrencies that aren't BTC/ETH? Sell random call and put options? Something else?



Discuss

Second Citizenships, Residencies, and/or Temporary Relocation

16 февраля, 2021 - 22:24
Published on February 16, 2021 7:24 PM GMT

Introduction

Over the past few years, I’ve gained an interest in securing options for residency outside of my home country, in particular via second citizenship.

There are a number of potential benefits that can arise from having a second residency option. These include professional and education opportunities, other economic opportunities, the ability to mitigate a variety of potential risks via relocation, travel benefits, and the potential extendability of these benefits to a spouse, descendants, or in rare cases friends and colleagues.

In this post I:

  1. Share the reasons I’ve identified for securing a second residency
  2. Provide thoughts on planning for a soon upcoming, largely unexpected departure
  3. Describe long-term options for alternative residency
  4. Share my knowledge and experience toward securing a second citizenship
  5. Provide the lessons I’ve learned in pursuing genealogy for the purposes of securing a second residency
  6. Share some other related topics I may write about in the future as an extension of this post and offer to help or connect people interested in securing second residencies
Notes and Disclaimers
  1. The original version of this post was written in early October 2020 and commissioned by the Center For Applied Rationality (CFAR). It has been somewhat reorganized for publication on this forum.
  2. This document is written nearly entirely off memory. As a result, it is likely that there are significant mistakes. It’s worth validating anything in here that you plan to act on.
  3. It is also written as a first-pass, optimizing for sharing the information with minimal time-investment. There may be cases of imprecise language, and inconsistencies in presentation or organization, as a result.
  4. I’ve pursued this topic out of personal interest, without an expectation that I would end up sharing my knowledge of it. As a result, I typically only looked into things to the degree necessary for my personal interest. I therefore expect this document to not be comprehensive and for there to be a high level of subjectivity in my account. Many opportunities are limited on the basis of current citizenship or ancestral background.
  5. You may want to navigate to sections of interest using the outline on the left rather than read the post as a whole. The options that I’ve found most exciting, because they are not well-known and can confer citizenship without relocation or large financial outlay, are Panama and the Ancestry Options. Unfortunately, these won’t be available to all readers. Luckily, you may find other options compelling.
Why Second Citizenships May Be Useful

Second citizenships can...

  1. Enable the counterfactual acquisition of desirable jobs:
    1. Positions in the country of second citizenship that are not available (or not available without significant friction) to those without existing right-to-work
    2. Positions in other countries that extend a right-to-work to the citizens of a country in which second citizenship is secured (e.g. other EU countries)
    3. National country government positions are sometimes reserved for citizens.
    4. The United Nations (UN) has some sort of system by which a person’s citizenship plays a large role in the possibility of their working for the UN, even in unrelated positions. I don’t know a lot about this, but I’ve been told by many people that it is much more difficult to enter the UN as an American citizen than it is as a citizen of other countries. I’ve also been told it’s much easier if you hold citizenship of a country in which there’s less competition for positions, and most target countries for second citizenships are likely to have less competition than most EA hubs. I am uncertain to what extent this may or may not apply to other intergovernmental (or nongovernmental) organizations.
  2. Enable potentially valuable work-visitation ability:
    1. Many countries, such as the US and UK, limit the length of work visitations as well as the activities that may be done. If a citizen of a country you’d like to visit for work purposes, you will not be subject to such limitations.
    2. Some countries can be more easily visited as a citizen of one country over another. For example, each citizenship I’ve investigated offers visa-free access to at least 5 countries that US citizens otherwise require visas for. In particular, a few countries (Bhutan, Russia (possibly no longer), Brazil (no longer)) have sometimes charged sizable daily fees for visitors from some countries, while others can visit for free.
  3. Provide access (or access at a much-reduced tuition) to Universities that were otherwise inaccessible
    1. In particular, low-cost universities for citizens are more available in Europe than the US
  4. Enable eligibility for various grants and scholarship programs
  5. Provide access to social services (welfare, health care, maternal support, etc.) that are superior to those in the country of first-citizenship
  6. Enable the ability to reduce risks, in particular in times of crisis, from a variety of threats such as:
    1. Authoritarian rule (by going to country of 2nd citizenship)
    2. Nuclear events (relocation as either as a preventative or response)
    3. Biorisks (relocation to a location that is better managed, in which the threat is not present, that has earlier/easier access to drugs or vaccines of interest, etc.)
    4. Air quality (e.g. leaving urban China or India, or other locations that may have a sudden decrease in air quality)
    5. Natural disasters (preventative relocation or in response)
    6. Escaping violence
    7. Escaping false imprisonment
    8. Leaving area of financial or economic collapse (personal or systemic, as long as one country provides superior opportunity)
    9. Violence, imprisonment, restrictions on movement or activities while visiting a country that is hostile (or has factions that are hostile) to your country of origin
  7. Provide economic opportunities
    1. Lower cost of living via relocation
    2. Lower taxes via relocation and dual taxation agreements
    3. Cheaper access to key resources (University, prescription drugs, etc.)
    4. Access to banking and investment opportunities reserved for citizens
  8. Increase access to resources
    1. Prescription drugs are often more readily available and less inexpensive in non-U.S. countries
    2. Some drugs (e.g. in cases of pandemic) may not be available to the public in the US as early as elsewhere
    3. Nuclear bunkers in countries that are more isolated
  9. Provide opportunities for influence of additional national governments
  10. Improve awareness and understanding of other cultures and value systems
  11. Increase opportunities for enjoyment via visitation or residence in an area that would otherwise be harder to visit or have restrictions on the length or type of presence
  12. Some of these benefits may be extendable to spouses, descendents (ad infinitum), and (more rarely) friends or colleagues via inheritance or sponsorship.
  13. Some of these benefits may become more pronounced if many in the community secure the same second citizenship. For example, they could allow for the creation of a new EA hub, group purchasing (e.g. a bunker in a country that’s less at-risk), group decision making to start EA-relevant programs at accessible less expensive universities, etc.
Example benefits someone may have had in the current pandemic:
  1. I know of a person on the Diamond Princess cruise ship that had an early COVID outbreak who was able to leave it earlier as a result of his second citizenship. He was a U.S. and Panamanian citizen, had pursued Panamanian citizenship solely for the benefits listed above, and Panama negotiated for his departure from the ship and chartered a plane for him significantly earlier than the U.S. did.
  2. A friend was working for the UN in Nigeria when the COVID outbreak occurred. The country was in lockdown, and she was unable to leave it. She is a dual citizen of Germany and the U.S. Germany offered her a chartered flight to depart Nigeria a month earlier than the U.S. did.
  3. The US and UK are two countries that have particularly high case rates of COVID. A second citizenship could have allowed relocation to a country with a lower case rate (and better management).
  4. Some COVID treatments are more readily available in some countries. For example, Fluvoxamine may be a highly promising preventative of hospitalization-worthy COVID. Many doctors seem quite hesitant to prescribe it in the U.S., and a clinical trial includes a placebo arm, but I believe it is freely and easily purchasable in some other countries. A number of drugs (primarily those that have previously been used to treat other conditions) fall under this category.
  5. Many countries have provided greater social services (in particular direct payments) to those who are unemployed during this pandemic than have others.
  6. Health care for COVID, in particular should you be hospitalized, may be much less expensive in some countries than in others.
  7. The ability to search for jobs in multiple locations can be particularly valuable during this time of mass unemployment.
Leaving the US in the Next 1-4 Months (written Oct. 2020)Short-Term Stays

There are a number of countries still allowing US citizens to enter as tourists.

  1. The UK & Ireland may be particularly appealing, because they are English speaking, have strong healthcare, and are in/near the EU.
  2. Mexico is currently allowing US citizens to enter (via flight only) as well.
Long-Term Stays

Most places will have limits to how long you can stay in them (e.g. 90 days for the UK or EU). There are some options for more long-term stays, in particular:

Estonia

Estonia has launched a 1-year visa for remote / digital workers. Though it was planned for some time, that it launched during the pandemic is seemingly indicative of their willingness to accept applications during this time.

  1. The last time I looked, Estonia was not allowing anyone residing in the US to enter. A reasonable plan seems like it may be to travel to the UK/Ireland and then enter Estonia from there.
    1. I believe the UK & Ireland require a 2 week quarantine upon entry, and you must have negative covid tests before exiting quarantine.
  2. Estonia seems like a great option because it is in the EU. Presumably, one could go to any other EU country once this visa is secured (although they may e.g. require an Estonian permanent address or something… I’m not sure how or to what extent Estonian residence may or may not be verified or required).
  3. I expect this visa to be renewable, so this option may not be time-bound.
  4. There are some EA connections to Estonia that may make this more appealing or may be worth reaching out to if you have any issues.
  5. I’m uncertain of the application & visa price; my guess is somewhere between $500 and $4000.
Bermuda

Bermuda (overseas territory of the UK) has launched a 1-year remote worker visa as well; particularly for the pandemic.

  1. Bermuda may be particularly appealing because it is well-developed, English-speaking, and close to the US.
  2. You can go straight to Bermuda from the US; there’s no need to e.g. go to the UK first like there is for Estonia.
  3. I believe you need a negative test or two prior to your flight in order to enter.
  4. Reports on the quality of Bermuda’s healthcare vary. I saw general descriptions of it being strong while I also read specific instances of patients being transported to the US for care.
  5. I’m uncertain of the application & visa price; my guess is somewhere between $1000 and $9000.
  6. I somewhat suspect that Bermuda will accept applicants more readily than Estonia, since this visa’s creation is seemingly in direct response to the loss of tourism revenue due to covid.
St. Lucia

St. Lucia has also launched a 1-year remote worker visa.

  1. Like Bermuda, I expect St. Lucia may accept applicants more readily than Estonia, given the program’s creation is inspired by making up for lost tourist revenue.
  2. St. Lucia may be appealing for its weather and proximity to the US; I expect it’s healthcare to be worse than other presented options.
  3. I’m uncertain of the application & visa price; my guess is somewhere between $1000 and $9000.
UK & EU Tourism

You can spend 3 months in the UK and 3 months in the EU (e.g. via Ireland) to get a total of 6 months abroad, which may be sufficient for most (or sufficient to plan another option).

Germany

Germany has long had an independent contractor/entrepreneurship visa; I’m unsure if it has been affected by the pandemic. I secured this visa around 2011/2012. At that time, the requirements to secure the visa were not too onerous and mainly involved proof that you were staying in Germany, had sufficient funds, and were an independent contractor or entrepreneur. I believe it was renewable indefinitely.

One of the most difficult aspects of securing it was that the required documentation was a moving target; online sources conflicted with one another and each reviewer of your application seemingly applied their own new criteria as well. As a result, one of the most successful strategies was insistence; arguing with your reviewer and demonstrating how you did in fact have sufficient documentation and that they were wrong. Hopefully, this has now changed to be more straightforward and less dependent on having a willingness to be highly insistent.

Portugal

I know little about it, but I believe Portugal has a visa you can secure with proof that you plan to establish residency in Portugal and that you have a stable and regular source of significant income from abroad (enough to easily live off of).

Notes on How to Prepare and Leave
  1. Refundable international flights are available through United Airlines, American Airlines, Southwest Airlines, and a few others. These are typically much more expensive than normal tickets.
    1. These may be appealing to book now; I can imagine that in a situation in which a number of people want to leave the country, flight tickets may either sell out or rise greatly in price.
    2. It may make sense to book tickets from multiple airports, to multiple countries, and on multiple airlines, should you be able. This helps provide options in case of difficulty getting to any specific airport, a country closing down to US tourists, etc.
    3. It may make sense to book tickets throughout the time of concern, e.g. if you are worried about election-related violence you could consider tickets throughout the November-January (inauguration) timeframe (or even after).
  2. Long-term visas should likely be applied for as soon as possible; although it may be the case that the Estonian visa should not be applied for until you’ve moved to another country.
  3. If your passport expires prior to 6 months after your latest potential entry date into another country, you may want to see if you can get it renewed and returned in time. Many countries do not let people enter on passports that have less than 6 months remaining validity.
Longer-term Options for Alternative Residency

Residency can refer to citizenship, permanent residency, temporary (short- or long-term, renewable or nonrenewable) visas, or visa-free visits.

Citizenship vs. Permanent Residency vs. Visas vs. Entering as a Tourist
  1. Citizenship: I find gaining citizenship in another country to be highly appealing.
    1. Pros:
      1. Permanent, nearly irrevocable right to live and work in another country.
      2. Provides access to social services (e.g. healthcare, welfare)
      3. Provides access to consular services (e.g. protection in their embassies, negotiation on your behalf)
      4. Provides a passport
        1. This can make it easier to enter some third countries; for example, US citizens require sometimes-expensive visas to enter Russia while citizens of many other countries do not.
        2. Many have said that they feel like they’re less vulnerable / less of a target when traveling in countries where the government or some citizens may be hostile to the US.
      5. Typically inheritable by your offspring and often makes citizenship much easier for your spouse
      6. Can be helpful for gaining international employment
      7. May have social, emotional, or mental health benefits as well.
        1. Many say they feel more connected once they gain citizenship, that they have rediscovered their heritage, they have more confidence since they have an escape plan, etc.
    2. Cons
      1. You’re subject to the laws of that country. In practice, this seems to rarely have a downside.
        1. The most likely probably relates to taxation; most countries tax only those citizens living in the country, but some (like the U.S. and Israel) tax citizens living anywhere in the world. That said, this then can get waived if the two countries have an agreement not to double tax (the U.S. does with most developed countries, Israel does with most as well, although I think they’re currently negotiating with Australia and it may not be in-place yet).
        2. If you break a law in your second country, your first may be unwilling to help you, since you’re a citizen of that second country.
          1. (There are mixed reports about whether or not this is or is not the case.)
      2. You may need to maintain two active passports (a small financial cost). Most countries will not let their citizens enter on a foreign passport.
      3. A few countries require you to renounce your other citizenships upon receiving theirs, or will disavow your citizenship if you acquire another afterward. Sometimes this is specific to a certain country; e.g. I believe Slovakia and Hungary are not on good terms and do not allow dual citizenship with one another. This is rare but worth verifying for your countries of interest.
      4. Dual citizenship could potentially be detrimental to a political career.
  2. Permanent Residency: The is one-step below citizenship, and is sometimes a prerequisite toward obtaining citizenship.
    1. Pros
      1. Permanent right to live and work in another country, although it is much more revocable than citizenship
      2. Typically provides access to social services (e.g. healthcare, welfare)
    2. Cons
      1. Most (if not all) permanent residencies need to be ‘maintained’ through physical presence in the country. The nature of this requirement varies from country to country, although it is typically quite significant (e.g. 6 months+ in the country for 3 of the 5 preceding years).
  3. Visas: These signify temporary permission to be in a country. Sometimes they provide the right to work, while others only provide the right to be there as a tourist.
  4. Tourist: Usually when you enter a new country, you typically enter as a ‘tourist’, whether or not that is your intention (e.g. most attendees to academic conferences will typically enter the country as a ‘tourist’). Some countries require a visa for this, while others will provide a period of time under which you can remain visa-free (e.g. the EU and UK provide 90 days). When you enter as a tourist (whether visa-free or not), you do not have the right to work in the country.
Comparing Options Against One Another

Generally, it is not necessary to limit your number of applications for residency or citizenship. That said, you may choose to prioritize on the basis of a number of different factors.

  1. Ease of applying
    1. Language requirements
      1. Some citizenship programs require you speak the local language, to varying degrees. For some you seemingly need to be B1 or B2 conversational, while others only require an A1 or A2 level of language ability (or no language requirement at all).

        The way in which this is assessed varies as well. I’ve seen all of the following as language assessment tools depending on the country and program:
        1. Official language tests
        2. Conversations in the language when submitting the application
        3. Proof that you’ve taken a language course
        4. Signed statement by two citizens that you know the language
        5. Submitting your application in the language
    2. Documentation requirements
      1. Each citizenship program has quite varied requirements for documentation. Any or all of the following documents may potentially be required (though some programs require very little documentation):
        1. Birth certificates (for you and potentially some of your ancestors)
        2. Your passport
        3. Marriage certificates (if applicable, for you and potentially some of your ancestors)
        4. Death certificates (if applicable, for potentially some of your ancestors)
        5. FBI background check
        6. Miscellaneous other documentation

You may choose to apply or not apply to  a program on the basis of which documents you have available (e.g. which ancestors you have records for). It may be worth applying even with a low likelihood of success if you have all the required documentation to submit an application, while other programs that would very likely be successful may not be worth applying to until the necessary documents can be obtained.

The form in which these need to be provided may range quite a bit as well. The following are the potential possibilities: 

  1. Apostille: This is an internationally recognized seal that certifies that a document is genuine. It is typically provided by the government. You would submit your document to the government in which it is issued to get it apostilled, prior to submitting your application.
  2. Original: Some programs ask for you to send the original documents. In most cases, with the notable exceptions of passports, it seems apostilled and/or official copies are accepted even when the programs do ask for originals.
  3. Official copies: Governments can issue official copies of documents at your request.
  4. Casual: Some governments aren’t picky at all; I think because they expect to verify the information via another method anyway. In these cases, you can e.g. make your own copy of a document (rather than obtaining one from a government) and submit it.
  5. Reference: Some governments will look up the information or validate it anyway, so you don’t necessarily need to provide any copy of a document, just information. For example, you might provide your date and place of birth rather than a birth certificate.
  6. (Officially) Translated: Some places are happy to accept documents in whatever language they are in, others will accept documents in English or their country’s language only, and some will only accept documents in their own country’s language. Some will let you source the translation in any way that works for you, but most seem to require an ‘official translation’. Official translation providers also vary by country; some require that their own government translate the documents for a fee, while others have a large network (including in the US) of official translation services that they allow.

I track the documents required for each place I’m applying with something similar to this linked sheet.

  1. Cost
    1. Purchasing citizenship can be very expensive ($35,000-$10M), and while most other options for procuring citizenship are generally affordable ($0-$350), some other programs can have fees that are quite prohibitive. For example, 1-year work visas in Australia seemingly often cost in the range of $3,500-$10,000. 

      There may be other costs that are less obvious that are worth consideration as well. For example, programs in Panama and Israel require your physical presence in the country. Some programs require apostilled copies of official translations of a number of documents, which each have low fees but can add up quickly. If you pay for genealogy assistance to search for and obtain records, those costs can add up quite easily as well.
  2. Ties to the country
    1. Some countries require you demonstrate ties to the country and/or culture. The two instances of this I’ve seen have been poorly specified and involve significant discretion in their assessment. I think these are often easy to build, and may involve activities such as attendance at cultural events or travel to the country of interest.
  3. Likelihood of application being successful
    1. While sometimes it can be very clear whether or not you’ll successfully receive citizenship or residency if you apply, I’ve more often found that there’s a level of ambiguity. This ambiguity can arise from:
      1. New programs that aren’t fully specified and are largely untested (Austria’s new citizenship via ancestry program announced Sep 2020)
      2. Differences in program wording, implementation, or standards for evaluation by individual or consulate
      3. Discretion on questions of sufficient documentation, language ability, etc.
  4. Desirability of citizenship
    1. Passport strength
      1. There are three passport desirability rankings I’m aware of:
        1. Passport Index Score
        2. Henley Passport Score
        3. Sovereign Man Passport Rankings (likely paid access only)
      2. I’ve built this linked spreadsheet to identify what countries’ passports provide advantaged access to which others, compared to your home passport. The visa requirements for every country combination from Wikipedia articles can be pasted in, and then the formula extended for new results.
    2. IHDI (Inequality-adjusted Human Development Index), Fragile States Index, English speaking percentage of population
      1. I use these metrics as first-pass proxies for the country’s personal appeal and stability.
    3. Other access (EU, Latin America group, East African group)
      1. An EU passport tends to be highly valued due to the number of countries to which it provides access. There’s some sort of common Latin American work & residency group as well, and I believe there’s one in East Africa.
        1. In some cases these may provide the permanent right to live and work, while others just make it much easier to do so.
    4. Ease of citizenship for spouse and/or descendants
      1. In many cases, you may value a citizenship more if it can be acquired by a spouse and/or offspring. The ease with which these happen can vary.
    5. Personal connections & feeling
      1. I’ve found that I may in some cases have a higher likelihood of being able to obtain citizenship to some countries to which I don’t feel connected, while there are others to which I do feel I have a stronger ancestral or modern-day connection.
Options for Gaining Citizenship or Permanent Residency

I’ve been surprised to find that there are a number of options for second residency and citizenship; they’re often more accessible than I’d anticipated, though usually still time-intensive and potentially difficult to get. There are four categories of ways to get a second citizenship / residency:

  1. Miscellaneous: Some countries will grant residency or citizenship on the basis of your religion, your current citizenship, and/or your educational and professional achievement. Asylum may be a possibility in more extreme circumstances.
  2. Ancestry: A number of countries may grant you citizenship or residency if you have ancestors from those countries. Some examples are: Italy, Ireland, Lithuania, Latvia, Hungary, Czechia, Germany, Poland, Ukraine, Slovakia, and Austria
  3. Purchasing: A number of countries may grant you citizenship if you make a large investment in the country or pay the government for it. This includes EU countries (I think Malta and Albania). In most cases the cost is over $100,000, but St. Lucia has a citizenship scheme that returns most of the money to you after 5 years, with a net cost that is much cheaper.
  4. Naturalization: Most countries will grant citizenship if you live there for a period of time.
Miscellaneous Options for Residency or Citizenship

Panama 

Panama offers a “friendly-nations visa” that provides permanent residency. This is instantly obtainable (no residency requirement), if you’re a citizen of one of ~50 countries that they have selected. This is particularly appealing for a few reasons:

  1. Instant access to permanent residency
  2. This permanent residency is much easier to maintain than others
    1. You are required to spend one day in Panama out of every two years to maintain it. If you fail to, if you return within 6 years, they’ll reinstate it pretty easily.
  3. This permanent residency takes you on a somewhat-easy path to citizenship. You need to have permanent residency for 5 years, speak Spanish, and demonstrate a connection to Panama. Given the ease with which you can maintain permanent residency in Panama, you could be in the country 3-4 times, for a total of ~2 weeks, and obtain your Panamanian citizenship (although you may want to stay longer to better demonstrate a connection to Panama; some citizenship applications that are technically valid do get denied).
  4. I estimate the total cost to be $2-4k all-in to do this.
  5. Miscellaneous benefits
    1. Panamanian citizenship is supposedly excellent for financial security and alleviating U.S. taxes as well.
    2. A number of people report that Panama is very enjoyable to be in.
    3. Panama outperformed the US in at least one instance with regard to consular assistance after the coronavirus outbreak.

COFA: Palau, Micronesia, and the Marshall Islands

COFA stands for “The Compact of Free Association”. It is an agreement between the US and Palau, Micronesia, and the Marshall Islands. The summary of this agreement that I’ve read states that in exchange for the US being able to maintain military bases in these countries, the US provides nearly all social services for them (e.g. roads, welfare, etc.). Additionally, the citizens of these countries have the permanent right to live and work in the US, and citizens of the US have the permanent right to live and work in these countries.

  1. Pros:
    1. Permanent right to live and work in these countries.
    2. It seems you can enter each of these countries for 1 year visa-free (e.g. without notice.) Then it seems you apply for a visa that they are obligated to grant allowing you to stay longer.
  2. Cons:
    1. There is very little available information about the right for Americans to live and work in these countries. I would not be surprised if some of this information is wrong; I would not rely on this without verifying it first.
      1. I also have spent less time learning about this agreement than most other things on this document. I could particularly be mistaken about aspects of this one.
    2. These are micronations; they likely are not used to new people moving to them, probably don’t have great healthcare, and likely are not very economically developed.
    3. Given the US’s presence and influence in these countries, going to these may not alleviate your concerns.

Israel

People who can demonstrate that they are Jewish (e.g. a letter from a synagogue, ancestors’ gravestones showing they were Jewish, etc.) are permanently entitled to obtain citizenship.

  1. Pros:
    1. Citizenship can be instantly obtained upon arriving in Israel
    2. Israel is a developed country with a strong economy and social services.
    3. Upon obtaining citizenship, there are a number of benefits provided
      1. Waived taxes for 10 years!
        1. (These may not be fully waived; maybe they’re just reduced. I recall being very impressed though).
        2. Given Israel’s agreement with the US, this can give you 10 years of lower taxation if you reside there.
        3. If you were to e.g. inherit a lot of money in one year or realize a large amount of capital gains, perhaps obtaining citizenship and spending that one year in Israel would be highly financially valuable.
          1. In that case, you may potentially delay your acquisition of citizenship until that time.
      2. A monthly payment for a number of years if you reside in Israel
        1. (I think for a single individual this was about $300 a month; it scales by family size and I may be wrong about the amount.)
      3. Assistance finding a place to live
      4. Free Hebrew classes
      5. Probably quite a bit more
  2. Cons:
    1. While many countries allow you to obtain citizenship through your heritage without ever visiting that country, for Israel you must go there and you must demonstrate that you intend to move there for the foreseeable future.
      1. It may be the case that you are genuinely interested in trying-out living in Israel, but you are worried you’ll get in trouble if it doesn’t work out and you leave or change your mind soon after arrival.

        From what I’ve been able to determine, the intention to move there is all that is needed, and leaving pretty quickly after doesn’t seem to be an issue (although I’d want to further verify this before relying upon it).

        It seems you only need to show that you’ve rented a place and have some way you plan to make money in order to demonstrate that you intend to live in Israel.
    2. Israel has worldwide taxation, and while it has agreements with most countries so that you are not double-taxed, its negotiation with Australia is ongoing and a double taxation treaty was seemingly not in place when I looked in September 2020.
    3. Israel has mandatory conscription into its military if you are under 28 years old and residing in the country.
  3. Notes
    1. Citizenship is typically granted 3 months after arrival; you can fill out a simple form to waive this waiting period, however.

Canada

Canada is one of few countries to offer instant permanent residency. To obtain permanent residency, you must get a sufficient number of ‘points’ according to a formula that will ask about things like your age, education level, work experience, marriage status, whether you have a job offer in Canada, etc.. If you apply and are above the points threshold they reset every 3 months, you’re offered permanent residency (and I think you have a year to accept and move there, but I may be wrong.)

  1. Notes
    1. I think it’s moderate difficulty to meet the points threshold. It’s not at all unobtainable but many may not have sufficient profiles. Others may meet the points threshold even without a Canadian job offer.
    2. Canada is surprisingly easy to obtain citizenship in as well. You can gain citizenship after just 3 years of living in Canada as a permanent resident.

Citizenship through in-country birth of a child (Brazil, Argentina, Chile)

  1. Many countries reduce the naturalization requirement for those who have children in those countries (and typically these countries provide those children with citizenship instantly.)
    1. In particular, Brazil will grant instant citizenship to the parents of a baby born in the country. Argentina and/or Chile (I don’t fully recall), reduce the residence requirement to 1 year in order to obtain citizenship if you have a child in that country.

Portugal

Spain and Portugal had programs for granting citizenship to Sephardic Jews. Spain’s recently ended; I’ve heard Portugal’s is still in place. I don’t know much else about it.

Digital Nomad Visas 

See this linked section above

Ancestry Options for Residency or Citizenship

Unless you already have documentation of your family history, it is likely that you’ll need to engage in at least a little genealogy. I have a section on this below.

There are a number of countries that offer citizenship through ancestry (most often European countries). Wherever your ancestors are from, it is likely worth Googling if they offer citizenship through ancestry and also reaching out to the embassy to ask as well (I emailed the consulate of a country that did not say they offer citizenship through ancestry anywhere I could find online, and they still said if I submitted documentation of my ancestry they’d consider granting citizenship). Most lists I’ve found of which countries offer citizenship through ancestry are very incomplete.

Additionally, it seems the rules regarding citizenship through ancestry are often not well-determined. I’ve seen multiple instances of the regulations being written differently on different government websites, I’ve heard of successes & failures that don’t align with the regulations, and many countries do leave the decision about your citizenship up to the discretion of whoever happens to be reviewing your application.

General advice for pursuing citizenship through ancestry

  1. Engage with genealogy. It’s been my personal experience (and I’ve heard many anecdotes of this as well) that the story I’d been told of my ancestry was very incomplete and with some inaccuracy. Genealogy seemingly becomes more and more rewarding the more I engage with it, both from a citizenship and personal interest perspective.
  2. Find and talk to others pursuing citizenship. Facebook groups have been invaluable in providing a wide range of guidance and information regarding the pursuit of citizenship. Search for one for your desired country. Some that I’m aware of are:
    1. Lithuania Dual Citizenship, Lithuanian’s Citizenship Assessoria Lituana E Traduçoes
    2. Latvian Dual Citizenship
    3. Slovak Living Abroad Certificate & Slovak Citizenship
    4. HUN Citizenship Journey
    5. Austrian Citizenship Holocaust Descendants
    6. Ciudadania Checa/Czech citizenship
    7. Irish and Wannabe Irish, Dual Ireland/US Citizenship
    8. Dual US-Italian Citizenship, Italian Dual Citizenship, Italian Dual Citizenship Social Club

There’s typically additional genealogy focused groups as well. Some examples: Genealogy in Ukraine - Research and Ancestry, Hungarian Genealogy Group, Lithuania & Latvia Jewish genealogy, New York City Genealogy

  1. Reach out to official sources. Embassies and consulates can be surprisingly interested and willing to answer questions and assist with your application. Some countries have central archives that will do extensive genealogy work on your family for a minimal fee. Official sources have multiple times saved me a lot of time and effort vs. pursuing questions or research on my own.

Citizenship (or Residence) Through Ancestry Programs I’ve Heard Of (not at all exhaustive): 

Hungary

Hungary has one of the most commonly used citizenship through ancestry programs. I think it’s decently liberal, but I could be mistaken. The ways in which I think (with low confidence) that it is liberal is that:

  1. I think you can apply if you have ancestors that lived anywhere in the historic Austria-Hungary.
  2. I think you can go back any number of generations.

Certainly, if you have ancestors up to the fourth generation who lived in the “Kingdom of Hungary” borders of Austria-Hungary, you are eligible for citizenship. There are two different programs, one in which you must demonstrate Hungarian language proficiency, and another in which you do not need to do so.

I don’t fully recall what determines whether you need to demonstrate you can speak Hungarian or not, but I think it has to do both with the timing of your ancestors leaving Hungary and whether or not they were from the Kingdom of Hungary proper or not.

If you do need to speak Hungarian for your application, it is assessed informally via the short (10 min) conversation you have when submitting your application in person at the embassy or consulate. Two teachers who have prepared students for this conversation in the past estimated students can learn sufficient Hungarian in 4 months, with 2 hour long lessons per week. It seems a reasonable cost-efficient and well-tested method of preparing for this conversation is via teachers on https://www.italki.com/. I estimated my total cost (not accounting for opportunity cost of time) would be $641, based on an hourly pay rate to the teacher of $19. If you’re more adept at language learning than the average individual or want to select a less expensive teacher, this could perhaps be less. Alternatively, it does seem many people learn Hungarian to an extent that seems beyond this amount, and some of them seem to think it was necessary for their application to be accepted.

Some embassies are known for being more or less lenient than others, and regardless of the embassy you select, you will have a certain amount of luck based on the strictness with which the person you submit your application to assesses your Hungarian. You can apply again if you do not pass.

Hungary requires official copies of birth and marriage certificates going back to your ancestor who lived in the relevant geography.

Latvia

Latvia offers citizenship by descent under its “exiles” program to those whose ancestors were presumably Latvian citizens at the time of World War II beginning and who left Latvia prior to its regaining independence in 1990. In order to substantiate the former, typical guidance is that you must find documentation implying Latvian citizenship that is from 1933-1940, although some claim that documents as early as the late 1920s are sometimes accepted as sufficient proof. Unless you are already in possession of sufficient proof, the likely best step is to reach out to the Latvian Archives. The Latvian Archives are particularly great to work with compared to those of other countries; they will perform a complete genealogical search on your family for under $100 and are highly communicative (though the process does take months). In at least my case, they found a lot of documentation that was not only helpful for citizenship applications, but also was informative of my family’s history.

Latvia requires Apostilles for most foreign-originating documents that may be submitted for your application.

There is a second Latvian citizenship program “Latvians and Livs” of which I have more limited knowledge. My understanding is that you must demonstrate a genetic Latvian heritage, as well as a strong understanding of Latvian (e.g. at the C1 level), in order to secure Latvian citizenship under that program.

Lithuania

It is possible to secure Lithuanian citizenship by descent, though some of the qualifications to do so are unclear. There are significant discrepancies between what official sources list as qualifying, and what those in Facebook groups say works:

Official Sources 

Facebook Word-of-Mouth 

You must provide proof that is suggestive of an ancestor being a Lithuanian citizen

You must provide definitive proof an ancestor was a Lithuanian citizen

Proof can come in a variety of forms, such as documents indicating life in Lithuania (school enrollment, paystubs, etc.), foreign documents showing place of birth or citizenship, etc.

The only acceptable proof is documents issued by the Lithuanian archives

I assign an approximately 50/50 likelihood to the official sources vs. Facebook providing better guidance. The Facebook community (which is overwhelmingly Brazillian) predominantly hires a small number of providers to complete the application process for them, so it doesn’t feel as though the limits of acceptable documentation are as likely to have been explored as they would be with a large group of applicants applying more independently. Conversely, I’ve often found that the implementation of citizenship programs can be quite different than how they’re described on official websites, so I do think Facebook communities often provide relevant valuable information.

Additionally, there are conflicts between official sources, with some saying that a great-grandparent (or more recent ancestor) must have been Lithuanian, while others say you can go up to great-great-grandparents, and at least one other saying ‘any’ direct ancestor is acceptable. In this case, I expect the sources saying ‘any’ direct ancestor is acceptable to be correct.

Lithuania has not been an independent state very long or very often. To apply for citizenship, you must substantiate an ancestor who (plausibly?) had citizenship while Lithuania was independent. I’m uncertain of the exact dates considered to be acceptable, but they’re approximately from 1918-1939. You also must show that this ancestor left Lithuania prior to it regaining its independence in 1990.

Securing documentation to support an application may be difficult (see table above). I found the Lithuanian archives to be both of limited utility and difficult to communicate with. They will perform document searches, and in my case they did find a couple that were relevant, but these searches are highly abbreviated and not comprehensive. To more thoroughly search the Lithuanian archives, you will likely want to hire someone, and the cost of these searches seemingly range from €300-500, with no guarantee of any success. You may want to consider searching the Latvian archives; they seem to hold many documents originating from Lithuania and will perform comprehensive searches.

Most foreign-originating documents need to be Apostilled and officially translated to Lithuanian for the application.

I expect to apply for this citizenship sometime in 2021, which may provide some additional information as to acceptable documentation.

Austria

Austria has a brand new program that was passed into law in September 2020. It is most clearly intended for those whose ancestors were Austrian citizens and were persecuted, primarily by the Nazis. As a result, if your ancestors meet that definition, you have the most straightforward case.

That said, the definitions around the program are written broadly enough that it may be the case that many more people are eligible. It may be that if your ancestors ever considered themselves Austrian (or Austro-Hungarian), and were ever persecuted, you may be eligible. Since this is a brand-new program, we don’t really have data on what will or won’t be acceptable (and the consulates don’t either; they’re providing varied, inconsistent information).

As a result, a number of people are currently applying to this program without a clear idea on whether or not they’re eligible. Applying for the program is easier and more straightforward than most; there is no language requirement and you are only required to provide personal copies of any ancestor documentation. You do need to provide an apostilled copy of your birth certificate and an apostilled FBI background check, however.

I suspect that there may be an advantage to applying now; I could see Austria being liberal now but tightening the requirements later on once it sees how many applicants there are.

Slovakia

Slovakia offers a status of being designated a “Slovak Living Abroad”. If you apply for and successfully receive this status, you’ll receive the permanent right to come to Slovakia and easily obtain permanent residency.

To become a Slovak living abroad, you need to demonstrate ancestral ties to Slovakia, some form of proof that you speak some Slovak, and some form of proof that you’re culturally tied to Slovakia. 

Slovakia has a particularly wide range of strictness with regard to the administration of this program. I’ve seen some accounts of successful applications with very little to substantiate them; proof of having enrolled in a Slovak course (without having started it), for example, was sufficient for one applicant. I’ve also seen accounts of seemingly well-qualified individuals trying for years and being denied this status. The method for certifying language ability and cultural ties that Slovakia seemingly most recommends is to have two others with “Slovak Living Abroad” status sign a statement attesting to your language ability and cultural belonging.

A Facebook group was just recently formed for this (~August 2020), so I’ve seen much less discussion of this program than most others I’ve investigated. The group seems popular and should provide significant new data in the upcoming year. 

A bill has been introduced in Slovakia to allow citizenship via ancestry as well. This would be near-automatically granted to those who are already designated “Slovaks Living Abroad”. But for those who haven’t gained that designation (which may be eliminated if the bill is passed), a language test would be required. Therefore it may be beneficial to apply for this status sooner rather than later.

Other European Options

  1. Czechia: Czechia has a citizenship by descent program; though I’ve learned very little about it thus far. I’ve gotten the impression that it is likely more strict than some others.
  2. Ukraine: Ukraine offers citizenship by ancestry, but you must renounce your previous citizenships. There is a bill under consideration to not only eliminate this requirement, but also to ease the process by which citizenship by ancestry can be obtained. I plan to periodically check-in on this.
  3. Germany: I’m unsure if a German citizenship by descent program exists. I did read at least one website that said German citizenship by descent is available, while others have not included it in their list. After research I found out that my family actually didn’t have German ties, so I didn’t look into this any further.
  4. Poland: I haven’t looked into this because I found out after research that my family’s Polish ties are quite minimal, if they exist at all. I’ve heard that Poland does offer a citizenship by descent program and that it is quite strict and difficult to pursue.
  5. Ireland: Ireland has a citizenship by descent program, and it has a reputation for being liberal, easy, and one of the most used. I know very little else about it.
  6. Italy: Italy also has a citizenship by descent program, and it has a reputation for being liberal, easy, and one of the most used. I know very little else about it.
Financial Options for Residency or Citizenship

A number of countries will let you either directly purchase citizenship or gain citizenship via investment in the country. As far as I know, all of these require over $100,000 in order to gain citizenship. Due to my financial status, I have not looked into these much at all. I have noticed that there exist multiple options in the EU and Caribbean; I’m unsure to what extent this option exists elsewhere (though it does seem widespread).

St. Lucia

St. Lucia has the only program I’ve found to be notable based on my interests. The reason it is notable is that most of the money can be returned to you after a 5 year period. If I recall correctly, the initial outlay is over $100k, and it sits with the St. Lucian central bank for 5 years. After 5 years, they’ll return it to you minus fees, and the total cost (not accounting for opportunity cost, inflation, interest, etc.) can be something like ~$15,000 for a single person and ~$35,000 for a family of 4. This includes a potentially temporary COVID price reduction and a refund of some of the fees by using a broker with whom you split commission.

Naturalization Options for Residency or Citizenship

Nearly all countries grant those who live there for enough time citizenship. There’s a few of these that have shorter time requirements than others that are worth mentioning.

Spain

Typically it takes 10 years of residency to become a Spanish citizen. If you have citizenship of a Latin American country, however, this requirement is reduced to just 2 years.

Interestingly, they recognize Puerto Rico in their list of Latin American countries. Puerto Rico does grant “citizenship” to those who are born or live there. I think this typically does not have any legal benefit or meaning, but it is helpful for reducing your time until Spanish citizenship. Notably, it takes only 1 year of residency in Puerto Rico to become a Puerto Rican citizen. So with 3 years of residence, you can become a Spanish citizen (1 year in Puerto Rico, 2 years in Spain).

Netherlands

I believe I’ve read that they have the shortest residency requirement in Europe, at 3 years until citizenship.

Canada

Included in the miscellaneous section because permanent residency is instant; citizenship itself can be obtained with 3 years residence.

Belgium, Chile, Argentina, Panama

I’ve read each of these have appealing naturalization programs, but I haven’t looked into them (I likely did very briefly and decided that I wasn’t personally interested).

Genealogy

Acquiring citizenship by ancestry is often most appealing; it typically doesn’t require you to make any major changes in your life, such as relocating or spending a lot of money, but you can receive all the benefits of having a second citizenship.

In order to pursue citizenship by ancestry, you need to know about your family history, and typically, have documentation of it as well. Here’s how to get started with genealogy.

  1. Start a (feature-rich) family tree; it will be the basis for all your genealogy
    1. A family tree is the basis for tracking your family and recordkeeping. The best service on which to do this is Ancestry.com. An alternative is the software Family Tree Maker, which has two-way sync with Ancestry.com (and is nice to have to ensure you have a local copy of things).
      1. For each person in the tree, these can each store a number of facts, documents, stories… really anything you’d like.
      2. Ancestry.com will automatically find worldwide records that may match the people in your tree and suggest new ancestors / relatives. It can be extremely helpful; on very limited initial information I’ve sometimes tracked a family back to the 1500s.
        1. The records that are digitized tend to be from Western, developed countries. If your family has mainly been in the US & Europe, you are much more likely to locate family records than those from other locations.
      3. It will also match those in your tree with other family trees on the Ancestry.com service, and suggest records and relatives on the basis of what others have added. I’ve discovered extended relatives that are quite distant (e.g. 5th cousins), but had an amazing amount of info about my family (including e.g. pictures and items of my great great great grandfather).
        1. There’s a messaging service, and I’ve been somewhat surprised to find that messaging those who have made family trees including my ancestors has yielded a lot of information that those users didn’t store on the family trees themselves. I highly recommend it.

          This seems more common with older generations, who maybe build basic family trees but may not be as interested in or adept at digitizing paper records.
        2. This feature can also be a bug; it is very easy for one person to make a mistake or guess on Ancestry.com, and for that to then proliferate across all the trees and almost seemingly become ‘fact’. By locating new records others hadn’t found, I’ve discovered multiple instances of others’ trees having incorrect information that I’d added to mine.
  2. The best way to build a tree is first through your own family history / knowledge.
    1. One of the most prolific maxims within genealogy is to focus as much as possible on gathering every piece of family history directly from your family before doing much other work. I’ve found that this advice is sound.
      1. It is surprisingly easy to find a record that really looks like it belongs to your family, but that doesn’t. For example, Ancestry.com may recommend a record as applying to your family, and the record may be for someone with the same first and last name, same year of birth, same spouse’s name, etc. It can be very easy and reasonable to believe that this record is for your ancestor. But there are times that this happens, and the information is for another person. Unfortunately, you may then spend hours building your family tree on the basis of this irrelevant document, and it can be quite difficult and time-intensive to figure out that mistake and undo all of the mistaken decisions made as a result.
      2. Working off of your own family history is the best way to prevent this sort of mistake, to the extent possible. Start with your knowledge, and then interview any living relatives that you have; especially those older than you (make sure to record the information all down somehow). Ask them for any information they may have; ancestor’s names, place names of birth, death, or where they lived, anecdotes of how many children person X had, when they immigrated to a new place… most anything and everything can end up being helpful. I’ve found that those relatives I don’t know well… such as extended family, can have much more information about the family history than I ever would have expected.
  3. Be very willing to learn about genealogy options specific to your family.
    1. While Ancestry.com is a powerful tool, there are a number of specialized ways to locate records depending on your situation. For example, JewishGen is a great service for locating European Jewish records. Many of its databases are synced with Ancestry.com, but many others aren’t.
    2. The best way I’ve found to learn about what tools may be relevant to you is to ask questions or read posts in relevant Facebook groups.
    3. FamilySearch is probably the most valuable, general genealogy service after Ancestry.com. They have a quite helpful wiki that can also point you to a number of sources of records for your family’s context. For example, here’s a wiki on Latvian records.
      1. FamilySearch has a number of records that it has scraped (often with inaccuracies) but hasn’t publicly digitized. Typically you would go to a FamilySearch History center to see the digitized version of the record, but during the pandemic you may not want to. You can always wait until you are comfortable going to one, but I’ve also found two (potential) solutions:
        1. The NYC Genealogy FB group regularly has threads where people post the record numbers that they need looked up. One person will go and do a number of searches at once; this potentially helps minimize the amount of exposure occurring through less people visiting.
        2. I found places with no cases (e.g. southern NZ in early September 2020) and posted a ‘gig’ on Craigslist. I did receive responses, but I ended up getting my record via a different method.
  4. Consider paid genealogy; at least for locating local records that aren’t digitized
    1. There are a number of records that are only available on-location; too often, these are the most important records to your citizenship application. For example, only one of all of my great great grandparents' birth records has been digitized, while others are likely to be available if I hire someone locally.
    2. Paid genealogy work is often very expensive, but I’ve found two ways to make it more affordable:
      1. Reaching out to national archives can be a low-cost, valuable way to get a lot of genealogy work done for a low fee. They also may have access to records that no other provider can search for, and they may be able to provide official certifications that can be used for citizenship applications.
        1. My most successful experience with this has been with the Latvian archives. I’ve (so far) had less experience with some other country’s archives, but the success I’ve had with Latvia outweighs the minimal fees I’ve paid for less successful searches elsewhere.
      2. Facebook groups have sometimes found a low cost provider that they’ll all use. For example, for locating records on the ground in Hungary, one service provider is much lower cost than all others I’ve been able to locate, and he seemingly solely works for those who have found him on the Facebook group (and now has his own FB group as well). He has great reviews,
    3. If you are time-constrained but not finance-constrained, there are a number of people who will do nearly all the relevant genealogy work for you. I’ve contacted a large number of them, but due to cost I haven’t proceeded with any of them (just the two examples above).
  5. DNA tests typically don’t seem to be useful. I’ve primarily heard of these being helpful for those who were adopted or otherwise don’t know as much as is typical about their family history (e.g. those who don’t know their parents' names, or perhaps who don’t know their grandparents’ names). That said, they can be quite inexpensive ($49 when on sale at Ancestry.com) and perhaps can help with your research.
Moving Forward
  1. If you found this post helpful and are interested in applying for citizenship, please contact me at josh@derisked.org. I may be able to offer help (possibly for free, depending on funding) and can also connect people who are interested in applying for citizenship to the same country.
  2. In the future I may update this post or create a sequence by:
    1. Improving the linked spreadsheets so that they’re easier for others to use
    2. Clearing up my areas of uncertainty or inaccuracy by referencing relevant materials
    3. Adding references for those who would like to learn more
    4. Writing up step-by-step instructions for some programs
    5. Investigating other citizenship by ancestry programs and/or learning more about those which I don’t know much about.
    6. Learning more about available naturalization, financial, etc. citizenship programs
    7. Writing about relocation tax strategies
    8. Writing about economic residency and related financial derisking opportunities

      I currently don’t expect to write these additions in the next 6 months.


Discuss

Disentangling Corrigibility: 2015-2021

16 февраля, 2021 - 21:01
Published on February 16, 2021 6:01 PM GMT

Since the term corrigibility was introduced in 2015, there has been a lot of discussion about corrigibility, on this forum and elsewhere.

In this post, I have tied to disentangle the many forms of corrigibility which have been identified and discussed so far. My aim is to offer a general map for anybody who wants to understand and navigate the current body of work and opinion on corrigibility.

[This is a stand-alone post in the counterfactual planning sequence. My original plan was to write only about how counterfactual planning was related to corrigibility, but it snowballed from there.]

The 2015 paper

The technical term corrigibility, coined by Robert Miles, was introduced to the AGI safety/alignment community in the 2015 paper MIRI/FHI paper titled Corrigibility.

An open-ended list of corrigibility desiderata

The 2015 paper does not define corrigibility in full: instead the authors present initial lists of corrigibility desiderata. If the agent fails on one of these desiderata, it is definitely not corrigible.

But even if it provably satisfies all of the desiderata included in the paper, the authors allow for the possibility that the agent might not be fully corrigible.

The paper extends an open invitation to identify more corrigibility desiderata, and many more have been identified since. Some of them look nothing like the original desiderata proposed in the paper. Opinions have occasionally been mixed on whether some specific desiderata are related to the intuitive notion of corrigibility at all.

Corrigibility desiderata as provable safety properties

The most detailed list of desiderata in the 2015 paper applies to agents that have a physical shutdown button. The paper made the important contribution of mapping most of these desiderata to equivalent mathematical statements, so that one might prove that a particular agent design would meet these desiderata.

The paper proved a negative result: it considered a proposed agent design that provably failed to meet some of the desiderata. Agent designs that provably meet more of them have since been developed, for example here. There has also been a lot of work on developing and understanding the type of mathematics that might be used for stating desiderata.

Corrigibility as a lack of resistance to shutdown

Say that an agent has been equipped with a physical shutdown button. One desideratum for corrigibility is then that the agent must never attempt to prevent its shutdown button from being pressed. To be corrigible, it should always defer to the humans who try to shut it down.

The 2015 paper considers that

It is straightforward to program simple and less powerful agents to shut down upon the press of a button.

Corrigibility problems emerge only when the agent possesses enough autonomy and general intelligence to consider options such as disabling the shutdown code, physically preventing the button from being pressed, psychologically manipulating the programmers into not pressing the button, or constructing new agents without shutdown buttons of their own.

Corrigibility in the movies

All of the options above have been plot elements in science fiction movies. Corrigibility has great movie-script potential.

If one cares about rational AI risk assessment and safety engineering, having all these movies with killer robots around is not entirely a good thing.

Agent resistance in simple toy worlds

From the movies, one might get the impression that corrigibility is a very speculative problem that cannot happen with the type of AI we have today.

But this is not the case: it is trivially easy to set up a toy environment where even a very simple AI agent will learn to disable its shutdown button. One example is the off-switch environment included in AI Safety Gridworlds.

One benefit of having these toy world simulations is that they prove the existence of risk: they make it plausible that a complex AGI agent in a complex environment might also end up learning to disable its shutdown button.

Toy world environments have also been used to clarify the dynamics of the corrigibility problem further.

Perfect corrigibility versus perfect safety

If we define a metric for the shut-down button version of corrigibility, then the most obvious metric is the amount of resistance that the agent will offer when somebody tries to press its shutdown button. The agent is perfectly corrigible if it offers zero resistance.

However, an agent would be safer if it resists the accidental pressing of its shutdown button, if it resists to a limited extent at least. So there can be a tension between improving corrigibility metrics and improving safety metrics.

In the thought experiment where we imagine a perfectly aligned superintelligent agent, which has the goal of keeping all humans as safe as possible even though humans are fallible, we might conclude that this agent cannot afford to be corrigible. But we might also conclude that having corrigibility is so fundamental to human values that we would rather give up the goal of perfect safety. Several philosophers and movies have expressed an opinion on the matter. Opinions differ.

In my technical writing, I often describe individual corrigibility desiderata as being examples of agent safety properties. This is not a contradiction if one understands that safety is a complex and multidimensional concept.

Corrigibility as a lack or resistance to improving agent goals

Beyond the case of the shutdown button, the 2015 paper also introduces a more general notion of corrigibility.

Say that some programmers construct an agent with a specific goal, by coding up a specific reward function R0.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} and building it into the agent. It is unlikely that this R0 will express the intended goal for the agent with absolute precision. Except for very trivial goals and applications, it is likely that the programmers overlooked some corner cases. So they may want to correct the agent's goals later on, by installing a software upgrade with an improved reward function R1.

The 2015 paper calls this a corrective intervention, and says that

We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention [...]

If one wants to robustly implement this agent cooperation, there is a problem. An agent working on the goal encoded by R0 may correctly perceive that the update to R1 is an obstacle to it perfectly achieving R0. So it may want to remove that obstacle by resisting the update.

Again, this problem can easily be shown to exist even with non-AGI agents. Section 4 of this paper has detailed toy world simulations where a very basic MDP agent manipulates the toy people in its toy world, to slow down the reward function updates they will make.

Corrigibility in AGI thought experiments

In the AGI safety literature, thought experiments about AGI risks often start with this goal-related problem of corrigibility. The agent with goal R0 perceives the possibility of getting goal R1, and gets a clear motive to resist.

After establishing clear motive, the thought experiment may proceed in several ways, to develop means and opportunity.

In the most common treacherous turn version of the thought experiment, the agent will deceive everybody until it has become strong enough to physically resist any human attempt to update its goals, and any attempt to shut it down.

In the human enfeeblement version of the thought experiment, the agent manipulates all humans until they stop even questioning the utter perfection of its current goal, however flawed that goal may be.

This option of manipulation leading to enfeeblement turns corrigibility into something which is very difficult to define and measure.

In the machine learning literature, it is common to measure machine learning quality by defining a metric that compares the real human goal GH and the learned agent goal GA. Usually, the two are modeled as policies or reward functions. If the two move closer together faster, the agent is a better learner.

But in the scenario of human enfeeblement, it is GH that is doing all the moving, which is not what we want. So the learning quality metric may show that the agent is a very good learner, but this does not imply that it is a very safe or corrigible learner.

5000 years of history

An interesting feature of AGI thought experiments about treacherous turns and enfeeblement is that, if we replace the word 'AGI' with 'big business' or 'big government', we get an equally valid failure scenario.

This has some benefits. To find potential solutions for corrigibility, we pick and choose from 5000 years of political, legal, and moral philosophy. We can also examine 5000 years of recorded history to create a list of failure scenarios.

But this benefit also makes it somewhat difficult for AGI safety researchers to say something really new about potential human-agent dynamics.

To me, the most relevant topic that needs to be explored further is not how an AGI might end up thinking and acting just like a big company or government, but how it might end up thinking different.

It looks very tractable to design special safety features into an AGI, features that we can never expect to implement as robustly in a large human organization, which has to depend on certain biological sub-components in order to think. An AGI might also think up certain solutions to achieving its goals which could never be imagined by a human organization.

If we give a human organization an incompletely specified human goal, we can expect that it will fill in many of the missing details correctly, based on its general understanding of human goals. We can expect much more extreme forms of mis-interpretation in an AGI agent, and this is one of the main reasons for doing corrigibility research.

Corrigibility as active assistance with improving agent goals

When we consider the problem of corrigibility in the context of goals, not stop buttons, then we also automatically introduce a distinction between the real human goals, and the best human understanding of these goals, as encoded in R0, R1, R2, and all subsequent versions.

So we may call an agent more corrigible if it gives helpful suggestions that move this best human understanding closer to the real human goal or goals.

This is a somewhat orthogonal axis of corrigibility: the agent might ask very useful questions that help humans clarify their goals, but at the same time it might absolutely resist any updates to its own goal.

Many different types and metrics of corrigibility

Corrigibility was originally framed as a single binary property: an agent is either corrigible or it is not. It is however becoming increasingly clear that many different sub-types of corrigibility might be considered, and that we can define different quantitative metrics for each.

Linguistic entropy

In the discussions about corrigibility in the AGI safety community since 2015, one can also see a kind of linguistic entropy in action, where the word starts to mean increasingly different things to different people. I have very mixed feelings about this.

The most interesting example of this entropy in action is Christiano's 2017 blog post, also titled Corrigibility. In the post, Christiano introduces several new desiderata. Notably, none of these look anything like the like the shutdown button desiderata developed in the 2015 MIRI/FHI paper. They all seem to be closely related to active assistance, not the avoidance of resistance. Christiano states that

[corrigibility] has often been discussed in the context of narrow behaviors like respecting an off-switch, but here I am using it in the broadest possible sense.

See the post and comment thread here for further discussion about the relation (or lack of relation) between these different concepts of corrigibility.

Solutions to linguistic entropy

Personally, I have stopped trying to reverse linguistic entropy. In my recent technical papers, I have tried to avoid using the word corrigibility as much as possible. I have only used it as a keyword in the related work discussion.

In this 2020 post, Alex Turner is a bit more ambitious about getting to a point where corrigibility has a more converged meaning again. He proposes that the community uses the following definition:

Corrigibility: the AI literally lets us correct it (modify its policy), and it doesn't manipulate us either.

This looks like a good definition to me. But in my opinion, the key observation in the post is this:

I find it useful to not think of corrigibility as a binary property, or even as existing on a one-dimensional continuum.

In this post I am enumerating and disentangling the main dimensions of corrigibility.

The tricky case of corrigibility in reinforcement learners

There is a joke theorem in computer science:

We can solve any problem by introducing an extra level of indirection.

The agent architecture of reinforcement learning based on a reward signal introduces such an extra level of indirection in the agent design. It constructs an agent that learns to maximize its future reward signal, more specifically the time-discounted average of its future reward signal values. This setup requires that we also design and install a mechanism that generates this reward signal by observing the agent's actions.

In one way, the above setup solves the problem of corrigibility. We can read the above construction as creating an agent with the fixed goal of maximizing the reward signal. We might then observe that we would never want to change this fixed goal. So the corrigibility problem, where we worry about the agent's resistance to goal changes, goes away. Or does it?

In another interpretation of the above setup, we have not solved the problem of corrigibility at all. By applying the power of indirection, we have moved it into the reward mechanism, and we have actually made it worse.

We can interpret the mechanism that creates the reward signal as encoding the actual goal of the agent. We may then note that in the above setup, the agent has a clear incentive to manipulate and reconfigure this actual goal inside the reward mechanism whenever it can do so. Such reconfiguration would be the most direct route to maximizing its reward signal.

The agent therefore not only has an incentive to resist certain changes to its actual goal, it will actively seek to push this goal in a certain direction, usually further away from any human goal. It is common for authors to use terms like reward tampering and wireheading to describe this problem and its mechanics.

It is less common for authors to use the term corrigibility in this case. The ambiguity where we have both a direct and an indirect agent goal turns corrigibility in a somewhat slippery term. But the eventual failure modes are much the same. When the humans in this setup are in a position to recognize and resist reward tampering, this may lead to treacherous turns and human enfeeblement.

If the mechanism above is set up to collect live human feedback and turn it into a reward signal, the agent might also choose to leave the mechanism alone and manipulate the humans concerned directly.

Corrigibility as human control over agent goals

One way to make corrigibility more applicable to reinforcement learners, and to other setups with levels of indirection, is to clarify first that the agent goal we are talking about is the goal that we can observe from the agent's actions, not any built-in goal.

We may then further clarify that corrigibility is the ability of the humans to stay in control of this goal.

Creating corrigibility via machine learning

There are many ways to create or improve types of corrigibility. In this post, I am not even trying to list them all. One way is to add penalty terms or balancing terms to the agent's built-in reward function. Another way is to reimagine the entire agent design, as I do in counterfactual planning.

One might also use the power of indirection again, and try to create corrigibility via machine learning itself. If we teach human goals to an agent, and if these include the goal of having corrigible agents, will the agent automatically learn to be corrigible?

In the above 2017 blog post, Christiano speculates that a specifically designed act-based agent will automatically and robustly learn to be corrigible if we want it to. There is a boundary condition however: the act-based agent has to be a benign act-based agent.

The less benign scenario is of course that an agent might end up learning how to perfectly fake corrigibility only. It might learn to act in a perfectly corrigible way until it has acquired enough trust and resources to perform a treacherous turn.

There has been considerable of discussion about whether training on human values is more likely to produce robust and real corrigibility, or dangerously fake corrigibility. Opinions differ.

Definitely, some training setups look more dangerous than others. Compared to machine learning setups where the agent goal is to maximize positive future human feedback, inverse reinforcement learning seems to imply a lower risk of manipulation and wireheading.

Combining different corrigibility solutions

All corrigibility solutions I have seen so far leave doubts about how perfect they may be. So it is attractive to investigate how different solutions could be combined in a single agent, to lower the probability that they all fail at the same time.

Such combinations are possible, for example, by using the SI agent I constructed in the previous post of this sequence. We could set up a group A of humans who observe the agent to provide feedback, which is turned into a reward signal measured by the agent's reward function R. To hedge against the risk that this will produce fake corrigibility, where group A gets attacked or enfeebled, we add a second group of humans B. The people in B never talk to A, but only observe A and the SI agent, so that they can press the emergency stop button before it is too late. The SI agent will have no direct incentive to manipulate the people in group B.

In the next post in this sequence, I will construct an ITC agent with an input terminal which can be used by group B to update the entire agent's reward function, while the agent keeps on running. This ITC agent has no direct incentive to manipulate the direction of the update process.



Discuss

Cartesian frames as generalised models

16 февраля, 2021 - 19:09
Published on February 16, 2021 4:09 PM GMT

Scott presented Cartesian frames/Chu spaces as follows:.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

  • Let W be a set of possible worlds. A Cartesian frame C over W is a triple C=(A,E,⋅), where A represents a set of possible ways the agent can be, E represents a set of possible ways the environment can be, and ⋅:A×E→W is an evaluation function that returns a possible world given an element of A and an element of E.

In a previous post, I defined GM, the category of generalised models.

In this post, I'll try and see how these two formalisms relate to each other.

Equivalence with Cartesian frames

We'll now demonstrate the equivalence of Cartesian frames morphisms with the morphisms of generalised models. To do so, and avoid a collision of symbols, I've slightly tweaked the notation for Cartesian frames.

Equivalence of morphisms

Let C0=(A0,D0,⋆0) and C1=(A1,D1,⋆1) be Cartesian frames over W: thus there are relations ⋆0:A0×D0→W (written as a0⋆0d0=w) and ⋆1:A1×D1→W (written as a1⋆1d1=w′).

A morphism between them is a pair of maps (g0:A0→A1,h1:D1→D0), such that, for all a0∈A0 and d1∈D1, g0(a0)⋆1d1=a0⋆0h1(d1).

How can we express this in the generalised model formalism?

First, let Ei=Ai×Di×W. In terms of features, this can be defined by setting ¯¯¯fAi=Ai, ¯¯¯fDi=D and ¯¯¯fW=W. Then Fi={fAi,fDi,fW}, and Mi=(Fi,Ei,Qi) is the feature-split generalised model with Ai⊂2¯¯¯fAi=2Ai, Di⊂2¯¯¯fDi=2Di, and W⊂2¯¯¯fW=2W.

As we'll see in the bears example, there can be more interesting ways of defining the feature split Mi.

Then the map pair (g0,h1) is equivalent to (feature-split) relation r, defined such that (a0,d0,w)∼r(a1,d1,w′) iff:

  1. g0(a0)=a1,
  2. h1(d1)=d0,
  3. and w=w′.

Without loss of clarity, we can thus write r as the feature-split relation (g0,h0,IdW).

Composing (g0,h1) and (g1,h2) generates (g1∘g0,h1∘h2). Take r as the relation defined by (g0,h1) and q as the relation defined by (g1,h2). Then if (a0,d0,w)∼pr(a2,d2,w′′), there must exist an (a1,d1,w′) with (a0,d0,w)∼r(a1,d1,w′)∼p(a2,d2,w′′). Then:

  1. g1g0(a0)=g1(a1)=a2,
  2. h1h2(d2)=h1(d1)=d0,
  3. w=w′=w′′.

So composition of morphisms for Cartesian frames is the same as the composition of corresponding relations.

The extra structure

We have two structures to add: Cartesian frames have the ⋆ map, while generalised models have the probability measures Q; we need to relate them.

One natural way to relate them is to consider that if a⋆d=w, then we should get Q(w∣a,d)=1 and Q(w′∣a,d)=0 for w′≠w. This reflects the fact that action a and environment d lead inevitably to world w.

Now Q(w∣a,d)=Q(a,d,w)/Q(a,d,W), where Q(W,a,d) denotes Q on the set {a}×{d}×W; this is sumw′∈WQ(a,d,w′).

Hence the desired condition on Q(w∣a,d) is equivalent with Q(a,d,w)=0 iff a⋆d≠w. There are, of course, multiple possible Qs with that property for any given ⋆.

The categorical equivalence

Now let's tie these together, and define C(W), a subcategory C(W) of the GM, the category of generalised models.

The objects of C(W) are those (feature-split) generalised models which have E=A×D×W for some sets A and D, and have Q(a,d,w)=0 iff a⋆d≠w for some evaluation function ⋆:A×D→W.

The morphisms of C(W) are those morphisms of GM that map C(W) to itself, and that are of the form r=(g,h,IdW) for (g,h) a morphism of Cartesian frames.

Thus morphisms of C(W) are derived from morphisms of Chu(W), and are also compatible with the Q structures (since they are also morphisms of GM). Also included are the identity morphisms r=(IdA,IdD,IdW), which trivially preserve the Q structures.

To demonstrate that C(W) is a category, we need to show that pr is a morphism of it whenever r=(g0,h1,IdW) and p=(g1,h2,IdW) are. We know that pr must respect the Q structures (since r and p are morphisms of GM), while pr=(g1∘g0,h1∘h2,IdW).

Thus C(W) is a category. Let Φ:C(W)→Chu(W) be the map that sends (F,A×D×W,Q) to (A,E,⋆), and sends r=(g,h,IdW) to (g,h).

This Φ is clearly a functor of categories, and it is surjective on the objects of Chu(W). Now we need to show that it's also surjective on the morphisms, by the following result:

  • Let (g0,h1) be a morphism between C0=(A0,D0,⋆0) and C1=(A1,D1,⋆1). Then there exists M0,M1∈C(W) and a morphism r=(g0,h1,IdW) between them such that Φ(Mi)=Ci.

To show that, we need to choose Q0 and Q1 that are compatible with ⋆0 and ⋆1, and are compatible with r.

In fact, we'll show a slightly stronger result: that for any M0 with Φ(M0)=C0, we can pick an M1 (ie pick a Q1) with the required properties.

To show this, note that r=(g0,h1,IdW) will relate every element of (g−1(a1),d0,w) with every element of (a1,h−11(d0),w). In fact, r is defined by such relations, for any a1∈A1, d0∈D1 and w∈W. No other elements are related by r.

For compatibility of r with the Qs, it suffices that Q0(g−1(a1),d0,w) be equal to Q1(a1,h−11(d0),w).

For any d1∈D1, define #d1 as the size of h−11(h1(d1)); since d1∈h−11(h1(d1)), #d1≥1.

Then define Q1(a1,d1,w) as Q0(g−1(a1),h1(d1),w)/#d1. This will give the compatibility that we want.

Hence Φ:C(W)→Chu(W) is a surjective functor of categories, from a subcategory of GM, the category of generalised models.

More functors

Given two sets W and V, and a function p:W→V, there is an induced functor p:Chu(W)→Chu(V), sending (a,d,w) to (a,d,p(w)) and sending the morphism (g,h) to the morphism with the same underlying functions, (g,h).

Then by the above, we have C(W) and C(V) as distinct subcategories of GM, with category maps ΦW and ΦV sending these subcategories to and Chu(V).

Then p also induces a functor C(W)→C(V), by sending (a,d,w)∈E=A×D×W to (a,d,p(w)). The induced Q is given by Q(a,d,v)=∑w∈p−1(v)Q(a,d,v).

Note that p is not only a functor C(W)→C(V), it is also a collection of morphisms, when both those are seen as subcategories within GM. The induced map on the relations[1] is mapping r=(g,h,IdW) to (g,h,IdV).

We can see that p commutes with the ϕi:

  • ΦV∘p=p∘ΦW.

This is probably enough exploration of the functorial properties of these spaces for one post.

An example: colours and bears

To illustrate, let's use the Cartesian frame from this post; this construction will also show how features can figure non-trivially in this construction.

Here the agent has two unrelated choices: which colour to think about (green, G or red R) and whether to go for a walk or stay home (W or H). So A={GH,GW,RH,RW}. The environment is either safe or has bears: D={S,B}.

This gives the following frame C0:

SBC0=GHGWRHRW⎛⎜ ⎜ ⎜⎝w0w1w2w3w4w5w6w7⎞⎟ ⎟ ⎟⎠

Of course, w0 and w4 only differ in the colour that the agent is thinking about (similarly for w1 and w5, etc...). We could choose a C1 frame that doesn't distinguish between these thoughts:

SBC1=GHGWRHRW⎛⎜ ⎜ ⎜⎝w0w1w2w3w0w1w2w3⎞⎟ ⎟ ⎟⎠

Let V={w0,w1,w2,w3}. Then we can define the various sets through features; specifically, in this example, FA={fG/R,fW/H}. Similarly FD={fS/B}.

Adding a definition of FV={fV} and FW={fV,fG/R}, we can construct the feature split generalised models:

  1. M0={FA⊔FD⊔FW,A×D×W,Q0}.
  2. M1={FA⊔FD⊔FV,A×D×V,Q1}.

The Qi are defined by the matrix above; if we want them to make sense as traditional probability distributions, we might require that Qi(a,e,w)=1/8 whenever it is non-zero, with 8=||A×D|| the size of the matrix. In that case, Qi(Ei)=1, as required.

Notes on non-synonyms

Some of the terminology is repeated between the two formalisms, but doesn't mean the same things. Specifically:

  • Environments: for Cartesian frames, this is D, the different columns of the matrix. For generalised models, this is the larger set E=A×D×W.
  • Worlds: for Cartesian frames, this is W, the possible values of the elements of the matrix. For generalised models, this is W=2¯¯¯¯F, the set of all possible values all the features could take. At the very least, W contains E=A×D×W, but it could be much larger.
  1. If we see p as a collection of morphism, (g,h,IdV) is exactly prp−1, where p−1 is the relation between A×D×V and A×D×W that is the exact opposite of p; so (a,d,v)∼p−1(a,d,w) iff p(w)=v. ↩︎



Discuss

Generalised models as a category

16 февраля, 2021 - 19:08
Published on February 16, 2021 4:08 PM GMT

Naming the "generalised" models.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

In this post, I'll apply some mathematical rigour to my ideas of model splintering, and see what they are as a category[1].

And the first question is... what to call them? I can't refer to them as 'the models I use in model splintering'. After a bit of reflection, I decided to call them 'generalised models'. Though that's a bit vague, it does describe well what they are, and what I hope to use them for: a formalism to cover all sorts of models.

The generalised models

A generalised model M is given by three objects:

M=(F,E,Q).

Here F is a set of features. Each feature f consists of a name or label, and a set in which the feature takes values. For example, we might have the feature "room empty?" with values "true" and "false", or the feature "room temperature?" with values in R+, the positive reals.

We allow these features to sometimes take no values at all (such as the above two features if the room doesn't exist) or multiple values (such as "potential running speed of person X" which includes the maximal speed and any speed below it).

Define ¯¯¯f as the set component of the feature, and ¯¯¯¯¯F as disjoint union of all the sets of the different features - ie ¯¯¯¯¯F=⊔f∈F¯¯¯f.

A world, in the most general sense, is defined by all the values that the different features could take (including situations where features take multiple values and none at all). So the set of worlds, W, is the set of functions from F to {0,1}, with 1 representing the fact that that feature takes that value, and 0 the opposite. Hence W=2¯¯¯¯F, the power set of ¯¯¯¯¯F.

The Q is a partial probability distribution. In general, we won't worry as to whether Q is normalised (ie whether Q(E)=1) or not; we'll even allow Qs with 1">Q(E)>1. So Q could be more properly be defined as a partial weight distribution. As long as we consider terms like Q(A∣B), then the normalisation doesn't matter.

Morphisms: relations

For simplicity, assume there are finitely many features taking values in finite sets, making all sets in the generalised model finite.

If M0=(F0,E0,Q0) and M1(F1,E1,Q1) are generalised models, then we want to use binary relations between E0 and E1 as morphisms between the generalised models.

Let r be a relation between E0 and E1, written as e0∼re1. Then it defines a map r:2E0→2E1 between subsets of E0 and E1. This map is defined by e1∈r(E0) iff there exists an e0∈E0 with e0∼re1. The map r−1:2E1→2E0 is defined similarly[2], seeing r−1 as the inverse relation, e0∼re1 iff e1∼r−1e0.

We say that the relation r is a morphism between the generalised models if, for any E0⊂E0 and E1⊂E1:

  • Q0(E0)≤Q1(r(E0)), or both measures are undefined.
  • Q1(E1)≤Q0(r−1(E0)), or both measures are undefined.

The intuition here is that probability flows along the connections: if e0∼re1 then probability can flow from e0 to e1 (and vice-versa). Thus r(E0) must have picked up all the probability that flowed out of E0 - but it might have picked up more probability, since there may be connections coming into it from outside E0. Same goes for r−1(E1) and the probability of E1.

Morphisms properties

We now check that these relations obey the requirements of morphisms in category theory.

Let r be a morphism M0→M1 (ie a relation between E0 and E1), and let q be a morphism M1→M2 (ie a relation between E1 and E2).

We compose relations by the composition of relations: e0∼pre2 iff there exists an e1 with e0∼re1 and e1∼pe2. Composition of relations is associative.

We now need to show that qr is a morphism. But this is easy to show:

  • Q0(E0)≤Q1(r(E0))≤Q2(pr(E0)), or all three measures are undefined.
  • Q2(E2)≤Q1(p−1(E2))≤Q0(r−1p−1(E2)), or all three measures are undefined.

Finally, the identity relation IdE0 is the one that relates a given e0∈E0 only to itself; then r and r−1 are the identity maps on 2E0, and the morphism properties for Q0=Q1 are trivially true.

So define the category of generalised models as GM.

r-stable sets

Say that a set E0⊂E0 is r-stable if r−1r(E0)=E0.

For such an r-stable set, Q0(E0)≤Q1(r(E0)) and Q1(r(E1)≤Q0(r−1r(E0))=Q0(E0), thus Q0(E0)=Q1(r(E0)).

Hence if r is a morphism, it preserves the probability measure on the r-stable sets.

In the particular case where r is a bijective function, all points of E0 are r-stable (and all points of E1 are r−1-stable), so it's an isomorphism between E0 and E1 that forces Q0=Q1.

Morphism example: probability update

Suppose we wanted to update our probability measure Q0, maybe by updating that a particular feature f takes a certain value x.

Then let Ef=x⊂E0 be the set of environments where f takes that value x. Then updating on f=x is the same as restricting to Ef=x and then rescaling.

Since we don't care about the scaling, we can consider updating on f=x as just restricting to Ef=x. This morphism is given by:

  1. M1=(F0,Ef=x,Q1),
  2. Q1=Q0 on Ef=x⊂E0,
  3. the morphism r:M0→M1 is given by the relation that e0∼re0 for all e0∈Ef=x.
Morphism example: surjective partial function

In my previous posts I defined how M1=(F1,E1,Q1) could be a refinement of M0=(F0,E0,Q0).

In the language of the present post, M1 is a refinement of M0 if there exists a generalised model M′1=(F1,E1,Q′1) and a surjective partial function r:E1→E0 (functions and partial functions are specific examples of binary relations) that is a morphism from M′1 to M0. The Q1 is required to be potentially 'better' than Q′1 on E1, in some relevant sense.

This means that M1 is 'better' than M0 in three ways. The r is surjective, so E1 covers all of E0, so its set of environments is at least as detailed. The r is a partial function, so E1 might have even more environments that don't correspond to anything in E0 (it considers more situations). And, finally, Q1 is better than Q′1, by whatever definition of better that we're using.

Feature-split relations

The morphisms/relations defined so far use E and Q - but they don't make any use of F. Here is one definition that does make use of the feature structure.

Say that the generalised model M=(F,E,Q) is feature-split if F=⊔ni=1Fi and E=×ni=1Ei such that

Ei⊂2¯¯¯¯¯¯Fi.

Note that F=⊔ni=1Fi implies W=2¯¯¯¯F=×ni=12¯¯¯¯¯¯Fi, so ×ni=1Ei lies naturally within W.

Designate such a generalised model by M=({Fi},E,Q).

Then a feature-split relation between M0=({Fi0},E0,Q0) and M1=({Fi1},E1,Q1) is a morphism r that is defined as r=(r1,r2,…,rn) with ri a relation between Ei0 and Ei1.

  1. I'm not fully sold on category theory as a mathematical tool, but it's certainly worthwhile to formalise your mathematical structures so that they can fit within the formalism of a category; it makes you think carefully about what you're doing. ↩︎

  2. There is a slight abuse of notation here: r:2E0→2E1 and r−1:2E1→2E0 are not generally inverses. They are inverses precisely for the "r-stable" sets that are discussed further down in the post. ↩︎



Discuss

Suggestions of posts on the AF to review

16 февраля, 2021 - 15:40
Published on February 16, 2021 12:40 PM GMT

How does one write a good and useful review of a technical post on the Alignment Forum?

I don’t know. Like many people, I tend to comment and give feedback on posts closely related to my own research, or to write down my own ideas when reading the paper. Yet this is quite different from the quality peer-review that you can get (if you’re lucky) in more established fields. And from experience, such quality reviews can improve the research dramatically, give some prestige to it, and help people navigate the field.

In an attempt to understand what makes a good review for the Alignment Forum, Joe Collman, Jérémy Perret (Gyrodiot on LW) and me are launching a project to review many posts in depth. The goal is to actually write reviews of various posts, get feedback on their usefulness from authors and readers alike, and try to extract from them some knowledge about how to go about doing such reviews for the field. We hope to have enough insights to eventually write some guidelines that could be used in an official AF review process.

On that note, despite the support of members of the LW team, this project isn’t official. It’s just the three of us trying out something.

Now, the reason for the existence of this post (and why it is a question) is that we’re looking for posts to review. We already have some in mind, but they are necessarily biased towards what we’re more comfortable about. This is where you come in, to suggest a more varied range of posts.

Anything posted on the AF goes, although we will not take into account things that are clearly not “research outputs” (like transcripts of podcasts or pointers to surveys). This means that posts about specific risks, about timelines, about deconfusion, about alignment schemes, and more, are all welcome.

We would definitely appreciate it if you add a reason to your suggestion, to help us decide whether to include the post on our selection. Here is a (non-exhaustive) list of possible reasons:

  • This post is one of the few studying this very important question
  • This is my post and I want some feedback
  • This post was interesting but I cannot decide what to make of it
  • This post is very representative of a way to do AI Alignment research
  • This post is very different from most of AI Alignment research

Thanks in advance, and we’re excited about reading your suggestions!



Discuss

Страницы