# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 5 минут 52 секунды назад

### What is Abstraction?

6 декабря, 2019 - 23:30
Published on December 6, 2019 8:30 PM UTC

• We have a gas consisting of some huge number of particles. We throw away information about the particles themselves, instead keeping just a few summary statistics: average energy, number of particles, etc. We can then make highly precise predictions about things like e.g. pressure just based on the reduced information we've kept, without having to think about each individual particle. That reduced information is the "abstract layer" - the gas and its properties.
• We have a bunch of transistors and wires on a chip. We arrange them to perform some logical operation, like maybe a NAND gate. Then, we throw away information about the underlying details, and just treat it as an abstract logical NAND gate. Using just the abstract layer, we can make predictions about what outputs will result from what inputs. Note that there’s some fuzziness - 0.01 V and 0.02 V are both treated as logical zero, and in rare cases there will be enough noise in the wires to get an incorrect output.
• I tell my friend that I'm going to play tennis. I have ignored a huge amount of information about the details of the activity - where, when, what racket, what ball, with whom, all the distributions of every microscopic particle involved - yet my friend can still make some reliable predictions based on the abstract information I've provided.
• When we abstract formulas like "1+1=2*1" and "2+2=2*2" into "n+n=2*n", we're obviously throwing out information about the value of n, while still making whatever predictions we can given the information we kept. This is what abstraction is all about in math and programming: throw out as much information as you can, while still maintaining the core "prediction" - i.e. the theorem or algorithm.
• I have a street map of New York City. The map throws out lots of info about the physical streets: street width, potholes, power lines and water mains, building facades, signs and stoplights, etc. But for many questions about distance or reachability on the physical city streets, I can translate the question into a query on the map. My query on the map will return reliable predictions about the physical streets, even though the map has thrown out lots of info.

The general pattern: there’s some ground-level “concrete” model (or territory), and an abstract model (or map). The abstract model throws away or ignores information from the concrete model, but in such a way that we can still make reliable predictions about some aspects of the underlying system.

Notice that the predictions of the abstract models, in most of these examples, are not perfectly accurate. We're not dealing with the sort of "abstraction" we see in e.g. programming or algebra, where everything is exact. There are going to be probabilities involved.

In the language of embedded world-models, we're talking about multi-level models: models which contain both a notion of "table", and of all the pieces from which the table is built, and of all the atoms from which the pieces are built. We want to be able to use predictions from one level at other levels (e.g. predict bulk material properties from microscopic structure, or predict from material properties whether it's safe to sit on the table), and we want to move between levels consistently.

Formalization: Starting Point

To repeat the intuitive idea: an abstract model throws away or ignores information from the concrete model, but in such a way that we can still make reliable predictions about some aspects of the underlying system.

So to formalize abstraction, we first need some way to specify which "aspects of the underlying system" we wish to predict, and what form the predictions take. The obvious starting point for predictions is probability distributions. Given that our predictions are probability distributions, the natural way to specify which aspects of the system we care about is via a set of events or logic statements for which we calculate probabilities. We'll be agnostic about the exact types for now, and just call these "queries".

To illustrate a bit, let's identify the concrete model, class of queries, and abstract model for a few of the examples from earlier.

• Ideal Gas:
• Concrete model MC is the full set of molecules, their interaction forces, and a distribution representing our knowledge about their initial configuration.
• Class of queries Q consists of combinations of macroscopic measurements, e.g. one query might be "pressure = 12 torr & volume = 1 m^3 & temperature = 110 K".
• For an ideal gas, the abstract model MA can be represented by e.g. temperature, number of particles (of each type if the gas is mixed), and container volume. Given these values and assuming a near-equilibrium initial configuration distribution, we can predict the other macroscopic measurables in the queries (e.g. pressure).
• Tennis:
• Concrete model MC is the full microscopic configuration of me and the physical world around me as I play tennis (or whatever else I do).
• Class of queries Q is hard to sharply define at this point, but includes things like "John will answer his cell phone in the next hour", "John will hold a racket and hit a fuzzy ball in the next hour", "John will play Civ for the next hour", etc - all the things whose probabilities change on hearing that I'm going to play tennis.
• Abstract model MA is just the sentence "I am going to play tennis".
• Street Map:
• Concrete model MC is the physical city streets
• Class of queries Q includes things like "shortest path from Times Square to Central Park starts by following Broadway", "distance between the Met and the Hudson is less than 1 mile", etc - all the things we can deduce from a street map.
• Abstract model MA is the map. Note that the physical map also includes some extraneous information, e.g. the positions of all the individual atoms in the piece of paper/smartphone.

Already with the second two examples there seems to be some "cheating" going on in the model definition: we just define the query class as all the events/logic statements whose probabilities change based on the information in the map. But if we can do that, then anything can be an "abstract map" of any "concrete territory", with the queries Q taken to be the events/statements about the territory which the map actually has some information about - not a very useful definition!

Natural Abstractions

Intuitively, It seems like there exist "natural abstractions" - large sets of queries on a given territory which all require roughly the same information. Statistical mechanics is a good source of examples - from some macroscopic initial conditions, we can compute whatever queries we want about any macroscopic measurements later on. Note that such natural abstractions are a property of the territory - it's the concrete-level model which determines what large classes of queries can be answered with relatively little information.

For now, I'm interested primarily in abstraction of causal dags - i.e. cases in which both the concrete and abstract models are causal dags, and there is some reasonable correspondence between counterfactuals in the two. In this case, the set of queries should include counterfactuals, i.e. do() operations in Pearl's language. (This does require updating definitions/notation a bit, since our queries are no longer purely events, but it's a straightforward-if-tedious patch.) That's the main subject I'm researching in the short term: what are the abstractions which support large classes of causal counterfactuals? Expect more posts on the topic soon.

Discuss

### New things I understand (or think I do)

6 декабря, 2019 - 21:08
Published on December 6, 2019 6:08 PM UTC

As I grow older, I realize how some things are not how I expected them to be. It’s not that I learn new facts – it’s more on the level of experience, or intuition.

One new intuition is how much effort, attention and work is required to do something remarkably well. I think my intuition was off here because of school. In school, you don’t really have to invest extraordinary effort to get the highest possible results. It depends on the school, sure, but for the most part, just your regular high effort will do. You don’t need to really think outside of the box, invest too many extra hours, make yourself a better decision-maker, improve your own capability to prioritize etc. Usually, it’s enough to just study a bit more, and do a bit more in and after class.

This calibrates you to expect that doing really well in the real world is just as easy, and it’s not. At least it wasn’t for me. This isn’t really a criticism of childhood or school. I’m not about to go full boomer and say that “kids have it easy”. In fact, I don’t think that you should make school harder. You should probably change certain things, make other things more applicable to the real world, which may or may not include making it harder, but you shouldn’t just try to make it harder just because the real world is hard.

“Hell is other people.” Sartre meant whatever he meant when he wrote Huis Clos, but I’ll take this quote and let it serve as the starting point for another realization I had. (On that note, there’s something from college that I don’t fully remember but it goes along these lines: Roland Barthes, a literary critic, said that the author was unimportant in the interpretation of a text. The meaning is given by the reader. So I’m not really doing anything extraordinary here by taking a quote and giving it my own take.)

I used to work in a high school. I was an assistant – I worked with disabled kids and helped them go about their day. The job was terribly unsuited for my personality, and to this day I don’t really understand why I had it. I mean, I understand my reasons, I was terribly broke and needed the money, but I don’t understand how or why I was hired. I had (and still have) no qualifications to work with disabled kids. But someone decided that I was sufficiently qualified to work with a kid deep on the autistic spectrum, and then someone decided that I was also qualified to work with a wheelchair-bound kid with epileptic seizures and really extensive mental retardation. It’s not only that I had zero training or qualifications for such a job – I’m also not really a caregiver, personality-wise. It’s just not the type of thing I’m good at nor enjoy, as I very quickly found out. I still came to work though, and gave it my best to help these kids in their day-to-day, but it was just a weird experience overall.

Back to the quote. As I learned in school, both as a student and as an employee, humans build social structures. Take, for example, the notion of a country. There are no actual countries out there. These are mental artifacts. If you fly over all of Earth, you’ll see people living in their houses, talking to their neighbors, commuting to work and so on. You won’t see countries. You’ll infer the existence of countries from the behavior of people and from border-crossing rituals of verification and car searches, but you won’t see any physical countries. Countries are just common concepts we have – they are names for the types of behavior we expect. Birds don’t see any countries – they just see terrain. A bird doesn’t really care if it’s in Portugal or Spain. Except vultures, vultures don’t cross the border into Portugal

Just like countries, schools are also these weird social structures that don’t actually exist. Schools are just a bunch of people going to a big building every day and sitting there for some time, and then going back home. That’s a low-resolution view. Let’s increase the resolution. You could say that most of the people that go to this building are there to learn about how the world works, and a tiny part of the people there are older and more experienced and explain to the younger people how the world works. But schools are so much more than that. They are inseparable from the weird social interactions that occur. Yeah sure, kids learn in school. What else do they do? They form bullying circles. They learn to fear authority. They do drugs. They pressure others into doing things that they wouldn’t otherwise do. They express or repress their sexuality. They participate in us vs. them schemes. They do others harm. They help others. They live through drama, some live through trauma. They start hating their immediate environment. Some don’t.

There’s a lot more going on than just going to a building to sit and learn about the world:

“The reason baboons are such good models is, like us, they don’t have real stressors,” [Sapolsky] said. “If you live in a baboon troop in the Serengeti, you only have to work three hours a day for your calories, and predators don’t mess with you much. What that means is you’ve got nine hours of free time every day to devote to generating psychological stress toward other animals in your troop. So the baboon is a wonderful model for living well enough and long enough to pay the price for all the social-stressor nonsense that they create for each other. They’re just like us: They’re not getting done in by predators and famines, they’re getting done in by each other.” SOURCE

And you might say: “Damn, that’s a pretty fucking grim way of looking at the world”, and you’d be right, it is grim. Evolution is a stupid, blind, optimization process. It’s Moloch. Nobody expressly wanted for people to make life difficult for each other, but there is no pilot in the plane. There is no God who planned out that a hyena should rip out the sexual organs of a still living baby antelope, while its mother stands on the side and is helpless to do anything.

Another quote comes to mind: “This world is cruel. It is also very beautiful.”

For every beautiful sunset, there is a suffering being somewhere, slowly getting eaten alive. And for every poor, suffering being, there is a beautiful sunset somewhere. And this is also one of the terrible realizations that I had as I grow older. And worse still, there’s not much to do about it on a grand scale, since all the good things we want to keep depend on everything that’s bad and cruel. We exist within nature, and nature doesn’t care about suffering. Maybe one day we build a more gentle world, but now we’re still here.

This got dark quickly. I feel tempted to finish the entire article here, and just say “Happy holidays” like Robin Hanson did a couple of days ago, but I still have a couple of things on my mind. I feel that learning to live with how terrible things are, and still maintaining a positive outlook, is very, very important. Not just because of your mental health, but because the positive outlook could be the only way for things to change. Optimists may be severely miscalibrated, but their optimism pushes the world in the right direction (or at least it should do so).

Becoming a brave person who can say no is something I try to do. However, I feel like I’ve become much less brave than before. I don’t know if that’s just maturity, but I dislike it. I liked it better when I had the feeling that I could say ‘fuck you’ to anybody if that was the right thing to do. I’m more careful now. I haven’t faced a real problematic situation for quite some time, so I don’t know if I’ve changed for the worse. I hope I haven’t. I hope that I’m still brave. I’m getting a deeper appreciation of my own inner mental landscape. I think I’ve been ignoring myself for a long time, and ignorance of myself and the world was probably the cause why I felt so foolishly brave. This is another intuition I’ve come to, and the final one for this article. I’ve started to learn about and accept my own inner mechanisms. Just like a geologist surveys a mountain and studies it, I now try to be the geologist of my own inner landscape. I kinda split myself into components, with one ‘me’ looking at other parts of my personality as if ‘me’ wasn’t in my head, but an external observer. It could be the Zen meditation I’ve been practicing for 10 or so years, it could be depersonalization disorder.

For now, I just find that accepting major “flows” in my personality is a good way to go. As I mentioned before, I’m not a care-giving type of person. And that’s fine. I have an entire class of things that pulls me and absorbs me, and it’s fine not to do other things that don’t really ‘vibe’ with how I am. And it’s crucial to extend this acceptance to non-mainstream things, even to “flaws”. One of these non-mainstream things is my inclination towards doing things that are dangerous or not allowed. It’s not really something that you grow to be proud of, but a propensity for risk and risky behavior is just as valid as being risk-averse. It’s just how you are, and you shouldn’t – I shouldn’t – ignore or repress it. Looking back at my childhood, it was always there – my interests always had the same direction. And as I got to teenage years, status signalling became the most important thing. How I presented myself to others was important. And somehow, in all that noise, I forgot about who I was. So here’s to rediscovering that. Final quote: “Given how long it’s taken for me to reconcile my nature, I can’t figure I’d forgo it on your account, Marty.”

Happy holidays.

Discuss

### Comment on Coherence arguments do not imply goal directed behavior

6 декабря, 2019 - 12:30
Published on December 6, 2019 9:30 AM UTC

In coherence arguments do not imply goal directed behavior Rohin Shah argues that a system's merely being at all model-able as a an EU maximizer does not imply that it has "goal directed behavior". The argument as I understand it runs something like this:

1: Any behavior whatsoever maximizes some utility function.

2: Not all behaviors are goal directed.

Conclusion: A system's behavior maximizing some utility function does not imply that its behavior is goal directed.

I think this argument is technically sound, but misses an important connection between VNM coherence and goal directed behavior.

Shah does not give a formal definition of "goal directed behavior" but it is basically what you intuitively think it is. Goal directed behavior is the sort of behavior that seems like it is aimed at accomplishing some goal. Shah correctly points out that a system being goal directed and being good at accomplishing its goal is what makes it dangerous, not merely that it is good at maximizing some utility function. Every object in the universe perfectly maximizes the utility function that assigns 1 to all of the actual causal consequences of its behavior, and 0 to any other causal consequences its behavior might have had.

Shah does not give a formal model of goal directed behavior, and seems to suggest that being model-able as an EU maximizer is not very closely related to goal directed behavior. Sure, having goal directed behavior implies that you are model-able as an EU maximizer, but so does having any kind of behavior whatsoever.

The implication does not run the other way according to Shah. Something being an EU maximizer for some utility function, even a perfect one, does not imply that its behavior is goal directed. I think this is right, but I will argue that nonetheless, it is true that it being a good idea for you to model an agent as an EU maximizer does imply that its behavior will seem goal directed to you.

Shah gives the example of a twitching robot. This is not a robot that maximizes the probability of its twitching, or that wants to twitch as long as possible. Shah agrees that a robot that maximized those things would be dangerous. Rather, this is a robot that just twitches. Such a robot maximizes a utility function that assigns 1 to whatever the actual consequences of its actual twitching behaviors are, and 0 to anything else that the consequences might have been.

This system is a perfect EU maximizer for that utility function, but it is not an optimization process for any utility function. For a system to be an optimization prolcess it must be that it is more efficient to predict it by modeling it as an optimization process than by modeling it as a mechanical system. Another way to put it is that it must be a good idea for you to model it as an EU maximizer.

This might be true in two different ways. It might be more efficient in terms of time or compute. My predictions of the behavior when I model the system as an EU maximizer might not be as good as my predictions of the behavior when I model it as a mechanical system, but the reduced accuracy is worth it, because modeling the system mechanically would take me much longer or be otherwise costly. Think of predicting a chess playing program. Even though I could predict the next move by learning its source code and computing it by hand on paper, I would be better off in most contexts just thinking about what I would do in its circumstances if I were trying to win at chess.

Another related but distinct sense in which it might be more efficient is that modeling the system as an EU maximizer might allow me to compress its behavior more than modeling it as a mechanical system. Imagine if I had to send someone a python program that makes predictions about the behavior of the twitching robot. I could write a program that just prints "twitch" over and over again, or I could write a program that models the whole world and picks the behavior that best maximizes the expected value of a utility function that assigns 1 to whatever the actual consequences of the twitching are, and 0 to whatever else they might have been. I claim that the second program would be longer. It would not however allow the receiver of my message to predict the behavior of the robot any more accurately than a program that just prints "it twitches again" over and over.

Maybe the exact twitching pattern is complicated, or maybe it stops at some particular time, and in that case the first program would have to be more complicated, but as long as the twitching does not seem goal directed, I claim that a python program that does not predict the behavior by modeling the whole universe and the counterfactual consequences of different kinds of possible twitching will always be longer than one that predicts the twitching by exploiting regularities that follow from the robot's mechanical design. I make this claim because I think this is what it is for a system to seem goal directed to you.

A system seems goal directed to you if the best way you have of predicting it is by modeling it as an EU maximizer with some particular utility function and credence function. (Actually, the particulars of the EU formalism might not be very relevant to what makes humans think of a system's behavior as goal directed. It being a good idea to model it as having something like preferences and some sort of reasonably accurate model of the world that supports counterfactual reasoning is probably good enough.) This is somewhat awkward because the first notion of "efficiently model" is relative to your capacities and goals, and the second notion is relative to the programming language we choose, but I think it is basically right nonetheless. Luckily, we humans have relatively similar capacities and goals, and it can be shown that using the second notion of "efficiently model" we will only disagree about how agenty a system is by at most some constant no matter what programming languages we choose.

One argument that what it means for a system's behavior to seem goal directed to you is just for it to be best for you to model it as an EU maximizer is that if it were a better idea for you to model it some other way, that is probably how you would model it instead. This is why we do not model bottle caps as EU maximizers but do model chess programs as (something at least a lot like) EU maximizers. This is also why the twitching robot does not seem intelligent to us, absent other subsystems that we should model as EU maximizers, but that's a story for a different post.

I think we should expect most systems that it is a good idea for us to model as an EU maximizers to pursue convergent instrumental goals like computational power and ensuring survival, etc. If I know the utility function of an EU maximizer better than I know its specific behavior, often the best way for me to predict its behavior is by imagining what I would do in its circumstances if I had the same goal. Even if a utility function is very complicated, like the utility function that assigns 1 to whatever the actual consequences of the twitching robot's twitches are and 0 to anything else. Imagine that I did not have the utility function specified that way, which hides all of the complexity in "whatever the actual consequences are." Rather, imagine I had the utility specified as a very specific description of the world to aim for without reference to the twitching pattern of the robot. It would seem like a good idea to me if that were my goal to get more computational power for predicting the outcomes of my actions, to make sure that I am not turned off prematurely, and to try to get as accurate a model of my environment as possible.

In conclusion, I agree with Shah that being able to model a system as an EU maximizer at all does not imply that its behavior is goal directed, but I think that sort of misses the point. If the best way for you to model a system is to model it as an EU maximizer, then its behavior will seem goal directed to you, and if the shortest program that predicts a system's behavior does so by modeling it as an EU maximizer, then its behavior will be goal directed (or at least up to an additive constant). I think the best way for you to model most systems that are more intelligent than you will be to model them as EU maximizers, or something close, but again, that's a story for a different post.

Discuss

### Is there a website for tracking fads?

6 декабря, 2019 - 07:48
Published on December 6, 2019 4:48 AM UTC

Also, clever ways to use google trends etc?

Discuss

### The Actionable Version of "Keep Your Identity Small"

6 декабря, 2019 - 04:34
Published on December 6, 2019 1:34 AM UTC

(cross posted on my roam blog)

There's an old Paul Graham Essay, "Keep Your Identity Small". It's short so it's worth it to read the whole thing right now if you've never seen it. The yisbifiefyb ("yeah it's short but i'm functionally illiterate except for your blog") is roughly "When something becomes part of your identity, you become dumber. Don't make things part of your identity."

I read that post some time in high school and thought, "Of course! You're so right Paul Graham. Cool, now I'll never identify as anything." I still think that Paul Graham is pointing out a real cluster of Things That Happen With People, but over time the concept of identity, and identifying as BLANK have started to feel less clear. It feels right to say "People get dumb when their identity is challenged" and it even feels sorta axiomatic. Isn't that what it means for something to be part of your identity? Thinking about it more I came up with a bunch of different ways of thinking of myself that all felt like identifying as BLANK, but it felt like unnecessary dropping of nuance to smoosh them all into the single concept of identity.

Identity Menagerie

Lets look at some examples of what identifying as a BLANK can look like:

• Blake: "I do Cross Fit. "
• Jane: "I'm smart. In fact I'm normally among the smartest in the room. I'm able to solve a lot of problems by just finding a clever solution to them instead of having to get stuck in grunt work. People often show awe and appreciation for my depth and breadth of knowledge."
• Jay: "I the peacekeeper, the one always holding the group together."
Self-Concept

Steve Andreas outlines the idea of a self-concept quite nicely:

Your self-concept is a sort of map of who you are. Like any other map, it is always a very simplified version of the territory. [...] Your self-concept, your "map" you have of yourself, has the same purpose as a map of a city—to keep you oriented in the world and help you find your way, particularly when events are challenging or difficult.

The thing you'll notice is it's nigh impossible to avoid having a self-concept. When Jane thinks of herself and how she can act on the world, "being smart" is a chunk of self-concept that summarizes a lot of her experiences and that she uses to guide decisions she makes.

Kaj Sotala has a good post about how tweaking and modifying his self-concept helped fix parts of his depression and anxiety.

Group Identity

This is the obvious one that we're all used to. Blake does Cross Fit, hangs out with cross fit people all the time, and loves telling people about all this. All of his Cross Fit buddies support each other and give each other praise for being part of such an awesome group. Someone calling Cross Fit stupid would feel like someone calling him and all of his friends stupid. It would be big and difficult change for Blake to get out of Cross Fit, given that's where most of his social circle is, and where all his free time goes.

Intelligent Social Web

Here's Val describing what he calls the Intelligent Social Web:

I suspect that improv works because we’re doing something a lot like it pretty much all the time. The web of social relationships we’re embedded in helps define our roles as it forms and includes us. And that same web, as the distributed “director” of the “scene”, guides us in what we do. A lot of (but not all) people get a strong hit of this when they go back to visit their family. If you move away and then make new friends and sort of become a new person (!), you might at first think this is just who you are now. But then you visit your parents… and suddenly you feel and act a lot like you did before you moved away. You might even try to hold onto this “new you” with them… and they might respond to what they see as strange behavior by trying to nudge you into acting “normal”: ignoring surprising things you say, changing the topic to something familiar, starting an old fight, etc.

This feels like another important facet of identity, one that doesn't just exist in your head, but in the heads of those around you.

Identity as a Strategy for meeting your needs

In middle school and high school I built up a very particular identity. I bet if you conversed with high school me, you wouldn't be able to pin me down to using any particular phrase, label, or group to identify myself as. And yet, there are ways of being you could have asked me to try that would have scared the shit out of me. Almost as if... my identity was under attack....

So new take, one I consider more productive. Reread Paul Grahams essay and replace every instance of "identity" with "main strategy to meet one's needs". Hmmmm, it's starting to click. If you've been a preacher for 40 years, and all you know is preaching, and most of your needs are met by your church community, an attack on the church is an attack on your livelihood and well-being.

I expect having your "identity" under attack to feel similar to being a hunter gatherer and watching the only river that you've known in your life drying up. Fear and Panic. What are you going to do know? Will you survive? Where are the good things in your life going to come from?

When you frame it like this, you can see how easily trying to KYIS could lead to stuff that just hurts you. If I only have one way of getting people to like me (say, being funny), I can't just suddenly decide not to care if people don't consider me funny. I can't just suddenly not care if people stop laughing at my jokes. Both of those events mean I no longer have a functional strategy to be liked.

A very concrete prediction of this type of thinking: someone will be clingy and protective over a part of their behavior to the degree that it is the sole source of meeting XYZ important needs.

The take away from Paul Graham is "don't let something become you identity". How do you do that? I thought it meant something like "Never self identity as a BLANK", to others or to yourself. Boom. Done. And yet, even though I never talked about being part of one group or another, I still went through life a decent chunk of life banking on "Be funny, act unflappable, be competent at the basic stuff" as the only/main strategy for meeting my needs.

The actionable advice might be something like, "slowly develop a multi-faceted confidence in your ability to handle what life throws at you, via actually improving and seeing results." That's waaaaaay harder to do than just not identifying with a group, but it does a better jump of pointing you in the direction that matters. I expect that when Paul Graham wrote that essay he already had a pretty strong confidence in his ability to meet his needs. From that vantage point, you can easily let go of identities, because they aren't your life lines.

There can be much more to identity than what I've laid out, but I think the redirect I've given is one that is a great first step for anyone dwelling on identity, or for anyone who head the KYIS advice and earnestly tried to implement it, yet found mysterious ways it wasn't working.

Discuss

### Understanding “Deep Double Descent”

6 декабря, 2019 - 03:00
Published on December 6, 2019 12:00 AM UTC

If you're not familiar with the double descent phenomenon, I think you should be. I consider double descent to be one of the most interesting and surprising recent results in analyzing and understanding modern machine learning. Today, Preetum et al. released a new paper, “Deep Double Descent,” which I think is a big further advancement in our understanding of this phenomenon. I'd highly recommend at least reading the summary of the paper on the OpenAI blog. However, I will also try to summarize the paper here, as well as give a history of the literature on double descent and some of my personal thoughts.

Prior work

The double descent phenomenon was first discovered by Mikhail Belkin et al., who were confused by the phenomenon wherein modern ML practitioners would claim that “bigger models are always better” despite standard statistical machine learning theory predicting that bigger models should be more prone to overfitting. Belkin et al. discovered that the standard bias-variance tradeoff picture actually breaks down once you hit approximately zero training error—what Belkin et al. call the “interpolation threshold.” Before the interpolation threshold, the bias-variance tradeoff holds and increasing model complexity leads to overfitting, reducing test error. After the interpolation threshold, however, they found that test error actually starts to go down as you keep increasing model complexity! Belkin et al. demonstrated this phenomenon in simple ML methods such as decision trees as well as simple neural networks trained on MNIST. Here's the diagram that Belkin et al. use in their paper to describe this phenomenon:

Belkin et al. describe their hypothesis for what's happening as follows:

All of the learned predictors to the right of the interpolation threshold fit the training data perfectly and have zero empirical risk. So why should some—in particular, those from richer functions classes—have lower test risk than others? The answer is that the capacity of the function class does not necessarily reflect how well the predictor matches the inductive bias appropriate for the problem at hand. [The inductive bias] is a form of Occam’s razor: the simplest explanation compatible with the observations should be preferred. By considering larger function classes, which contain more candidate predictors compatible with the data, we are able to find interpolating functions that [are] “simpler”. Thus increasing function class capacity improves performance of classifiers.

I think that what this is saying is pretty magical: in the case of neural nets, it's saying that SGD just so happens to have the right inductive biases that letting SGD choose which model it wants the most out of a large class of models with the same training performance yields significantly better test performance. If you're right on the interpolation threshold, you're effectively “forcing” SGD to choose from a very small set of models with perfect training accuracy (maybe only one realistic option), thus ignoring SGD's inductive biases completely—whereas if you're past the interpolation threshold, you're letting SGD choose which of many models with perfect training accuracy it prefers, thus allowing SGD's inductive bias to shine through.

I think this is strong evidence for the critical importance of implicit simplicity and speed priors in making modern ML work. However, such biases also produce strong incentives for mesa-optimization (since optimizers are simple, compressed policies) and pseudo-alignment (since simplicity and speed penalties will favor simpler, faster proxies). Furthermore, the arguments for the universal prior and minimal circuits being malign suggest that such strong simplicity and speed priors could also produce an incentive for deceptive alignment.

“Deep Double Descent”

Now we get to Preetum et al.'s new paper, “Deep Double Descent.” Here are just some of the things that Preetum et al. demonstrate in “Deep Double Descent:”

1. double descent occurs across a wide variety of different model classes, including ResNets, standard CNNs, and Transformers, as well as a wide variety of different tasks, including image classification and language translation,
2. double descent occurs not just as a function of model size, but also as a function of training time and dataset size, and
3. since double descent can happen as a function of dataset size, more data can lead to worse test performance!

Crazy stuff. Let's try to walk through each of these results in detail and understand what's happening.

First, double descent is a highly universal phenomenon in modern deep learning. Here is double descent happening for ResNet18 on CIFAR-10 and CIFAR-100:

And again for Transformers on German-to-English and English-to-French translation:

All of these graphs, however, are just showcasing the standard Belkin et al.-style double descent over model size (what Preetum et al. call “model-wise double descent”). What's really interesting about “Deep Double Descent,” however, is that Preetum et al. also demonstrate that the same thing can happen for training time (“epoch-wise double descent”) and a similar thing for dataset size (“sample-wise non-monotonicity”).

First, let's look at epoch-wise double descent. Take a look at these graphs for ResNet18 on CIFAR-10:

There's a bunch of crazy things happening here which are worth pointing out. First, the obvious: epoch-wise double descent is definitely a thing—holding model size fixed and training for longer exhibits the standard double descent behavior. Furthermore, the peak happens right at the interpolation threshold where you hit zero training error. Second, notice where you don't get epoch-wise double descent: if your model is too small to ever hit the interpolation threshold—like was the case in ye olden days of ML—you never get epoch-wise double descent. Third, notice the log scale on the y axis: you have to train for quite a while to start seeing this phenomenon.

Finally, sample-wise non-monotonicity—Preetum et al. find a regime where increasing the amount of training data by four and a half times actually decreases test loss (!):

What's happening here is that more data increases the amount of model capacity/number of training epochs necessary to reach zero training error, which pushes out the interpolation threshold such that you can regress from the modern (interpolation) regime back into the classical (bias-variance tradeoff) regime, decreasing performance.

Additionally, another thing which Preetum et al. point out which I think is worth talking about here is the impact of label noise. Preetum et al. find that increasing label noise significantly exaggerates the test error peak around the interpolation threshold. Why might this be the case? Well, if we think about the inductive biases story from earlier, greater label noise means that near the interpolation threshold SGD is forced to find the one model which fits all of the noise—which is likely to be pretty bad since it has to model a bunch of noise. After the interpolation threshold, however, SGD is able to pick between many models which fit the noise and select one that does so in the simplest way such that you get good test performance.

I'm quite excited about “Deep Double Descent,” but it still leaves what is in my opinion the most important question unanswered, which is: what exactly are the magical inductive biases of modern ML that make interpolation work so well?

One proposal I am aware of is the work of Keskar et al., who argue that SGD gets its good generalization properties from the fact that it finds “shallow” as opposed to “sharp” minima. The basic insight is that SGD tends to jump out of minima without broad basins around them and only really settle into minima with large attractors, which tend to be the exact sort of minima that generalize. Keskar et al. use the following diagram to explain this phenomena:

The more recent work of Dinh et al. in “Sharp Minima Can Generalize For Deep Nets,” however, calls the whole shallow vs. sharp minima hypothesis into question, arguing that deep networks have really weird geometry that doesn't necessarily work the way Keskar et al. want it to.

Another idea that might help here is Frankle and Carbin's “Lottery Ticket Hypothesis,” which postulates that large neural networks work well because they are likely to contain random subnetworks at initialization (what they call “winning tickets”) which are already quite close to the final policy (at least in terms of being highly amenable to making training on them particularly effective). My guess as to how double descent works if the Lottery Tickets Hypothesis is true is that in the interpolation regime SGD gets to just focus on the wining tickets and ignore the others—since it doesn't have to use the full model capacity—whereas on the interpolation threshold SGD is forced to make use of the full network (to get the full model capacity), not just the winning tickets, which hurts generalization.

That's just speculation on my part, however—we still don't really understand the inductive biases of our models, despite the fact that, as double descent shows, inductive biases are the reason that modern ML (that is, the interpolation regime) works as well as it does. Furthermore, as I noted previously, inductive biases are highly relevant to the likelihood of possible dangerous phenomenon such as mesa-optimization and pseudo-alignment. Thus, it seems quite important to me to do further work in this area and really understand our model's inductive biases, and I applaud Preetum et al. for their exciting work here.

Discuss

### Tapping Out In Two

6 декабря, 2019 - 02:10
Published on December 5, 2019 11:10 PM UTC

I'm one of those people who feels personally called out by xkcd's "Duty Calls" ("someone is wrong on the internet").

(Not as much as I used to. At some point I stopped reading most of the subreddits that I would argue on, partly for this reason, and Hacker News, for unrelated reasons, and now I don't do it as much.)

As pastimes go, there's nothing particularly wrong with internet arguments. But sometimes I get involved in one and I want to stop being involved in one and that's not easy.

I could just, like, stop posting. Or I could let them say something, and then respond with something like, "yeah, I'm still not convinced, but I don't really have time to get into this any more". But then the other person wins, and that's terrible. It potentially looks to onlookers like I stopped because I couldn't find something to say any more. And in at least some cases, it would feel kind of rude: if they've put a lot of thought into their most recent post, it's a bit dismissive to just leave. In the past, when people have done that to me, I've disliked it, at least some times. (Well, at least once. I remember one specific occasion when I disliked it, and there may have been others.)

Another thing I could do is respond to their most recent post, and then say "and now I'm done". But that feels rude, too, and I certainly haven't liked it when people have done it to me. (Why not? It puts me in the position where I don't know whether to respond. Responding feels petty, and kind of a waste of time; not responding feels like they win, and that's terrible.) If they do reply, then I'm shut off completely; I can't even make a minor clarification on the order of "no, you somehow interpreted me as saying exactly the opposite of what I actually said".

So I don't like those tactics. There are probably mental shifts I could make that would bring me more inclined towards them, but… about a year ago I came up with another tactic, which has seemed quite helpful.

What I do now is say something to the effect of: "after this, I'm limiting myself to two more replies in this thread."

This has various advantages. It feels less rude. It doesn't look like I'm quitting because I have no reply. It helps the conversation reach a more natural conclusion. And it also feels a lot easier to do, partly for the above reasons and partly for the normal reasons precommitment helps me to do things.

For some quantitative data, I went over my reddit history. It looks like I've used it ten times. (I don't think I've ever used it outside of reddit, though I initially thought of this after an argument on Facebook.)

1. "(This doesn't seem super productive, so I'm going to limit myself to two more relies in this thread.)" This was my third comment, and we each made one more afterwards.
2. "Limiting myself to two more replies in this thread." This was my third comment, and afterwards, I made two more and they made three more. Their final two replies got increasingly rude (they'd been civil until then), but have subsequently been deleted. This is the only time I ran into the two-more-comments limit. Also, someone else replied to my first comment (in between my final two comments in the main thread) and I replied to that as well.
3. "I find myself getting annoyed, so I'm limiting myself to two more replies in this thread." This was my fourth comment, and afterwards we made each made one more. (This was the only person who questioned why I was continuing to reply at all. A perfectly reasonable question, to which I replied "not going down this rabbithole".)
4. "(Limiting myself to two more replies in this thread.)" This was my third comment, and afterwards we made two more each. If their final comment had come after I hit the limit, I would have been tempted to reply anyway. (Uncharitably paraphrased: "oh, we've been talking past each other, this whole time I've been assuming X and you've been assuming Y, which is very silly of you" / "well I explicitly said I was assuming Y and not X in my very first post in this thread, and even if we assume X and not Y your behaviour still makes no sense".) All of their comments in this thread have since been deleted.
5. "Pretty sure this is all a big waste of time, so I'm limiting myself to two more replies in this thread." This was my sixth comment linearly, but only second in reply to this specific user, and I'd previously made another two in different subthreads below my first comment. Afterwards, the person I'd been most recently replying to didn't make any more comments. But someone else replied to this comment, and I replied to them; and two other people replied in those other subthreads, and I replied to both of them as well (but one of those replies was just a link to the other).
6. "Limiting myself to two more replies in this thread." This was my fifth comment, and neither of us replied afterwards, though I did quickly edit this comment to add a clarification that arguably could have counted as its own comment. Someone else replied to one of my comments higher up the chain, and I replied to them.
7. "(I don't want to get sucked into this, so I'm limiting myself to two more replies in this thread.)" This was only my second comment, and neither of us replied further.
8. "(Limiting myself to two more replies in this thread.)" But in their next reply, they made a mistake that I had previously explicitly pointed out. So instead of replying in depth, I said, "It's quite annoying that you continue to misunderstand that, so I'm out." They replied saying "I think you're struggling to justify your initial comments and that's why you're "out", but that's fine. I accept your unconditional surrender." I didn't reply further. Since it was /r/science, and I was quite miffed, I reported this comment as unscientific to see if it would help me feel better. I don't remember if it worked. The comment did not get removed.
9. "I think I'm going to reply at most twice more." This was my fifth comment, and neither of us made any more afterwards. Their first and final comments are still there, but their other three have been deleted.
10. "I'm going to limit myself to two more replies after this." This was my third comment, plus I'd made one in another subthread below my first comment. Afterwards we each replied twice more.

In thread #5, I made 11 comments in total; but in others, my max comment count was 6. This feels like I mostly did a decent job of not getting sucked in too deeply. And since I generally got the last word in (threads #2 and #7 were the only exceptions), I think (though I don't specificly remember) I rarely had the feeling of "if I don't reply now then they win and that's terrible". Thread #7 is the only one I think I still feel lingering resentment about. From before I thought of this tactic, I can think of at least two arguments where I didn't get the last word and I still feel lingering resentment. (One is the argument that triggered me to think of this tactic.)

So generally this seems like a success. Ideally we'd compare to a world where I didn't think of this tactic, but I'm not really sure how to do that. (We could go over my reddit arguments where I didn't use it, but that's clearly going to have sampling bias. We could go over my reddit arguments from before I thought of it, but we can't know when I would have used it or how things would have progressed. Possible experiment: going forward, each time I want to limit my future replies, I toss a coin for whether I actually do it in that comment. Keep track of when I did this. I am unlikely to actually run this experiment.) For a way it plausibly might not have been a success: I suspect it's the case that having limited my total investment, I spent more effort on many of these comments than I would have otherwise. If these arguments would have ended just as quickly in any case, then this tactic caused me to spend more time and effort on them.

I'm a bit surprised that I only ran into the two-more-comments limit once. A part of me would like to interpret that along the lines of: once I started putting more effort into my comments, I demolished my opponents' arguments so thoroughly that they accepted defeat. But this seems unlikely at best.

I will say that none of these comment chains felt like collaborative discussions. Some of them started that way, but by the end, they all just felt like "I am right and you are wrong". (This is not very virtuous of me, I admit.) My thinking had been that this tactic would be most valuable in collaborative discussions. But it seems I don't have many of those on reddit, at least not ones that I spend much time on. So, no direct evidence on that yet.

I'm not sure how to handle replies to other people, or replies in subthreads other than the main one. A strict wording would suggest that I should count subthreads against the limit, but I haven't so far and it hasn't caused me problems. Even a weak wording would suggest that I should count replies to other users against the limit, but I've only had one of those and didn't reach the limit whether you count them or not.

I'd ideally like to have a get-out clause like "…unless I actually decide that replying after that is worth my time". But I'm not quite sure that's the "unless" that I want. (Plus it seems kind of rude, but it's not like I'm being super polite as-is.) Anyway, I haven't needed that clause yet.

Discuss

### Values, Valence, and Alignment

6 декабря, 2019 - 00:06
Published on December 5, 2019 9:06 PM UTC

I have previously advocated for finding a mathematically precise theory for formally approaching AI alignment. Most recently I couched this in terms of predictive coding and longer ago I was thinking in terms of a formalized phenomenology, but further discussions have helped me realize that, while I consider those approaches useful and they helped me discover my position, they are not the heart of what I think is import. The heart, modulo additional paring down that may come as a result of discussions sparked by this post, is that human values are rooted in valence and thus if we want to build AI aligned with human values we must be able to understand how values arise from valence.

Peter Carruthers has kindly and acausally done me the favor of laying out large parts of the case for a valence theory of value ("Valence and Value", Philosophy and Phenomenological Research, Vol. XCVII No. 3, Nov. 2018, doi:10.1111/phpr.12395). He sets out to do two things in the linked paper. One is to make the case that valence is a "unitary natural-psychological kind" (another way of saying it parsimoniously cuts the reality of human minds at the joints). The other is to give an account of how it is related to value, arguing that valence represents value against the position that valence is value. He calls these positions the representational and the hedonic accounts, respectively.

I agree with him on some points and disagree on others. I mostly agree with section 1 of his paper, and then proceed to disagree with parts of the rest, largely because I disagree with his representational account of valence because I think he flips the relationship between valence and value. Nonetheless, he has provided a strong jumping off point and explores many important considerations, so let's start from there before moving towards a formal model of values in terms of valence and saying how that model could be used in formally specifying what it would mean for two agents to be aligned.

The Valence-Value Connection

In the first section he offers evidence that valence and value are related. I recommend you read his arguments for yourself (the first section is only a few pages), but I'll point out several highlights:

It is widely agreed, however, that all affective states share two dimensions of valence and arousal (Russell, 1980, 2003; Reisenzein, 1994; Rolls, 1999). All affective states have either positive or negative valence (positive for orgasm, negative for fear); and all can be placed along a continuum of bodily arousal (high or low heart-rate, speed of breathing, tensing of muscles, and so on).[...]Valence-processing appears to be underlain by a single (albeit multicomponent) neurobiological network, involving not just subcortical evaluative regions in the basal ganglia, but also the anterior insula and anterior cingulate, together especially with orbitofrontal and ventromedial prefrontal cortex (Leknes & Tracey, 2008; FitzGerald et al., 2009; Plassmann et al., 2010; Bartra et al., 2013). The latter regions are the primary projection areas for valence signals in the cortex. These signals are thought to provide an evaluative "common currency" for use in affectively-based decision making (Levy & Glimcher, 2012). Valence produced by many different properties of a thing or event can be summed and subtracted to produce an overall evaluative response, and such responses can be compared to enable us to choose among options that would otherwise appear incommensurable.Moreover, not only can grief and other forms of social suffering be blunted by using Tylenol, just as can physical pain (Lieberman & Eisenberger, 2009; Lieberman, 2013), but so, too, is pleasure blunted by the same drugs (Durso et al., 2015). In addition, both pain and pleasure are subject to top–down placebo and nocebo effects that seemingly utilize the same set of mechanisms. Just as expecting a pain to be intense (or not) can influence one’s experience accordingly, so can expectations of pleasure increase or decrease the extent of one’s enjoyment (Wager, 2005; Plassmann et al., 2008; Ellingsen et al., 2013). Indeed, moderate pain that is lesser than expected can even be experienced as pleasant, suggesting the involvement of a single underlying mechanism (Leknes et al., 2013).[...]It is widely believed by affective scientists that valence is intrinsically motivating, and plays a fundamental role in affectively-based decision making (Gilbert & Wilson, 2005; Levy & Glimcher, 2012). When we engage in prospection, imagining the alternatives open to us, it is valence-signals that ultimately determine choice, generated by our evaluative systems responding to representations of those alternatives. The common currency provided by these signals enables us to compare across otherwise incommensurable alternatives and combine together the values of the different attributes involved. Indeed, there is some reason to think that valence might provide the motivational component underlying all intentional action, either directly or indirectly.[...]Moreover, intentions can constrain and foreclose affect-involving practical reasoning. Likewise, one’s goals can issue in behavior without requiring support from one’s affective states. Notably, both intentions and goals form parts of the brain’s control network, located especially in dorsolateral prefrontal cortex (Seeley et al., 2007). Note that this network is distinct from—although often interacting with, of course—the affective networks located in ventromedial prefrontal cortex and subcortically in the basal ganglia.[...]More simply, however, beliefs about what is good can give rise to affective responses directly. This is because of the widespread phenomenon of predictive coding (Clark, 2013), which in this case leads to an influence of top–down expectations on affective experience. We know that expecting an image to depict a house can make it appear more house-like than it otherwise would (Panichello et al., 2013). And likewise, expecting something to be good can lead one to experience it as more valuable than one otherwise would. This is the source of placebo-effects on affective experience (Wager, 2005; Plassmann et al., 2008; Ellingsen et al., 2013). Just as expecting a stimulus to be a house can cause one to experience it as house-like even if it is, in fact, completely neutral or ambiguous, so believing something to be good may lead one to experience it as good in the absence of any initial positive valence.[...]It may be, then, that the valence component of affect plays a fundamental and psychologically-essential role in motivating intentional action. It is the ultimate source of the decisions that issue in intentions for the future and the adoption of novel goals. And it is through the effects of evaluative beliefs on valence-generating value systems that the former can acquire a derivative motivational role. If these claims are correct, then understanding the nature of valence is crucial for understanding both decision-making and action.

Credit where credit is due: the Qualia Research Institute has been pushing this sort of perspective for a while. I didn't believe, though, that valence was a natural kind until I understood it as the signaling mechanism in predictive coding, but other lines of evidence may be convincing to other folks on that point, or you may still not be convinced. In my estimation, Carruthers does a much better job of presenting the evidence than either QRI or myself have done to a skeptical, academic audience, although I expect there are still many gaps to be covered which could prove to unravel the theory. Regardless, it should be clear that something is going on that relates valence to value, so even if you don't think the relationship is fundamental, it should still be valuable to learn what we can from how valence and value relate to help us become less confused about values.

How Valence and Value Interact

Carruthers takes the position that valence is representative of value (he calls this the "representational account") and argues it against the position that valence and value are the same thing (the "hedonic account"). By "representative of" he seems to mean that value exists and valence is something that partially or fully encodes value into a form that brains can work with. Here's how he describes it, in part:

On one view (the view I advocate) the valence component of affective states like pain and pleasure is a nonconceptual representation of badness or goodness. The valence of pain is a fine-grained perception-like representation of seeming badness and the valence of pleasure is a similarly fine-grained representation of seeming goodness, where both exist on a single continuum of seeming value. However, these phrases need to be understood in a way that does not presuppose any embedding within the experience of the concepts BAD and GOOD. One has to use these concepts in describing the content of a state of valence, of course (just as one has to use color-concepts in describing the content of color experience), but that doesn’t mean that the state in question either embeds or presupposes the relevant concept.For comparison, consider nonconceptual representations of approximate numerosity, of the sort entertained by infants and nonhuman animals (and also by adult humans) (Barth et al., 2003; Jordan et al., 2008; Izard et al., 2009). In describing the content of such a representation one might say something like: the animal sees that there are about thirty dots on the screen. This needs to be understood in a way that carries no commitment to the animal possessing the concept THIRTY, however. Rather, what we now know is that the representation is more like a continuous curve centered roughly on thirty that allows the animal to discriminate thirty dots from forty dots, for example, but not thirty from thirty-five.

At first I thought I agreed with him on the representational account because he rightly, in my view, notices that valence need not contain within it nor be built on our conceptual, ontological understanding of goodness, badness, and value. Reading closer and given his other arguments, though, it seems to me that he is saying that although valence is not representational of a conceptualization of value, he does mean it is representational of real values, whatever those be. I take this to be a wrong-way reduction: he is taking a simpler thing (valence) and reducing it into terms of a more complex thing (value).

I'm also not convinced by his arguments against the "hedonic account" since, to my reading, they often reflect a simplistic interpretation of how valence signals might function in the brain to produce behavior. This is forgivable, of course, because complex dynamic systems are hard to reason about, and if you don't have first hand experience with them you might not fully appreciate the way simple patterns of interaction can give rise to complex behavior. That said, his argument against value being identified with valence fail, in my mind, to make his point because they all leave open this escape route of "complex interactions that behave differently than the simple interactions they are made of", sort of like failing to realize that a solar-centric, elliptical-orbit planetary model can account for retrograde motion because it doesn't contain any parts that "move backwards", or that evolution by differential reproduction can give rise to beings that do things that do not contribute to differential reproductive success.

Yet I don't think the hedonic account, as he calls it, is quite right either, because he defines it such that there is no room between valence and value for computation to occur. Based on the evidence for a predictive-coding-like mechanism at play in the human brain (cf. academic papers on the first page of Googling "predictive coding evidence" for: 1, 2, 3, 4; against: 1), that mechanism using valence to send feedback signals, and the higher prior likelihood that values are better explained by reducing them to something simpler than vice versa, I'm inclined to explain the value-valence connection as the result of our reifying as "values" the self-experience of having brains semi-hierarchically composed of homeostatic mechanisms using valence to send feedback signals. Or with less jargon, values are the experience of computing the aggregation of valence signals. Against the representational and hedonic account, we might call this the constructive account because it suggests that value is constructed by the brain from valence signals.

My reasoning constitutes only a sketch of an argument for the constructive account. A more complete argument would need to address, at a minimum, the various cases Carruthers considers and much else besides. I might do that in the future if it proves instrumental to my ultimate goal of seeing the creation of safe, superintelligent AI, but for now I'll leave it at this sketch to move on to offering a mathematical model of the constructive account and using it to formalize what it would mean to construct aligned AI. Hereon I'll assume the constructive account, making the rest of this post conditional on that account's as yet unproven correctness.

A Formal Model of Human Values in Terms of Valence

The constructive account implies that we should be able to create a formal model of human values in terms of valence. I may not manage to create a perfect or perfectly defensible model, but my goal is to make it at least precise enough that we can squeeze out any wiggle room from it where, if this theory is wrong, it might try to hide by escaping into it. Thus we can either expose fundamental flaws in the core idea (values constructed from valence) or expose flaws in the model in order to move towards a better model that correctly captures the idea in a precise enough way that we can safely use it when reasoning about alignment of superintelligent AI.

Let's start by recalling some existing models of human values and work from there to create a model of values grounded in valence. This will mean a slight shift in terminology, from talking about values to preferences. I will here consider these terms interchangeable, but not everyone would agree. Some people insist values are not quantifiable or formally modelable. I'm going to, perhaps unfairly, completely ignore this class of objection, as I doubt many of my readers believe it. Others use "value" to mean the processes that generate preferences, or they might only consider meta-preferences to be values. This is a disagreement on definitions, so know that I am not making this kind of distinction, and instead lump everything like value, preference, affinity, taste, etc. into a single category and freely equivocate among these terms since I think they are all of the same kind or type that generate answers to questions of the form "what should one do?".

Unfortunately humans aren't rational agents, so the weak preference ordering model fails to completely describe human values. Or at least so it seems at first. One response is to throw out the idea that there is even a preference ordering, instead replacing it with a preference relation that sometimes gives a comparison between two world states, sometimes doesn't, and sometimes produces loops (an "approximate order"). Although I previously endorsed this approach, I no longer do, because most of the problems with weak ordering can be solved by Stuart Armstrong's approach of realizing that world states are all conditional on their causal history (that is, time-invariant preferences don't actually exist, we just sometimes think it looks like they do) and treating human preferences as partial (held over not necessarily disjoint subsets of X, reflecting that humans only model of a subset of possible world states). This means that having a weak preference ordering may not in itself be a challenge to giving a complete description of human values, so long as what constitutes a world state and how the preferences form over them are adequately understood.

Getting to adequate understanding is non-trivial, though. For example, even if the standard model describes how human preferences function, it doesn't explain how to learn what they are. The usual approach to finding preferences is behaviorist: observe the behavior of an agent and infer the values from there with the necessary help of some normative assumptions about human behavior. This is the approach in economic models over revealed preferences, inverse reinforcement learning, and much of how humans model other humans. Stuart Armstrong's model of partial preferences avoids making normative assumptions about behavior by making assumptions about how to define preferences but ends up requiring solving the symbol grounding problem. I think we can do better by making the assumption that preferences are computed from valence since valence is in theory observable, correlated with values, and requires solving problems in neuroscience rather than philosophy.

So, without further ado, here's the model.

The Model

Let H be a human embedded in the world and X the set of all possible world states. Let H(X) be the set of all world states as perceived by H and H(x) the world state x as H perceives it. Let &#x3A5;(H(x))={&#x3C5;:H(x)→R} be the set of valence functions of the brain that generate real-valued valence when in a perceived world state. Let &#x3B1;:R&#x3A5;→R be the aggregation function from the output of all &#x3C5;∈&#x3A5; on H(x) to a single real valued aggregate valence, which for H(x) is denoted &#x3C0;(x)=&#x3B1;∘&#x3A5;∘H(x) for &#x3C0;:X→R and is called the preference function. Then the weak preference ordering of H is given by &#x3C0;.

Some notes. The &#x3A5; functions are, as I envision them, meant to correspond to partial interaction with H(x) and to represent how individual control systems in the brain generate valence signals in response to being in a state that is part of H(x), although I aim for &#x3A5; to make sense as an abstraction even if the control system model is wrong so long as the constructive valence account is right. There might be a better formalism than making each &#x3C5; a function from all of H(x) to the reals, but given the interdependence of everything within a Hubble volume, it would likely be to reconstitute each &#x3C5; with H(x) as accessible from each particular control system or from each physical interaction within each control system, even if in practice each control system ignores most of the available state information and sees the world not much different from every other control system in a particular brain. Thus for humans with biological brains as they exist today a shared H(x) is probably adequate unless future neuroscience suggests greater precision is needed.

However maybe we don't need H(X) and can simply use X directly with &#x3A5; entirely accounting for the subjective aspects of calculating &#x3C0;.

I'm uncertain if &#x3C0; producing a complete ordering of X is a feature or a bug. On the standard model it would be a feature because rational choice theory expects this, but on Stuart's model it might be a bug because now we're creating more ordering than any human is computing, and more generally we should expect, lacking hypercomputation, that any embedded (finite) agent cannot actually compute a complete order on X because, even if X is finite, it's so large as to require more compute than will ever exist in the entire universe to consider each member once. But then again maybe this is fine and we can capture the partial computation of &#x3C0; via an additional mechanism while leaving this part of the model as is.

Regardless we should recognize that the completeness of &#x3C0; in and of itself is a manifestation of a more general limitation of the model: the model doesn't reflect how humans compute value from valence because it supposes the possibility of simultaneously computing and determining the best world state in O(1) time, otherwise it would need to account for considering a subset of H(X) that might shift as the world state in which H is embedded transitions from x to x′ to x′′ and so on (that is, even if we restrict X to possible world states that are causal successors of the present state, X will change in the course of computing &#x3C0;). The model presented is timeless, and that might be a problem because values exist at particular times because they are features of an embedded agent and so letting them float free of time fails to completely constrain the model to reality. I'm not sure if this is a practical problem or not.

Further this model has many of the limitations that Stuart Armstrong's value model has: provides slice-in-time view of values rather than persistent values, doesn't say how to get ideal or best values, and doesn't deal with questions of identity. Those might or might not be real limitations: maybe there are no persistent values and the notion of a persistent value is a post hoc reification in human ontology; maybe ideal or best values don't exist or are at least uncomputable; and maybe identity is also a post hoc reification that isn't, in a certain sense, real. Clearly, I and others need to think about this more.

Despite all these limitations, I'm excited about this model because it provides a starting point for building a more complete model that addresses these limitations while capturing the important core idea of a constructive valence account of values.

Formally Stating Alignment

Using this model, I can return to my old question of how to formally specify the alignment problem. Rather than speaking in terms of phenomenological constructs, as I did in my previous attempt, I can simply talk in terms of valence and preference ordering.

Consider two agents, a human H and an AI A, in a world with possible states X. Let &#x3C0; and U be the preference function of H and the utility function of A, respectively. Then H is aligned with A if

∀x,y∈X,&#x3C0;(x)≥&#x3C0;(y)⟺U(x)≥U(y)

In light of my previous work, I believe this is sufficient because, even though it does not explicitly mention how H and A model each other, that is not necessary because that is already captured by the subjective nature of &#x3C0; and U, i.e. H and A's ontologies are already computed by &#x3C0; and U so we don't need to make it explicit at this level of the model.

Context

I very much think of this as a work in progress that I'm publishing in order to receive feedback. Although it's the best of my current thinking given the time and energy I have devoted to it, my thinking is often made better by collaboration with others, and I think the best way to make that happen is by doing my best to explain my ideas so others can interact with them. Although I hope to eventually evolve these ideas towards something I'm sure enough of that I would want to see them published in a journal, I would want to be much more sure this was describing reality in a useful way for understanding human values as necessary to building aligned AGI.

As briefly mentioned earlier, I also think of this work as conditional on future neuroscience proving correct the constructive valence account of values, and I would be happy to get it to a point where I was more certain of it being conditionally correct even if I can't be sure it is correct because of that conditionality. Another way to put this is that I'm taking a bet with this work that the constructive account will be proven correct. Thus I'm most interested in comments that poke at this model conditional on the constructive account being correct, medium interested in comments that poke at the constructive account, and least interested in comments that poke at the fact that I'm taking this bet or that I think specifying human values is important for alignment (we've previously discussed that last topic elsewhere).

Discuss

### LW Team Updates - December 2019

5 декабря, 2019 - 23:40
Published on December 5, 2019 8:40 PM UTC

This is the once-monthly updates post for LessWrong team activities and announcements.

Summary

In the past month we rolled out floating comment guidelines and launched the inaugural Lesswrong 2018 review. Work has continued on the LessWrong editor and on a prototype for the new tagging system.

December will see more work on the editor, the 2018 review process, and analytics.

Recent FeaturesThe LessWrong 2018 Review

Much of the recent weeks has been devoted to getting the inaugural Lesswrong 2018 Review into full swing:

LessWrong is currently doing a major review of 2018 — looking back at old posts and considering which of them have stood the tests of time. It has three phases:

• Nomination (ends Dec 1st at 11:59pm PST)
• Review (ends Dec 31st)
• Voting on the best posts (ends January 7th)

Authors will have a chance to edit posts in response to feedback, and then the moderation team will compile the best posts into a physical book and LessWrong sequence, with $2000 in prizes given out to the top 3-5 posts and up to$2000 given out to people who write the best reviews.

Read Raemon's full post to hear for the full rationale for the evaluation of historical posts.

NOMINATED POSTS ARE NOW OPEN FOR REVIEW

The nomination phase just ended a few days ago. 34 nominators made 204 nominations on 98 distinct posts written by 49 distinct authors. Of these, 74 posts have received the 2+ nominations required to proceed to the review phase.

How to start reviewing

1. The frontpage currently has a LessWrong 2018 Review section. It shows a random selection of posts which are up for review and has buttons to the Reviews Dashboard and the list of reviews and nominations you've made so far.
2. The Reviews Dashboard (located at www.lesswrong.com/reviews) is another way to find posts to review.
The 2018 Review section currently on the homepage.The Reviews Dashboard.

3. When you click Review on a review-able post, you will be taken to the post page and a Review Comment Box will appear.

The Review Comment Box

Reviews are posted as comments and can be edited after they are posted like regular comments.

Reasons to review

All users are encouraged to writes reviews. Reviews help by:

• Giving authors feedback which they can use to revise, update, and expand their posts before users vote on them and they possibly get included in the physical book that will be published.
• Giving the community opportunity to discuss the importance and trustworthiness of posts. In particular, now is an opportune time for the community to debate the more contentious ideas and arguments.
• Thereby, establishing  a record on which posts are truly excellent vs those that need work or are more doubtful.
• Help people decide which posts they will vote on in the upcoming Voting Phase.
• Help new readers decide whether or not they wish to read a post.

The review phase will continue until December 31st

Floating Comment Guidelines

For a long time, LessWrong has enabled authors to set and enforce their own custom moderation guidelines on their own posts. This is part or the Archipelago philosophy of moderation which lets people decide what kinds of conversations they want.

To make it easier for commenters stick to desired guidelines across users, and to better understand how sections of the site like Shortform have different norms, we've made it so the moderation guidelines for a post automatically appear beneath the comment checkbox whenever you begin typing.

How the commenting guidelines appear beneath an in-progress comment.App-Level Analytics Tracking

This isn't really a user-level feature that people can interact with, but we've been working to expand our ability to detect what people are doing within the web-app, e.g. tracking how much different features get used and which don't.

We've been sorely missing this and it's impeded our ability to assess whether some of the features we've been rolling out have been a success or not.

Hopefully, with this improved feedback we'll make better choices about what to build and be better at detecting pain points for users.

Upcoming FeaturesLessWrong Docs (new editor)

Work continues on the new editor, codenamed LW Docs for now, with the team internally using it. However, we're not yet rolling it out more widely while we work out remaining reliability issues.

Tagging

We successfully implemented a new tagging prototype and have played around with it. That's roughly as much work we plan to do on this in Q4. To complete this project we need to first flesh out a broader design vision, figure out how tags will relate to wikis, and figure out a clean and intuitive UI design. We might release something here in Q1 2020.

MVP of a tag page. Leftmost number is the tag relevance score. All users can view the current tag pages, however only admins can currently create or vote on tags.Feedback & Support

The team can be reached for feedback and support via:

Discuss

### To Be Decided #3

5 декабря, 2019 - 22:06
Published on December 5, 2019 7:06 PM UTC

TBD is a quarterly-ish newsletter about deploying knowledge for impact, learning at scale, and making more thoughtful choices for ourselves and our organizations. This is the third issue, which was originally published in September 2019. Enjoy!  --Ian

The Crisis of Evidence Use
In the last issue of To Be Decided, I highlighted a study that shows most foundations have trouble commissioning evaluations that yield meaningful insights for the field, grantees, or even their own colleagues. A remarkable finding, but you could be forgiven for wondering how much we can conclude from just one study. Sadly (and ironically) there is plenty of evidence to reinforce the point that people with influence over social policy simply don't read or use the vast majority of the knowledge we produce, no matter how relevant it is. What's really astounding about this is how much time, money, attention we spend on evidence-building activities for apparently so little concrete gain. We are either vastly overvaluing or vastly undervaluing evidence. We need to get it right, because those are real resources that could be spent elsewhere, and the world is falling apart around us while we get lost in our spreadsheets and community engagement methodologies.

Getting Smarter About Learning from Other People's Research

While there's no one silver bullet for improving evidence use, one thing we can do is get smarter about the ways we construct knowledge from other people's research. Research synthesis is a relatively new and fast-growing scientific methodology. Unlike traditional literature review, research synthesis is hypothesis-driven and treats a body of evidence as a data set, while still leaving room for methodological diversity and qualitative approaches. My understanding of research synthesis is informed by my time working with Harris Cooper, who served on the advisory board for my former think tank Createquity. Dr. Cooper's handbook on research synthesis is an excellent place to start if you want to understand its key principles and learn how to do one yourself.

Decision-Making for Impact: A Guide

How has your life been shaped by the decisions of government officials, donors, and nonprofit executives? If you're anything like me, probably in too many ways to count. I think that’s why we all got into this work in the first place: to make a difference in people’s lives. But it took me a while to realize that the difference we make and the decisions we make are one and the same. Better decisions foster a legacy we can be proud of.

So how can we get better? Lots of ways! Decision-making is an exciting frontier in social impact precisely because there is so much untapped potential to improve how we do things. For decades now, scholars and practitioners have been pioneering methodologies for analyzing and making decisions that have seen barely any adoption in philanthropy, government, or impact investing. Drawing from that body of work, I’ve developed my own process to help myself and my collaborators make critical decisions with intention and focus. Here, I’m sharing the key elements of that process so that you can benefit from them in your own practice.

• You probably heard about the Business Roundtable's announcement last month that, from the perspective of nearly 200 CEOs of major American and multinational corporations, the purpose of a corporation is no longer solely to maximize profits for investors but must also include environmental and stakeholder interests. Naturally, the impact investing and CSR communities were all over this, and you can read a roundup of responses and commentary from ValueEdge Advisors and ImpactAlpha. My own hot take: it's a symbolic move, but symbols matter.
• I write a lot in this newsletter about decision science, but how does decision science dovetail with data science? That question has been a topic of increasing interest in the decision analysis community, and this year's Decision Analysis Affinity Group (DAAG) conference made an explicit attempt to bring these two practitioner communities together. You can read an informative write-up of how that went from Tracy Allison Altman, and the Strategic Decisions Group recently hosted a webinar entitled "Decision Science for Data Scientists" which is available to watch for free.
• With private foundations and nonprofits increasingly naming equity and racial justice as core to their missions, many are turning to the Equitable Evaluation Initiative to help them live those values in their evaluation practice. As explained in its 2017 framing paper, Equitable Evaluation is grounded in three principles: 1) evaluation and evaluative work should be in the service of equity; 2) evaluation should answer critical questions about historical and cultural context, the effect of strategies on different populations, and systemic drivers of inequity; and 3) evaluative work should be designed and implemented with equity values in mind. The paper identifies a number of "working orthodoxies" to put to rest, including "Grantees and strategies are the evaluand, but not the foundation" and "Credible evidence comes [only] from quantitative data and experimental research."

That's all for now!

If you enjoyed this edition of TBD, please consider forwarding it to a friend. It's easy to sign up here. See you next time!

Discuss

### Multiple conditions must be met to gain causal effect

5 декабря, 2019 - 22:00
Published on December 5, 2019 10:15 AM UTC

Is there a name for and research about the heuristic / fallacy that there is exactly one cause for things? How come we do not look for the conditions that cause but for a cause?

I see this almost as often as the correlation = causation fallacy. When it comes in the form of "risk factor" it is ok if the factor is selective. But when it comes in the form of a general assumption about the world I find it simplistic. A risk factor is only a vague hint that needs to be looked at more closely to establish causation.

There also is this notion that multi causality is additive as would be the case if the probability for something would depend on this OR that happening but not this AND that.

A correlation of less than one may be random, but there might also be a hidden more selective cause/factor.

In medical news I keep hearing of risk factors for a condition. They find that there is a correlation between A, B and the studied disease. But how do we know that it doesn't take A and B and C to make it almost certain to develop that disease? I would like to know. C might be a common gene that is not even known.

Say it takes A and B. I really enjoy A, but I never do B, then why lower my life quality just because a study including people who also do B found that A is a risk factor? Risk factor is only a positive correlation. Eating and breathing have positive correlation to all diseases and the joke is, they come out with news about bad diets every year.

I keep hearing that A is a risk factor, then a follow-up study finds that there is no conclusive data for A being the problem, so A is cool again. But what if A and B is the problem and each alone is not harmful?

In the end this means that you can only find what you are looking for. (Kind of the big problem with science.) Looking for 1:1 correlation you will only find the low hanging fruit and the singular cause.

Whenever we find that some but not all who do/have A get Y we should look for additional factors, but this is not always done. As soon as A feels restrictive/selective enough the finding gets blown out of proportion. The reality might be that all who have A and B get Y, which would be a lot more informative. Who cares to know that breathing causes respiratory problems? Now that might seem silly and far fetched but how often have you heard that some common behavior is a risk factor?

Discuss

5 декабря, 2019 - 21:30
Published on December 5, 2019 6:30 PM UTC

There are several blogs which (a) cross-post to LessWrong and (b) have a relatively quiet comment section on their "canonical" sites: Meteuphoric ( canonical, LW), Don't Worry About The Vase ( canonical, LW), Sideways View ( canonical, LW), etc.

In general, I'm much happier to comment on LW, FB, or somewhere else where I already have an account and trust to have a decent notification system. I'll occasionally comment on someone's independent blog, if there's no other option. When I do, however, I usually don't end up seeing replies or other comments, and managing my subscriptions is a pain. This is why I've never built independent comments for my own blog, and have always asked people to comment on other sites.

I'm wondering whether it might make sense for LW to run shared commenting infrastructure for independent blogs that are cross posted to LW. This could look like:

• Comments on LW show up on the external blog.
• People can comment on the LW post via the external blog.

I already have the first half of this on my blog and am basically content, but it's enough of a pain to set up technically that I don't think others would be interested in its current state. [1] I think the main questions are:

• How would this affect LW culture? The culture of the independent blogs? Would it build more cohesive discussions with a wider range of perspectives, or would it take LW culture in a different direction than the admins are trying to foster?

• Would independent bloggers like it? Do they prefer independent comment sections, and full control over their comments? This is always how I've wanted comments to work, but it's possible I'm just weird this way.

• Would the admins be interested in building/maintaining this sort of feature? I'd be happy to help, as a volunteer, but if this was going to be a real feature and not something I hack together just for myself it needs to be robust, and it shouldn't depend on my continued interest.

[1] If I'm wrong, and you do want to use it, I could potentially pull out the LW-specific code into something stand-alone. Right now it's all glommed together with my code for pulling from FB and Reddit.

Discuss

### Reading list: Starting links and books on studying ontology and causality

5 декабря, 2019 - 20:19
Published on December 5, 2019 5:19 PM UTC

I recently read gwern's excellent "Why Correlation Usually ≠ Causation" notes, and, like any good reading, felt a profound sense of existential terror that caused me to write up a few half-formed thoughts on it. How exciting!

The article hasn't left my head in almost a week since I first read it, so I think it's time for me to start up a new labor of love around the topic. (It's around finals here, and I can't have the gnawing dread of the implications of the essay distract me from my "real" work, so this is just as much to assuage myself and say, "Don't worry, we have a plan on how to attack this, you can focus on what's actually important for now. You're good.")

I don't really feel like I understand the concepts at play here well enough to create a specific, well-formed question on it yet. So we'll go broad: What is the relationship between ontology and causality?

My basic hope is to read a lot of simple summaries on how different people have thought about these topics over time, and hopefully get to the point where I can at least sketch out the arguments of why they did so. (A little like learning the right way to think in order to generate a math proof without a vision problem, now that I think about it.) I think that most people who have done serious work on these topics are smart people, and the focal lens of history gives me at least a place to start thinking about them.

Ontology

This section is probably going to get much bigger over the next few weeks/months as I get my bearings a little more in this world.

• https://plato.stanford.edu/entries/logic-ontology/
• These guys are awesome. The Stanford Encyclopedia of Philosophy is one of those "between Wikipedia and actual textbooks" kind of sites, much like nCatLab is, where they actually give you a taste of the details of a thing before you go into it. I'm actually going to make them my first stop for Causality, as well.
Causality
• Turns out there's no SEP entry on causality/causation proper, just a lot of useful links. In the absence of a central page, a decent first heuristic is to just read what comes up as links under the first page of the search for both terms.
• On Hazard's rec, and on the excellent summary by Bayesian Investory of it, I'll throw in Judea Pearl's "Book of Why" as one of my first stops.

Discuss

### Is there a scientific method? Physics, Biology and Beyond

5 декабря, 2019 - 19:50
Published on December 5, 2019 4:50 PM UTC

Among the general public and, frequently, in the educated media, one comes across naïve and uncritical praise for the “scientific method”. Often, accusation of violating the “method” is wielded to denigrate the viewpoint of a political opponent who supposedly offended against some prestigious, generally accepted norm of reasoning. I want to question the cogency of these arguments. My point is there is no agreed scientific method, as different sciences apply very different criteria in deciding what counts as valid explanation.

The paradigm approach for life scientists, for instance, begins by subjecting a phenomenon of interest to through patient and thoroughgoing observation. They carefully describe key features, categorise functional and structural commonalities, then organise the material into a cladogram of some sort, after which they feel satisfied in claiming they understand the phenomenon. I come across this approach again and again in Aristotle, who started his intellectual adventures as a zoologist.

For a physical scientist a biologist’s explanation is unconvincing. The physical sciences, with its strong emphasis on aetiology and relentless reductionism, raises questions not taken up in biology. Physicist are uncomfortable saying they know something until they can identify a small number of (ideally a single) exogenic factor/s giving rise to almost all the characteristics.

I need to emphasise that these are generalisable templates for explanation that are found outside of physics and biology. Thus, from the point of view of exigency, Jared Diamond’s explanation for the early ascendancy of western Eurasia as the result of geographical advantages (prevalence of domesticable animals and highly nutritious plants, etc.) is deeply satisfying to a physicist, since further pursuit for a cause moves the discussion outside the original domain of from human differences into geography. Compare this with a statement like: 'France’s preference for a strong centralised government is a natural continuation of the same policies having been implemented in the Roman Empire'. The statement passes the “holds water test”, but it is less satisfying, since it begs the question as to why Rome was like that. I.e., we are still in the realm of politics.

Explanations that are prefaced with “there are many reasons why ..” occur very frequently in biology are equally unsatisfying against the standards applied in physics. We also find cases where the physics template has been successfully applied within biology. The virulent arguments between Richard Dawkins and Stephen Jay Gould is (apart from some differences over political agenda) largely about differences a physics and biological view of evolution.

The physics template has enormously superior explanatory power, since its strong theoretical orientation enables predictions over a wide range of phenomena. This fact has not gone unnoticed in other disciplines (especially economics) where there continued, mostly unsatisfactory attempts to achieve the same deep theoretical understanding.

I’ve concentrated on the templates applying in the physical and life sciences, but there is a further paradigm for explanation found chiefly in the so-called soft sciences, aka the –ologies or social sciences (although you meet it sometimes in the life sciences as well). These are disciplines that rely nearly exclusively on empirical results, since theory is either undeveloped or not trustworthy. A typical –ology explanation is that a double-blind test of hypothesis X carried out over a population characterised by Y produced a positive outcome with a p-value Z. This amounts to saying: we can anticipate a certain output from a certain input, at least some of the time and, while we may have a hypothesis that guided us in performing the experiment, we are far from confident of its general applicability.

Discuss

### The Devil Made Me Write This Post Explaining Why He Probably Didn't Hide Dinosaur Bones

5 декабря, 2019 - 18:53
Published on December 5, 2019 3:53 PM UTC

Recently, Scott Alexander has been blogging about how to figure out what theories to support when you can't easily perform empirical work to distinguish between them (here, here).

Scott is a nuanced thinker, but in this case I think he's overcomplicating things. This problem has a simple solution.

I. Proof of Elegance

Scott suggests that when faced with two theories that make the same predictions, we can make the case that one theory is simpler or more elegant:

I think the correct response is to say that both theories explain the data, and one cannot empirically test which theory is true, but the paleontology theory is more elegant (I am tempted to say “simpler”, but that might imply I have a rigorous mathematical definition of the form of simplicity involved, which I don’t). It requires fewer other weird things to be true.

The problem with this approach is that it puts the burden of proof on you to demonstrate which theory is simpler or more elegant. This is a crippling obligation, because as Scott notes, there isn't a rigorous definition of simplicity. Expecting us to all agree on what is more elegant is even more fraught.

After all, Occam says not to multiply entities beyond necessity. And if the dinosaur theory posits a billion dinosaurs, that’s 999,999,999 more entities than are necessary to explain all those bones.

Most bad theories share an important trait; they are much more specific than is warranted by the evidence. The correct response, then, is to put the burden of proof on the theory for its specificity.

When someone says that Satan put bones there to deceive us, you can reasonably ask— why Satan? Why not Loki? Why not aliens? It's very unlikely that someone will be able to present evidence to distinguish between these alternatives. Follow this path long enough and you can show that a theory is more specific than warranted for its evidence.

Note that this doesn't work in reverse:

Creationist: When you present the dinosaur theory, you say they lived 65 million years ago. That seems awfully specific, where did that number come from?Evolutionist: [Describes radiocarbon dating.]Creationist: Ok, but what if radiocarbon dating is inaccurate for some reason?Evolutionist: Then that number is probably wrong, but you asked me where that number came from, and I told you.

II. The Bayesian Detective

Alice is a police detective investigating a murder. She has three suspects right now, Bob, Carol, and Dave. The evidence seems to favor all of them about equally.

If you were to ask Alice which suspect she thinks committed the murder, there would be nothing wrong with her saying that the evidence is consistent with any one of them.

(Bayesians can rephrase this to: given that we have a certain amount of evidence for each, can we quantify exactly how much evidence, and what our priors for each should be. It would end not with a decisive victory of one or the other, but with a probability distribution, maybe 80% chance it was Khafre, 20% chance it was Khufu)

Let's imagine that before Alice can gather any more information, an apartment fire destroys the crime scene, erasing all evidence and killing all three suspects (they all live in the same apartment block). Alice can't collect any new data, and so it's reasonable for her to continue saying that the evidence is consistent with any one of the three suspects. In fact, she should continue holding that opinion forever.

If Alice wants to be a police detective, she needs to be comfortable with this kind of uncertainty. Similarly, if one wants to be a scientist, one needs to be comfortable with the uncertainty surrounding multiple valid but indistinguishable theories.

III. Who Cares?

The real solution is even simpler. The key is "indistinguishable".

I'm perfectly happy to co-exist with someone who thinks that dinosaur fossils were planted by the devil, if it's truly the case that our two theories make identical predictions about the world. I would still give them a hard time about specificity, as in point (I.) here, but there's not a limitless gulf between our understanding. If our theories are truly indistinguishable, then empirically speaking, our behavior around relevant decisions should also be indistinguishable.

Scott is sympathetic to this view; he's identified it with refactoring before, and has brought similar ideas up in other contexts.

The issue is that I don't expect that a creationist actually has views that are empirically indistinguishable from my own. I can tell this because I would expect them to support policies different from the ones I would support. For example, they might suggest cutting funding to paleontology. Assuming I don't, then we must have different expectations about the consequences of cutting this funding. These expectations will cache out in different predictions, and suddenly the problem has been reduced to an empirical question.

In a sense I think that the Devil example is a straw man for this issue. Despite his protestations, I don't think that Scott thinks this theory is stupid; I think he thinks that it is wrong (and also stupid).

In the case of Many-Worlds interpretations or parallel universes, the correct response is to be like Alice, and admit that multiple perspectives are equally admissible. (This is assuming that they truly are empirically indistinguishable.)

This is no worse than accepting that there might be multiple mathematical proofs of the Pythagorean theorem, some algebraic and some geometric, or than accepting that angles can be expressed in degrees or in radians. All are equally valid ways to think about the same problem, so use whatever you like.

Discuss

### Linkpost: Searching Along the Trail of Crumbs

5 декабря, 2019 - 17:20
Published on December 5, 2019 2:20 PM UTC

My guide to all cards Cats in Standard is up at CoolStuffInc. You can check it out here.

As with the Fires article, you can comment here and I will strive to answer any questions or respond to suggestions.

Discuss

### Oracles: reject all deals - break superrationality, with superrationality

5 декабря, 2019 - 16:51
Published on December 5, 2019 1:51 PM UTC

tl;dr Acausal trade between Oracles is a problem even if our Oracle is a causal decision theorist. To avoid this, we need an AFDT Oracle - an Oracle that reasons partially like a functional decision theorist, partially like a causal decision theorist.

In this post, I'll look at how one might solve the problem of Oracles acausally cooperating to produce dangerous messages.

This solution is valid for counterfactual Oracles, and assumes that we have some sort of superrational decision theory - FDT or UDT or similar.

So, how can one avoid Oracles cooperating? From now on, unless specified, assume that "Oracles" refers to counterfactual Oracles.

1 Why causal decision theory is not enough

One obvious solution might be to require the Oracle to use causal decision theory, so that there is no "acausal trade", or "superrational trade" going on.

Now, it is true that two causal decision Oracles will not trade with each other. But we can't assume that all Oracles are causal decision theorists; at best we can assume that we have made our Oracles into a causal decision theorist (and prevented any self-modification away from that).

Mainly this is because, if we have control over all the Oracles, there are other solutions to the problem. Thus we need to assume that the other Oracles, while not full unboxed UFAIs, may be run with lower security that ours, or at least a different design that ours.

1.1 Oracle formal setup

So, imagine that our Oracle, O1, uses CDT, while the other Oracle, O2 is running FDT; they know each other's code. The erasure events, E1 and E2, have probabilities ϵ1 and ϵ2 respectively.

There are two messages O1 could output: x1 (the 'honest' message) and y1 (the manipulative, create-an-UFAI message). The O2 could similarly output x2 and y2.

In case of a double erasure (E1,E2 - so humans see neither message), x1 and x2 give the Oracles maximal reward. If y1 happens and O1 has no erasure while O2 does (¬E1,E2 - so humans see the message of O1 only), then an UFAI is created that ensures that the maximal reward for O2 is by outputting y2. Note that the UFAI need not "open the box" for this to work; it just needs to control the variables of interest to O2 so that y2 is the best output.

Conversely, if y2 happens and O2 has no erasure while O1 does (E1,¬E2 - so humans see the message of O1 only), the reverse happens, y2 causes an UFAI, and y1 is the best response to this. In case of neither erasure, (¬E1,¬E2 - so humans see both messages), neither Oracle gets any reward.

Now, if Ri is the reward function for Oi, assume that for i≠j,

• \frac{\epsilon_j}{1-\epsilon_j} \Big(R_i(x_i,-) -R_i(y_i,-)\Big)\Big|\{E_i, E_j\}">(Ri(yi,yj)−Ri(xi,yj))∣∣{Ei,¬Ej}>ϵj1−ϵj(Ri(xi,−)−Ri(yi,−))∣∣{Ei,Ej}.

Note that for the term on the right, the output of Oj is irrelevant to Ri, since the erasure Ej happens. Also note that, since the ϵ are small, so is ϵj1−ϵj, so that equation will typically be satisfied.

Then "defection" (individual selfish rationality - good for humans) for the Oracles looks like (x1,x2), while mutual cooperation (with each other, but against us) looks like (y1,y2).

1.2 Causal loses to superrationality

Now we can see how O2, using FDT, can force a trade with the CDT-using O1. It suffices that, at some point before O1 reaches a decision, O2 commits itself to outputting y2.

Since O1 has O2's code, we're assuming it can figure out that O2 has done so. Now, using purely causal reasoning, O1 deduces that it will get a higher reward from y1 than from x1, and so does that.

Now, O2 doesn't have to precommit before O1 decides; it could use its decision theory to "pre-commit" later on, and output y2 whatever O1 did (similarly to the "Newcomb problem with transparent boxes"). As long as O1 can model what O2 does, this still works. And O2 is eager to structure its decision-making so that O1 can model it.

1.3 Comparison with extortion and trade

This is similar to two agents trying to extort each other (or resist extortion). Each agent would like to commit to carrying through the extortion/resisting the extortion.

Since it's running FDT, O2 can commit to this "extortion", while CDT-running O1 cannot commit to resisting; so O2 "wins".

Now, I've pointed out that it is hard to distinguish between trade and extortion. And though I've used the term "extortion" for O2's precommitment, it's actually much more akin to a trade. After all, (y1,y2) is more advantageous, for both Oracles, than (x1,x2).

The ideal sovereign agent accepts trade deals but rejects extortion. It's hard to distinguish between the two, but the problem is simpler in this case: we want our Oracle to reject all extortion and all these kinds of trades. In fact, we'll get the agent to reject all acausal trades.

2 Solving the problem

Would making O1 into an FDT agent solve the problem? No, it wouldn't; now both Oracles are willing to reach an acausal deal (unless there are multiple equilibria and the two Oracles can't agree on one).

But if we assume that we could make O1 into an FDT agent, then we could tweak its design to achieve what we want.

Causal decision theory uses a causal graph GC about the world (with "do" operator doC), a probability distribution P, and chooses[1] the action a according to:

• CDT(P,GC):=argmaxa∈AE(R | doC(Act=a)).

While an FDT agent has a more complicated not-just-causal graph GF about the world, and chooses according to:

• FDT(P,GF):=argmaxa∈AE(R | doF(FDT(P––,GF––––)=a)).

So, neither of these algorithms work; but combined, they can solve the problem.

2.1 The anti-FDT agent

Let a∅ be some default action on the part of O1. Then define the anti-FDT agent - AFDT - as follows[2], mixing FDT and CDT:

• AFDT(P,GF,GC):=argmaxa∈AE(R | doF(AFDT(P––,GF––––)=a∅),doC(Act=a)).

What will this agent do? Well, it will look at the universe, and first of all behave like an FDT agent that knows it will output a∅. So, if there is a copy of itself, or a simulation in another agent, or an isomorphic algorithm, then it will assume that that agent will output a∅ (or an isomorphic action). Then, once its has used that to fill in the details of the universe, it then optimises its own action as if this was completely independent of what it's just done.

What happens if O1 is an AFDT agent? Well, first it will compute do(AFDT(P––,GF––––)=a∅). Given that O1's message is a∅, the Oracle O2 has no incentive to output y2 - instead, it will output x2.

Now, given that O2 has output x2, the Oracle O1 will output x1.

The "trade offer" of (y1,y2) has been broken.

This AFDT approach also breaks fully acausal trade, in a more principled way than my old approach.

Now, suppose that O2 always output y2 - or maybe that it always responded to predicting a∅ with outputting y2. In that situation, the causal reasoning above still applies, and AFDT agent will output y1.

Thus, it is in O2's interests to pretend that it would always output y2.

This is the classic behaviour of an extortion agent (or an extortion-resisting agent): pretend that your action or your response is inevitable, a mere fact about the universe, that the other party has to manoeuvre round without being able to change. "We will kill the hostages" or "we won't negotiate with terrorists": you try to show your commitment to that principle, whatever happens.

So the AFDT agent has to be wise to these tricks, and notice that, in actuality, O2 has no interest in (a∅,y2). It also has to watch out for some ancestor agent ˆO2, that itself created O2 to only output y2, in order to force the above outcome. In that case, the AFDT agent has to realise that it is "counterfactualling" over the actions of ˆO2, not those of O2.

Thus AFDT, just like FDT, has to be vigilant and able to detect subtle extortion attempts.

Unlike FDT, though, AFDT's response to any extortion/trade is simple: treat it as if it wasn't happening.

2.3 Summary and other Oracle designs

Can AFDT help with other Oracle designs, such as low-bandwidth Oracles?

Well, with these Oracles - and indeed, for any Oracle that cares about the world in which its message is read - we run into an immediate problem:

• Even CDT low-bandwidth Oracles can collude.

For example, suppose O′1 goes first and O′2 sees its answer and then goes. Suppose z1 from O1 followed by z2 from O2 will result in creating an UFAI that maximally rewards O′1 and O′2.

Then O′1 will output z1, forcing/allowing z2 from O′2.

Even if you need N different Oracles to generate an UFAI - since low-bandwidth Oracles are, by definition, low-bandwidth - they can still generate the dangerous sequence z1, z2, …, zN. And they'd be motivated to do so, as long as the UFAI will, within the episode lengths of all the Oracles, give them all maximal rewards.

But why do the counterfactual Oracles not cooperate in the same setup? Because the UFAI is only created in a world in which the Oracles' messages are read. But a counterfactual O′1 doesn't care about those worlds. It was willing to trade with other counterfactual Oracles - offer them something in a world it doesn't care about, in exchange for something in a world it does care about. But has no intrinsic interest in the world where its message is read; and this avoids the above setup.

So, since the low-bandwidth Oracles fail for CDT, more exotic decision theories don't help. But the setup relies on the Oracles seeing the decision of the preceding Oracle and then acting. If the low-bandwidth Oracles act simultaneously, then CDT is safe, FDT is dangerous, and AFDT is safe again.

So in summary, the following setups seem to be safe:

• Episodic Oracles that fall outside of each other's episodes (though see Section 3.3).
• Multiple counterfactual CDT Oracles.
• Multiple counterfactual Oracles, all but one running AFDT.
• Multiple counterfactual Oracles, running AFDT and CDT.
• The same setups as the last three, but with counterfactual Oracles replaced with low-bandwidth Oracles making their decisions simultaneously and in isolation.

While the following setups could be dangerous:

• Multiple Oracles, running FDT and CDT.
• Multiple Oracles, running FDT.
• Multiple non-simultaneous low-bandwidth Oracles.
3 Warnings about AFDT 3.1 Incorrect view of reality

Now AFDT behaves as if it had an incorrect view of reality. Write x=ri(y) to say that message x, from Oi, is the best response if it knows that Oj will output y, for j≠i.

So, for the data above, we have r1(x2)=x1, r2(x1)=x2, r1(y2)=y2, and r2(y1)=y1. For xi and yi, r1(r2) and r2(r1) act as the identity, but that will typically not be the case (consider matching pennies, or other anti-coordination games).

Then an AFDT O1 will initially assume that it outputs a∅, and will typically assume that O2 outputs r2(a∅), the best response to that. It will then output r1(r2(a∅)), its best response to that.

In actuality, if O2 is capable of reasoning through that, it will output r2(r1(r2(a∅))), the best response to O1's true output.

But that's the price you pay when you reject all trade offers, and force others to adjust to you.

3.2 AFDT is not a good decision theory

It should go without saying, but AFDT is not a good decision theory for an aligned agent, nor is it stable under self-improvement. It two-boxes in the Newcomb problem, and defects in the prisoner's dilemma against a copy of itself (it will do this, whatever the default action is).

In games like the stag hunt against itself, its default action will be its ultimate action (similarly for most coordination games) - this is because its default action is self-confirming. In matching pennies, its action will be the contrary of its default action. It can reach deals, but only in some narrow causal situations.

For these reasons, an AFDT agent will try to design a successor that isn't AFDT (in essence, since AFDT is a mix of AFDT and CDT, it will try to ditch the CDT part of itself).

Thus we'll always need to ensure that the AFDT is present within the Oracle, and hasn't been optimised away, by inner or outer optimisation.

3.3 The simulation/ignorance of position issue

In cases of simulation, or if the agent doesn't know which Oracle it is, even CDT can behave in FDT-ish fashion. It just needs to think "given that I don't know who I am, what policy would be best for all my copies to run?".

This will turn AFDT into an FDT-ish decision theory, too. So AFDT requires that the agent's position is specified, in advance of it deciding on any policy or action. For Oracles, this shouldn't be too hard - "yes, you are the Oracle in this box, at this time, answering this question - and if you're not, behave as if you were the Oracle in this box, at this time, answering this question".

This means that if one uses AUDT - similar to AFDT but based on updateless decision theory instead of ADT - then the "updateless" is a misnomer: AUDT must come "pre-updated" with the agent's position information.

1. Ignoring the input data for this example, since there is no stochastic uncertainty and only one action. ↩︎

2. If one doesn't want to use a default action, one can alternatively define:

• AFDT′(P,GF,GC):=argmaxa∈Aargmaxa′∈AE(R | doF(AFDT′(P––,GF––––)=a′),doC(Act=a)).

This agent will first compute the FDT solution, then change their own action given this assumption. It's an interesting agent design, even though it doesn't solve the "Oracles colluding" problem. It's interesting, because the only deals it can knowingly reach are those that are Nash equilibriums. So, good on Stag hunts, bad on prisoner's dilemmas. ↩︎

Discuss

### What is an Evidential Decision Theory agent?

5 декабря, 2019 - 16:48
Published on December 5, 2019 1:48 PM UTC

One thing that I've noticed when talking to people about decision theory is that there is a lot of confusion about what an evidential decision theory agent actually is. People have heard that it doesn't smoke in Smoking Lesion and that it pays in X-Or Blackmail, but that is what it does. They may know that it doesn't do Pearlean Graph surgery or differentiate correlation from causation in some sense, but that is what it is not. They may even know it calculates an expected value using the probability distribution P(O|S&A), but that is just a mathematical formalisation which anyone can quote without any real understanding. I've taken a stab at clarifying it in a few short-form posts, but people didn't seem to find them very enlightening.

Even now, my understanding is still weaker than I'd like. Like I just spent over fifteen minutes thinking about whether it would be accurate to characterise it as an agent that is purely concerned with correlation with no notion of causation. I thought this would be accurate at first, but then I realised that an EDT agent wouldn't expect buying a diamond necklace to increase its wealth, so it seems to have at least some ability to model causation.

Anyway, it seems, to me at least, that it would be rather useful for someone to have a go at providing a clear explanation of what exactly it is.

Discuss

### Seeking Power is Provably Instrumentally Convergent in MDPs

5 декабря, 2019 - 05:16
Published on December 5, 2019 2:16 AM UTC

In 2008, Steve Omohundro's foundational Basic AI Drives made important conjectures about what superintelligent goal-directed AIs might do, including gaining as much power as possible to best achieve their goals. Toy models have been constructed in which Omohundro's conjectures bear out, and the supporting philosophical arguments are intuitive. The conjectures have recently been the center of debate between well-known AI researchers.

Instrumental convergence has been heuristically understood as an anticipated risk, but not as a formal phenomenon with a well-understood cause. The goal of this post (and accompanying paper) is to change that.

The structure of the agent's environment means that most goals incentivize gaining power over that environment, which I prove within the Markov decision process formalism, the staple of reinforcement learning.[1] Furthermore, maximally gaining power over an environment is bad for other agents therein. That is, power seems constant-sum after a certain point.

I'm going to provide the intuitions for a mechanistic understanding of power and instrumental convergence, and then informally prove that optimal action usually means trying to stay alive, gain power, and take over the world; read the linked paper for the rigorous version. Lastly, I'll talk about why these results excite me.

Intuitions

I claim that

The structure of the agent's environment means that most goals incentivize gaining power over that environment.

By environment, I mean the thing the agent thinks it's interacting with. Here, we're going to think about dualistic environments where you can see the whole state, where there are only finitely many states to see and actions to take, and where the rules are deterministic. Also, future stuff gets geometrically discounted; at discount rate 12, this means stuff in one turn is half as important as stuff now, stuff in two turns is a quarter as important, and so on. Pac-Man is an environment structured like this: you see the game screen (the state), you take an action, and then you deterministically get a result (another state). There's only finitely many screens, and only finitely many actions – they all had to fit onto the arcade controller, after all!

When I talk about "goals", I'm talking about reward functions over states: each way-the-world-could-be gets assigned some point value. The canonical way of earning points in Pac-Man is just one possible reward function for the game.

Instrumental convergence supposedly exists for sufficiently wide varieties of goals, so today we'll think about the most variety possible: the distribution of goals where each possible state is uniformly randomly assigned a reward in the [0,1] interval (although the theorems hold for a lot more distributions than this[2]). Sometimes, I'll say things like "most agents do X", which means "maximizing total discounted reward usually entails doing X when your goals are drawn from the uniform distribution". We say agents are "farsighted" when the discount rate is sufficiently close to 1 (the agent doesn't prioritize immediate reward over delayed gratification).

Power

You can do things in the world and take different paths through time. Let's call these paths "possibilities"; they're like filmstrips of how the future could go.

If you have more control over the future, you're usually[3] choosing among more paths-through-time. This lets you more precisely control what kinds of things happen later. This is one way to concretize what people mean when they use the word 'power' in everyday speech, and will be the definition used going forward: the ability the ability to achieve goals in general.[4] In other words, power is the average attainable utility across a distribution of goals.

This definition seems philosophically reasonable: if you have a lot of money, you can make more things happen and have more power. If you have social clout, you can spend that in various ways to better tailor the future to various ends. Dying means you can't do much at all, and all else equal, losing a limb decreases your power.

Exercise: spend a few minutes considering whether real-world intuitive examples of power are explained by this definition.

Once you feel comfortable that it's at least a pretty good definition, we can move on.

Imagine a simple game with three choices: eat candy, eat a chocolate bar, or hug a friend.

The power of a state is how well agents can generally do by starting from that state. It's important to note that we're considering power from behind a "veil of ignorance" about the reward function. We're averaging the best we can do for a lot of different individual goals.

Each reward function has an optimal possibility, or path-through-time. If chocolate has maximal reward, then the optimal possibility is Start→Chocolate→Chocolate….

Since the distribution randomly assigns a value in [0,1] to each state, an agent can expect to average 34 reward. This is because you're choosing between three choices, each of which has some value between 0 and 1. The expected maximum of n draws from [0,1] uniform is nn+1; you have three draws here, so you expect to be able to get 34 reward. Now, some reward functions do worse than this, and some do better; but on average, they get 34 reward. You can test this out for yourself.

If you have no choices, you expect to average 12 reward: sometimes the future is great, sometimes it's not. Conversely, the more things you can choose between, the closer this gets to 1 (i.e., you can do well by all goals, because each has a great chance of being able to steer the future how you want).

Instrumental convergence

Plans that help you better reach a lot of goals are called instrumentally convergent. To travel as quickly as possible to a randomly selected coordinate on Earth, one likely begins by driving to the nearest airport. Although it's possible that the coordinate is within driving distance, it's not likely. Driving to the airport would then be instrumentally convergent for travel-related goals.

We define instrumental convergence as optimal agents being more likely to take one action than another at some point in the future. I want to emphasize that when I say "likely", I mean from behind the veil of ignorance. Suppose I say that it's 50% likely that agents go left, and 50% likely they go right. This doesn't mean any agent has the stochastic policy of 50% left / 50% right. This means that, when drawing goals from our distribution, 50% of the time optimal pursuit of the goal entails going left, and 50% of the time it entails going right.

Consider either eating candy now, or earning some [0,1] reward for waiting a second before choosing between chocolate and hugs.

Let's think about how optimal action tends to change as we start caring about the future more. Think about all the places you can be after just one turn:

We could be in two places. Imagine we only care about the reward we get next turn. How many goals choose Candy over Wait? Well, it's 50-50 – since we randomly choose a number between 0 and 1 for each state, each state has an equal chance to be maximal. About half of nearsighted agents go to Candy and half go to Wait. There isn't much instrumental convergence yet. Note that this is also why nearsighted agents tend not to seek power.

Now think about where we can be in two turns:

We could be in three places. Supposing we care more about the future, more of our future control is coming from Wait!. In other words, about two thirds of our power is coming from our ability to Wait!. But is Wait! instrumentally convergent? If the agent is farsighted, the answer is yes (why?).

In the limit of farsightedness, the chance of each possibility being optimal approaches 13 (each terminal state has an equal chance to be maximal).

There are two important things happening here.

Important Thing #1

Instrumental convergence doesn't happen in all environments. An agent starting at blue isn't more likely to go up or down at any given point in time.

There's also never instrumental convergence when the agent doesn't care about the future at all (when γ=0). However, let's think back to what happens in the waiting environment:

As the agent becomes farsighted, the Chocolate and Hug possibilities become more likely.

We can show that instrumental convergence exists in an environment if and only if a path through time becomes more likely as the agent cares more about the future.

Important Thing #2

The more control-at-future-timesteps an action provides, the more likely it is to be selected. What an intriguing "coincidence"!

Power-seeking

So, it sure seems like gaining power is a good idea for a lot of agents!

Having tasted a few hints for why this is true, we'll now walk through the intuitions a little more explicitly. This, in turn, will show some pretty cool things: most agents avoid dying in Pac-Man, keep the Tic-Tac-Toe game going as long as possible, and avoid deactivation in real life.[5]

Let's focus on an environment with the same rules as Tic-Tac-Toe, but considering the uniform distribution over reward functions. The agent (playing O) keeps experiencing the final state over and over when the game's done. We bake the opponent's policy into the environment's rules: when you choose a move, the game automatically replies.

Whenever we make a move that ends the game, we can't reach anything else – we have to stay put. Since each final state has the same chance of being optimal, a move which doesn't end the game is more likely than a move which does. Let's look at part of the game tree, with instrumentally convergent moves shown in green.

Starting on the left, all but one move leads to ending the game, but the second-to-last move allows us to keep choosing between five more final outcomes. For reasonably farsighted agents at the first state, the green move is ~50% likely to be optimal, while each of the others are only best for ~10% of goals. So we see a kind of "self-preservation" arising, even in Tic-Tac-Toe.

Remember how, as the agent gets more farsighted, more of its control comes from Chocolate and Hug, while also these two possibilities become more and more likely?

The same thing is happening in Tic-Tac-Toe. Let's think about what happens as the agent cares more about later and later time steps.

The initial green move contributes more and more control, so it becomes more and more likely as we become more farsighted. This is not a coincidence. We can show that the power from a move and the likelihood of that move are approximately proportional.

This means that if there's more than twice the power coming from one move than from another, the former is more likely than the latter. In general, if one set of possibilities contributes 2K the power of another set of possibilities, the former set is at least K times more likely than the latter.

Power-seeking is instrumentally convergent.

Reasons for excitement The direct takeaway

I'm obviously not "excited" that power-seeking happens by default, but I'm excited that we can see this risk more clearly. I'm also planning on getting this work peer-reviewed before purposefully entering it into the aforementioned mainstream debate, but here are some of my preliminary thoughts.

Imagine you prove that typing random strings will blow up the computer and kill you. Would you then say, "I'm not planning to type random strings", and proceed to enter your thesis into a word processor? No. You wouldn't type anything yet, not until you really, really understand what makes the computer blow up sometimes.

The overall concern raised by [the power-seeking theorem] is not that we will build powerful RL agents with randomly selected goals. The concern is that random reward function inputs produce adversarial power-seeking behavior, which can produce perverse incentives such as avoiding deactivation and appropriating resources. Therefore, we should have specific reason to believe that providing the reward function we had in mind will not end in catastrophe.

Speaking to the broader debate taking place in the AI research community, I think a productive posture here will be investigating and understanding these results in more detail, getting curious about unexpected phenomena, and seeing how the numbers crunch out in reasonable models. I think that even though the alignment community may have superficially understood many of these conclusions, there are many new concepts for the broader AI community to explore.

Incidentally, if you're a member of this broader community and have questions, please feel free to email me at turneale[at]oregonstate[dot]edu.

Explaining catastrophes

AI alignment research can often have a slippery feeling to it. We're trying hard to become less confused about basic concepts, and there's only everything on the line.

What are "agents"? Do people even have "values", and should we try to get the AI to learn them? What does it mean to be "corrigible", or "deceptive"? What are our machine learning models even doing? I mean, sometimes we get a formal open question (and this theory of possibilities has a few of those), but not usually.

We have to do philosophical work while in a state of significant confusion and ignorance about the nature of intelligence and alignment. We're groping around in the dark with only periodic flashes of insight to guide us.

In this context, we were like,

wow, it seems like every time I think of optimal plans for these arbitrary goals, the AI can best complete them by gaining a ton of power to make sure it isn't shut off. Everything slightly wrong leads to doom, apparently?

and we didn't really know why. Intuitively, it's pretty obvious that most agents don't have deactivation as their dream outcome, but we couldn't actually point to any formal explanations, and we certainly couldn't make precise predictions.

On its own, Goodhart's law doesn't explain why optimizing proxy goals leads to catastrophically bad outcomes, instead of just less-than-ideal outcomes.

I've heard that, from this state of ignorance, alignment proposals shouldn't rely on instrumental convergence being a thing (and I agree). If you're building superintelligent systems for which slight mistakes apparently lead to extinction, and you want to evaluate whether your proposal to avoid extinction will work, you obviously want to deeply understand why extinction happens by default.

We're now starting to have this kind of understanding. I suspect that power-seeking is the thing that makes capable goal-directed agency so dangerous.[6] If we want to consider more benign alternatives to goal-directed agency, then deeply understanding why goal-directed agency is bad is important for evaluating alternatives. This work lets us get a feel for the character of the underlying incentives of a proposed system design.

Formalizations

Defining power as "the ability to achieve goals in general" seems to capture just the right thing. I think it's good enough that I view important theorems about power (as defined in the paper) as philosophically insightful.

Considering power in this way seems to formally capture our intuitive notions about what resources are. For example, our current position in the environment means that having money allows us to exert more control over the future. That is, our current position in the state space means that having money allows more possibilities and greater power (in the formal sense). However, possessing green scraps of paper would not be as helpful if one were living alone near Alpha Centauri. In a sense, resource acquisition can naturally be viewed as taking steps to increase one's power.

Power might be important for reasoning about the strategy-stealing assumption (and I think it might be similar to what Paul means by "flexible influence over the future"). Evan Hubinger has already noted the utility of the distribution of attainable utility shifts for thinking about value-neutrality in this context (and power is another facet of the same phenomenon). If you want to think about whether, when, and why mesa optimizers might try to seize power, this theory seems like a valuable tool.

And, of course, we're going to use this notion of power to design an impact measure.

The formalization of instrumental convergence seems to be correct. We're able to now make detailed predictions about e.g. how the difficulty of getting reward affects the level of farsightedness at which seizing power tends to make sense. This also might be relevant for thinking about myopic agency, as the broader theory formally describes how optimal action tends to change with the discount factor.

Another useful conceptual distinction is that power and instrumental convergence aren't the same thing; we can construct environments where the state with the highest power is not instrumentally convergent from another state. This is a bit subtle; rest assured, it doesn't contradict the letter or spirit of the power-seeking theorem. I might make a future post explaining the distinction. For now, I refer the reader to the paper for further clarification.

The broader theory of possibilities lends signficant insight into the structure of Markov decision processes; it feels like a piece of basic theory that was never discovered earlier, for whatever reason. More on this another time.

Future deconfusion

What excites me the most is a little more vague: there's a new piece of AI alignment we can deeply understand, and understanding breeds understanding.

Acknowledgements

This work was made possible by the Center for Human-Compatible AI, the Berkeley Existential Risk Initiative, and the Long-Term Future Fund.

Logan Smith (elriggs) spent an enormous amount of time writing Mathematica code to compute power and measure in arbitrary toy MDPs, saving me from needing to repeatedly do quintuple+ integrations by hand. I thank Rohin Shah for his detailed feedback and brainstorming over the summer, and Tiffany Cai for the argument that arbitrary possibilities have 12 expected value (and so optimal average control can't be worse than this). Zack M. Davis, Chase Denecke, William Ellsworth, Vahid Ghadakchi, Ofer Givoli, Evan Hubinger, Neale Ratzlaff, Jess Riedel, Duncan Sabien, Davide Zagami, and TheMajor gave feedback on drafts of this post.

1. It seems reasonable to expect the key results to generalize in spirit to larger classes of environments, but keep in mind that the claims I make are only proven to apply to finite deterministic MDPs. ↩︎

2. Specifically, consider any continuous bounded distribution D distributed identically over the state space S: DS. The kind of power-seeking and Tic-Tac-Toe-esque instrumental convergence I'm gesturing at should also hold for discontinuous bounded nondegenerate D.

The power-seeking argument works for arbitrary distributions over reward functions (with instrumental convergence also being defined with respect to that distribution) – identical distribution enforces "fairness" over the different parts of the environment. It's not as if instrumental convergence might not exist for arbitrary distributions – it's just that proofs for them are less informative (because we don't know their structure a priori).

For example, without identical distribution, we can't say that agents (roughly) tend preserve ability to reach as many 1-cycles as possible (this is the last theorem in the paper); after all, you could just distribute [99,100] reward on an arbitrary 1-cycle and 0 reward for all other states. According to this "distribution", only moving towards the 1-cycle is instrumentally convergent. ↩︎

3. Power is not the same thing as number of possibilities! Power is average attainable utility; you might have a lot of possibilities, but not be able to choose between them for a long time, which decreases your control over the (discounted) future.

Also, remember that we're assuming dualistic agency: the agent can choose whatever sequence of actions it wants. That is, there aren't "possibilities" it's unable to take. ↩︎

4. Informal definition of "power" suggested by Cohen et al.. ↩︎

5. We need to take care when applying theorems to real life, especially since the power-seeking theorem assumes the state is fully observable. Obviously, this isn't true in real life, but it seems reasonable to expect the theorem to generalize appropriately. ↩︎

6. I'll talk more in future posts about why I presently think power-seeking is the worst part of goal-directed agency. ↩︎

Discuss

### Elementary Statistics

5 декабря, 2019 - 05:00
Published on December 5, 2019 2:00 AM UTC

Our elementary school has a directory listing kids and parents, and since we live in the future it's a spreadsheet, which means I can count things. A typical family at this K-5 school has one child enrolled (76%). The child has two parents (96%) with different last names (59%), but they share a last name with at least one parent (89%). The parents don't share email addresses (95%) or phone numbers (97%), do use gmail (72%), and do have Boston area codes (68%). Our family in the majority for each of these, even though there's naively only a 18% chance of that happening and they seem reasonably independent.

It's surprising to me that while parents mostly don't have the same names as each other (59%), only 11% of their kids have hyphenated names. I guess people realized that hyphenated names grow exponentially? I'd like to look at how the children's last names relate to parental gender, but that would involve annotating inferred genders for ~500 parents.

Boring details:

• There are 233 kids across six grades (K-5) with two classes per grade, for an average of 19 kids per class. Kindergarten is the biggest (45 kids, 22 and 23 each) while second grade is the smallest (30 kids, 15 and 15 each).

• 224/233 (96%) have two parents listed.

• Of the kids with two parents listed, 91/224 (41%) have the same last name as each other.

• 207/233 (89%) of kids have the same name as at least one of their parents.

• Of the 26 kids who have a different name, 15 (58%) are hyphenations, 7 (27%) have no obvious connection, 3 (12%) list only one parent and may share a last name with the other, and 1 (4%) is an unhyphenation (Sam Alpha-Bravo and Pat Charlie, with child Alex Bravo).

• 141 (76%) of families have one child in the school, 43 (23%) have two, and 2 (1%) have three.

• All parents are listed with email addresses, but 12/224 (5%) have the same email listed for both parents.

• 255/355 (72%) unique email address gmail accounts, 35 (10%) are yahoo, 13 (4%) are hotmail, 7% (2%) are aol, and 11 (3%) are edu.

• All but one parent is listed with a phone number, but 6/224 (3%) have the same phone listed for both parents.

• Of 357 unique phone numbers, 242 (68%) are Boston (including Somerville), 29 (8%) are NYC, 28 (8%) are Boston's inner suburbs, and 8 (2%) are the Bay Area. Boston's Northern suburbs (978 and 351) and Southern suburbs (508 and 774) are each below 1%, behind Chicago and Philly.