# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 39 минут 7 секунд назад

### Practical Considerations Regarding Political Polarization

28 января, 2019 - 01:26
Published on January 27, 2019 10:26 PM UTC

HAROLD WASHINGTON LIBRARY

6TH FLOOR NORTH STUDY ROOM
** BYO SNACKS **

~~~~~~~~~~~~~
~~~~~~~~~~~~~

https://www.lesswrong.com/posts/QGDgMr3za43WuNZHu/the-context-is-conflict

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LONG FORM TOPIC PROPOSAL
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The goal of this discussion is to answer a very specific question that seems to come up frequently: is it OK to punch a Nazi? But more pragmatically: what are the effects of punching a Nazi?

This question is a stand-in for a more serious question about communication. Some people think it's a important to communicate angrily, loudly, adversarily, essentially as if we're in a war with the other side, because we have to win. Some people advocate civil discourse, because they think that adversarial communication is counter-productive, pushing the sides further apart and making things worse The first people counter that silence validates and supports oppression. The second people counter there is something between silence and war. And so forth quite ad nauseum.

This is my poor summary of a frequently recurring argument with, I think, very real practical implications. So please bring your own versions of the debate.

I suspect that a lot of rationalists will favor the civil-discourse side, so I want to give a bit of a boost to the other side so that we carefully consider both possible answers. Social movement theory (history) suggests that change often occurs only after violence, and that changing social conventions in particular requires being disruptive and making people uncomfortable. I.e., people need to be pushed past their tipping point: an equilibrium cannot be exited slowly. It's the revolution vs evolution problem. Malcom X vs MLK. There is really no easy answer.

I couldn't find any rationalist-sphere readings on this specific topic, so the readings I recommend for this discussion deal with polarization and conflict more generally.

Discuss

### [Question] Why is this utilitarian calculus wrong? Or is it?

28 января, 2019 - 00:08
Published on January 27, 2019 6:32 PM UTC

Suppose that I value a widget at $30. Suppose that the widget costs the widget-manufacturer$20 to produce, but, due to monopoly power on their part, they can charge $100 per widget. The economic calculus for this problem is as follows.$30 (widget valuation) - $100 (widget price) = -$70 to me; $100 (widget price) -$20 (widget cost) = $80 to widget producers.$80 - $70 = +$10 total value. Ordinarily, this wouldn't imply that utilitarians are required to spend all their money on widgets because for a function to convert dollars to utils u($), u'($)>0, u''($)<0 and widget-producers usually have higher$ then widget consumers.

But suppose the widget monopolist is a poor worker commune. The profits go directly to the workers who, on average, have lower $then I do. It seems like buying widgets would be more moral then, say, donating$80 to the same group of poor people ($80 -$80 = $0) because the widget purchase slightly compensates me for the donation in a way that is greater then the cost of the recipient to produce the widget. And yet, I feel even less moral compunction to buy widgets then I do to donate$80 to GiveDirectly. Is this just an arbitrary, unjustifiable, subconscious desire to shove economic transactions into a separate domain from charitable donations or is there actually some mistake in the utilitarian logic here? If there isn't a mistake in the logic, is this something that the Open Philanthropy Project should be looking at?

[Question inspired by a similar question at the end of chapter 7 of Steven Landsburg's The Armchair Economist]

Discuss

### "The Unbiased Map"

27 января, 2019 - 22:08
Published on January 27, 2019 7:08 PM UTC

“Long have we suffered under the tyranny of maps.

Biased maps which show topography, but not population.

Wretched maps which speak of religion, but not languages.

Divisive maps which paint with the color of Party, but not the color of economic conditions.

Dirty maps which show crop yield across the heartland, but neglect Fiber Optic Internet coverage.

Our age calls for better maps, maps free from the bias of these old maps, perfect maps.

Imagine the day of the unbiased map. The map which shows both how to get to the airport via public transit and GDP by county. The holy map demonstrating last year’s rainfall and the distribution of seminaries and rabbinical schools. The ancestral map depicting migration of immigrants and American tribes in 1491.

Don’t give me an atlas which pretends at perfection but hisses red herrings from hydra-heads. I want the real thing, a map which doesn’t end at some arbitrary border whether it be the county line, or the sphere of earth. A map which can show the world as known by the Qing Dynasty, Strabo, Majorcan Jews, and the Aztecs. A map of Elon Musk’s neurons and a map of the solar system.

Today’s maps enlighten as the Brothers Grimm, through a bundle of fairy tales. There are no ethical maps under capitalism, all of them drip with the status quo. None show me the world that should be, none provide directions to Valhalla, all show but the thin surface of Reality. And for Mankind, the surface does not satisfy!”

Discuss

### Prediction Contest 2018: Scores and Retrospective

27 января, 2019 - 20:20
Published on January 27, 2019 5:20 PM UTC

Way back in April 2018, I announced a Prediction Contest, in which the person who made the best predictions on a bunch of questions on PredictionBook ahead of a 1st July deadline would win a prize after they all resolved in January 2019, which is now.

It was a bit of an experiment; I had no idea how many people were up for practicing predictions to try to improve their calibration, and decided to throw a little money and time at giving it a try. And in the spirit of reporting negative experimental results: The answer was 3, all of which I greatly appreciate for their participation. I don't regret running the experiment, but I'm going to pass on running a Prediction Contest 2019. I don't think this necessarily rules out trying to practically test and compete in rationality-related areas in other ways later, though.

The Results

Our entrants were bendini, bw, and Ialaithion, and their ranked log scores were:

bw: -9.358221122

Ialaithion: -9.594999615

bendini: -10.0044594

This was sufficiently close that changing a single question's resolution could tip the results, so they were all pretty good. That said, bw came out ahead, and even managed to beat averaging everyone's predictions- if you simply took the average prediction (including non-entrants) as of entry deadline and made that your prediction, you'd have got -9.576568147.

The full calculations for each of the log scores, as well as my own log score and the results of feeding the predictions as of prediction time to a simple model rather than simply averaging them, are in a spreadsheet here.

I'll be in touch with bw to sort out their prize this evening, and thanks to everyone who participated and who helped with finding questions to use for it.

Discuss

### Freely Complying With the Ideal: A Theory of Happiness

27 января, 2019 - 15:28
Published on January 27, 2019 12:28 PM UTC

Epistemic status: somewhat empirical, having gone from badly depressed to astoundingly happy within a year through a lot of different experiences (and no drugs). I don’t know whether what I do that makes me happy is idiosyncratic to me and a subgroup of people like me, or universal. The discussion will tell (although many will be unaware of their own preferences/potential for preference changes, as I was a year ago).

This post is neither an attempt to predict happiness nor to teach Three Easy Steps To Be Happy™. I try to verbalize what I believe causes me to be so happy in such a consistent way.

Over the last few months, I have repetitively done some pretty difficult things, almost entirely because they were part of a path that 10+years-ideal-me would have taken at my age to become him.

One year ago, I asked myself if there was a version of the world in which I’d want to be alive: the answer was yes. That ideal has changed greatly since then; along with the milestones my preferences evolved. Still, the nature of the purpose that guides me remains the same: I try to walk the path that ideal me would be walking.

I think that taking now the actions that will bring about the ideal you or your ideal world will bear immense happiness into your life. I‘ll lay out my explanation of why that is.

Kahneman talks about the tyranny of the remembering self. We take pictures instead of enjoying the moment; in anticipation we choose not the experiences that will be the nicest but those that will be the most memorable, etc. The problem according to him is that this doesn’t optimize for overall happiness, which includes experiential happiness as well. We don’t receive as much value from fondly remembering memorable events as we lost value choosing them over more agreeable ones (because we overestimate how much time we’ll spend remembering).

Well, guess what you do have to remember all the time, as if you’d uncontrollably pressed “repeat” on the memory player? You.

Even if you aren’t pressing the button, everything is a reminder. The people, of your social status; the opposite sex, of you sexual marketplace value; your tests, job, understanding of the world... of your intelligence; and so on.

I‘ll admit here that it gets a little confusing what’s experiential and what’s remembering. Being, you experience you, how you think, how your self and your thoughts change; but everything around you also reminds you how much you suck if you think you do, how much you rock if you think you do, how much you [whatever you think of yourself] if you [whatever you think of yourself]. And then there are the memories of what you’ve done, from plaguing to inspiring.

When you choose to take the path that your ideal self would take rather than whatever the default derp state is — what I call Complying With the Ideal versus The Couch, it is true that the actions you do can suck more. For instance if ideal you is in great shape, you’ll go to the gym. Pull-ups are worse than Facebook browsing (Straight Pain vs a mild feeling of interest possibly followed by mild self-hatred or disgust). That said, the self you experience and are reminded of is just so much better to you, because they're walking the path you want. Annoying, difficult actions are quickly forgotten, they don’t really make one unhappy. Are you unhappy because you’ve taken out the trash? Of course not. Unless you‘re not doing it freely, from your own agency or sense of responsibility. This is where the “Free” in the title makes its appearance.

I went back home for Christmas, in France. Despite that almost all emotions were positive, that we were having fun, that I was there for too short a time to get into fights..., after a couple of days I felt my happiness fall somewhat drastically. Things felt heavy, complicated, similar to how they felt when I was depressed. I couldn‘t understand why until I came back to London.

Yes the emotions were high, but I literally wasn’t in the driver’s seat of my life. We had to go from place to place, my parents driving us. X amount of time there, enjoy, time‘s up, get in the car. Y amount of time in that other place, enjoy, time‘s up, get in the car. And so on.

I quickly found myself miserable because I wasn’t making the choices for myself, I wasn’t free. I was given a script, I had to play the part. I couldn’t not play it, much less play my own script. This was quick to make me excessively tired, and from there things went downhill.

The rationalist community is acutely aware of rationalization. Whatever you like, whatever your tribe is, you’ll rationalize why that is Good, more Logical, Better, Smarter, whatever you think is a valuable quality, a reason to pursue. And we think that is Bad. With which I tend to agree.

There are however aspects to rationalization which we can use to our advantage.

Two examples.

Your brain is so good at coming up with logical-sounding stuff to justify preferences and emotions that if you look at something and say “Oh, this reminds me of... when I was a kid...” you will come up with something true, that you are truly reminded of with a genuine feeling that the thing and the memory have something in common — despite that, as the words left your mouth, you had absolutely no idea of what you were going to say. Rationalization is fast.

If someone that you don’t know insults you, it triggers your internal alarms and conflicts with your idea of yourself. You are not a loser people can blatantly disrespect. ...unless they were your friend. In order for reality to match your ego/map, you will laugh it off, and rationalize that the person must be joking and actually nice, or your friend, and before you know whether that’s true, you will like them more. Under certain conditions, negatives will be rationalized as positives.

The way that rationalization will work to your advantage is that, if you’re doing something that bears some degree of pain to you, whether physically or mentally; once you plow through your initial resistances to doing it, your brain will quickly rationalize that you like it or that some higher instance makes it worth the pain.

When someone forces you to do something, you don’t want to do it. Plowing through the resistance, although arguably doable, becomes much harder. Self-inflicted hardship however, when plowed through, will be not just tolerable but some kind of mystical, I-must-accomplish-type journey. Mistical probably because your brain can’t fabricate a good logical argument so it just feeds you a feeling to induce perseverance. It creates engagement. Which in itself is already great because you brain goes from a hindrance to your final goal (when you resist the initial pain) to a facilitator.

The reason freedom is important is that it allows to do something that you want, ie it enables to Comply With the Ideal. Rationalization of pain can only go so far. You can rationalize liking hardships, but if it’s aimless pain forever you’ll soon find yourself out of strength. There comes a point where the pain has to face the Why question (and more often than not, it’s the first time pain arises). As you lose your agency over yourself (ie your get less choice on what you do) even when you chose to lose it, the Why slowly fades away and you find yourself lacking that ultimate justification — the one that’s not a rationalization — reaching your ideal self. And you give up.

The need for agency is, by the way, why coaching doesn’t work, why advice isn’t listened to. People wish “if only someone could drive my life for me”. But that’s terrible and that’s not real. As soon as the coach leaves, you’re left to your weak brain circuitry. You don’t get a driver’s license for jumping in cabs.

My point is this:

1. When you comply with your ideal, you’ll see everywhere reasons why you’re awesome, you’ll be proud, you won’t feel ashamed to rest, you won’t beat yourself up over taking breaks because they’ll feel well deserved.
2. Thanks to rationalization, the pain of the path will be toned down once it’s clear to your brain that you have locked in on the path and will not flinch. You will want it in a weird kind of way. When all I want to do is quit, I'll often double the reps of my last set at the gym, just because. Overall I’m very happy about it, but when it happens a part of me wants to punch me in the face. The beauty is that it doesn’t need to make sense as long as it’s making you do it.
3. Because the rationalization mechanism stops working effectively when you’re not your own agency, you have to be the person that chooses and walks the path for yourself.

I’m not sure I believe it after all, that Happinness = Freedom+Compliance with the Ideal. It seems to me that what happiness fundamentally comes down to is the happy feeling, and you can get that by learning to reframe negatives as lessons/funny/interesting/unimportant and everything else as good/funny/interesting/beautiful/soothing (which I’ve done, and it works very, very well). But maybe you need to like yourself to begin with, to be able to think positively.

The post is submitted nonetheless. If anything it’ll trigger an interesting discussion, and after all the argument could still be true.

Discuss

### Confessions of an Abstraction Hater

27 января, 2019 - 08:50
Published on January 27, 2019 5:50 AM UTC

I've written about the cost of abstraction before. Once you are in the IT industry for couple of decades and once you've read couple of millions lines on legacy code you become healthily suspicious of any kind of abstraction. Not that we can do without abstraction. We need it to be able to write code at all. However, each time you encounter an abstraction in the code that could have been avoided you get a little bit sadder. And some codebases are sadder than Romeo and Juliet and King Lear combined.

Remember reading an unfamiliar codebase the last time? Remember how you've thought that the authors were a bunch of incompetent idiots?

People may argue that this is because legacy stuff is necessarily convoluted, but hey, at that point you were just skimming through the codebase and you weren't understanding it deep enough to tell your typical enterprise legacy monstrosity from a work of an architectural genius. The reason you were annoyed was because you were overwhelmed by the sheer amount of unfamiliar abstraction. (To prove that, consider what was your opinion of the codebase was few months later, after getting familiar with it. It looked much better, no?)

Keep that feeling in mind. Think of it when writing new code. How will a person who doesn't know first thing about this codebase feel when reading it?

The options are not palatable. Either you try to be clever, use abstraction a lot and they'll think you are a moron. Or you get rid of all unnecessary abstraction. You'll make their life much less frustrating but they'll think you are some kind of simpleton. (And they'll probably refactor the code to make it look more clever.)

I want to give a very basic example of the phenomenon.

Imagine that the requirements are that your program does A, B, C, D and E, in that order.

You can do it in the dumbest possible way:

void main() { // Do A. ... // Do B. ... // Do C. ... // Do D. ... // Do E. ... }

Or maybe you notice that B, C and D are kind of related and comprise a logical unit of work:

void foo() { // Do B. ... // Do C. ... // Do D. ... } void main() { // Do A. ... foo(); // Do E. ... }

But C would probably be better off as a stand-alone function. You can imagine a case where somewhene would like to call it from elsewhere:

void bar() { // Do C. ... } void foo() { // Do B. ... bar(); // Do D. ... } void main() { // Do A. ... foo(); // Do E. ... }

Now think of it from the point of view of casual reader, someone who's just skimming through the code.

When they look at the first version of the code they may thing the author was a simpleton, but they can read it with ease. It looks like a story. You can read it as if it were a novel. There's nothing confusing there. The parts come in the correct order:

A B C D E

But when skimming through the refactored code that's no longer the case. What you see is:

C B D A E

It's much harder to get the grip of what's going on there but at least they'll appreciate author's cleverness.

Or maybe they won't.

January 27th, 2019

Discuss

### Río Grande: judgment calls

27 января, 2019 - 06:50
Published on January 27, 2019 3:50 AM UTC

In a particularly bad recent bout of anxiety, I learned what seemed to me to be a new and exciting mental move, which I named ‘making a judgment call’. If you had asked me previously whether I made judgment calls, I would have said ‘yes’ however, so describing what I am talking about here is perhaps somewhat subtle.

When I have to decide about something and the relevant issues are not certain, I think I have usually waited for the world to resolve enough uncertainty to make the decision clear. For instance, for it to become clear that the food is safe to eat or sufficiently likely to be unsafe that it should be thrown out. I mean, I thought and collected evidence, but the procedure was to go through these procedures until the answer was returned. The thing I would have called ‘making a judgment call’ was something like ‘think until it becomes clear that the food is safe’ (then ‘the food is safe’ is your judgment) or perhaps ‘think until it is clear at a higher level that more thinking isn’t worthwhile’. You make a judgment by becoming confident in the expected value calculation. It isn’t necessarily an explicit calculation, but you feel confident enough that one side is right.

But you can also just stop before anything is clear. Before you even have a clear assessment of the relevant uncertainties and expected values, or which heuristics are solidly applicable in this case. Instead of waiting until the best way to act reveals itself to you, you can make a decision. You can just say ‘nah, the food is fine, I judged it so’.

Something like that felt like a mental motion that I didn’t know that I could do. A bit like learning to wiggle your ears, when you didn’t know where the muscles were. A friend asked me what this mental motion felt like. I can’t remember what I said, but now I’d say it has a sense of ownership and mineness. I suppose because what was more a feature of the circumstance has been replaced by my own will.

Surely I have always often done something like this in other kinds of cases, e.g. when I’m deciding where to put the tomato on my lunch plate I don’t do any kind of implicit EV calculation that I’m aware of. But the mental motion there feels different—the situation doesn’t present itself as a choice in the same way perhaps. I think I just follow some feeling of what is right. Which seems like a different interesting avenue of decision making exploration, but I shan’t go into it here.

I think my ability to do this comes and goes, and I might be wrong that it is a distinct thing, or a thing I hadn’t done much before. I don’t have a detailed recollection of my mental processes in general. But this is what it seemed like to me.

This schema of there being some passive process which can be replaced with an active decision—and of being able to make a decision where you didn’t know you could—reminds me a bit of the kind of mistake where you jump to trying to get things (or assuming that you are trying to get things) because you feel desire for them. Or the one where you jump to thinking that you believe a thing, because it is displayed in your head sometimes. But actually you can choose what to pursue, and what to believe, and it’s way better. Well, you can also choose what decisions to make.

Perhaps in general, you can observe who you are or you can choose who you are. (Or, either way you are choosing, but maybe choosing badly because you haven’t noticed that you have a choice). And these aren’t different ways of seeing the world, they are different sets of processes you can run.

Discuss

### "Forecasting Transformative AI: An Expert Survey", Gruetzemacher et al 2019

27 января, 2019 - 05:34
Published on January 27, 2019 2:34 AM UTC

Discuss

### Building up to an Internal Family Systems model

26 января, 2019 - 15:25
Published on January 26, 2019 12:25 PM UTC

Introduction

Internal Family Systems (IFS) is a psychotherapy school/technique/model which lends itself particularly well for being used alone or with a peer. For years, I had noticed that many of the kinds of people who put in a lot of work into developing their emotional and communication skills, some within the rationalist community and some outside it, kept mentioning IFS.

So I looked at the Wikipedia page about the IFS model, and bounced off, since it sounded like nonsense to me. Then someone brought it up again, and I thought that maybe I should reconsider. So I looked at the WP page again, thought “nah, still nonsense”, and continued to ignore it.

This continued until I participated in CFAR mentorship training last September, and we had a class on CFAR’s Internal Double Crux (IDC) technique. IDC clicked really well for me, so I started using it a lot and also facilitating it to some friends. However, once we started using it on more emotional issues (as opposed to just things with empirical facts pointing in different directions), we started running into some weird things, which it felt like IDC couldn’t quite handle… things which reminded me of how people had been describing IFS. So I finally read up on it, and have been successfully applying it ever since.

In this post, I’ll try to describe and motivate IFS in terms which are less likely to give people in this audience the same kind of a “no, that’s nonsense” reaction as I initially had.

Epistemic status

This post is intended to give an argument for why something like the IFS model could be true and a thing that works. It’s not really an argument that IFS is correct. My reason for thinking in terms of IFS is simply that I was initially super-skeptical of it (more on the reasons of my skepticism later), but then started encountering things which it turned out IFS predicted - and I only found out about IFS predicting those things after I familiarized myself with it.

Additionally, I now feel that IFS gives me significantly more gears for understanding the behavior of both other people and myself, and it has been significantly transformative in addressing my own emotional issues. Several other people who I know report it having been similarly powerful for them. On the other hand, aside for a few isolated papers with titles like “proof-of-concept” or “pilot study”, there seems to be conspicuously little peer-reviewed evidence in favor of IFS, meaning that we should probably exercise some caution.

I think that, even if not completely correct, IFS is currently the best model that I have for explaining the observations that it’s pointing at. I encourage you to read this post in the style of learning soft skills - trying on this perspective, and seeing if there’s anything in the description which feels like it resonates with your experiences.

But before we talk about IFS, let’s first talk about building robots. It turns out that if we put together some existing ideas from machine learning and neuroscience, we can end up with a robot design that pretty closely resembles IFS’s model of the human mind.

What follows is an intentionally simplified story, which is simpler than either the full IFS model or a full account that would incorporate everything that I know about human brains. Its intent is to demonstrate that an agent architecture with IFS-style subagents might easily emerge from basic machine learning principles, without claiming that all the details of that toy model would exactly match human brains. A discussion of what exactly IFS does claim in the context of human brains follows after the robot story.

Wanted: a robot which avoids catastrophes

Suppose that we’re building a robot that we want to be generally intelligent. The hot thing these days seems to be deep reinforcement learning, so we decide to use that. The robot will explore its environment, try out various things, and gradually develop habits and preferences as it accumulates experience. (Just like those human babies.)

Now, there are some problems we need to address. For one, deep reinforcement learning works fine in simulated environments where you’re safe to explore for an indefinite duration. However, it runs into problems if the robot is supposed to learn in a real life environment. Some actions which the robot might take will result in catastrophic consequences, such as it being damaged. If the robot is just doing things at random, it might end up damaging itself. Even worse, if the robot does something which could have been catastrophic but narrowly avoids harm, it might then forget about it and end up doing the same thing again!

How could we deal with this? Well, let’s look at the existing literature. Lipton et al. (2016) proposed what seems like a promising idea for addressing the part about forgetting. Their approach is to explicitly maintain a memory of danger states - situations which are not the catastrophic outcome itself, but from which the learner has previously ended up in a catastrophe. For instance, if “being burned by a hot stove” is a catastrophe, then “being about to poke your finger in the stove” is a danger state. Depending on how cautious we want to be and how many preceding states we want to include in our list of danger states, “going near the stove” and “seeing the stove” can also be danger states, though then we might end up with a seriously stove-phobic robot.

In any case, we maintain a separate storage of danger states, in such a way that the learner never forgets about them. We use this storage of danger states to train a fear model: a model which is trying to predict the probability of ending up in a catastrophe from some given novel situation. For example, maybe our robot poked its robot finger at the stove in our kitchen, but poking its robot finger at stoves in other kitchens might be dangerous too. So we want the fear model to generalize from our stove to other stoves. On the other hand, we don’t want it to be stove-phobic and run away at the mere sight of a stove. The task of our fear model is to predict exactly how likely it is for the robot to end up in a catastrophe, given some situation it is in, and then make it increasingly disinclined to end up in the kinds of situations which might lead to a catastrophe.

This sounds nice in theory. On the other hand, Lipton et al. are still assuming that they can train their learner in a simulated environment, and that they can label catastrophic states ahead of time. We don’t know in advance every possible catastrophe our robot might end up in - it might walk off a cliff, shoot itself in the foot with a laser gun, be beaten up by activists protesting technological unemployment, or any number of other possibilities.

So let’s take inspiration from humans. We can’t know beforehand every bad thing that might happen to our robot, but we can identify some classes of things which are correlated with catastrophe. For instance, being beaten or shooting itself in the foot will cause physical damage, so we can install sensors which indicate when the robot has taken physical damage. If these sensors - let’s call them “pain” sensors - register a high amount of damage, we consider the situation to have been catastrophic. When they do, we save that situation and the situations preceding it to our list of dangerous situations. Assuming that our robot has managed to make it out of that situation intact and can do anything in the first place, we use that list of dangerous situations to train up a fear model.

At this point, we notice that this is starting to remind us about our experience with humans. For example, the infamous Little Albert experiment. A human baby was allowed to play with a laboratory rat, but each time that he saw the rat, a researcher made a loud scary sound behind his back. Soon Albert started getting scared whenever he saw the rat - and then he got scared of furry things in general.

Something like Albert’s behavior could be implemented very simply using something like Hebbian conditioning to get a learning algorithm which picks up on some features of the situation, and then triggers a panic reaction whenever it re-encounters those same features. For instance, it registers that the sight of fur and loud sounds tend to coincide, and then it triggers a fear reaction whenever it sees fur. This would be a basic fear model, and a “danger state” would be “seeing fur”.

Wanting to keep things simple, we decide to use this kind of an approach as the fear model of our robot. Also, having read Consciousness and the Brain, we remember a few basic principles about how those human brains work, which we decide to copy because we’re lazy and don’t want to come up with entirely new principles:

• There’s a special network of neurons in the brain, called the global neuronal workspace. The contents of this workspace are roughly the same as the contents of consciousness.
• We can thus consider consciousness a workspace which many different brain systems have access to. It can hold a single “chunk” of information at a time.
• The brain has multiple different systems doing different things. When a mental object becomes conscious (that is, is projected into the workspace by a subsystem), many systems will synchronize their processing around analyzing and manipulating that mental object.

So here is our design:

• The robot has a hardwired system scanning for signs of catastrophe. This system has several subcomponents. One of them scans the “pain” sensors for signs of physical damage. Another system watches the “hunger” sensors for signs of low battery.
• Any of these “distress” systems can, alone or in combination, feed a negative reward signal into the global workspace. This tells the rest of the system that this is a bad state, from which the robot should escape.
• If a certain threshold level of “distress” is reached, the current situation is designated as catastrophic. All other priorities are suspended and the robot will prioritize getting out of the situation. A memory of the situation and the situations preceding it are saved to a dedicated storage.
• After the experience, the memory of the catastrophic situation is replayed in consciousness for analysis. This replay is used to train up a separate fear model which effectively acts as a new “distress” system.
• As the robot walks around its environment, sensory information about the surroundings will enter its consciousness workspace. When it plans future actions, simulated sensory information about how those actions would unfold enters the workspace. Whenever the new fear model detects features in either kind of sensory information which it associates with the catastrophic events, it will feed “fear”-type “distress” into the consciousness workspace.

So if the robot sees things which remind it of poking at hot stove, it will be inclined to go somewhere else; if it imagines doing something which would cause it to poke at the hot stove, then it will be inclined to imagine doing something else.

Introducing managers

But is this actually enough? We've now basically set up an algorithm which warns the robot when it sees things which have previously preceded a bad outcome. This might be enough for dealing with static tasks, such as not burning yourself at a stove. But it seems insufficient for dealing with things like predators or technological unemployment protesters, who might show up in a wide variety of places and actively try to hunt you down. By the time you see a sign of them, you're already in danger. It would be better if we could learn to avoid them entirely, so that the fear model would never even be triggered.

As we ponder this dilemma, we surf the web and run across this blog post summarizing Saunders, Sastry, Stuhlmüller & Evans (2017). They are also concerned with preventing reinforcement learning agents from running into catastrophes, but have a somewhat different approach. In their approach, a reinforcement learner is allowed to do different kinds of things, which a human overseer then allows or blocks. A separate “blocker” model is trained to predict which actions the human overseer would block. In the future, if the robot would ever take an action which the “blocker” predicts the human overseer would disallow, it will block that action. In effect, the system consists of two separate subagents, one subagent trying to maximize rewards and the other subagent trying to block non-approved actions.

Since our robot has a nice modular architecture into which we can add various subagents which are listening in and taking actions, we decide to take inspiration from this idea. We create a system for spawning dedicated subprograms which try to predict and and block actions which would cause the fear model to be triggered. In theory, this is unnecessary: given enough time, even standard reinforcement learning should learn to avoid the situations which trigger the fear model. But again, trial-and-error can take a very long time to learn exactly which situations trigger fear, so we dedicate a separate subprogram to the task of pre-emptively figuring it out.

Each fear model is paired with a subagent that we’ll call a manager. While the fear model has associated a bunch of cues with the notion of an impending catastrophe, the manager learns to predict which situations would cause the fear model to trigger. Despite sounding similar, these are not the same thing: one indicates when you are already in danger, the other is trying to figure out what you can do to never end up in danger in the first place. A fear model might learn to recognize signs which technological unemployment protesters commonly wear. Whereas a manager might learn the kinds of environments where the fear model has noticed protesters before: for instance, near the protester HQ.

Then, if a manager predicts that a given action (such as going to the protester HQ) would eventually trigger the fear model, it will block that action and promote some other action. We can use the interaction of these subsystems to try to ensure that the robot only feels fear in situations which already resemble the catastrophic situation so much as to actually be dangerous. At the same time, the robot will be unafraid to take safe actions in situations from which it could end up in a danger zone, but are themselves safe to be in.

As an added benefit, we can recycle the manager component to also do the same thing as the blocker component in the Saunders et al. paper originally did. That is, if the robot has a human overseer telling it in strict terms not to do some things, it can create a manager subprogram which models that overseer and likewise blocks the robot from doing things which the model predicts that the overseer would disapprove of.

Putting together a toy model

If the robot does end up in a situation where the fear model is sounding an alarm, then we want to get it out of the situation as quickly as possible. It may be worth spawning a specialized subroutine just for this purpose. Technological unemployment activists could, among other things, use flamethrowers that set the robot on fire. So let’s call these types of subprograms dedicated to escaping from the danger zone, firefighters.

So how does the system as a whole work? First, the different subagents act by sending into the consciousness workspace various mental objects, such as an emotion of fear, or an intent to e.g. make breakfast. If several subagents are submitting identical mental objects, we say that they are voting for the same object. On each time-step, one of the submitted objects is chosen at random to become the contents of the workspace, with each object having a chance to be selected that’s proportional to its number of votes. If a mental object describing a physical action (an “intention”) ends up in the workspace and stays chosen for several time-steps, then that action gets executed by a motor subsystem.

Depending on the situation, some subagents will have more votes than others. E.g. a fear model submitting a fear object gets a number of votes proportional to how strongly it is activated. Besides the specialized subagents we’ve discussed, there’s also a relatively standard reinforcement learning algorithm, which is just taking whatever actions (that is, sending to the workspace whatever mental objects) it thinks will produce the greatest reward. This subagent only has a small number of votes.

Finally, there’s a self-narrative agent which is constructing a narrative of the robot’s actions as if it was a unified agent, for social purposes and for doing reasoning afterwards. After the motor system has taken an action, the self-narrative agent records this as something like “I, Robby the Robot, made breakfast by cooking eggs and bacon”, transmitting this statement to the workspace and saving it to an episodic memory store for future reference.

Consequences of the model

Is this design any good? Let’s consider a few of its implications.

First, in order for the robot to take physical actions, the intent to do so has to be in its consciousness for a long enough time for the action to be taken. If there are any subagents that wish to prevent this from happening, they must muster enough votes to bring into consciousness some other mental object replacing that intention before it’s been around for enough time-steps to be executed by the motor system. (This is analogous to the concept of the final veto in humans, where consciousness is the last place to block pre-consciously initiated actions before they are taken.)

Second, the different subagents do not see each other directly: they only see the consequences of each other’s actions, as that’s what’s reflected in the contents of the workspace. In particular, the self-narrative agent has no access to information about which subagents were responsible for generating which physical action. It only sees the intentions which preceded the various actions, and the actions themselves. Thus it might easily end up constructing a narrative which creates the internal appearance of a single agent, even though the system is actually composed of multiple subagents.

Third, even if the subagents can’t directly see each other, they might still end up forming alliances. For example, if the robot is standing near the stove, a curiosity-driven subagent might propose poking at the stove (“I want to see if this causes us to burn ourselves again!”), while the reinforcement learning system might propose cooking dinner, since that’s what it predicts will please the human owner. Now, a manager trying to prevent a fear model agent from being activated, will eventually learn that if it votes for the RL system’s intentions to cook dinner (which it saw earlier), then the curiosity-driven agent is less likely to get its intentions into consciousness. Thus, no poking at the stove, and the manager’s and the RL system’s goals end up aligned.

Fourth, this design can make it really difficult for the robot to even become aware of the existence of some managers. A manager may learn to support any other mental processes which block the robot from taking specific actions. It does it by voting in favor of mental objects which orient behavior towards anything else. This might manifest as something subtle, such as a mysterious lack of interest towards something that sounds like a good idea in principle, or just repeatedly forgetting to do something, as the robot always seems to get distracted by something else. The self-narrative agent, not having any idea of what’s going on, might just explain this as “Robby the Robot is forgetful sometimes” in its internal narrative.

Fifth, the reinforcement learning subagent here is doing something like rational planning, but given its weak voting power, it’s likely to be overruled if other subagents disagree with it (unless some subagents also agree with it). If some actions seem worth doing, but there are managers which are blocking it and the reinforcement learning subagent doesn’t have an explicit representation of them, this can manifest as all kinds of procrastinating behaviors and numerous failed attempts for the RL system to “try to get itself to do something”, using various strategies. But as long as the managers keep blocking those actions, the system is likely to remain stuck.

Sixth, the purpose of both managers and firefighters is to keep the robot out of a situation that has been previously designated as dangerous. Managers do this by trying to pre-emptively block actions that would cause the fear model agent to activate; firefighters do this by trying to take actions which shut down the fear model agent after it has activated. But the fear model agent activating is not actually the same thing as being in a dangerous situation. Thus, both managers and firefighters may fall victim to Goodhart’s law, doing things which block the fear model while being irrelevant for escaping catastrophic situations.

For example, “thinking about the consequences of going to the activist HQ” is something that might activate the fear model agent, so a manager might try to block just thinking about it. This has obvious consequence that the robot can’t think clearly about that issue. Similarly, once the fear model has already activated, a firefighter might Goodhart by supporting any action which helps activate an agent with a lot of voting power that’s going to think about something entirely different. This could result in compulsive behaviors which were effective at pushing the fear aside, but useless for achieving any of the robot’s actual aims.

At worst, this could cause loops of mutually activating subagents pushing in opposite directions. First, a stove-phobic robot runs away from the stove as it was about to make breakfast. Then a firefighter trying to suppress that fear, causes the robot to get stuck looking at pictures of beautiful naked robots, which is engrossing and thus great for removing the fear of the stove. Then another fear model starts to activate, this one afraid of failure and of spending so much time looking at pictures of beautiful naked robots that the robot won’t accomplish its goal of making breakfast. A separate firefighter associated with this second fear model has learned that focusing the robot’s attention on the pictures of beautiful naked robots even more is the most effective action for keeping this new fear temporarily subdued. So the two firefighters are allied and temporarily successful at their goal, but then the first one - seeing that the original stove fear has disappeared - turns off. Without the first firefighter’s votes supporting the second firefighter, the fear manages to overwhelm the second firefighter, causing the robot to rush into making breakfast. This again activates its fear of the stove, but if the fear of failure remains strong enough, it might overpower its fear of the stove so that the robot manages to make breakfast in time...

Hmm. Maybe this design isn’t so great after all. Good thing we noticed these failure modes, so that there aren’t any mind architectures like this going around being vulnerable to them!

The Internal Family Systems model

But enough hypothetical robot design; let’s get to the topic of IFS. The IFS model hypothesizes the existence of three kinds of “extreme parts” in the human mind:

• Exiles are said to be parts of the mind which hold the memory of past traumatic events, which the person did not have the resources to handle. They are parts of the psyche which have been split off from the rest and are frozen in time of the traumatic event. When something causes them to surface, they tend to flood the mind with pain. For example, someone may have an exile associated with times when they were romantically rejected in the past.
• Managers are parts that have been tasked with keeping the exiles permanently exiled from consciousness. They try to arrange a person’s life and psyche so that exiles never surface. For example, managers might keep someone from reaching out to potential dates due to a fear of rejection.
• Firefighters react when exiles have been triggered, and try to either suppress the exile’s pain or distract the mind from it. For example, after someone has been rejected by a date, they might find themselves drinking in an attempt to numb the pain.
• Some presentations of the IFS model simplify things by combining Managers and Firefighters into the broader category of Protectors, so only talk about Exiles and Protectors.

Exiles are not limited to being created from the kinds of situations that we would commonly consider seriously traumatic. They can also be created from things like relatively minor childhood upsets, as long as the child didn’t feel like they could handle the situation.

IFS further claims that you can treat these parts as something like independent subpersonalities. You can communicate with them, consider their worries, and gradually persuade managers and firefighters to give you access to the exiles that have been kept away from consciousness. When you do this, you can show them that you are no longer in the situation which was catastrophic before, and now have the resources to handle it if something similar was to happen again. This heals the exile, and also lets the managers and firefighters assume better, healthier roles.

As I mentioned in the beginning, when I first heard about IFS, I was turned off by it for several different reasons. For instance, here were some of my thoughts at the time:

1. The whole model about some parts of the mind being in pain, and other parts trying to suppress their suffering. The thing about exiles was framed in terms of a part of the mind splitting off in order to protect the rest of the mind against damage. What? That doesn’t make any evolutionary sense! A traumatic situation is just sensory information for the brain, it’s not literal brain damage: it wouldn’t have made any sense for minds to evolve in a way that caused parts of it to split off, forcing other parts of the mind to try to keep them suppressed. Why not just… never be damaged in the first place?
2. That whole thing about parts being personalized characters that you could talk to. That… doesn’t describe anything in my experience.
3. Also, how does just talking to yourself fix any trauma or deeply ingrained behaviors?
4. IFS talks about everyone having a “True Self”. Quote from Wikipedia: IFS also sees people as being whole, underneath this collection of parts. Everyone has a true self or spiritual center, known as the Self to distinguish it from the parts. Even people whose experience is dominated by parts have access to this Self and its healing qualities of curiosity, connectedness, compassion, and calmness. IFS sees the therapist's job as helping the client to disentangle themselves from their parts and access the Self, which can then connect with each part and heal it, so that the parts can let go of their destructive roles and enter into a harmonious collaboration, led by the Self. That… again did not sound particularly derived from any sensible psychology.

Hopefully, I’ve already answered my past self’s concerns about the first point. The model itself talks in terms of managers protecting the mind from pain, exiles being exiled from consciousness in order for their pain to remain suppressed, etc. Which is a reasonable description of the subjective experience of what happens. But the evolutionary logic - as far as I can guess - is slightly different: to keep us out of dangerous situations.

The story of the robot describes the actual “design rationale”. Exiles are in fact subagents which are “frozen in the time of a traumatic event”, but they didn’t split off to protect the rest of the mind from damage. Rather, they were created as an isolated memory block to ensure that the memory of the event wouldn’t be forgotten. Managers then exist to keep the person away from such catastrophic situations, and firefighters exist to help escape them. Unfortunately, this setup is vulnerable to various failure modes, similar to those that the robot is vulnerable to.

With that said, let’s tackle the remaining problems that I had with IFS.

Personalized characters

IFS suggests that you can experience the exiles, managers and firefighters in your mind as something akin to subpersonalities - entities with their own names, visual appearances, preferences, beliefs, and so on. Furthermore, this isn’t inherently dysfunctional, nor indicative of something like Dissociative Identity Disorder. Rather, even people who are entirely healthy and normal may experience this kind of “multiplicity”.

Now, it’s important to note right off that not everyone has this to a major extent: you don’t need to experience multiplicity in order for the IFS process to work. For instance, my parts feel more like bodily sensations and shards of desire than subpersonalities, but IFS still works super-well for me.

In the book Internal Family Systems Therapy, Richard Schwartz, the developer of IFS, notes that if a person’s subagents play well together, then that person is likely to feel mostly internally unified. On the other hand, if a person has lots of internal conflict, then they are more likely to experience themselves as having multiple parts with conflicting desires.

I think that this makes a lot of sense, assuming the existence of something like a self-narrative subagent. If you remember, this is the part of the mind which looks at the actions that the mind-system has taken, and then constructs an explanation for why those actions were taken. (See e.g. the posts on the limits of introspection and on the Apologist and the Revolutionary for previous evidence for the existence of such a confabulating subagent with limited access to our true motivations.) As long as all the exiles, managers and firefighters are functioning in a unified fashion, the most parsimonious model that the self-narrative subagent might construct is simply that of a unified self. But if the system keeps being driven into strongly conflicting behaviors, then it can’t necessarily make sense of them from a single-agent perspective. Then it might naturally settle on something like a multiagent approach and experience itself as being split into parts.

Kevin Simler, in Neurons Gone Wild, notes how people with strong addictions seem particularly prone to developing multi-agent narratives:

This American Life did a nice segment on addiction a few years back, in which the producers — seemingly on a lark — asked people to personify their addictions. "It was like people had been waiting all their lives for somebody to ask them this question," said the producers, and they gushed forth with descriptions of the 'voice' of their inner addict:"The voice is irresistible, always. I'm in the thrall of that voice.""Totally out of control. It's got this life of its own, and I can't tame it anymore.""I actually have a name for the voice. I call it Stan. Stan is the guy who tells me to have the extra glass of wine. Stan is the guy who tells me to smoke."

This doesn’t seem like it explains all of it, though. I’ve frequently been very dysfunctional, and have found very intuitive the notion of the mind being split into very parts. Yet I mostly still don’t seem to experience my subagents anywhere near as person-like as some others clearly do. I know at least one person who ended up finding IFS because of having all of these talking characters in their head, and who was looking for something that would help them make sense of it. Nothing like that has ever been the case for me: I did experience strongly conflicting desires, but they were just that, strongly conflicting desires.

I can only surmise that it has something to do with the same kinds of differences which cause some people to think mainly verbally, others mainly visually, and others yet in some other hard-to-describe modality. Some fiction writers spontaneously experience their characters as real people who speak to them and will even bother the writer when at the supermarket, and some others don’t.

It’s been noted that the mechanisms which use to model ourselves and other people overlap - not very surprisingly, since both we and other people are (presumably) humans. So it seems reasonable that some of the mechanisms for representing other people, would sometimes also end up spontaneously recruited for representing internal subagents or coalitions of them.

Why should this technique be useful for psychological healing?

Okay, suppose it’s possible to access our subagents somehow. Why would just talking with these entities in your own head, help you fix psychological issues?

Let’s consider that a person having exiles, managers and firefighters is costly in the sense of constraining that person’s options. If you never want to do anything that would cause you to see a stove, that limits quite a bit of what you can do. I strongly suspect that many forms of procrastination and failure to do things we’d like to do are mostly a manifestation of overactive managers. So it’s important not to create those kinds of entities unless the situation really is one which should be designated as categorically unacceptable to end up in.

The theory for IFS mentions that not all painful situations turn into trauma: just ones in which we felt helpless and like we didn’t have the necessary resources for dealing with it. This makes sense, since if we were capable of dealing with it, then the situation can’t have been that catastrophic. The aftermath of the immediate event is important as well: a child who ends up in a painful situation doesn’t necessarily end up traumatized, if they have an adult who can put the event in a reassuring context afterwards.

But situations which used to be catastrophic and impossible for us to handle before, aren’t necessarily that any more. It seems important to have a mechanism for updating that cache of catastrophic events and for disassembling the protections around it, if the protections turn out to be unnecessary.

How does that process usually happen, without IFS or any other specialized form of therapy?

Often, by talking about your experiences with someone you trust. Or writing about them in private or in a blog.

In my post about Consciousness and the Brain, I mentioned that once a mental object becomes conscious, many different brain systems synchronize their processing around it. I suspect that the reason why many people have such a powerful urge to discuss their traumatic experiences with someone else, is that doing so is a way of bringing those memories into consciousness in detail. And once you’ve dug up your traumatic memories from their cache, their content can be re-processed and re-evaluated. If your brain judges that you now do have the resources to handle that event if you ever end up in it again, or if it’s something that simply can’t happen anymore, then the memory can be removed from the cache and you no longer need to avoid it.

I think it’s also significant that, while something like just writing about a traumatic event is sometimes enough to heal, often it’s more effective if you have a sympathetic listener who you trust. Traumas often involve some amount of shame: maybe you were called lazy as a kid and are still afraid of others thinking that you are lazy. Here, having friends who accept you and are willing to nonjudgmentally listen while you talk about your issues, is by itself an indication that the thing that you used to be afraid of isn’t a danger anymore: there exist people who will stay by your side despite knowing your secret.

Now, when you are talking to a friend about your traumatic memory, you will be going through cached memories that have been stored in an exile subagent. A specific memory circuit - one of several circuits specialized for the act of holding painful memories - is active and outputting its contents into the global workspace, from which they are being turned into words.

Meaning that, in a sense, your friend is talking directly to your exile.

Could you hack this process, so that you wouldn’t even need a friend, and could carry this process out entirely internally?

In my earlier post, I remarked that you could view language as a way of joining two people’s brains together. A subagent in your brain outputs something that appears in your consciousness, you communicate it to a friend, it appears in their consciousness, subagents in your friend’s brain manipulate the information somehow, and then they send it back to your consciousness.

If you are telling your friend about your trauma, you are in a sense joining your workspaces together, and letting some subagents in your workspace, communicate with the “sympathetic listener” subagents in your friend’s workspace.

So why not let a “sympathetic listener” subagent in your workspace, hook up directly with the traumatized subagents that are also in your own workspace?

I think that something like this happens when you do IFS. You are using a technique designed to activate the relevant subagents in a very specific way, which allows for this kind of a “hooking up” without needing another person.

For instance, suppose that you are talking to a manager subagent which wants to hide the fact that you’re bad at something, and starts reacting defensively whenever the topic is brought up. Now, one way by which its activation could manifest, is feeding those defensive thoughts and reactions directly into your workspace. In such a case, you would experience them as your own thoughts, and possibly as objectively real. IFS calls this “blending”; I’ve also previously used the term “cognitive fusion” for what’s essentially the same thing.

Instead of remaining blended, you then use various unblending / cognitive defusion techniques that highlight the way by which these thoughts and emotions are coming from a specific part of your mind. You could think of this as wrapping extra content around the thoughts and emotions, and then seeing them through the wrapper (which is obviously not-you), rather than experiencing the thoughts and emotions directly (which you might experience as your own). For example, the IFS book Self-Therapy suggests this unblending technique (among others):

Allow a visual image of the part [subagent] to arise. This will give you the sense of it as a separate entity. This approach is even more effective if the part is clearly a certain distance away from you. The further away it is, the more separation this creates.

Another way to accomplish visual separation is to draw or paint an image of the part. Or you can choose an object from your home that represents the part for you or find an image of it in a magazine or on the Internet. Having a concrete token of the part helps to create separation.

I think of this as something like, you are taking the subagent in question, routing its responses through a visualization subsystem, and then you see a talking fox or whatever. And this is then a representation that your internal subsystems for talking with other people can respond to. You can then have a dialogue with the part (verbally or otherwise) in a way where its responses are clearly labeled as coming from it, rather than being mixed together with all the other thoughts in the workspace. This lets the content coming from the sympathetic-listener subagent and the exile/manager/firefighter subagent be kept clearly apart, allowing you to consider the emotional content as you would as an external listener, preventing you from drowning in it. You’re hacking your brain so as to work as the therapist and client as the same time.

The Self

IFS claims that, below all the various parts and subagents, there exists a “true self” which you can learn to access. When you are in this Self, you exhibit the qualities of “calmness, curiosity, clarity, compassion, confidence, creativity, courage, and connectedness”. Being at least partially in Self is said to be a prerequisite for working with your parts: if you are not, then you are not able to evaluate their models objectively. The parts will sense this, and as a result, they will not share their models properly, preventing the kind of global re-evaluation of their contents that would update them.

This was the part that I was initially the most skeptical of, and which made me most frequently decide that IFS was not worth looking at. I could easily conceptualize the mind as being made up of various subagents. But then it would just be numerous subagents all the way down, without any single one that could be designated the “true” self.

But let’s look at IFS’s description of how exactly to get into Self. You check whether you seem to be blended with any part. If you are, you unblend with it. Then you check whether you might also be blended with some other part. If you are, you unblend from it also. You then keep doing this until you can find no part that you might be blended with. All that’s left are those “eight Cs”, which just seem to be a kind of a global state, with no particular part that they would be coming from.

I now think that “being in Self” represents a state where there no particular subagent is getting a disproportionate share of voting power, and everything is processed by the system as a whole. Remember that in the robot story, catastrophic states were situations in which the organism should never end up. A subagent kicking in to prevent that from happening is a kind of a priority override to normal thinking. It blocks you from being open and calm and curious because some subagent thinks that doing so would be dangerous. If you then turn off or suspend all those priority overrides, then the mind’s default state absent any override seems to be one with the qualities of the Self.

This actually fits at least one model of the function of positive emotions pretty well. Fredrickson (1998) suggests that an important function of positive emotions is to make us engage in activities such as play, exploration, and savoring the company of other people. Doing these things has the effect of building up skills, knowledge, social connections, and other kinds of resources which might be useful for us in the future. If there are no active ongoing threats, then that implies that the situation is pretty safe for the time being, making it reasonable to revert to a positive state of being open to exploration.

The Internal Family Systems Therapy book makes a somewhat big deal out of the fact that everyone, even most traumatized people, ultimately has a Self which they can access. It explains this in terms of the mind being organized to protect against damage, and with parts always splitting off from the Self when it would otherwise be damaged. I think the real explanation is much simpler: the mind is not accumulating damage, it is just accumulating a longer and longer list of situations not considered safe.

As an aside, this model feels like it makes me less confused about confidence. It seems like people are really attracted to confident people, and that to some extent it’s also possible to fake confidence until it becomes genuine. But if confidence is so attractive and we can fake it, why hasn’t evolution just made everyone confident by default?

Turns out that it has. The reason why faked confidence gradually turns into genuine confidence is that by forcing yourself to act in confident ways which felt dangerous before, your mind gets information indicating that this behavior is not as dangerous as you originally thought. That gradually turns off those priority overrides that kept you out of Self originally, until you get there naturally.

The reason why being in Self is a requirement for doing IFS, is the existence of conflicts between parts. For instance, recall the stove-phobic robot having a firefighter subagent that caused it to retreat from the stove into watching pictures of beautiful naked robots. This triggered a subagent which was afraid of the naked-robot-watching preventing the robot from achieving its goals. If the robot now tried to do IFS and talk with the firefighter subagent that caused it to run away from stoves, this might bring to mind content which activated the exile that was afraid of not achieving things. Then that exile would keep flooding the mind with negative memories, trying to achieve its priority override of “we need to get out of this situation”, and preventing the process from proceeding. Thus, all of the subagents that have strong opinions about the situation need to be unblended from, before integration can proceed.

IFS also has a separate concept of “Self-Leadership”. This is a process where various subagents eventually come to trust the Self, so that they allow the person to increasingly remain in Self even in various emergencies. IFS views this as a positive development, not only because it feels nice, but because doing so means that the person will have more cognitive resources available for actually dealing with the emergency in question.

I think that this ties back to the original notion of subagents being generated to invoke priority overrides for situations which the person originally didn’t have the resources to handle. Many of the subagents IFS talks about seem to emerge from childhood experiences. A child has many fewer cognitive, social, and emotional resources for dealing with bad situations, in which case it makes sense to just categorically avoid them, and invoke special overrides to ensure that this happens. A child’s cognitive capacities, models of the world, and abilities to self-regulate are also less developed, so she may have a harder time staying out of dangerous situations without having some priority overrides built in. An adult, however, typically has many more resources than a child does. Even when faced with an emergency situation, it can be much better to be able to remain calm and analyze the situation using all of one’s subagents, rather than having a few of them take over all the decision-making. Thus, it seems to me - both theoretically and practically - that developing Self-Leadership is really valuable.

That said, I do not wish to imply that it would be a good goal to never have negative emotions. Sometimes blending with a subagent, and experiencing resulting negative emotions, is the right thing to do in that situation. Rather than suppressing negative emotions entirely, Self-Leadership aims to get to a state where any emotional reaction tends to be endorsed by the mind-system as a whole. Thus, if feeling angry or sad or bitter or whatever feels appropriate to the situation, you can let yourself feel so, and then give yourself to that emotion without resisting it. As a result, negative emotions become less unpleasant to experience, since there are fewer subagents trying to fight against them. Also, if it turns out that being in a negative emotional state is no longer useful, the system as a whole can just choose to move back into Self.

Final words

I’ve now given a brief summary of the IFS model, and explained why I think it makes sense. This is of course not enough to establish the model as true. But it might help in making the model plausible enough to at least try out.

I think that most people could benefit from learning and doing IFS on themselves, either alone or together with a friend. I’ve been saying that exiles/managers/firefighters tend to be generated from trauma, but it’s important to realize that these events don’t need to be anything immensely traumatic. The kinds of ordinary, normal childhood upsets that everyone has had can generate these kinds of subagents. Remember, just because you think of a childhood event as trivial now, doesn’t mean that it felt trivial to you as a child. Doing IFS work, I’ve found exiles related to memories and events which I thought left no negative traces, but actually did.

Remember also that it can be really hard to notice the presence of some managers: if they are doing their job effectively, then you might never become aware of them directly. “I don’t have any trauma so I wouldn’t benefit from doing IFS” isn’t necessarily correct. Rather, the cues that I use for detecting a need to do internal work are:

• Do I have the qualities associated with Self, or is something blocking them?
• Do I feel like I’m capable of dealing with this situation rationally, and doing the things which feel like good ideas on an intellectual level?
• Do my emotional reactions feel like they are endorsed by my mind-system as a whole, or is there a resistance to them?

If not, there is often some internal conflict which needs to be addressed - and IFS, combined with some other practices such as Focusing and meditation - has been very useful in learning to solve those internal conflicts.

Even if you don’t feel convinced that doing IFS personally would be a good idea, I think adopting its framework of exiles, managers and firefighters is useful for better understanding the behavior of other people. Their dynamics will be easier to recognize in other people if you’ve had some experience recognizing them in yourself, however.

If you want to learn more about IFS, I would recommend starting with Self-Therapy by Jay Early. In terms of What/How/Why books, my current suggestions would be:

This post was written as part of research supported by the Foundational Research Institute. Thank you to everyone who provided feedback on earlier drafts of this article: Eli Tyre, Elizabeth Van Nostrand, Jan Kulveit, Juha Törmänen, Lumi Pakkanen, Maija Haavisto, Marcello Herreshoff, Qiaochu Yuan, and Steve Omohundro.

Discuss

### Future directions for narrow value learning

26 января, 2019 - 05:36
Published on January 26, 2019 2:36 AM UTC

Narrow value learning is a huge field that people are already working on (though not by that name) and I can’t possibly do it justice. This post is primarily a list of things that I think are important and interesting, rather than an exhaustive list of directions to pursue. (In contrast, the corresponding post for ambitious value learning did aim to be exhaustive, and I don’t think I missed much work there.)

You might think that since so many people are already working on narrow value learning, we should focus on more neglected areas of AI safety. However, I still think it’s worth working on because long-term safety suggests a particular subset of problems to focus on; that subset seems quite neglected.

For example, a lot of work is about how to improve current algorithms in a particular domain, and the solutions encode domain knowledge to succeed. This seems not very relevant for long-term concerns. Some work assumes that a handcoded featurization is given (so that the true reward is linear in the features); this is not an assumption we could make for more powerful AI systems.

I will speculate a bit on the neglectedness and feasibility of each of these areas, since for many of them there isn’t a person or research group who would champion them whom I could defer to about the arguments for success.

The big picture

This category of research is about how you could take narrow value learning algorithms and use them to create an aligned AI system. Typically, I expect this to work by having the narrow value learning enable some form of corrigibility.

As far as I can tell, nobody outside of the AI safety community works on this problem. While it is far too early to stake a confident position one way or the other, I am slightly less optimistic about this avenue of approach than one in which we create a system that is directly trained to be corrigible.

Avoiding problems with goal-directedness. How do we put together narrow value learning techniques in a way that doesn’t lead to the AI behaving like a goal-directed agent at each point? This is the problem with keeping a reward estimate that is updated over time. While reward uncertainty can help avoid some of the problems, it does not seem sufficient by itself. Are there other ideas that can help?

Dealing with the difficulty of “human values”. Cooperative IRL makes the unrealistic assumption that the human knows her reward function exactly. How can we make narrow value learning systems that deal with this issue? In particular, what prevents them from updating on our behavior that’s not in line with our “true values”, while still letting them update on other behavior? Perhaps we could make an AI system that is always uncertain about what the true reward is, but how does this mesh with epistemics, which suggest that you can get to arbitrarily high confidence given sufficient evidence?

Human-AI interaction

This section of research aims to figure out how to create human-AI systems that successfully accomplish tasks. For sufficiently complex tasks and sufficiently powerful AI, this overlaps with the big picture concerns above, but there are also areas to work on with subhuman AI with an eye towards more powerful systems.

Assumptions about the human. In any feedback system, the update that the AI makes on the human feedback depends on the assumption that the AI makes about the human. In Inverse Reward Design (IRD), the AI system assumes that the reward function provided by a human designer leads to near-optimal behavior in the training environment, but may be arbitrarily bad in other environments. In IRL, the typical assumption is that the demonstrations are created by a human behaving Boltzmann rationally, but recent research aims to also correct for any suboptimalities they might have, and so no longer assumes away the problem of systematic biases. (See also the discussion in Future directions for ambitious value learning.) In Cooperative IRL, the AI system assumes that the human models the AI system as approximately rational. COACH notes that when you ask a human to provide a reward signal, they provide a critique of current behavior rather than a reward signal that can be maximized.

Can we weaken the assumptions that we have to make, or get rid of them altogether? Barring that, can we make our assumptions more realistic?

Managing interaction. How should the AI system manage its interaction with the human to learn best? This is the domain of active learning, which is far too large a field for me to summarize here. I’ll throw in a link to Active Inverse Reward Design, because I already talked about IRD and I helped write the active variant.

Human policy. The utility of a feedback system is going to depend strongly on the quality of the feedback given by the human. How do we train humans so that their feedback is most useful for the AI system? So far, most work is about how to adapt AI systems to understand humans better, but it seems likely there are also gains to be had by having humans adapt to AI systems.

Finding and using preference information

New sources of data. So far preferences are typically learned through demonstrations, comparisons or rankings; but there are likely other useful ways to elicit preferences. Inverse Reward Design gets preferences from a stated proxy reward function. An obvious one is to learn preferences from what people say, but natural language is notoriously hard to work with so not much work has been done on it so far, though there is some. (I’m pretty sure there’s a lot more in the NLP community that I’m not yet aware of.) We recently showed that there is even preference information in the state of the world that can be extracted.

Handling multiple sources of data. We could infer preferences from behavior, from speech, from given reward functions, from the state of the world, etc. but it seems quite likely that the inferred preferences would conflict with each other. What do you do in these cases? Is there a way to infer preferences simultaneously from all the sources of data such that the problem does not arise? (And if so, what is the algorithm implicitly doing in cases where different data sources pull in different directions?)

Acknowledging Human Preference Types to Support Value Learning talks about this problem and suggests some aggregation rules but doesn’t test them. Reward Learning from Narrated Demonstrations learns from both speech and demonstrations, but they are used as complements to each other, not as different sources for the same information that could conflict.

I’m particularly excited about this line of research -- it seems like it hasn’t been explored yet and there are things that can be done, especially if you allow yourself to simply detect conflicts, present the conflict to the user, and then trust their answer. (Though this wouldn’t scale to superintelligent AI.)

Generalization. Current deep IRL algorithms (or deep anything algorithms) do not generalize well. How can we infer reward functions that transfer well to different environments? Adversarial IRL is an example of work pushing in this direction, but my understanding is that it had limited success. I’m less optimistic about this avenue of research because it seems like in general function approximators do not extrapolate well. On the other hand, I and everyone else have the strong intuition that a reward function should take fewer bits to specify than the full policy, and so should be easier to infer. (Though not based on Kolmogorov complexity.)

The next AI Alignment Forum sequences post will be in the sequence on Iterated Amplification.

The next post in this sequence will be 'Conclusion to the sequence on value learning' by Rohin Shah, on Sunday 27th Jan.

Discuss

### Is intellectual work better construed as an exploration or a performance?

26 января, 2019 - 00:59
Published on January 25, 2019 9:59 PM UTC

I notice I rely on two metaphors of intellectual work:

1. intellectual work as exploration – intellectual work is an expedition through unknown territory (a la Meru, a la Amundsen & the South Pole). It's unclear whether the expedition will be successful; the explorers band together to function as one unit & support each other; the value of the work is largely "in the moment" / "because it's there", the success of the exploration is mostly determined by objective criteria.

Examples: Andrew Wiles spending six years in secrecy to prove Fermat's Last Theorem, Distill's essays on machine learning, Donne Martin's data science portfolio (clearly a labor of love)

2. intellectual work as performance – intellectual work is a performative act with an audience (a la Black Swan, a la Super Bowl halftime shows). It's not clear that any given performance will succeed, but there will always be a "best performance"; performers tend to compete & form factions; the value of the work accrues afterward / the work itself is instrumental; the success of the performance is mostly determined by subjective criteria.

Examples: journal impact factors, any social science result that's published but fails to replicate, academic dogfights on Twitter

Clearly both of these metaphors do work – I'm wondering which is better to cultivate on the margin.

My intuition is that it's better to lean on the image of intellectual work as exploration; curious what folks here think.

Discuss

### Clarifying Logical Counterfactuals

25 января, 2019 - 18:05
Published on January 25, 2019 3:05 PM UTC

This post expands on and clarifies my previous analysis of Newcomb's problem to provide a general, but still somewhat informal approach to logical counterfactuals:

• Miri's Functional Decision Theory paper defines subjunctive dependence to refer to the situation where two decision process involve the same calculation. This can result from causation, such as if I calculate an equation, then tell you the answer and we both implement a decision based on this sum. It can also occur non-causally, such as a prediction being linked to a decision as per Newcomb's problem.
• As explained in this article about the Co-operation Game, logical counterfactuals are more about your knowledge of the state of the world than the world itself. Suppose there are two people who can choose A or B. Suppose that a predictor knows that both people will choose A conditional on them being told one of the following two facts a) the other person will choose A b) the other person will choose the same as you. Then whether your decision is modelled to subjunctively depend on the other person depends on which of the two facts you are told. Going further than the original post, one might be told a) and the other b), so that the first sees themselves as not subjunctively linked to the second, while the second sees themselves as subjunctively linked to the first.
• A decision problem requires uncertainty about what decision you are going to implement. If you already know you are going to choose A, then whether A or B would be better is besides the point. If you have perfect knowledge of the environment and perfect knowledge of yourself, then in order to produce a decision theory problem requires you to erase some knowledge about yourself.
• If we erased your knowledge of your decision, but left yourself with the knowledge of what a perfect predictor predicted about you, then you would still know what decision you would inevitably be making and we still wouldn't have a decision theory problem. So we need to erase this too.
• Concepts like counterfactuals and decisions exist in the map and not the territory. We can build up towards them as follows. Starting with the territory we use some process to produce a model. We can then imagine constructing different models by switching out or altering parts of the model. Consistent models can be labelled Raw Counterfactuals. These represent what could have been and provide a coherent notion of counterfactuals seperate from any discussion of decisions.
• Causal Decision Theory uses its own notion of counterfactuals, which we'll term Decision Counterfactuals. These are created by performing world surgery on models of causal processes. Unlike raw counterfactuals, decision counterfactuals are inconsistent. Suppose that you actually defect in the Prisoner's Dilemma, but we are considering the counterfactual where you cooperate. Up until the point of the decision you are the kind of person who defects, but when we arrive at the decision, you magically cooperate.
• Inconsistency is already one mark against decision counterfactuals, but a more important argument is noticing that decision counterfactuals are ultimately justified as an approximation of a raw counterfactuals. Decision counterfactuals are supposed to be about how the world could have been if you'd made a different decision, but how does it make sense to say the world could have been in some inconsistent state? Well, we could perform world surgery further back to change your past personality so that your decision is consistent and then again further back so that it is consistent for you to have ended up with that personality, but that is a lot of extra work. In theoretical decision theory problems it usually doesn't make a different and in practical decision theory problems sufficient data is unavailable. So most of the time we just perform world surgery at the point of the decision and work under the assumption that performing surgery further back wouldn't make a difference.
• This approximation breaks down when performing world surgery to make a counterfactual decision consistent with the past requires us to change an element of the environment that is important for the specific problem. For example, in Newcomb's problem, changing your current decision requires changing your past self which involves changing a predictor which is considered part of the environment. In this case, it makes sense to fall back to raw counterfactuals and build a new decision theory on top of these.
• Functional decision theory, as normally characterised, is closer to this ideal as world surgery isn't just performed on your decision, but also on all decisions that subjunctively depend on you. This removes a bunch of inconsistencies, however, assuming that f(x)=b when f(x) really equals a is an inconsistency which needs to be ironed out in order to produce a general theory.
• It should be clear that the question: "What would the world be like if f(x)=b instead of a?" is underdefined as stated. There are two possible paths towards a more solid definition. The first is to find an f' such that f'(x)=b and also with certain unstated similarities to f and substitute this into all subjunctively linked processes. The second is to see what consequences can be logically deduced from f(x)=b without ever directly or indirectly invoking the fact that f(x)=a. The second is easier to do informally, but I suspect that the first approach would be more amenable to formalisation.
• Before we can address the question of logical counterfactuals, we probably need to understand what it means to make a decision. As I explained in a previous post, there's a sense in which you don't so much 'make' a decision as implement one. If you make something, it implies that it didn't exist before and now it does. In the case of decisions, it nudges you towards believing that the decision you were going to implement wasn't set, then you made a decision, and then it was. But given complete knowledge of yourself and the environment, there was only ever one decision you could implement. Which is why I said above that sometimes you'll have to erase information in order to produce a decision theory problem.
• When "you" is defined down to the atom, you can only implement one decision. In order to imagine counterfactual actions we must also imagine a counterfactual version of "you" who decide slightly different. And in order to imagine a consistent world, we must also imagine the predictor making a different decision. So the prediction is linked to your decision, but this is a function of how counterfactuals are constructed. It is invalid to say that the prediction would be different if you "made" a different decision. Instead, it is more accurate to say that the prediction is different if you shift counterfactuals. So no backwards causation required.
• Causal decision theorists construct these counterfactuals incorrectly and hence they believe that they can change their decision without changing the prediction. They fail to realise that they can't actually "change" their decision as there is a single decision that they will inevitably implement. We can't change our decision, all we can do is imagine counterfactuals and as we've seen above, that inevitably involves imagining changing you as well and possibly other processes that are subjunctively linked to you.
• If you have perfect self-knowledge, then constructing these counterfactuals will involve erasing some of it such that we end up with multiple possible versions of you. For example, we can examine Newcomb's Problem by imagining substituting in either an agent who always one-boxes or an agent who always two-boxes. We can then see that the agent who always one-boxes will do better. We could also consider agents who one-box in varying proportions. Once this information is known, no other information about the agent is relevant to the outcome.
• Things become more complicated if we want to erase less information about the agent. For example, we might want an agent to know that it is a utility maximiser as this might be relevant to evaluating the outcome the agent will receive from future decisions. Suppose that if you one-box in Newcomb's you'll then be offered a choice of $1000 or$2000, but if you two-box you'll be offered $0 or$10 million. We can't naively erase all information about your decision process in order to compute a counterfactual of whether you should one-box or two-box. Here the easiest solution is to "collapse" all of the decisions and ask about a policy instead that covers all three decisions that may be faced. Then we can calculate the expectations without producing any inconsistencies in the counterfactuals as it then becomes safe to erase the knowledge that it is a utility maximiser.
• The 5 and 10 problem isn't an issue with the forgetting approach. If you keep the belief that you are a utility maximiser then the only choice you can implement is 10 so we don't have a decision problem. We can define all the possible strategies as follows, p probability of choosing 5 and 1-p probability of choosing 10, so forget everything about yourself except that you are one of these strategies. There's no downside as there is no need for an agent to know whether or not it is a utility maximiser. So we can solve the 5 and 10 problem without epsilon exploration.
• When will we be able to use this forgetting technique? One initially thought might be the same scope FDT is designed to be optimal on - problems where the reward only depends on your outputs or predicted outputs. Because only the output or predicted output matters and not the algorithm, these can be considered fair, unlike a problem where an alphabetical decision theorist (picks the first decision ordered in alphabetical order) is rewarded and every other type of agent is punished.
• However, some problems where this condition doesn't hold also seem fair. Like suppose there are long programs and short programs (in terms of running time). Further suppose programs can output either A or B. The reward is then determined purely based on these two factors. Now suppose that there exists a program that can calculate the utilities in each of these four worlds and then based upon this either terminate immediately or run for a long time and then output its choice of A or B. Assume that if it terminates immediately after the decision it'll qualify as a short program, while if it terminates after a long time it is a long program. Then, as a first approximation, we can say that this is also a fair problem since it is possible to win in all cases.
• It's actually slightly more complex than this. An AI usually doesn't have to win on only one problem. Adding code to handle more situations will extend the running time and may prevent such an AI from always being able to choose the dominant option. An AI might also want to do things like double check calculations or consider whether it is actually running on consistent outcomes, so winning the problem might put limitations on the AI in other ways.
• But nonetheless, so long as we can write such a program that picks the best option, we can call the problem "fair" in a limited sense. It's possible to extend our definition of "fair" further. Like suppose that it's impossible to analyse all the options and still return in a short amount of time. This isn't a problem if the short options are worth less utility than the long options.
• Comparing the method in this post to playing chicken with the universe: using raw counterfactuals clarifies that such proof-based methods are simply tricks that allow us to forget without forgetting.
• In general, I would suggest that logical counterfactuals are about knowing which information to erase such that you can produce perfectly consistent counterfactuals. Further, I would suggest that if you can't find information to erase that produces perfectly consistent counterfactuals then you don't have a decision theory problem. Future work could explore exactly when this is possible and general techniques for making this work as the current exploration has been mainly informal.

Notes:

• Epistemic Status: Exploratory
• I changed my terminology from Point Counterfactuals to Decision Counterfactuals and Timeless Counterfactuals to Raw Counterfactuals in this post as this better highlights where these models come from.

Discuss

### [Question] For what do we need Superintelligent AI?

25 января, 2019 - 18:01
Published on January 25, 2019 3:01 PM UTC

Most of human problems could be solved by humans or slightly above human AIs. Which practically useful tasks – which we could define now – require very high level of superintelligence? I see two main such tasks:

1) Prevention of the creation of other potentially dangerous superintelligences.

I could imagine other tasks, like near-light-speed space travel, but they are neither urgent nor necessary.

For which other tasks do we need superintelligence?

Discuss

### "AlphaStar: Mastering the Real-Time Strategy Game StarCraft II", DeepMind [won 10 of 11 games against human pros]

24 января, 2019 - 23:49

### Thoughts on reward engineering

24 января, 2019 - 23:15
Published on January 24, 2019 8:15 PM UTC

Note: This is the first post from part five: possible approaches of the sequence on iterated amplification. The fifth section of the sequence breaks down some of these problems further and describes some possible approaches.

Suppose that I would like to train an RL agent to help me get what I want.

If my preferences could be represented by an easily-evaluated utility function, then I could just use my utility function as the agent’s reward function. But in the real world that’s not what human preferences look like.

So if we actually want to turn our preferences into a reward function suitable for training an RL agent, we have to do some work.

This post is about the straightforward parts of reward engineering. I’m going to deliberately ignore what seem to me to be the hardest parts of the problem. Getting the straightforward parts out of the way seems useful for talking more clearly about the hard parts (and you never know what questions may turn out to be surprisingly subtle).

The setting

To simplify things even further, for now I’ll focus on the special case where our agent is taking a single action a. All of the difficulties that arise in the single-shot case also arise in the sequential case, but the sequential case also has its own set of additional complications that deserve their own post.

Throughout the post I will imagine myself in the position of an “overseer” who is trying to specify a reward function R(a) for an agent. You can imagine the overseer as the user themselves, or (more realistically) as a team of engineer and/or researchers who are implementing a reward function intended to expresses the user’s preferences.

I’ll often talk about the overseer computing R(a) themselves. This is at odds with the usual situation in RL, where the overseer implements a very fast function for computing R(a) in general (“1 for a win, 0 for a draw, -1 for a loss”). Computing R(a) for a particular action a is strictly easier than producing a fast general implementation, so in some sense this is just another simplification. I talk about why it might not be a crazy simplification in section 6.

Contents
1. Long time horizons. How do we train RL agents when we care about the long-term effects of their actions?
2. Inconsistency and unreliability. How do we handle the fact that we have only imperfect access to our preferences, and different querying strategies are not guaranteed to yield consistent or unbiased answers?
3. Normative uncertainty. How do we train an agent to behave well in light of its uncertainty about our preferences?
4. Widely varying reward. How do we handle rewards that may vary over many orders of magnitude?
5. Sparse reward. What do we do when our preferences are very hard to satisfy, such that they don’t provide any training signal?
6. Complex reward. What do we do when evaluating our preferences is substantially more expensive than running the agent?
• Conclusion.
• Appendix: harder problems.
1. Long time horizons

A single decision may have very long-term effects. For example, even if I only care about maximizing human happiness, I may instrumentally want my agent to help advance basic science that will one day improve cancer treatment.

In principle this could fall out of an RL task with “human happiness” as the reward, so we might think that neglecting long-term effects is just a shortcoming of the single-shot problem. But even in theory there is no way that an RL agent can learn to handle arbitrarily long-term dependencies (imagine training an RL agent to handle 40 year time horizons), and so focusing on the sequential RL problem doesn’t address this issue.

I think that the only real approach is to choose a reward function that reflects the overseer’s expectations about long-term consequences — i.e., the overseer’s task involves both making predictions about what will happen, and value judgments about how good it will be. This makes the reward function more complex and in some sense limits the competence of the learner by the competence of the reward function, but it’s not clear what other options we have.

Before computing the reward function R(a), we are free to execute the action a and observe its short-term consequences. Any data that could be used in our training process can just as well be provided as an input to the overseer, who can use the auxiliary input to help predict the long-term consequences of an action.

2. Inconsistency and unreliability

A human judge has no hope of making globally consistent judgments about which of two outcomes are preferred — the best we can hope for is for their judgments to be right in sufficiently obvious cases, and to be some kind of noisy proxy when things get complicated. Actually outputting a numerical reward — implementing some utility function for our preferences — is even more hopelessly difficult.

Another way of seeing the difficult is to suppose that the overseer’s judgment is a noisy and potential biased evaluation of the quality of the underlying action. If both R(a) and R(a′) are both big numbers with a lot of noise, but the two actions are actually quite similar, then the difference will be dominated by noise. Imagine an overseer trying to estimate the impact of drinking a cup of coffee on Alice’s life by estimating her happiness in a year conditioned on drinking the coffee, estimating happiness conditioned on not drinking the coffee, and then subtracting the estimates.

We can partially address this difficulty by allowing the overseer to make comparisons instead of assessing absolute value. That is, rather than directly implementing a reward function, we can allow the overseer to implement an antisymmetric comparison function C(a, a′): which of two actions a and a′ is a better in context? This function can take real values specifying how muchone action is better than another, and should by antisymmetric.

In the noisy-judgments model, we are hoping that the noise or bias of a comparison C(a, a′) depends on the actual magnitude of the difference between the actions, rather than on the absolute quality of each action. This hopefully means that the total error/bias does not drown out the actual signal.

We can then define the decision problem as a zero-sum game: two agents propose different actions a and a′, and receive rewards C(a, a′) and C(a′, a). At the equilibrium of this game, we can at least rest assured that the agent doesn’t do anything that is unambiguously worse than another option it could think of. In general, this seems to give us sensible guarantees when the overseer’s preferences are not completely consistent.

One subtlety is that in order to evaluate the comparison C(a, a′), we may want to observe the short-term consequences of taking action a or action a′. But in many environments it will only be possible to take one action. So after looking at both actions we will need to choose at most one to actually execute (e.g. we need to estimate how good drinking coffee was, after observing the short-term consequences of drinking coffee but without observing the short-term consequences of not drinking coffee). This will generally increase the variance of C, since we will need to use our best guess about the action which we didn’t actually execute. But of course this is a source of variance that RL algorithms already need to contend with.

3. Normative uncertainty

The agent is uncertain not only about its environment, but also about the overseer (and hence the reward function). We need to somehow specify how the agent should behave in light of this uncertainty. Structurally, this is identical to the philosophical problem of managing normative uncertainty.

One approach is to pick a fixed yardstick to measure with. For example, our yardstick could be “adding a dollar to the user’s bank account.” We can then measure C(a, a′) as a multiple of this yardstick: “how many dollars would we have to add to the user’s bank account to make them indifferent between taking action a and action a′?” If the user has diminishing returns to money, it would be a bit more precise to ask: “what chance of replacing a with a′ is worth adding a dollar to the user’s bank account?” The comparison C(a, a′) is then the inverse of this probability.

This is exactly analogous to the usual construction of a utility function. In the case of utility functions, our choice of yardstick is totally unimportant — different possible utility functions differ by a scalar, and so give rise to the same preferences. In the case of normative uncertainty that is no longer the case, because we are specifying how to aggregate the preferences of different possible versions of the overseer.

I think it’s important to be aware that different choices of yardstick result in different behavior. But hopefully this isn’t an important difference, and we can get sensible behavior for a wide range of possible choices of yardstick — if we find a situation where different yardsticks give very different behaviors, then we need to think carefully about how we are applying RL.

For many yardsticks it is possible to run into pathological situations. For example, suppose that the overseer might decide that dollars are worthless. They would then radically increase the value of all of the agent’s decisions, measured in dollars. So an agent deciding what to do would effectively care much more about worlds where the overseer decided that dollars are worthless.

So it seems best to choose a yardstick whose value is relatively stable across possible worlds. To this effect we could use a broader basket of goods, like 1 minute of the user’s time + 0.1% of the day’s income + etc. It may be best for the overseer to use common sense about how important a decision is relative to some kind of idealized influence in the world, rather than sticking to any precisely defined basket.

It is also desirable to use a yardstick which is simple, and preferably which minimizes the overseer’s uncertainty. Ideally by standardizing on a single yardstick throughout an entire project, we could end up with definitions that are very broad and robust, while being very well-understood by the overseer.

Note that if the same agent is being trained to work for many users, then this yardstick is also specifying how the agent will weigh the interests of different users — for example, whose accents will it prefer to spend modeling capacity on understanding? This is something to be mindful of in cases where it matters, and it can provide intuitions about how to handle the normative uncertainty case as well. I feel that economic reasoning is useful for arriving at sensible conclusions in these situations, but there are other reasonable perspectives.

4. Widely varying reward

Some tasks may have widely varying rewards — sometimes the user would only pay 1¢ to move the decision one way or the other, and sometimes they would pay \$10,000.

If small-stakes and large-stakes decisions occur comparably frequently, then we can essentially ignore the small-stakes decisions. That will happen automatically with a traditional optimization algorithm — after we normalize the rewards so that the “big” rewards don’t totally destroy our model, the “small” rewards will be so small that they have no effect.

Things get more tricky when small-stakes decisions are much more common than the large-stakes decisions. For example, if the importance of decisions is power-law distributed with an exponent of 1, then decisions of all scales are in some sense equally important, and a good algorithm needs to do well on all of them. This may sound like a very special case, but I think it is actually quite natural for there to be several scales that are all comparably important in total.

In these cases, I think we should do importance sampling — we oversample the high-stakes decisions during training, and scale the rewards down by the same amount, so that the contribution to the total reward is correct. This ensures that the scale of rewards is basically the same across all episodes, and lets us apply a traditional optimization algorithm.

Further problems arise when there are some very high-stakes situations that occur very rarely. In some sense this just means the learning problem is actually very hard — we are going to have to learn from few samples. Treating different scales as the same problem (using importance sampling) may help if there is substantial transfer between different scales, but it can’t address the whole problem.

For very rare+high-stakes decisions it is especially likely that we will want to use simulations to avoid making any obvious mistakes or missing any obvious opportunities. Learning with catastrophes is an instantiation of this setting, where the high-stakes settings have only downside and no upside. I don’t think we really know how to cope with rare high-stakes decisions; there are likely to be some fundamental limits on how well we can do, but I expect we’ll be able to improve a lot over the current state of the art.

5. Sparse reward

In many problems, “almost all” possible actions are equally terrible. For example, if I want my agent to write an email, almost all possible strings are just going to be nonsense.

One approach to this problem is to adjust the reward function to make it easier to satisfy — to provide a “trail of breadcrumbs” leading to high reward behaviors. I think this basic idea is important, but that changing the reward function isn’t the right way to implement it (at least conceptually).

Instead we could treat the problem statement as given, but view auxiliary reward functions as a kind of “hint” that we might provide to help the algorithm figure out what to do. Early in the optimization we might mostly optimize this hint, but as optimization proceeds we should anneal towards the actual reward function.

Typical examples of proxy reward functions include “partial credit” for behaviors that look promising; artificially high discount rates and careful reward shaping; and adjusting rewards so that small victories have an effect on learning even though they don’t actually matter. All of these play a central role in practical RL.

A proxy reward function is just one of many possible hints. Providing demonstrations of successful behavior is another important kind of hint. Again, I don’t think that this should be taken as a change to the reward function, but rather as side information to help achieve high reward. In the long run, we will hopefully design learning algorithms that automatically learn how to use general auxiliary information.

6. Complex reward

A reward function that intends to capture all of our preferences may need to be very complicated. If a reward function is implicitly estimating the expected consequences of an action, then it needs to be even more complicated. And for powerful learners, I expect that reward functions will need to be learned rather than implemented directly.

It is tempting to substitute a simple proxy for a complicated real reward function. This may be important for getting the optimization to work, but it is problematic to change the definition of the problem.

Instead, I hope that it will be possible to provide these simple proxies as hints to the learner, and then to use semi-supervised RL to optimize the real hard-to-compute reward function. This may allow us to perform optimization even when the reward function is many times more expensive to evaluate than the agent itself; for example, it might allow a human overseer to compute the rewards for a fast RL agent on a case by case basis, rather than being forced to design a fast-to-compute proxy.

Even if we are willing to spend much longer computing the reward function than the agent itself, we still won’t be able to find a reward function that perfectly captures our preferences. But it may be just as good to choose a reward function that captures our preferences “for all that the agent can tell,” i.e. such that the conditioned on two outcomes receiving the same expected reward the agent cannot predict which of them we would prefer. This seems much more realistic, once we are willing to have a reward function with much higher computational complexity than the agent.

Conclusion

In reinforcement learning we often take the reward function as given. In real life, we are only given our preferences — in an implicit, hard-to-access form — and need to engineer a reward function that will lead to good behavior. This presents a bunch of problems. In this post I discussed six problems which I think are relatively straightforward. (Straightforward from the reward-engineering perspective — the associated RL tasks may be very hard!)

Understanding these straightforward problems is important if we want to think clearly about very powerful RL agents. But I expect that most of our time will go into thinking about harder problems, for which we don’t yet have any workable approach. These harder problems may expose more fundamental limits of RL, that will require substantially new techniques to address.

Appendix: harder problemsInformed oversight

The process that produces a decision may encode important information about the quality of that decision, and if we throw out that information then a very bad decision may nevertheless receive a high reward. For example, if we want an agent to produce original art, we may not be able to evaluate a painting without understanding the process that produced it (and, in particular, what other artwork it was inspired by).

In some cases, recovering information about the process seems feasible. For example, if the agent is using attention or an external memory, the overseer can check what art the agent looked at. But in other cases it seems much more challenging: if the agent’s weights have memorized properties of particular artwork, we would need new techniques in order to make that information available to the overseer.

It is not clear whether there is any general way to give the overseer enough insight into the agent’s thought process to avoid problems like this; there may be some inherent limitations on delegation. Either way, it would be nice to understand the situation much better.

Capability amplification

By folding our preferences and long-term predictions into the reward function, we are in some sense limiting the agent to be no stronger than the overseer. If we start with a human, then RL will only ever produce human-level reasoning about long-term consequences or about “what is good.”

So we need some other mechanism for producing better long-term reasoning. I think that the most promising approach is to treat “reasoning well” as a short-term problem, for which humans can provide adequate oversight, and then to run that reasoning process for longer in order to produce more powerful reasoning.

Reward hacking / security

If our reward function has any problems, a powerful RL agent may tend to find actions a that exploit those problems in order to receive a high reward. In order to get good consequences, our reward function needs to be sufficiently secure that the agent can’t attack it in this way.

The first line of defense against an attack is to avoid immediately executing the agent’s action. Instead, the overseer can inspect that action and try to ensure that it isn’t intended as an attack. This leaves attacks which the overseer cannot recognize as attacks, or which do damage even when the overseer looks at them.

If the techniques from the previous sections actually allow the overseer to evaluate the agent’s actions, then they can also probably allow the overseer can detect attacks. Security during evaluation itself is an additional question though.

The main cause for hope is if the overseer can (1) be smarter than the agent which is trying to attack it, and (2) have access to some information about the agent’s thought process. Hopefully (2) allows the overseer to overcome the disadvantage of the “position of the interior” — if the agent picks a particular attack vector, the overseer can “watch them thinking” and then devote its energies to trying to detect or defend against that particular attack.

Discuss

### From Personal to Prison Gangs: Enforcing Prosocial Behavior

24 января, 2019 - 21:07
Published on January 24, 2019 6:07 PM UTC

This post originally appeared here; I've updated it slightly and posted it here as a follow-up to this post.

David Friedman has a fascinating book on alternative legal systems. One chapter focuses on prison law - not the nominal rules, but the rules enforced by prisoners themselves.

The unofficial legal system of California prisoners is particularly interesting because it underwent a phase change in the 1960’s.

Prior to the 1960’s, prisoners ran on a decentralized code of conduct - various unwritten rules roughly amounting to “mind your own business and don’t cheat anyone”. Prisoners who kept to the code were afforded some respect by their fellow inmates. Prisoners who violated the code were ostracized, making them fair game for the more predatory inmates. There was no formal enforcement; the code was essentially a reputation system.

In the 1960’s, that changed. During the code era, California’s total prison population was only about 5000, with about 1000 inmates in a typical prison. That’s quite a bit more than Dunbar’s number, but still low enough for a reputation system to work through second-order connections. By 1970, California’s prison population had ballooned past 25000; today it is over 170000. The number of prisons also grew, but not nearly as quickly as the population, and today’s prisoners frequently move across prisons anyway. In short, a decentralized reputation system became untenable. There were too many other inmates to keep track of.

As the reputation system collapsed, a new legal institution grew to fill the void: prison gangs. Under the gang system, each inmate is expected to affiliate with a gang (though most are not formal gang members). The gang will explain the rules, often in written form, and enforce them on their own affiliates. When conflict arises between affiliates of different gangs, the gang leaders negotiate settlement, with gang leaders enforcing punishments on their own affiliates. (Gang leaders are strongly motivated to avoid gang-level conflicts.) Rather than needing to track reputation of everyone individually, inmates need only pay attention to gangs at a group level.
Of course, inmates need some way to tell who is affiliated with each gang - thus the rise of racial segregation in prison. During the code era, prisoners tended to associate by race and culture, but there was no overt racial hostility and no hard rules against associating across race. But today’s prison gangs are highly racially segregated, making it easy to recognize the gang affiliation of individual inmates. They claim territory in prisons - showers or ball courts - and enforce their claims, resulting in hard racial segregation.

The change from a small, low-connection prison population to a large, high-connection population was the root cause. That change drove a transition from a decentralized, reputation-based system to prison gangs. This, in turn, involved two further transitions. First, a transition from decentralized, informal unwritten rules to formal written rules with centralized enforcement. Second, a transition from individual to group-level identity, in this case manifesting as racial segregation.

Generalization

This is hardly unique to prisons. The pattern is universal among human institutions. In small groups, everybody knows everybody. Rules are informal, identity is individual. But as groups grow:

• Rules become formal, written, and centrally enforced
• Identity becomes group-based.

Consider companies. I work at a ten-person company. Everyone in the office knows everyone else by name, and everyone has some idea of what everyone else is working on. We have nominal job titles, but everybody works on whatever needs doing. Our performance review process is to occasionally raise the topic in weekly one-on-one meetings.

Go to a thousand or ten thousand person company, and job titles play a much stronger role in who does what. People don’t know everyone, so they identify others by department or role. They understand what a developer or a manager does, rather than understanding what John or Allan does. Identity becomes group-based. At the same time, hierarchy and bureaucracy are formalized.

The key parameter here is number of interactions between each pair of people (you should click that link, it's really cool). In small groups, each pair of people has many interactions, so people get to know each other. In large groups, there are many one-off interactions between strangers. Without past interactions to fall back on, people need other ways to figure out how to interact with each other. One solution is formal rules, which give guidance on interactions with anyone. Another solution is group-based identity - if I know how to interact with lawyers at work in general, then I don’t need to know each individual lawyer.

In this regard, prisons and companies are just microcosms of society in general.

Society

At some point over the past couple hundred years, society underwent a transition similar to that of the California prison system.

In 1800, people were mostly farmers, living in small towns. The local population was within an order of magnitude of Dunbar’s number, and generally small enough to rely on reputation for day-to-day dealings.

Today, that is not the case [citation needed].

Just as in prisons and companies, we should expect this change to drive two kinds of transitions:

• A transition from informal, decentralized rules to formal, written, centrally-enforced rules.
• A transition from individual to group-level identity.

This can explain an awful lot of the ways in which society has changed over the past couple hundred years, as well as how specific social institutions evolve over time.
To take just a few examples…

• Regulation. As people have more one-off interactions, reputation becomes less tenable, and we should expect formal regulation to grow. Conversely, regulations are routinely ignored among people who know each other.
• Litigation. Again, with more one-off interactions, we should expect people to rely more on formal litigation and less on informal settlement. Conversely, people who interact frequently rarely sue each other - and when they do, it’s expected to mess up the relationship.
• Professional licensing. Without reputation, people need some way to signal that they are safe to hire. We should expect licensing to increase as pairwise interactions decrease.
• Credentialism. This is just a generalization of licensing. As reputation fails, we should expect people to rely more heavily on formal credentials - “you are your degree” and so forth.
• Stereotyping. Without past interactions with a particular person, we should expect people to generalize based on superficially “similar” people. This could be anything from the usual culprits (race, ethnicity, age) to job roles (actuaries, lawyers) to consumption signals (iphone, converse, fancy suit).
• Tribalism. From nationalism to sports fans to identity politics, an increasing prevalence of group-level identity means an increasing prevalence of tribal behavior. In particular, I'd expect that social media outlets with more one-off or low-count interactions are characterized by more extreme tribalism.
• Standards for impersonal interactions. “Professionalism” at work is a good example.

I’ve focused mostly on negative examples here, but it’s not all bad - even some of these examples have upsides. When California’s prisons moved from an informal code to prison gangs, the homicide rate dropped like a rock; the gangs hate prison lockdowns, so they go to great lengths to prevent homicides. Of course, gangs have lots of downsides too. The point which generalizes is this: bodies with centralized power have their own incentives, and outcomes will be “good” to exactly the extent that the incentives of the centralized power align with everybody else’ incentives and desires.

Consider credentialism, for example. It’s not all bad - to the extent that we now hire based on degree rather than nepotism, it’s probably a step up. But on the other hand, colleges themselves have less than ideal incentives. Even setting aside colleges’ incentives, the whole credential system shoehorns people into one-size-fits-all solutions; a brilliant patent clerk would have a much more difficult time making a name in physics today than a hundred years ago.

Takeaway

Of course, all of these examples share one critical positive feature: they scale. That’s the whole reason things changed in the first place - we needed systems which could scale up beyond personal relationships and reputation.

This brings us to the takeaway: what should you do if you want to change these things? Perhaps you want a society with less credentialism, regulation, stereotyping, tribalism, etc. Maybe you like some of these things but not others. Regardless, surely there’s something somewhere on that list you’re less than happy about.
The first takeaway is that these are not primarily political issues. The changes were driven by technology and economics, which created a broader social graph with fewer repeated interactions. Political action is unlikely to reverse any of these changes; the equilibrium has shifted, and any policy change would be fighting gravity. Even if employers were outlawed from making hiring decisions based on college degree, they’d find some work-around which amounted to the same thing. Even if the entire federal register disappeared overnight, de-facto industry regulatory bodies would pop up. And so forth.

So if we want to e.g. reduce regulation, we should first focus on the underlying socioeconomic problem: fewer interactions. A world of Amazon and Walmart, where every consumer faces decisions between a million different products, is inevitably a world where consumers do not know producers very well. There’s just too many products and companies to keep track of the reputation of each. To reduce regulation, first focus on solving that problem, scalably. Think amazon reviews - it’s an imperfect system, but it’s far more flexible and efficient than formal regulation, and it scales.

Now for the real problem: online reviews are literally the only example I could come up with where technology offers a way to scale-up reputation-based systems, and maybe someday roll back centralized control structures or group identities. How can we solve these sorts of problems more generally? Please comment if you have ideas.

Discuss

### Is Agent Simulates Predictor a "fair" problem?

24 января, 2019 - 16:18
Published on January 24, 2019 1:18 PM UTC

It's a simple question, but I think it might help if I add in context. In the paper introducing Functional Decision Theory, it is noted that it is impossible to design an algorithm that can perform well on all decision problems since some of them can be specified to be blatantly unfair, ie. punish every agent that isn't an alphabetical decision theorist.

The question then arises, how do we define which problems are or are not fair? We start by noting that some people consider Newcomb's-like problems to be unfair since your outcome depends on a predictor's prediction, which is rooted in an analysis of your algorithm. So what makes this case any different from only rewarding the alphabetical decision theorist?

The paper answers that the prediction only depends on the decision you end up making and that any other internal details are ignored. So it only cares about your decision and not how you come to it, the problem seems fair. I'm inclined to agree with this reasoning, but a similar line of reasoning doesn't seem to hold with Agent Simulates Predictor. Here the algorithm you use is relevant as the predictor can only predict the agent if it's algorithm is less than a certain level of complexity, otherwise it may make a mistake.

Please note that this question isn't about whether this problem is worth considering; life is often unfair and we have to deal with it the best that we can. The question is about whether the problem is "fair", where I roughly understand "fair" meaning that this is in a certain class of problems that I can't specify at this moment (I suspect it would require its own seperate post) where we should be able to achieve the optimal result in each problem.

Discuss

### In what way has the generation after us "gone too far"?

24 января, 2019 - 13:22
Published on January 24, 2019 10:22 AM UTC

<no text needed, open to interpretation>

Discuss

### The human side of interaction

24 января, 2019 - 13:14
Published on January 24, 2019 10:14 AM UTC

The last few posts have motivated an analysis of the human-AI system rather than an AI system in isolation. So far we’ve looked at the notion that the AI system should get feedback from the user and that it could use reward uncertainty for corrigibility. These are focused on the AI system, but what about the human? If we build a system that explicitly solicits feedback from the human, what do we have to say about the human policy, and how the human should provide feedback?

Interpreting human actions

One major free variable in any explicit interaction or feedback mechanism is what semantics the AI system should attach to the human feedback. The classic examples of AI risk are usually described in a way where this is the problem: when we provide a reward function that rewards paperclips, the AI system interprets it literally and maximizes paperclips, rather than interpreting it pragmatically as another human would.

(Aside: I suspect this was not the original point of the paperclip maximizer, but it has become a very popular retelling, so I’m using it anyway.)

Modeling this classic example as a human-AI system, we can see that the problem is that the human is offering a form of “feedback”, the reward function, and the AI system is not ascribing the correct semantics to it. The way it uses the reward function implies that the reward function encodes the optimal behavior of the AI system in all possible environments -- a moment’s thought is sufficient to see that this is not actually the case. There will definitely be many cases and environments that the human did not consider when designing the reward function, and we should not expect that the reward function incentivizes the right behavior in those cases.

So what can the AI system assume if the human provides it a reward function? Inverse Reward Design (IRD) offers one answer: the human is likely to provide a particular reward function if it leads to high true utility behavior in the training environment. So, in the boat race example, if we are given the reward “maximize score” on a training environment where this actually leads to winning the race, then “maximize score” and “win the race” are about equally likely reward functions, since they would both lead to the same behavior in the training environment. Once the AI system is deployed on the environment in the blog post, it would notice that the two likely reward functions incentivize very different behavior. At that point, it could get more feedback from humans, or it could do something that is good according to both reward functions. The paper takes the latter approach, using risk-averse planning to optimize the worst-case behavior.

Similarly, with inverse reinforcement learning (IRL), or learning from preferences, we need to make some sort of assumption about the semantics of the human demonstrations or preferences. A typical assumption is Boltzmann rationality: the human is assumed to take better actions with higher probability. This effectively models all human biases and suboptimalities as noise. There are papers that account for biases rather than modeling them as noise. A major argument against the feasibility of ambitious value learning is that any assumption we make will be misspecified, and so we cannot infer the “one true utility function”. However, it seems plausible that we could have an assumption that would allow us to learn some values (at least to the level that humans are able to).

The human policy

Another important aspect is how the human actually computes feedback. We could imagine training human overseers to provide feedback in the manner that the AI system expects. Currently we “train” AI researchers to provide reward functions that incentivize the right behavior in the AI systems. With IRD, we only need the human to extensively test their reward function in the training environment and make sure the resulting behavior is near optimal, without worrying too much about generalization to other environments. With IRL, the human needs to provide demonstrations that are optimal. And so on.

(Aside: This is very reminiscent of human-computer interaction, and indeed I think a useful frame is to view this as the problem of giving humans better, easier-to-use tools to control the behavior of the AI system. We started with direct programming, then improved upon that to reward functions, and are now trying to improve to comparisons, rankings, and demonstrations.)

We might also want to train humans to give more careful answers than they would have otherwise. For example, it seems really good if our AI systems learn to preserve option value in the face of uncertainty. We might want our overseers to think deeply about potential consequences, be risk-averse in their decision-making, and preserve option value with their choices, so that the AI system learns to do the same. (The details depend strongly on the particular narrow value learning algorithm -- the best human policy for IRL will be very different from the best human policy for CIRL.) We might hope that this requirement only lasts for a short amount of time, after which our AI systems have learnt the relevant concepts sufficiently well that we can be a bit more lax in our feedback.

Learning human reasoning

So far I’ve been analyzing AI systems where the feedback is given explicitly, and there is a dedicated algorithm for handling the feedback. Does the analysis also apply to systems which get feedback implicitly, like iterated amplification and debate?

Well, certainly these methods will need to get feedback somehow, but they may not face the problem of ascribing semantics to the feedback, since they may have learned the semantics implicitly. For example, a sufficiently powerful imitation learning algorithm will be able to do narrow value learning simply because humans are capable of narrow value learning, even though it has no explicit assumption of semantics of the feedback. Instead, it has internalized the semantics that we humans give to other humans’ speech.

Similarly, both iterated amplification and debate inherit the semantics from humans by learning how humans reason. So they do not have the problems listed above. Nevertheless, it probably still is valuable to train humans to be good overseers for other reasons. For example, in debate, the human judges are supposed to say which AI system provided the most true and useful information. It is crucial that the humans judge by this criterion, in order to provide the right incentives for the AI systems in the debate.

Summary

If we reify the interaction between the human and the AI system, then the AI system must make some assumption about the meaning of the human’s feedback. The human should also make sure to provide feedback that will be interpreted correctly by the AI system.

The next AI Alignment Forum sequences post will be 'Thoughts on Reward Engineering' by Paul Christiano in the sequence on Iterated Amplification.

The next post in this sequence will be 'Future directions in narrow value learning' by Rohin Shah, on Friday 25th Jan.

Discuss

### Allowing a formal proof system to self improve while avoiding Lobian obstacles.

24 января, 2019 - 02:04
Published on January 23, 2019 11:04 PM UTC

[Epistemic status, can't spot a mistake, but am not confidant that there isn't one, if you find anything please say so. Posting largely because the community value of a new good idea is larger than any harm that might be caused by a flawed proof. ]

Suppose you have an automatic proof checker. Its connected to a source of statements that are sometimes correct proofs, and sometimes not. The proof checker wants to reject all false proofs, and accept as many of the true proofs as possible. It also wants to be able to update its own proof framework.

L={S×B→{⊤,⊥}}

and V⊂L be the set of programs that have the property that

∀v∈V:∀s∈S:(∃b∈B:v(s,b))⟹s

In other words, V is the set of all programs that never prove false statements. We should never leave V or need to talk about any program not in it.

For v∈V write v[s] to mean ∃b∈B:v(s,b). Ie v proves s

A simple setup would consist of a starting program p1∈V and a rule that says,

If p1[p2[s]⟹s] then you can add p2 to your list of trusted provers. If p1 proves the soundness of p2, then you can accept any statement when given a proof of it in p2.

The lobian obstacle is that p2 must be strictly weaker than p1, in that p1 can prove any statement that p2 can, but p1 can prove the soundness of p2 and p2 can't prove its own soundness. This means that each trusted prover has to be strictly weaker than the one that generated it. You could start with PA+3^^^3 and say that a few weakenings aren't a problem, but that isn't an elegant solution.

Note: You can't get around this problem using cycles. Suppose

a[b[s]⟹s]b[c[s]⟹s]

This would imply

a[b[c[s]⟹s]⟹(c[s]⟹s)]a[b[c[s]⟹s]]a[c[s]⟹s]

So any cycle could be shrunk by 1, and inductively, shrunk to a self trusting system.

I propose instead that you use the rule.

If p1[p2[s]⟹p1[s]] then accept any future proofs that are accepted by p2, and give p2 all rights given to p1, including taking p2[p3⟹p2] to mean that p3 is accepted, and so on recursively. If p1 is sound, then p1[p2[s]⟹p1[s]]⟹(p2[s]⟹p1[s]) so anything that is proven by p2 can be proven by p1. If p1 isn't sound, it can prove anything anyway.

Note that if p2 is a straightforward isomorphism of p1 then they are easily proven equivalent. However, if p2 says "run Turing machine T for length(b) steps, if it doesn't halt, check if b is a valid proof of s, if T does halt, return ⊤" then it could be equivalent to p1 from a second order logic perspective, but p1 can't prove that T never halts.

Still, this rule allows us to prove anything provable in p1 and only things provable in p1, while also allowing the user to add shorthands and semantic sugar.

Note that the "proof" b could just be the number of processor cycles you want to run a proof search for before giving up. In fact, this framework lets you swap and mix between hand proved results, and automatically proved results (with search time cut off) as you see fit.

This formalism allows a any system containing a proof checker to automatically upgrade itself to a version that has the same Godelian equivalence class, but is more suited to the hardware available.

Discuss