# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 7 минут 29 секунд назад

### The "Commitment Races" problem

23 августа, 2019 - 04:58
Published on August 23, 2019 1:58 AM UTC

[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future.]

This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. In short, here is the problem:

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be one of these times unless we think carefully about this problem and how to avoid it.

For this post I use "consequentialists" to mean agents that choose actions entirely on the basis of the expected consequences of those actions. For my purposes, this means they don't care about historical facts such as whether the options and consequences available now are the result of malicious past behavior. (I am trying to avoid trivial definitions of consequentialism according to which everyone is a consequentialist because e.g. "obeying the moral law" is a consequence.) This definition is somewhat fuzzy and I look forward to searching for more precision some other day.

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible

Consequentialists are bullies; a consequentialist will happily threaten someone insofar as they think the victim might capitulate and won't retaliate.

Consequentialists are also cowards; they conform their behavior to the incentives set up by others, regardless of the history of those incentives. For example, they predictably give in to credible threats unless reputational effects weigh heavily enough in their minds to prevent this.

In most ordinary circumstances the stakes are sufficiently low that reputational effects dominate: Even a consequentialist agent won't give up their lunch money to a schoolyard bully if they think it will only invite more bullying later. But in some cases the stakes are high enough, or the reputational effects low enough, for this not to matter.

So, amongst consequentialists, there is sometimes a huge advantage to "winning the commitment race." If two consequentialists are playing a game of Chicken, the first one to throw out their steering wheel wins. If one consequentialist is in position to seriously hurt another, it can extract concessions from the second by credibly threatening to do so--unless the would-be victim credibly commits to not give in first. If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can "move first" can get much more than the one that "moves second." In general, because consequentialists are cowards and bullies, the consequentialist who makes commitments first will predictably be able to massively control the behavior of the consequentialist who makes commitments later. As the folk theorem shows, this can even be true in cases where games are iterated and reputational effects are significant.

Note: "first" and "later" in the above don't refer to clock time, though clock time is a helpful metaphor for imagining what is going on. Really, what's going on is that agents learn about each other, each on their own subjective timeline, while also making choices (including the choice to commit to things) and the choices a consequentialist makes at time t are cravenly submissive to the commitments they've learned about by t.

Logical updatelessness and acausal bargaining combine to create a particularly important example of a dangerous commitment race. There are strong incentives for consequentialist agents to self-modify to become updateless as soon as possible, and going updateless is like making a bunch of commitments all at once. Since real agents can't be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai, and Abram Demski have described this problem already, so I won't say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That's my take on it anyway)

Anecdote: Playing a board game last week, my friend Lukas said (paraphrase) "I commit to making you lose if you do that move." In rationalist gaming circles this sort of thing is normal and fun. But I suspect his gambit would be considered unsportsmanlike--and possibly outright bullying--by most people around the world, and my compliance would be considered cowardly. (To be clear, I didn't comply. Practice what you preach!)When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in may be one of these times.

This situation is already ridiculous: There is something very silly about two supposedly rational agents racing to limit their own options before the other fellow limits his. But it gets worse.

Sometimes commitments can be made "at the same time"--i.e. in ignorance of each other--in such a way that they lock in an outcome that is disastrous for everyone. (Think both players in Chicken throwing out their steering wheels simultaneously.)

Here is a somewhat concrete example: Two consequentialist AGI think for a little while about game theory and commitment races and then self-modify to resist and heavily punish anyone who bullies them. Alas, they had slightly different ideas about what counts as bullying and what counts as a reasonable request--perhaps one thinks that demanding more than the Nash Bargaining Solution is bullying, and the other thinks that demanding more than the Kalai-Smorodinsky Bargaining Solution is bullying--so many years later they meet each other, learn about each other, and end up locked into all-out war.

I'm not saying disastrous AGI commitments are the default outcome; I'm saying the stakes are high enough that we should put a lot more thought into preventing them than we have so far. It would really suck if we create a value-aligned AGI that ends up getting into all sorts of fights across the multiverse with other value systems. We'd wish we built a paperclip maximizer instead.

Objection: "Surely they wouldn't be so stupid as to make those commitments--even I could see that outcome coming. A better commitment would be..."

Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven't anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don't know so much.

Objection: "Threats, submission to threats, and costly fights are rather rare in human society today. Why not expect this to hold in the future, for AGI, as well?"

1. Devastating "Grim trigger" commitments are much more possible with AGI--just alter the code! Inigo Montoya is a fictional character and even he wasn't able to summon lifelong commitment on a whim; it had to be triggered by the brutal murder of his father.

2. Credibility is much easier also, especially in an acausal context (see above.)

3. Some AGI bullies may be harder to retaliate against than humans, lowering their disincentive to make threats.

4. AGI may not have sufficiently strong reputation effects in the sense relevant to consequentialists, partly because threats can be made more devastating (see above) and partly because they may not believe they exist in a population of other powerful agents who will bully them if they show weakness.

5. Finally, these terrible things do happen to some extent even among humans today--especially in situations of anarchy. Hopefully we can do better.

Objection: "Any AGI that falls for this commit-now-before-the-others-do argument will also fall for many other silly do-X-now-before-it's-too-late arguments, and thus will be incapable of hurting anyone."

Reply: That would be nice, wouldn't it? Let's hope so, but not count on it. Indeed perhaps we should look into whether there are other arguments of this form that we should worry about our AI falling for...

Anecdote: A friend of mine, when she was a toddler, would threaten her parents: "I'll hold my breath until you give me the candy!" Imagine how badly things would have gone if she was physically capable of making arbitrary credible commitments. Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn't actually commit to anything then.Conclusion

Overall, I'm not certain that this is a big problem. But it feels to me that it might be, especially if acausal trade turns out to be a real thing. I would not be surprised if "solving bargaining" turns out to be even more important than value alignment, because the stakes are so high. I look forward to a better understanding of this problem.

Many thanks to Abram Demski, Wei Dai, John Wentworth, and Romeo Stevens for helpful conversations.

Discuss

### Analysis of a Secret Hitler Scenario

23 августа, 2019 - 04:37
Published on August 23, 2019 1:24 AM UTC

Secret Hitler is a social deception game in the tradition of mafia, the resistance and Avalon [1]. You can read the rules here if you aren't familiar. I haven't played social deception games regularly since 2016 but in my mind its a really good game that represented the state of the art in the genre at that time. I'm going to discuss an interesting situation in which I reasoned poorly.

I was a liberal in a ten player game. The initial table set up, displaying relevant players was,

We passed a fascist article in the first round. The next government was Marek as president and Chancellor a player 3 to the right of Marek [2]. Marek passed a fascist article and inspected Sam and declared Sam was fascist to Sam's counterclaim that Marek was fascist. Sam was to the left of Marek so we had no data about him. The table generally supported Sam but I leaned towards believing Marek.

At the beginning of the game my view of possibilities looked roughly like:

But of course I'm already approximating. From my perspective an individual is only 4/9 to be Fascist and the events that two individuals are fascist is not independent. A more careful calculation would have been:

• There are (94) possible distributions of Fascists in which (74) of them both Marek and Sam are good for a probability of 5/18
• There are (73) ways for Marek to be Fascist and Sam to be Liberal. Of course (74)=(73) so we get 5/18 again.
• The remaining event that both are evil must then be 3/18.

So a better perspective would have been:

I don't think I could do this math consistently in game though so I'll do the rest of the analysis with my original priors. I've included it here for reference in your future 10 player games of Secret Hitler.

When Marek declared Sam was Fascist the only scenario which is confidently eliminated is that both are liberal. Any reasonable liberal player is truth promoting and has no reason to lie. At first glance it seems that the possibility that both are fascist is also eliminated as the fascists should have no reason to fight each other. This doesn't strictly hold though. The fascists only need one fascist in a government to likely sink it and if they reason that the liberals will reason that one of them must be good then they have good reason to pick fights with each other. But in practice in an accusation situation the table often opts to pick neither so its a risky move.

Now we get into the questionable deduction I made during the game. If Marek is a fascist and he inspects a liberal he has a choice to make. He can accuse the liberal of being fascist to sow distrust among the liberals. Or he can tell the truth to garner trust with the liberals. If Marek is liberal and he inspects a fascist then he has no choice to make. He will declare that the fascist is a fascist.

In the moment I figured there was about a 50-50 chance that a fascist would choose to call a liberal a fascist or a liberal. Let's call fascists who lie about liberal's identities bold and fascists who tell the truth timid. I discounted the possibility that both were fascist and I reasoned with probabilities from the first square. So given that Marek had accused Sam that meant that Marek was a liberal with probability 75%.

The problem is I didn't adequately update on Marek's fascist presidency. In my mental model of the game it's not the improbable to draw 3 fascist articles. This mental model is derived from the fact that it generally happens once or twice a game. But its still an unlikely event in the sense that I should consider it evidence that the president was fascist. From the reports of the first group 3 fascist articles had already been buried. Even if I distrust them I can still guess at least 2 fascist articles were buried. So the probability that 3 fascist articles were drawn again is at least (115)/(145)≈0.23. Going into the investigation I had a belief that Marek was Fascist with probability about 50% but I should have already updated to ~88% that he was a fascist as opposed to an unlucky liberal (100% of the time he didn't draw 3 fascist articles he buried a liberal and is a fascist and half the rest of the time he's a fascist by my prior). Given that, even with my conjecture that Fascists make false accusations only half the time I should have guessed Marek was more likely fascist than Sam. Marek's accusation demonstrated Marek is not a timid fascist which I conjectured to be half the fascist probability mass. By Bayes' Theorem I should have updated to Marek being fascist with probability:

P(Marek a Fascist|Marek not a timid Fascist)=P(Marek not a timid Fascist|Marek a Fascist)P(Marek a Fascist)P(Marek not a timid Fascist)=0.5⋅0.880.12+0.88/2≈0.79

I definitely computed badly in the moment. I think my model building was also bad in a number of other ways which are harder for me to put numbers on:

• I thought Marek was unlikely to pick a fight since he seemed relatively new and he was very quiet after his accusation. In my mind people lie about other people's identities prepared to fight and Marek seemed sort of timid. The counterpoint to this is lying is the obvious level one strategy for fascists. A reasonable person might think its just what fascists are supposed to do.
• I didn't factor in the probability that both were fascists at all.
• All the players seemed to jump on Marek. I think part of the reason I defended him was a contrarian bias. But being contrarian against too big a consensus in Secret Hitler is important. If everyone agrees about something than some fascists are agreeing.
• On the meta level I was very sleepy and new myself to not be reasoning that well or remembering basic facts that accurately. I probably should have deferred to the group.

Thanks for reading and let me know if you have any other thoughts about the position.

1. The development mafia -> resistance -> avalon -> Secret Hitler represents substantial progress in board gaming technology. There's also a lot of amazing adjacent games like Two Rooms and a Boom and One Night Ultimate Werewolf. I'm thankful to live in a time of extraordinary board game technological progress. ↩︎

2. Our meta was such that the chancellorship rotated counterclockwise as the presidency rotated clockwise to see the maximum number of players. In later games our meta updated to make the chancellor 3 to the left of the president and neining to them if we got a favorable result which seems to be a very powerful strategy for the liberals. ↩︎

Discuss

### Thoughts from a Two Boxer

23 августа, 2019 - 03:57
Published on August 23, 2019 12:24 AM UTC

I'm writing this for blog day at MSFP. I thought about a lot of things here like category theory, the 1-2-3 conjecture and Paul Christiano's agenda. I want to start by thanking everyone for having me and saying I had a really good time. At this point I intend to go back to thinking about the stuff I was thinking about before MSFP (random matrix theory). But I learned a lot and I'm sure some of it will come to be useful. This blog is about (my confusion of) decision theory.

Before the workshop I hadn't read much besides Eliezer's paper on FDT and my impression was that it was mostly a good way of thinking about making decisions and at least represented progress over EDT and CDT. After thinking more carefully about some canonical thought experiments I'm no longer sure. I suspect many of the concrete thoughts which follow will be wrong in ways that illustrate very bad intuitions. In particular I think I am implicitly guided by non-example number 5 of an aim of decision theory in Wei Dai's post on the purposes of decision theory. I welcome any corrections or insights in the comments.

The Problem of Decision Theory

First I'll talk about what I think decision theory is trying to solve. Basically I think decision theory is the theory of how one should[1] decide on an action after one already understands: The actions available, the possible outcomes of actions, the probabilities of those outcomes and the desirability of those outcomes. In particular the answers to the listed questions are only adjacent to decision theory. I sort of think answering all of those questions is in fact harder than the question posed by decision theory. Before doing any reading I would have naively expected that the problem of decision theory, as stated here, was trivial but after pulling on some edge cases I see there is room for a lot of creative and reasonable disagreement.

A lot of the actual work in decision theory is the construction of scenarios in which ideal behavior is debatable or unclear. People choose their own philosophical positions on what is rational in these hairy situations and then construct general procedures for making decisions which they believe behave rationally in a wide class of problems. These constructions are a concrete version of formulating properties one would expect an ideal decision theory to have.

One such property is that an ideal decision theory shouldn't choose to self modify in some wide vaguely defined class of "fair" problems. An obviously unfair problem would be one in which the overseer gives CDT $10 and any other agent$0. One of my biggest open questions in decision theory is where this line between fair and unfair problems should lie. At this point I am not convinced any problem where agents in the environment have access to our decision theory's source code or copies of our agent are fair problems. But my impression from hearing and reading what people talk about is that this is a heretical position.

Newcomb's Problem

Let's discuss Newcomb's problem in detail. In this problem there are two boxes one of which you know contains a dollar. In the other box an entity predicting your action may or may not put a million dollars. They put a million dollars if and only if they predict you will only take one box. What do you do if the predictor is 99 percent accurate? How about if it is perfectly accurate? What if you can see the content of the boxes before you make your decision?

An aside on why Newcomb's problem seems important: It is sort of like a prisoner's dilemma. To see the analogy imagine you're playing a classical prisoner's dilemma against a player who can reliably predict your action and then chooses to match it. Newcomb's problem seems important because prisoner's dilemmas seem like simplifications of situations which really do occur in real life. The tragedy of prisoner dilemmas is that game theory suggests you should defect but the real world seems like it would be better if people cooperated.

Newcomb's problem is weird to think about because the predictor and agent's behaviors are logically connected but not causally. That is, if you tell me what the agent does or what the predictor predicts as an outside observer I can guess what the other does with high probability. But once the predictor predicts the agent could still take either option and flip flopping won't flip flop the predictor. Still one may argue you should one box because being a one boxer going into the problem means you will likely get more utility. I disagree with this view and see Newcomb's problem as punishing rational agents.

If Newcomb's problem is ubiquitous and one imagines an agent walking down the street constantly being Newcombed it is indeed unfortunate if they are doomed to two box. They'll end up with far fewer dollars. But this thought experiment is missing an important part of real world detail in my view. How the predictors predict the agents behavior. There are three possibilities:

• The predictors have a sophisticated understanding of the agent's inner workings and use it to simulate the agent to high fidelity.
• The predictors have seen many agents like our agent doing problems like this problem and use this to compute a probability of our agent's choice and compare it to a decision threshold.
• The predictor has been following the behavior of our agent and uses this history to assign its future behavior a probability.

In the third bullet the agent should one box if they predict they are likely to be Newcombed often[2]. In the second bullet they should one box if they predict that members of their population will be Newcombed often and they derive more utility from the extra dollars their population will get then the extra dollar they could get for themselves. I have already stated I see the third bullet as an unfair problem.

My big complaint with mind reading is that there just isn't any mind reading. All my understandings of how people behave comes from observing how they behave in general, how the human I'm trying to understand behaves specifically, whatever they have explicitly told me about their intentions and whatever self knowledge I have I believe is applicable to all humans. Nowhere in the current world do people have to make decisions under the condition of being accurately simulated.

Why then do people develop so much decision theory intended to be robust in the presence of external simulators? I suppose its because there's an expectation that this will be a major problem in the future which should be solved philosophically before it is practically important. Mind reading could become important to humans if mind surveillance because possible and deployed. I don't think such a thing is possible in the near term or likely even in the fullness of time. But I also can't think of any insurmountable physical obstructions so maybe I'm too optimistic.

Mind reading is relevant to AI safety because whatever AGI is created will likely be a program on a computer somewhere which could reason its program stack is fully transparent or its creators are holding copies of it for predictions.

Conclusion

Having written that last paragraph I suddenly understand why decision theory in the AI community is the way it is. I guess I wasn't properly engaging with the premises of the thought experiment. If one actually did tell me I was about to do a Newcomb experiment I would still two box because knowing I was in the real world I wouldn't really believe that an accurate predictor would be deployed against me. But an AI can be practically simulated and what's more can reason that it is just a program run by a creator that could have created many copies of it.

I'm going to post this anyway since its blog-day and not important-quality-writing day but I'm not sure this blog has much of a purpose anymore.

1. This may read like I'm already explicitly guided by the false purpose Wei Dai warned against. My understanding is that the goal is to understand ideal decision making. Just not for the purposes of implementation. ↩︎

2. I don't really know anything but I imagine the game theory of reputation is well developed ↩︎

Discuss

23 августа, 2019 - 03:21
Published on August 23, 2019 12:21 AM UTC

This post is a result of numerous discussions with other participants and organizers of the MIRI Summer Fellows Program 2019.

I recently (hopefully :-) ) dissolved some of my confusion about agency. In the first part of the post, I describe a concept that I believe to be central to most debates around agency. I then briefly list some questions and observations that remain interesting to me.

A(Θ)-morphization Architectures

Consider the following examples of "architectures":

Example (architectures)

1. "Agenty" according to me:
1. Monte Carlo tree search algorithm, parametrized by the number of rollouts made each move and utility function (or heuristic) used to evaluate positions.
2. (semi-vague) "Classical AI-agent" with several interconnected modules (utility function and world model, actions, planning algorithm, and observations used for learning and updating the world model).
3. (vague) Human parametrized by their goals, knowledge, and skills (and, of course, many other details).
2. "Non-agenty" according to me:
1. A hard-coded sequence of actions.
2. Look-up table.
3. Random generator (outputting x∼π on every input, for some probability distribution π).
3. Multi-agent systems:
1. Ant colony.
2. Company (consisting of individual employees, operating within an economy).
3. Comprehensive AI services.

Working definition: Architecture A(Θ) is some model parametrizable by θ inΘ that receives inputs, produces outputs, and possibly keeps an internal state. We denote specific instances of A(Θ) as A(θ).

Generalizing anthropomorphization

A standard item in the human mental toolbox is anthropomorphization: modeling various things as humans (specifically, ourselves) with "funny" goals or abilities. We can make the same mental move for architectures other than humans:

Working definition (A(Θ)-morphization): Let X be something that we want to predict or understand and let A(Θ) be an architecture. Then any model A(θ) is an A(Θ)-morphization of X.

Antropomorphization works well for other humans and some animals (curiosity, fear, hunger). On the other hand, it doesn't work so well for rocks, lightning, and AGI-s --- not that it would prevent us from using it anyway. We can measure the usefulness of A(Θ)-morphization by the degree to which it makes good predictions:

Working definition (prediction error): Suppose X exists in a world W and →E=(E1,…,En) is a sequence of variables (events about X) that we want to predict. Suppose that →e=(e1,…,en) is how →E actually unfolds and →π=(π1,…,πn) is the prediction obtained by A(Θ)-morphizing X as A(θ). The prediction error of A(θ) (w.r.t X and →E in W) is the expected Briar score of π with respect to →e.

Informally, we say that A(Θ)-morphizing X is accurate (resp. not accurate) if the corresponding prediction error is low (resp. high).[1]

When do we call things agents?

Claim: I claim that in many situations where we ask "Is X an agent?", we should instead be asking "Does X exhibit agent-like behavior?". And even better, we should explicitly operationalize this latter question by "Is A(Θ)-morphizing X accurate?". (A related question is how difficult is it for us to "run" A(θ). Indeed, we anthropomorphize so many things precisely because it is cheap for us to do so.)

Relatedly, I believe we already implicitly do this operationalization: Suppose you talk to your favorite human H about agency. H will likely subconsciously associate agency with certain architectures, maybe such as those in Example 1.1-3. Moreover, H will ascribe varying degrees of agency to different architectures --- for me, 1.3 seems more agenty than 1.1. Similarly, there are some architectures that H will associate with "definitely not an agent". I conjecture that, according to H, some X exhibits agent-like behavior if it can be accurately predicted via A(Θ)-morphization for some agenty-to-H architecture A(Θ). Similarly, H would say that X exhibits non-agenty behavior if H can accurately predict it using some non-agenty-to-H architecture.

*Critically, exhibiting agent-like and non-agenty behavior is not mutually exclusive, *and I think this causes most of the confusion around agency. Indeed, we humans seem very agenty but, at the same time, determinism implies that there exists some hard-coded behavior that we enact. A rock rolling downhill can be viewed as merely obeying the non-agenty laws of physics, but what if it "wants to" get as low as possible?

If we ban the concept of agency, which interesting problems remain?

"Agency" often comes up when discussing various alignment-related topics, such as the following:

Optimizer?

How do we detect whether X performs (or capable of performing) optimization? How to detect this from X's architecture (or causal origin) rather than looking at its behavior? (This seems central to the topic of mesa-optimization.)

Agent-like behavior vs agent-like architecture.

Consider the following conjecture: "Suppose some X exhibits agent-like behavior. Does it follow that X physically contains agent-like architecture, such as the one from Example 1.2?". This conjecture is false --- as an example, Q-learning is a "fairly agenty" architecture that leads to intelligent behavior. However, the resulting RL "agent" has a fixed policy and thus functions as a large look-up table. A better question would thus be whether there exist an agent-like architecture causally upstream of X. This question also has a negative answer, as witnessed by the example of an ant colony --- agent-like behavior without agent-like architecture, produced by a "non-agenty" optimization process of evolution. Nonetheless, a general version of the question remains: If some X exhibits agent-like behavior, does it follow that there exists some interesting physical structure[2] causally upstream of X?[3]

Moral standing.

Suppose there is some X, which I model as having some goals. When making actions should I give weight to those goals? (The answer to this question seems more related to conciousness than to A(Θ)-morphization. Note also that a particularly interesting version of the question can be obtained by replacing "I" by "AGI"...)

PC or NPC?

When making plans, should we model X as a part of the environment, or does it enter our game-theoretical considerations? Is X able to model us?

Creativity, unbounded goals, environment-generality.

In some sense, AlphaZero is an extremely capable game-playing agent. On the other hand, if we "gave it access to the internet", it wouldn't do anything with it. The same cannot be said for humans and unaligned AGIs, who would not only be able to orient in this new environment but would eagerly execute elaborate plans to increase their influence. How can we tell whether some X is more like the former or the latter?

To summarize, I believe that many arguments and confusions surrounding agency can disappear if we explicitly use A(Θ)-morphization. This should allow us to focus on the problems listed above. Most definitions I gave are either semi-formal or informal, but I believe they could be made fully formal in more specific cases.

Regarding feedback: Suggestions for a better name super-welcome! If you know of an application for which such formalization would be useful, please do let me know. Pointing out places where you expect a useful formalization to be impossible is also welcome.

1. Distinguishing between "small enough" and "too big" prediction errors seems non-trivial since different environments are naturally more difficult to predict than others. Formalizing this will likely require additional insights. ↩︎

2. An example of such "interesting physical structure" would be an implementation of an optimization architecture. ↩︎

3. Even if true, this conjecture will likely require some additional assumptions. Moreover, I expect "randomly-generated look-up tables that happen to stumble upon AGI by chance" to serve as a particularly relevant counterexample. ↩︎

Discuss

### Logical Optimizers

23 августа, 2019 - 02:54
Published on August 22, 2019 11:54 PM UTC

Epistemic status: I think the basic Idea is more likely than not sound. Probably some mistakes. Looking for sanity check.

Black box description

The following is a way to Foom an AI while leaving its utility function and decision theory as blank spaces. You could plug any uncomputable or computationally intractable behavior you might want in, and get an approximation out.

Suppose I was handed a hypercomputer and allowed to run code on it without worrying about mindcrime, then the hypercomputer is removed, allowing me to keep 1Gb of data from the computations. Then I am handed a magic human utility function, as code on a memory stick. This approach would allow me to use the situation to make a FAI.

Example algorithms

Suppose you have a finite set of logical formulas, each of which evaluate to some real number. A logical optimizer is an algorithm that takes those formulas and tries to maximize the value of the formula it outputs.

Another algorithm to pick a large rn is to run a logical inductor to estimate each rnand then pick the rnthat maximized those estimates.

Suppose the formulas were

1) "3+4"

2) "10 if P=NP else 1"

3) "0 if P=NP else 11"

4) "2*6-3"

When run with a small amount of compute, these algorithms would pick option (4). They are in a state of logical uncertainty about whether P=NP, and act accordingly.

Given vast amounts of compute, they would pick either 2 or 3.

We might choose to implicitly represent a set of propositions in some manner instead of explicitly stating them. This would mean that a Logical Optimizer could optimize without needing to explicitly consider every possible expression. It could use an evolutionary algorithm. It could rule out swaths of propositions based on abstract reasoning.

Self Improvement

Now consider some formally specifiable prior over sets of propositions called P. P could be a straightforward simplicity based prior, or it could be tuned to focus on propositions of interest.

Suppose α1,...,αn are a finite set of programs that take in a set of propositions R={r1,...,rm} and output one of them. If a program fails to choose a number from 1 to m quickly enough then pick randomly, or 1 or something.

Let C(α,R)=ri be the choice made by the program α.

Let S(α)=∑R∈PC(α,R)×PR be the average value of the proposition chosen by the program α, weighted by the prior over sets of propositions P.

Now attempting to maximize S(α) over all short programs α is something that Logical Optimizers are capable of doing. Logical Optimizers are capable of producing other, perhaps more efficient Logical Optimizers in finite time.

Odds and ends

Assuming that you can design a reasonably efficient Logical Optimizer to get things started, and that you can choose a sensible P, you could get a FOOM towards a Logical Optimizer of almost maximal efficiency.

Note that Logical Optimizers aren't AI's. They have no concept of empirical uncertenty about an external world. They do not perform Baysian updates. They barely have a utility function. You can't put one in a prisoners dilemma. They only resolve a certain kind of logical uncertainty.

On the other hand, a Logical Optimizer can easily be converted into an AI by defining a prior, a notion of baysian updating, an action space and a utility function.

Just maximize over action a∈A in expressions of the form "Starting with Prior P and updating it based on evidence E, if you take action a then your utility will be?"

I suspect that Logical Optimisers are safe, in the sense that you could get one to FOOM on real world hardware, without holomorphic encryption and without disaster.

Logical Optimizers are not Clever fool proof. A clever fool could easily turn one into a paper clip maximizer. Do not put a FOOMed one online.

I suspect that one sensible route to FAI is to FOOM a logical optimizer, and then plug in some uncomputable or otherwise unfeasible definition of friendliness.

Discuss

### Mechanistic Corrigibility

23 августа, 2019 - 02:20
Published on August 22, 2019 11:20 PM UTC

Acceptability

To be able to use something like relaxed adversarial training to verify a model, a necessary condition is having a good notion of acceptability. Paul Christiano describes the following two desiderata for any notion of acceptability:

1. "As long as the model always behaves acceptably, and achieves a high reward on average, we can be happy."
2. "Requiring a model to always behave acceptably wouldn't make a hard problem too much harder."

While these are good conditions that any notion of acceptability must satisfy, there may be many different possible acceptability predicates that meet both of these conditions—how do we distinguish between them? Two additional major conditions that I use for evaluating different acceptability criteria are as follows:

1. It must be not that hard for an amplified overseer to verify that a model is acceptable.
2. It must be not that hard to find such an acceptable model during training.

These conditions are different than Paul's second condition in that they are statements about the ease of training an acceptable model rather than the ease of choosing an acceptable action. If you want to be able to do some form of informed oversight to produce an acceptable model, however, these are some of the most important conditions to pay attention to. Thus, I generally think about choosing an acceptability condition as trying to answer the question: what is the easiest-to-train-and-verify property such that all models that satisfy that property[1] (and achieve high average reward) are safe?

Act-Based Corrigibility

One possible candidate property that Paul has proposed is act-based corrigibility, wherein an agent respects our short-term preferences, including those over how the agent itself should be modified. Not only is such an agent corrigible, Paul argues, but it will also want to make itself more corrigible, since having it be more corrigible is a component of our short-term preferences (Paul calls this the "broad basin" of corrigibility). While such act-based corrigibility would definitely be a nice property to have, it's unclear how exactly an amplified overseer could go about verifying such a property. In particular, if we want to verify such a property, we need a mechanistic understanding of act-based corrigibility rather than a behavioral one, since behavioral properties can only be verified by testing every input, whereas mechanistic properties can be verified just by inspecting the model.

One possible mechanistic understanding of corrigibility is corrigible alignment as described in "Risks from Learned Optimization," which is defined as the situation in which "the base objective is incorporated into the mesa-optimizer's epistemic model and [the mesa-optimizer's] objective is modified to 'point to' that information." While this gives us a starting point for understanding what a corrigible model might actually look like, there are still a bunch of missing pieces that have to be filled in. Furthermore, this notion of corrigibility looks more like instrumental corrigibility rather than act-based corrigibility, which as Paul notes is significantly less likely to be robust. Mechanistically, we can think of this lack of robustness as coming from the fact that "pointing" to the base objective is a pretty unstable operation: if you point even a little bit incorrectly, you'll end up with some sort of corrigible pseudo-alignment rather than corrigible robust alignment.

We can make this model more act-based, and at least somewhat mitigate this robustness problem, however, if we imagine pointing to only the human's short-term preferences. The hope for this sort of a setup is that, as long as the initial pointer is "good enough," there will be pressure for the mesa-optimizer to make its pointer better in the way in which its current understanding of short-term human preferences recommends, which is exactly Paul's "broad basin" of corrigibility argument. This requires it to be not that hard, however, to find a model with a notion of the human's short-term preferences as opposed to their long-term preferences that is also willing to correct that notion based on feedback.

In particular, it needs to be the case that it is not that hard to find an agent which will correct mistakes in its own prior over what the human's short-term preferences are. From a naive Bayesian perspective, this seems unlikely, as it seems strange for an agent to be incentivized to change its own prior. However, this is actually a very natural state for an agent to be in: if I trust your beliefs about X more than I trust my own, then that means I would endorse a modification of my prior to match yours. In the context of act-based corrigibility, we can think about this from a mechanistic perspective as having a pre-prior that encodes a belief that the human prior over human short-term preferences is to be preferred. Furthermore, pre-priors are generally epistemically valuable for agents to have, as a pre-prior can encourage an agent to correct its own cognitive biases. Thus, agents with pre-priors should be incentivized by most training processes, and thus shouldn't be too difficult to find.

Indifference Corrigibility

Instrumental and act-based corrigibility are not the only forms of corrigibility that have been discussed in the literature, however: there's also indifference corrigibility, wherein the agent is indifferent to modifications the human might make to it such as shutting it off. While this form of corrigibility doesn't in and of itself guarantee acceptability, if you had a way of verifying indifference corrigibility, there might be ways to turn that back into act-based corrigibility.

First, however, if we want to be able to verify indifference corrigibility, we are going to need a mechanistic understanding of it. Let's suppose we accept the argument that a competitive model will likely be a mesa-optimizer such that it will be running some sort of optimization process coupled with some sort of mesa-objective. Furthermore, let's suppose that its optimization process is computed over some sort of world model, which I think is a reasonable assumption for any competitive mesa-optimizer that wants to be able to take actions in the world. In such a case, we can think of indifference corrigibility as a form of stop gradient in the mesa-optimizer's world model which its optimization process doesn't optimize through,[2] which is a highly mechanistic property that we might expect an amplified overseer to be able to verify.

Again, however, we need agents with stop gradients to form a natural enough class such that they are not that hard to find. While it might naively seem as if agents with stop gradients would be very rare, in fact the exact opposite might be the case. Consider an agent such as the mesa-optimizer described previously which is running an optimization process over some world model, but which has preferences over the state of the world rather than the state of its world model. Such an agent will have an incentive to keep its world model aligned with the world, but only in one direction: it will want to modify its world model to match the world, but not modify the world to match its world model. This is a type of stop gradient, as it is a restriction on the agent to only minimize the difference between its world model and the world through one pathway but not through another. This example seems to suggest that stop gradients aren't that uncommon among agents such that it might be fairly achievable to find such an agent.

Even if we can train and verify indifference corrigibility, however, one of the big remaining problems with such an approach is that an indifference corrigible mesa-optimizer might do search in such a way that it finds a policy which is not itself indifference corrigible. We can think about this problem as arising from the fact that indifference corrigibility doesn't tile, though I more often think of this problem as the problem of "forwarding the guarantee," wherein even if we know how to do safe search on the outside, we have to make sure that any search inside of our model is similarly safe as well. This is a particularly interesting type of inner alignment problem in that, rather than being about how to ensure that a mesa-optimizer's objective is aligned, it is about how to ensure that a mesa-optimizer's search is safe even given that its objective is aligned. However, it seems plausible that this sort of problem could be resolved by ensuring that the model has a meta-preference towards any policies it produces also respecting the same stop gradient. In particular, the overseer could verify that any search over policies done by the model enforce the constraint that every policy have such a stop gradient.

Even once we can verify that our model is indifference corrigible and that it will forward that guarantee to any other search it might perform, however, there is still the question of how we might be able to use such a mechanism to produce a safe system. One way in which indifference corrigibility could be used to produce safety is to enforce that our model behave myopically. We can think of a myopic agent as one that only considers how best to answer the single question that you give to it rather than considering any sort of long-term consequences, which can be implemented as a stop gradient preventing any optimization outside of that domain. While myopia on its own is still insufficient to guarantee acceptability, it seems like it would at least prevent deceptive alignment, as one of the conditions for deceptive alignment is that the mesa-optimizer must have something to gain from cooperating now and then defecting later, which is not true for a myopic agent. Thus, if directed at a task which we are confident is outer aligned, such as pure supervised amplification (training a model to approximate a human consulting that model), and combined with a scheme for preventing standard pseudo-alignment (such as adversarial training), myopia verification might be sufficient to resolve the rest of the inner alignment problem by preventing deceptive alignment.

Conclusion

If we want to be able to do relaxed adversarial training to produce safe AI systems, we are going to need a notion of acceptability which is not that hard to train and verify. Corrigibility seems to be one of the most promising candidates for such an acceptability condition, but for that to work we need a mechanistic understanding of exactly what sort of corrigibility we're shooting for and how it will ensure safety. Though I think that both of the paths considered here look promising, further progress in understanding exactly what these different forms of corrigibility look like from a mechanistic perspective is likely to be necessary.

1. Or at least all models that we can find that satisfy that property. ↩︎

2. Thanks to Scott Garrabrant for the stop gradient analogy. ↩︎

Discuss

### Response to Glen Weyl on Technocracy and the Rationalist Community

23 августа, 2019 - 02:14
Published on August 22, 2019 11:14 PM UTC

Economist Glen Weyl has written a long essay, "Why I Am Not A Technocrat", a major focus of which is his differences with the rationalist community.

I feel like I've read a decent number of outsider critiques of the rationalist community at this point, and Glen's critique is pretty good. It has the typical outsider critique weakness of not being fully familiar with the subject of its criticism, balanced by the strength of seeing the rationalist community from a perspective we're less familiar with.

As I was reading Glen's essay, I took some quick notes. Afterwards I turned them into this post.

Glen's Strongest Points

The fundamental problem with technocracy on which I will focus (as it is most easily understood within the technocratic worldview) is that formal systems of knowledge creation always have their limits and biases. They always leave out important consideration that are only discovered later and that often turn out to have a systematic relationship to the limited cultural and social experience of the groups developing them. They are thus subject to a wide range of failure modes that can be interpreted as reflecting on a mixture of corruption and incompetence of the technocratic elite. Only systems that leave a wide range of latitude for broader social input can avoid these failure modes.

So far, this sounds a lot like discussions I've seen previously of the book Seeing Like a State. But here's where Glen goes further:

Yet allowing such social input requires simplification, distillation, collaboration and a relative reduction in the social status and monetary rewards allocated to technocrats compared to the rest of the population, thereby running directly against the technocratic ideology. While technical knowledge, appropriately communicated and distilled, has potentially great benefits in opening social imagination, it can only achieve this potential if it understands itself as part of a broader democratic conversation.

...

Technical insights and designs are best able to avoid this problem when, whatever their analytic provenance, they can be conveyed in a simple and clear way to the public, allowing them to be critiqued, recombined, and deployed by a variety of members of the public outside the technical class.

Technical experts therefore have a critical role precisely if they can make their technical insights part of a social and democratic conversation that stretches well beyond the role for democratic participation imagined by technocrats. Ensuring this role cannot be separated from the work of design.

...

[When] insulation is severe, even a deeply “well-intentioned” technocratic class is likely to have severe failures along the corruption dimension. Such a class is likely to develop a strong culture of defending its distinctive class expertise and status and will be insulated from external concerns about the justification for this status.

...

Market designers have, over the last 30 years designed auctions, school choice mechanisms, medical matching procedures, and other social institutions using tools like auction and matching theory, adapted to a variety of specific institutional settings by economic consultants. While the principles they use have an appearance of objectivity and fairness, they play out against the contexts of societies wildly different than those described in the models. Matching theory uses principles of justice intended to apply to an entire society as a template for designing the operation of a particular matching mechanism within, for example, a given school district, thereby in practice primarily shutting down crucial debates about desegregation, busing, taxes, and other actions needed to achieve educational fairness with a semblance of formal truth. Auction theory, based on static models without product market competition and with absolute private property rights and assuming no coordination of behavior across bidders, is used to design auctions to govern the incredibly dynamic world of spectrum allocation, creating holdout problems, reducing competition, and creating huge payouts for those able to coordinate to game the auctions, often themselves market design experts friendly with the designers. The complexities that arise in the process serve to make such mass-scale privatizations, often primarily to the benefit of these connected players and at the expense of the taxpayer, appear the “objectively” correct and politically unimpeachable solution.

...

[Mechanism] designers must explicitly recognize and design for the fact that there is critical information necessary to make their designs succeed that a) lies in the minds of citizens outside the technocratic/designer class, b) will not be translated into the language of this class soon enough to avoid disastrous outcomes and c) does not fit into the thin formalism that designers allow for societal input.

...

In order to allow these failures to be corrected, it will be necessary for the designed system to be comprehensible by those outside the formal community, so they can incorporate the unformalized information through critique, reuse, recombination and broader conversation in informal language. Let us call this goal “legibility”.

...

There will in general be a trade-off between fidelity and legibility, just as both will have to be traded off against optimality. Systems that are true to the world will tend to become complicated and thus illegible.

...

Democratic designers thus must constantly attend, on equal footing, in teams or individually, to both the technical and communicative aspects of their work.

(Please let me know if you think I left out something critical)

A famous quote about open source software development states that "given enough eyeballs, all bugs are shallow". Nowadays, with critical security bugs in open-source software like Heartbleed, the spirit of this claim isn't taken for granted anymore. One Hacker News user writes: "[De facto eyeball shortage] becomes even more dire when you look at code no one wants to touch. Like TLS. There were the Heartbleed and goto fail bugs which existed for, IIRC, a few years before they were discovered. Not surprising, because TLS code is generally some of the worst code on the planet to stare at all day."

In other words, if you want critical feedback on your open source project, it's not enough just to put it out there and have lots of users. You also want to make the source code as accessible as possible--and this may mean compromising on other aspects of the design.

Academic or other in-group status games may encourage the use of big words. But we'd be better off rewarding simple explanations--not only are simple explanations more accessible, they also demonstrate deeper understanding. If we appreciated simplicity properly:

• We'd incentivize the creation of more simple explanations, promoting accessibility. And people wouldn't dismiss simple explanations for being "too obvious".

• Intellectuals would realize that even if a simple idea required lots of effort to discover, it need not require lots of effort to grasp. Verification is much quicker than search.

At the very least, I think, Glen wants our institutions to be like highly usable software: The internals require expertise to create and understand, but from a user's perspective, it "just works" and does what you expect.

Another point Glen makes well is that just because you are in the institution design business does not mean you're immune to incentives. The importance of self-skepticism regarding one's own incentives has been discussed before around here, but this recent post probably comes closes to Glen's position, that you really can't be trusted to monitor yourself.

Finally, Glen talks about the insularity of the rationalist community itself. I think this critique was true in the past. I haven't been interacting with the community in person as much over the past few years, so I hesitate to talk about the present, but I think he's plausibly right. I also think there may be an interesting counterargument that the rationalist community does a better job of integrating perspectives across multiple disciplines than your average academic department.

Possible Points of Disagreement

Although I think Glen would find some common ground with the recent post I linked, it's possible he would also find points of disagreement. In particular, habryka writes:

Highlighting accountability as a variable also highlights one of the biggest error modes of accountability and integrity – choosing too broad of an audience to hold yourself accountable to.

There is tradeoff between the size of the group that you are being held accountable by, and the complexity of the ethical principles you can act under. Too large of an audience, and you will be held accountable by the lowest common denominator of your values, which will rarely align well with what you actually think is moral (if you've done any kind of real reflection on moral principles).

Too small or too memetically close of an audience, and you risk not enough people paying attention to what you do, to actually help you notice inconsistencies in your stated beliefs and actions. And, the smaller the group that is holding you accountable is, the smaller your inner circle of trust, which reduces the amount of total resources that can be coordinated under your shared principles.

I think a major mistake that even many well-intentioned organizations make is to try to be held accountable by some vague conception of "the public". As they make public statements, someone in the public will misunderstand them, causing a spiral of less communication, resulting in more misunderstandings, resulting in even less communication, culminating into an organization that is completely opaque about any of its actions and intentions, with the only communication being filtered by a PR department that has little interest in the observers acquiring any beliefs that resemble reality.

I think a generally better setup is to choose a much smaller group of people that you trust to evaluate your actions very closely, and ideally do so in a way that is itself transparent to a broader audience. Common versions of this are auditors, as well as nonprofit boards that try to ensure the integrity of an organization.

Common wisdom is that it's impossible to please everyone. And specialization of labor is a foundational principle of modern society. If I took my role as a member of "the public" seriously and tried to provide meaningful and fair accountability to everyone, I wouldn't have time to do anything else.

It's interesting that Glen talks up the value of "legibility", because from what I understand, Seeing Like a State emphasizes its disadvantages. Seeing Like a State discusses legibility in the eyes of state administrators, but Glen doesn't explain why we shouldn't expect similar failure modes when "the general public" is substituted for "state administration".

(It's possible that Glen doesn't mean "legibility" in the same sense the book does, and a different term like "institutional legibility" would pinpoint what he's getting at. But there's still the question of whether we should expect optimizing for "institutional legibility" to be risk-free, after having observed that "societal legibility" has downsides. Glen seems to interpret recent political events as a result of excess technocracy, but they could also be seen as a result of excess populism--a leader's charisma could be more "legible" to the public than their competence.)

Anyway, I assume Glen is aware of these issues and working to solve them. I'm no expert, but from what I've heard of RadicalxChange, it seems like a really cool project. I'll offer my own uninformed outsider's perspective on institution design, in the hope that the conceptual raw material will prove useful to him or others.

My Take on Institution Design

I think there's another model which does a decent job of explaining the data Glen provides:

• Human systems are complicated.

• Greed finds & exploits flaws in institutions, causing them to decay over time.

• There are no silver bullets.

From the perspective of this model, Glen's emphasis on legibility could be seen as yet another purported silver bullet. However, I don't see a compelling reason for it to succeed where previous bullets failed. How, concretely, are random folks like me supposed to help address the corruption Glen identifies in the wireless spectrum allocation process? There seems to be a bit of a disconnect between Glen's description of the problem and his description of the solution. (Later Glen mentions the value of "humanities, continental philosophy, or humanistic social sciences"--I'd be interested to hear specific ideas from these areas, which aren't commonly known, that he thinks are quite important & relevant for institution design purposes.)

As a recent & related example, a decade or two ago many people were talking about how the Internet would revitalize & strengthen democracy; nowadays I'd guess most would agree that the Internet has failed as a silver bullet in this regard. (In fact, sometimes I get the impression this is the only thing we can all agree on!)

Anyway... What do I think we should we do?

• All untested institution designs have flaws.

• The challenge of institution design is to identify & fix flaws as cheaply as possible, ideally before the design goes into production.

Under this framework, it's not enough merely to have the approval of a large number of people. If these people have similar perspectives, their inability to identify flaws offers limited evidence about the overall robustness of the design.

Legibility is useful for flaw discovery in this framework, just as cleaner code could've been useful for surfacing flaws like Heartbleed. But there are other strategies available too, like offering bug bounties for the best available critiques.

Experiments and field trials are a bit more expensive, but it's critical to actually try things out, and resolve disagreements among bug bounty participants. Then there's the "resume-building" stage of trialing one's institution on an increasingly large scale in the real world. I'd argue one should aim to have all the kinks worked out before "resume-building" starts, but of course, it's important to monitor the roll-out for problems which might emerge--and ideally, the institution should itself have means with which it can be patched "in production" (which should get tested during experimentation & field trials).

The process I just described could itself be seen as an untested institution which is probably flawed and needs critiques, experiments, and field testing. (For example, bug bounties don't do anything on their own for legibility--how can we incentivize the production of clear explanations of the institution design in need of critiques?) Taking everything meta, and designing an institutional framework for introducing new institutions, is the real silver bullet if you ask me :-)

Probable Points of Disagreement

Given Glen's belief in the difficulty of knowledge creation, the importance of local knowledge, and the limitations of outside perspectives, I hope he won't be upset to learn that I think he got a few things wrong about the rationalist community. (I also think he got some things wrong about the EA community, but I believe he's working to fix those issues, so I won't address them.)

Glen writes:

if we want to have AIs that can play a productive role in society, our goal should not be exclusively or even primarily to align them with the goals of their creators or the narrow rationalist community interested in the AIAP.

This doesn't appear to be a difference of opinion with the rationalist community. In Eliezer's CEV paper, he writes about the "coherent extrapolated volition of humankind", not the "coherent extrapolated volition of the rationalist community".

However, now that MIRI's research is non-disclosed by default, I wonder if it would be wise for them to publicly state that their research is for the benefit of all, in a charter like OpenAI has, rather than in a paper published in 2004.

Glen writes:

The institutions likely to achieve [constraints on an AI's power] are precisely the same sorts of institutions necessary to constrain extreme capitalist or state power.

An unaligned superintelligent AI which can build advanced nanotechnology has no need to follow human laws. On the flip side, an aligned superintelligent AI can design better institutions for aggregating our knowledge & preferences than any human could.

Glen writes:

A primary goal of AI design should be not just alignment, but legibility, to ensure that the humans interacting with the AI know its goals and failure modes, allowing critique, reuse, constraint etc. Such a focus, while largely alien to research on AI and on AIAP

This actually appears to me to be one of the primary goals of AI alignment research. See 2.3 in this paper or this parable. It's not alien to mainstream AI research either: see research on explainability and interpretability (pro tip: interpretability is better).

In any case, if the alignment problem is actually solved, legibility isn't needed, because we know exactly what the system's goals are: The goals we gave it.

Conclusion

As I said previously, I have not investigated RadicalxChange in very much depth, but my superficial impression is that it is really cool. I think it could be an extremely high leverage project in a world where AGI doesn't come for a while, or gets invented slowly over time. My personal focus is on scenarios where AGI is invented relatively rapidly relatively soon, but sometimes I wonder whether I should focus on the kind of work Glen does. In any case, I am rooting for him, and I hope his movement does an astonishing job of inventing and popularizing nearly flawless institution designs.

Discuss

### Why so much variance in human intelligence?

23 августа, 2019 - 01:36
Published on August 22, 2019 10:36 PM UTC

Epistemic status: Practising thinking aloud. There might be an important question here, but I might be making a simple error.

There is a lot of variance in general competence between species. Here is the standard Bostrom/Yudkowsky graph to display this notion.

There's a sense that while some mice are more genetically fit than others, they're broadly all just mice, bound within a relatively narrow range of competence. Chimps should not be worried about most mice, in the short or long term, but they also shouldn't worry especially so about peak mice - there's no incredibly strong or cunning mouse they ought to look out for.

However, my intuition is very different for humans. While I understand that humans are all broadly similar, that a single human cannot have a complex adaptation that is not universal [1], I also have many beliefs that humans differ massively in cognitive capacities in ways that can lead to major disparities in general competence. The difference between someone who does understand calculus and someone who does not, is the difference between someone who can build a rocket and someone who cannot. And I think I've tried to teach people that kind of math, and sometimes succeeded, and sometimes failed to even teach basic fractions.

I can try to operationalise my hypothesis: it seems plausible to me that if the average human intelligence was such that they'd be considered to have an IQ of 75 in the world we live in, that society could not have built rockets or do a lot of other engineering and science.

(Sidenote: I think the hope of iterated amplification is that this is false. That if I have enough humans with hard limits to how much thinking they can do, stacking lots of them can still produce all the intellectual progress we're going to need. My initial thought is that this doesn't make sense, because there are many intellectual feats like writing a book or coming up with special relativity that I generally expect individuals (situated within a conducive culture and institutions) to be much better at than groups of individuals (e.g. companies).

This is also my understanding of Eliezer's critique, that while it's possible to get humans with hard limits on cognition to make mathematical progress, it's by running an algorithm on them that they don't understand, not running an algorithm that they do understand, and only if they understand it do you get nice properties about them being aligned in the same way you might feel many humans are today.

It's likely I'm wrong about the motivation behind Iterated Amplification though.)

This hypothesis doesn't imply that someone who can do successful abstract reasoning is strictly more competent than a whole society of people who cannot. The Secret of our Success talks about how smart modern individuals stranded in forests fail to develop basic food preparation techniques that other, primitive cultures were able to build.

I'm saying that a culture with no people who can do calculus will in the long run score basically zero against the accomplishments of a culture with people who can.

One question is why we're in a culture so precariously balanced on this split between "can take off to the stars" and "mostly cannot". An idea I've heard before is the notion that if a culture is easily able to become technologically mature, it will come later than a culture who is just able to become technologically mature, because evolution works over much longer time scales than culture + technological innovation. As such, if you observe yourself to be in a culture that is able to become technologically mature, you're probably "the stupidest such culture that could get there, because if it could be done at a stupider level then it would've happened there first."

As such, we're a species whereby if we try as hard as we can, if we take brains optimised for social coordination and make them do math, then we can just about reach technical maturity (i.e. build nanotech, AI, etc).

That may be true, but the question I want to ask about is what is it about humans, culture and brains that allows for such high variance within the species, that isn't true about mice and chimps? Something about this is still confusing to me. Like, if it is the case that some humans are able to do great feats of engineering like build rockets that land, and some aren't, what's the difference between these humans that causes such massive changes in outcome? Because, as above, it's not some big complex genetic adaptation some have and some don't. I think we're all running pretty similar genetic code.

Is there some simple amount of working memory that's required to do complex recursion? Like, 6 working memory slots makes things way harder than 7?

I can imagine that there are many hacks, and not a single thing. I'm reminded of the story of Richard Feynman learning to count time, where he'd practice being able to count a whole minute. He'd do it while doing the laundry, while cooking breakfast, and so on. He later met the mathematician John Tukey, who could do the same, but they had some fierce disagreements. Tukey said you couldn't do it while reading the newspaper, and Feynman said he could. Feynman said you couldn't do it while having a conversation, and Tukey said they could. They then both surprised each other by doing exactly what they said they could.

It turned out Feynman was hearing numbers being spoken, whereas Tukey was visualising the numbers ticking over. So Feynman could still read at the same time, and his friend could still listen and talk.

The idea here is that if you're unable to use one type of cognitive resource, you may make up for it with another. This is probably the same situation as when you make trade-offs between space and time in computational complexity.

So I can imagine different humans finding different hacky ways to build up the skill to do very abstract truth-tracking thinking. Perhaps you have a little less working memory than average, but you have a great capacity for visualisation, and primarily work in areas that lend themselves to geometric / spacial thinking. Or perhaps your culture can be very conducive to abstract thought in some way.

But even if this is right I'm interested in the details of what the key variables actually are.

[1] Note: humans can lack important pieces of machinery.

Discuss

### Logical Counterfactuals and Proposition graphs, Part 1

23 августа, 2019 - 01:06
Published on August 22, 2019 10:06 PM UTC

I will use Greek letters to represent an arbitrary symbol, upper case for single symbols, lower case for strings.

Respecifying Propositional logic

The goal of this first section is to reformulate first order logic in a way that makes logical counterfactuals easier. Lets start with propositional logic.

We have a set of primitive propositions p,q,r,... as well as the symbols ⊤,⊥. We also have the symbols ∨,∧ which are technically functions from Bool2→Bool but will be written p∨q not ∨(p,q) . There is also ¬:Bool→Bool

Consider the equivalence rules.

1. α≡¬¬α

2. α∧β≡β∧α

3. (α∧β)∧γ≡α∧(β∧γ)

4. ¬α∧¬β≡¬(α∨β)

5. α∧⊤≡α

6. ¬α∨α≡⊤

7. ⊥≡¬⊤

8. α∧(β∨γ)≡(α∧β)∨(α∧γ)

9. ⊥∧α≡⊥

10.α≡α∧α

Theorem

Any tautology provable in propositional logic can be created by starting at ⊤ and repeatedly applying equivalence rules.

Proof

First consider α⟹β to be shorthand for ¬α∨β.

Lemma

We can convert ⊤ into any of the 3 axioms.

α⟹(β⟹α) is a shorthand for

¬α∨(¬β∨α)≡1

¬¬(¬α∨(¬β∨α))≡4

¬(¬¬α∧¬(¬β∨α))≡4

¬(¬¬α∧(¬¬β∧¬α))≡1

¬(¬¬α∧(β∧¬α))≡2

¬(¬¬α∧(¬α∧β))≡3

¬((¬¬α∧¬α)∧β)≡4

¬(¬(¬α∨α)∧β)≡6

¬(¬⊤∧β)≡7

¬(⊥∧β)≡9

¬⊥≡7

Similarly

(α⟹(β⟹γ))⟹((α⟹β)⟹(α⟹γ))

(¬α⟹¬β)⟹(β⟹α)

(if these can't be proved, add that they ≡⊤ as axioms)

End Lemma

Whenever you have α∧(α⟹β), that is equiv to

α∧(¬α∨β)≡8

(α∧¬α)∨(α∧β)≡1

(¬¬α∧¬α)∨(α∧β)≡4

¬(¬α∨α)∨(α∧β)≡6

¬⊤∨(α∧β)≡1

¬¬(¬⊤∨(α∧β))≡4

¬(¬¬⊤∧¬(α∧β))≡1

¬(⊤∧¬(α∧β))≡2

¬(¬(α∧β)∧⊤)≡5

¬¬(α∧β)≡1

α∧β

This means that you can create and apply axioms. For any tautology, look at the proof of it in standard propositional logic. Call the statements in this proof p1,p2,p3...

suppose we have already found a sequence of substitutions from ⊤ to p1∧p2...∧pi−1

Whenever pi is a new axiom, use (5.) to get p1∧p2...∧pi−1∧⊤, then convert ⊤ into the instance of the axiom you want. (substitute alpha and beta with arbitrary props in above proof schema)

Using substitution rules (2.) and (3.) you can rearrange the terms representing lines in the proof and ignore their bracketing.

Whenever pi is produced by modus ponus from the previous pj and pk=pj⟹pi then duplicate pk with rule (10.), move one copy next to pj and use the previous procedure to turn pj∧(pj⟹pi) into pj∧pi. Then move pi to end.

Once you reach the end of the proof, duplicate the result and unwind all the working back to ⊤, which can be removed by rule (5.)

Corollary

{p,q,r}⊢s then p∧q∧r≡p∧q∧r∧s

Because p∧q∧r⟹s is a tautology and can be applied to get s.

Corollary

Any contradiction is reachable from ⊥

The negation of any contradiction k is a tautology.

⊥≡¬⊤≡¬¬k≡k

Intuitive overview perspective 1

An illustration of rule 4. ¬α∧¬β≡¬(α∨β) in action.

We can consider a proposition to be a tree with a root. The nodes are labeled with symbols. The axiomatic equivalences become local modifications to the tree structure, which are also capable of duplicating and merging identical subtrees by (10.). Arbitrary subtrees can be created or deleted by (5.).

We can merge nodes with identical subtrees into a single node. This produces a directed acyclic graph, as shown above. Under this interpretation, all we have to do is test node identity.

Intuitive overview perspective 2

Consider each possible expression to be a single node within an infinite graph.

Each axiomatic equivalence above describes an infinite set of edges. To get a single edge, substitute the generic α,β... with a particular expression. For example, if you take (2. α∧β≡β∧α ) and substitute α:=p∨q and β:=¬q. We find a link between the node (p∨q)∧¬q and ¬q∧(p∨q).

Here is a connected subsection of the graph. Note that, unlike the previous graph, this one is cyclic and edges are not directed.

All statements that are provably equivalent in propositional logic will be within the same connected component of the graph. All statements that can't be proved equivalent are in different components, with no path between them.

Finding a mathematical proof becomes an exercise in navigating an infinite maze.

In the next Post

We will see how to extend the equivalence based proof system to an arbitrary first order theory. We will see what the connectedness does then. We might even get on to infinite dimensional vector spaces and why any of this relates to logical counterfactual.

Discuss

### Time Travel, AI and Transparent Newcomb

23 августа, 2019 - 01:04
Published on August 22, 2019 10:04 PM UTC

Epistemic status: has "time travel" in the title.

Let's suppose, for the duration of this post, that the local physics of our universe allows for time travel. The obvious question is: how are paradoxes prevented?

We may not have any idea how paradoxes are prevented, but presumably there must be some prevention mechanism. So, in a purely Bayesian sense, we can condition on paradoxes somehow not happening, and then ask what becomes more or less likely. In general, anything which would make a time machine more likely to be built should become less likely, and anything which would prevent a time machine being built should become more likely.

In other words: if we're trying to do something which would make time machines more likely to be built, this argument says that we should expect things to mysteriously go wrong.

For instance, let's say we're trying to build some kind of powerful optimization process which might find time machines instrumentally useful for some reason. To the extent that such a process is likely to build time machines and induce paradoxes, we would expect things to mysteriously go wrong when trying to build the optimizer in the first place.

On the flip side: we could commit to designing our powerful optimization process so that it not only avoids building time machines, but also actively prevents time machines from being built. Then the mysterious force should work in our favor: we would expect things to mysteriously go well. We don't need time-travel-prevention to be the optimization process' sole objective here, it just needs to make time machines sufficiently less likely to get an overall drop in the probability of paradox.

Discuss

### Embedded Naive Bayes

23 августа, 2019 - 00:40
Published on August 22, 2019 9:40 PM UTC

Suppose we have a bunch of earthquake sensors spread over an area. They are not perfectly reliable (in terms of either false positives or false negatives), but some are more reliable than others. How can we aggregate the sensor data to detect earthquakes?

It turns out that this procedure is equivalent to a Naive Bayes model.

Naive Bayes is a causal model in which there is some parameter θ in the environment which we want to know about - i.e. whether or not there’s an earthquake happening. We can’t observe θ directly, but we can measure it indirectly via some data {xi} - i.e. outputs from the earthquake sensors. The measurements may not be perfectly accurate, but their failures are at least independent - one sensor isn’t any more or less likely to be wrong when another sensor is wrong.

We can represent this picture with a causal diagram:

From the diagram, we can read off the model’s equation: P[θ,{xi}]=P[θ]∏iP[xi|θ]. We’re interested mainly in the posterior probability P[θ|{xi}]=1ZP[θ]∏iP[xi|θ] or, in log odds form,

L[θ|{xi}]=lnP[θ]P[∼θ]+∑ilnP[xi|θ]P[xi|∼θ]

Stare at that equation, and it’s not hard to see how the seismologist’s procedure turns into a Naive Bayes model: the seismologist’s intuitive scores for each sensor correspond to the “evidence” from the sensor lnP[xi|θ]P[xi|∼θ]. The “earthquake score” then corresponds to the posterior log odds of an earthquake. The seismologist has unwittingly adopted a statistical model. Note that this is still true regardless of whether the scores used are well-calibrated or whether the assumptions of the model hold - the seismologist is implicitly using this model, and whether the model is correct is an entirely separate question.

The Embedded Naive Bayes Equation

Let’s formalize this a bit.

We have some system which takes in data x, computes some stuff, and spits out some f(x). We want to know whether a Naive Bayes model is embedded in f(x). Conceptually, we imagine that f(x) parameterizes a probability distribution over some unobserved parameter θ - we’ll write P[θ;f(x)], where the “;” is read as “parameterized by”. For instance, we could imagine a normal distribution over θ, in which case f(x) might be the mean and variance (or any encoding thereof) computed from our input data. In our earthquake example, θ is a binary variable, so f(x) is just some encoding of the probability that θ=True.

Now let’s write the actual equation defining an embedded Naive Bayes model. We assert that P[θ;f(x)] is the same as P[θ|x] under the model, i.e.

P[θ;f(x)]=P[θ|x]=1ZP[θ]∏iP[xi|θ]

We can transform to log odds form to get rid of the Z:

L[θ;f(x)]=lnP[θ]P[∼θ]+∑ilnP[xi|θ]P[xi|∼θ]

Let’s pause for a moment and go through that equation. We know the function f(x), and we want the equation to hold for all values of x. θ is some hypothetical thing out in the environment - we don’t know what it corresponds to, we just hypothesize that the system is modelling something it can’t directly observe. As with x, we want the equation to hold for all values of θ. The unknowns in the equation are the probability functions P[θ;f(x)], P[θ] and P[xi|θ]. To make it clear what’s going on, let’s remove the probability notation for a moment, and just use functions G and {gi}, with θ written as a subscript:

∀θ,x:Gθ(f(x))=cθ+∑igθi(xi)

This is a functional equation: for each value of θ, we want to find functions G, {gi}, and a constant c such that the equation holds for all possible x values. The solutions G and {gi} can then be decoded to give our probability functions P[θ;f(x)] and P[xi|θ], while c can be decoded to give our prior P[θ]. Each possible θ-value corresponds to a different set of solutions Gθ, {gθi}, cθ.

This particular functional equation is a variant of Pexider’s equation; you can read all about it in Aczel’s Functional Equations and Their Applications, chapter 3. For our purposes, the most important point is: depending on the function f, the equation may or may not have a solution. In other words, there is a meaningful sense in which some functions f(x) do embed a Naive Bayes model, and others do not. Our seismologist’s procedure does embed a Naive Bayes model: let G be the identity function, c be zero, and gi(xi)=sxii, and we have a solution to the embedding equation with f(x) given by our seismologist’s add-all-the-scores calculation (although this is not the only solution). On the other hand, a procedure computing f(x)=xxx321 for real-valued inputs x1, x2, x3 would not embed a Naive Bayes model: with this f(x), the embedding equation would not have any solutions.

Discuss

### Intentional Bucket Errors

22 августа, 2019 - 23:02
Published on August 22, 2019 8:02 PM UTC

I want to illustrate a research technique that I use sometimes. (My actual motivation for writing this is to make it so that I don't feel as much like I need to defend myself when I use this technique.) I am calling it intentional bucket errors after a CFAR concept called bucket errors. Bucket errors is about noticing when multiple different concepts/questions are stored in your head as a single concept/question. Then, by noticing this, you can think about the different concepts/question separately.

What are Intentional Bucket Errors

Bucket errors are normally thought of as a bad thing. It has "errors" right in the name. However, I want to argue that bucket errors can sometimes be useful, and you might want to consider having some bucket errors on purpose. You can do this by taking multiple different concepts and just pretending that they are all the same. This usually only works if the concepts started out sufficiently close together.

Like many techniques that work by acting as though you believe something false, you should use this technique responsibly. The goal is to pretend that the concepts are the same to help you gain traction on thinking about them, but then to also be able to go back to inhabiting the world where they are actually different.

Why use Intentional Bucket Errors

Why might you want to use intentional bucket errors? For one, maybe the concepts actually are the same, but the look different enough that you won't let yourself consider the possibility. I think this is especially likely to happen if the concepts are coming from very different fields or areas of your life. Sometimes it feels silly to draw strong connections between e.g. human rationality, AI alignment, evolution, economics, etc. but such connections can be useful.

Also I find this useful for gaining traction. There is something useful about constrained optimization for being able to start thinking about a problem. Sometimes it is harder to say something true and useful about X than it is to say something true and useful that simultaneously applies to X, Y, and Z. This is especially true when the concepts you are conflating are imagined solutions to problems.

For example, maybe I have an imagined solution to counterfactuals that has a hole in it that looks like understanding multi-level world models. Then, maybe I also have have an imagined solution to tiling that also has a hole in it that looks like understanding multi-level world models. I could view this as two separate problems. The desired properties of my MLWM theory for counterfactuals might be different from the desired properties for tiling. I have these two different holes I want to fill, and one strategy I have, which superficially looks like it makes the problem harder is to try to find something that can fill both holes simultaneously. However, this can sometimes be easier because different use cases can help you triangulate the simple theory from which the specific solutions can be derived.

A lighter (maybe epistemically safer) version of intentional bucket errors is just to pay a bunch of attention to the connections between the concepts. This has its own advantages in that the relationships between the concepts might be interesting. However, I personally prefer to just throw them all in together, since this way I only have to work with one object, and it takes up fewer working memory slots while I'm thinking about it.

Examples

Here are a some recent examples where I feel like I have used something like this, to varying degrees.

How the MtG Color Wheel Explains AI Safety is obviously the product of conflating many things together without worrying too much about how all the clusters are wrong.

In How does Gradient Descent Interact with Goodhart, the question at the top about rocket designs and human approval is really very different from the experiments that I suggested, but I feel like learning about one might help my intuitions about the other. This was actually generated at the same time as I was thinking about Epistemic Tenure, which for me what partially about the expectation that there is good research and a correlated proxy of justifiable research, and even though our group idea selection mechanism is going to optimize for justifiable research, it is better if the inner optimization loops in the humans do not directly follow those incentives. The connection is a bit of a stretch in hindsight, but believing the connection was instrumental in giving me traction in thinking about all the problems.

Embedded Agency has a bunch of this, just because I was trying to factor a big problem into a small number of subfields, but the Robust Delegation section can sort of be described as "Tiling and Corrigibility kind of look similar if you squint. What happens when I just pretend they are two instatiations of the same problem."

Discuss

### Computational Model: Causal Diagrams with Symmetry

22 августа, 2019 - 20:54
Published on August 22, 2019 5:54 PM UTC

Consider the following program:

f(n): if n == 0: return 1 return n * f(n-1)

Let’s think about the process by which this function is evaluated. We want to sketch out a causal DAG showing all of the intermediate calculations and the connections between them (feel free to pause reading and try this yourself).

Here’s what the causal DAG looks like:

Each dotted box corresponds to one call to the function f. The recursive call in f becomes a symmetry in the causal diagram: the DAG consists of an infinite sequence of copies of the same subcircuit.

More generally, we can represent any Turing-computable function this way. Just take some pseudocode for the function, and expand out the full causal DAG of the calculation. In general, the diagram will either be finite or have symmetric components - the symmetry is what allows us to use a finite representation even though the graph itself is infinite.

Why would we want to do this?

For our purposes, the central idea of embedded agency is to take these black-box systems which we call “agents”, and break open the black boxes to see what’s going on inside.

Causal DAGs with symmetry are how we do this for Turing-computable functions in general. They show the actual cause-and-effect process which computes the result; conceptually they represent the computation rather than a black-box function.

In particular, a causal DAG + symmetry representation gives us all the natural machinery of causality - most notably counterfactuals. We can ask questions like “what would happen if I reached in and flipped a bit at this point in the computation?” or “what value would f(5) return if f(3) were 11?”. We can pose these questions in a well-defined, unambiguous way without worrying about logical counterfactuals, and without adding any additional machinery. This becomes particularly important for embedded optimization: if an “agent” (e.g. an organism) wants to plan ahead to achieve an objective (e.g. find food), it needs to ask counterfactual questions like “how much food would I find if I kept going straight?”.

The other main reason we would want to represent functions as causal DAGs with symmetry is because our universe appears to be one giant causal DAG with symmetry.

Because our universe is causal, any computation performed in our universe must eventually bottom out in a causal DAG. We can write our programs in any language we please, but eventually they will be compiled down to machine code and run by physical transistors made of atoms which are themselves governed by a causal DAG. In most cases, we can represent the causal computational process at a more abstract level - e.g. in our example program, even though we didn’t talk about registers or transistors or electric fields, the causal diagram we sketched out would still accurately represent the computation performed even at the lower levels.

This raises the issue of abstraction - the core problem of embedded agency. My own main use-case for the causal diagram + symmetry model of computation is formulating models of abstraction: how can one causal diagram (possibly with symmetry) represent another in a way which makes counterfactual queries on the map correspond to some kind of counterfactual on the territory? Can that work when the “map” is a subDAG of the territory DAG? It feels like causal diagrams + symmetry are the minimal computational model needed to get agency-relevant answers to this sort of question.

Learning

The traditional ultimate learning algorithm is Solomonoff Induction: take some black-box system which spews out data, and look for short programs which reproduce that data. But the phrase “black-box” suggests that perhaps we could do better by looking inside that box.

To make this a little bit more concrete: imagine I have some python program running on a server which responds to http requests. Solomonoff Induction would look at the data returned by requests to the program, and learn to predict the program’s behavior. But that sort of black-box interaction is not the only option. The program is running on a physical server somewhere - so, in principle, we could go grab a screwdriver and a tiny oscilloscope and directly observe the computation performed by the physical machine. Even without measuring every voltage on every wire, we may at least get enough data to narrow down the space of candidate programs in a way which Solomonoff Induction could not do. Ideally, we’d gain enough information to avoid needing to search over all possible programs.

Compared to Solomonoff Induction, this process looks a lot more like how scientists actually study the real world in practice: there’s lots of taking stuff apart and poking at it to see what makes it tick.

In general, though, how to learn causal DAGs with symmetry is still an open question. We’d like something like Solomonoff Induction, but which can account for partial information about the internal structure of the causal DAG, rather than just overall input-output behavior. (In principle, we could shoehorn this whole thing into traditional Solomonoff Induction by treating information about the internal DAG structure as normal old data, but that doesn’t give us a good way to extract the learned DAG structure.)

We already have algorithms for learning causal structure in general. Pearl’s Causality sketches out some such algorithms in chapter 2, although they’re only practical for either very small systems or very large amounts of data. Bayesian structure learning can handle larger systems with less data, though sometimes at the cost of a very large amount of compute - i.e. estimating high-dimensional integrals.

However, in general, these approaches don’t directly account for symmetry of the learned DAGs. Ideally, we would use a prior which weights causal DAGs according to the size of their representation - i.e. infinite DAGs would still have nonzero prior probability if they have some symmetry allowing for finite representation, and in general DAGs with multiple copies of the same sub-DAG would have higher probability. This isn’t quite the same as weighting by minimum description length in the Solomonoff sense, since we care specifically about symmetries which correspond to function calls - i.e. isomorphic subDAGs. We don’t care about graphs which can be generated by a short program but don’t have these sorts of symmetries. So that leaves the question: if our prior probability for a causal DAG is given by a notion of minimum description length which only allows compression by specifying re-used subcircuits, what properties will the resulting learning algorithm possess? Is it computable? What kinds of data are needed to make it tractable?

Discuss

### Simulation Argument: Why aren't ancestor simulations outnumbered by transhumans?

22 августа, 2019 - 20:29
Published on August 22, 2019 9:07 AM UTC

This is a point of confusion I still have with the simulation argument: Upon learning that we are in an ancestor simulation, should we be any less surprised? It would be odd for a future civilization to dedicate a large fraction of their computational resources towards simulating early 21st century humans instead of happy transhuman living in base reality; shouldn't we therefore be equally perplexed that we aren't transhumans?

I guess the question boils down to the choice of reference classes, so what makes the reference class "early 21st century humans" so special? Why not widen the reference class to include all conscious minds, or narrow it down to the exact quantum state of a brain?

Furthermore, if you're convinced by the simulation argument, why not believe that you're a Boltzmann brain instead using the same line of argument?

Discuss

### [AN #62] Are adversarial examples caused by real but imperceptible features?

22 августа, 2019 - 20:10
Published on August 22, 2019 5:10 PM UTC

[AN #62] Are adversarial examples caused by real but imperceptible features? View this email in your browser

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

Call for contributors to the Alignment Newsletter (Rohin Shah): I'm looking for content creators and a publisher for this newsletter! Apply by September 6.

Adversarial Examples Are Not Bugs, They Are Features (Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom et al) (summarized by Rohin and Cody): Distill published a discussion of this paper. This highlights section will cover the full discussion; all of these summaries and opinions are meant to be read together.

Consider two possible explanations of adversarial examples. First, they could be caused because the model "hallucinates" a signal that is not useful for classification, and it becomes very sensitive to this feature. We could call these "bugs", since they don't generalize well. Second, they could be caused by features that do generalize to the test set, but can be modified by an adversarial perturbation. We could call these "non-robust features" (as opposed to "robust features", which can't be changed by an adversarial perturbation). The authors argue that at least some adversarial perturbations fall into the second category of being informative but sensitive features, based on two experiments.

If the "hallucination" explanation were true, the hallucinations would presumably be caused by the training process, the choice of architecture, the size of the dataset, but not by the type of data. So one thing to do would be to see if we can construct a dataset such that a model trained on that dataset is already robust, without adversarial training. The authors do this in the first experiment. They take an adversarially trained robust classifier, and create images whose features (final-layer activations of the robust classifier) match the features of some unmodified input. The generated images only have robust features because the original classifier was robust, and in fact models trained on this dataset are automatically robust.

If the "non-robust features" explanation were true, then it should be possible for a model to learn on a dataset containing only non-robust features (which will look nonsensical to humans) and still generalize to a normal-looking test set. In the second experiment (henceforth WrongLabels), the authors construct such a dataset. Their hypothesis is that adversarial perturbations work by introducing non-robust features of the target class. So, to construct their dataset, they take an image x with original label y, adversarially perturb it towards some class y' to get image x', and then add (x', y') to their dataset (even though to a human x' looks like class y). They have two versions of this: in RandLabels, the target class y' is chosen randomly, whereas in DetLabels, y' is chosen to be y + 1. For both datasets, if you train a new model on the dataset, you get good performance on the original test set, showing that the "non-robust features" do generalize.

Rohin's opinion: I buy this hypothesis. It's a plausible explanation for brittleness towards adversarial noise ("because non-robust features are useful to reduce loss"), and why adversarial examples transfer across models ("because different models can learn the same non-robust features"). In fact, the paper shows that architectures that did worse in ExpWrongLabels (and so presumably are bad at learning non-robust features) are also the ones to which adversarial examples transfer the least. I'll leave the rest of my opinion to the opinions on the responses.

Read more: Paper and Author response

Response: Learning from Incorrectly Labeled Data (Eric Wallace): This response notes that all of the experiments are of the form: create a dataset D that is consistent with a model M; then, when you train a new model M' on D you get the same properties as M. Thus, we can interpret these experiments as showing that model distillation can work even with data points that we would naively think of "incorrectly labeled". This is a more general phenomenon: we can take an MNIST model, select only the examples for which the top prediction is incorrect (labeled with these incorrect top predictions), and train a new model on that -- and get nontrivial performance on the original test set, even though the new model has never seen a "correctly labeled" example.

Rohin's opinion: I definitely agree that these results can be thought of as a form of model distillation. I don't think this detracts from the main point of the paper: the reason model distillation works even with incorrectly labeled data is probably because the data is labeled in such a way that it incentivizes the new model to pick out the same features that the old model was paying attention to.

Response: Robust Feature Leakage (Gabriel Goh): This response investigates whether the datasets in WrongLabels could have had robust features. Specifically, it checks whether a linear classifier over provably robust features trained on the WrongLabels dataset can get good accuracy on the original test set. This shouldn't be possible since WrongLabels is meant to correlate only non-robust features with labels. It finds that you can get some accuracy with RandLabels, but you don't get much accuracy with DetLabels.

The original authors can actually explain this: intuitively, you get accuracy with RandLabels because it's less harmful to choose labels randomly than to choose them explicitly incorrectly. With random labels on unmodified inputs, robust features should be completely uncorrelated with accuracy. However, with random labels followed by an adversarial perturbation towards the label, there can be some correlation, because the adversarial perturbation can add "a small amount" of the robust feature. However, in DetLabels, the labels are wrong, and so the robust features are negatively correlated with the true label, and while this can be reduced by an adversarial perturbation, it can't be reversed (otherwise it wouldn't be robust).

Rohin's opinion: The original authors' explanation of these results is quite compelling; it seems correct to me.

Response: Adversarial Examples are Just Bugs, Too (Preetum Nakkiran): The main point of this response is that adversarial examples can be bugs too. In particular, if you construct adversarial examples that explicitly don't transfer between models, and then run ExpWrongLabels with such adversarial perturbations, then the resulting model doesn't perform well on the original test set (and so it must not have learned non-robust features).

It also constructs a data distribution where every useful feature of the optimal classifer is guaranteed to be robust, and shows that we can still get adversarial examples with a typical model, showing that it is not just non-robust features that cause adversarial examples.

In their response, the authors clarify that they didn't intend to claim that adversarial examples could not arise due to "bugs", just that "bugs" were not the only explanation. In particular, they say that their main thesis is “adversarial examples will not just go away as we fix bugs in our models”, which is consistent with the point in this response.

Rohin's opinion: Amusingly, I think I'm more bullish on the original paper's claims than the authors themselves. It's certainly true that adversarial examples can arise from "bugs": if your model overfits to your data, then you should expect adversarial examples along the overfitted decision boundary. The dataset constructed in this response is a particularly clean example: the optimal classifier would have an accuracy of 90%, but the model is trained to accuracy 99.9%, which means it must be overfitting.

However, I claim that with large and varied datasets with neural nets, we are typically not in the regime where models overfit to the data, and the presence of "bugs" in the model will decrease. (You certainly can get a neural net to be "buggy", e.g. by randomly labeling the data, but if you're using real data with a natural task then I don't expect it to happen to a significant degree.) Nonetheless, adversarial examples persist, because the features that models use are not the ones that humans use.

It's also worth noting that this experiment strongly supports the hypothesis that adversarial examples transfer because they are real features that generalize to the test set.

Response: Adversarial Example Researchers Need to Expand What is Meant by ‘Robustness’ (Justin Gilmer et al): This response argues that the results in the original paper are simply a consequence of a generally accepted principle: "models lack robustness to distribution shift because they latch onto superficial correlations in the data". This isn't just about L_p norm ball adversarial perturbations: for example, one recent paper shows that if the model is only given access to high frequency features of images (which look uniformly grey to humans), it can still get above 50% accuracy. In fact, when we do adversarial training to become robust to L_p perturbations, then the model pays attention to different non-robust features and becomes more vulnerable to e.g. low-frequency fog corruption. The authors call for adversarial examples researchers to move beyond L_p perturbations and think about the many different ways models can be fragile, and to make them more robust to distributional shift.

Rohin's opinion: I strongly agree with the worldview behind this response, and especially the principle they identified. I didn't know this was a generally accepted principle, though of course I am not an expert on distributional robustness.

One thing to note is what is meant by "superficial correlation" here. It means a correlation that really does exist in the dataset, that really does generalize to the test set, but that doesn't generalize out of distribution. A better term might be "fragile correlation". All of the experiments so far have been looking at within-distribution generalization (aka generalization to the test set), and are showing that non-robust features do generalize within-distribution. This response is arguing that there are many such non-robust features that will generalize within-distribution but will not generalize under distributional shift, and we need to make our models robust to all of them, not just L_p adversarial perturbations.

Response: Two Examples of Useful, Non-Robust Features (Gabriel Goh): This response studies linear features, since we can analytically compute their usefulness and robustness. It plots the singular vectors of the data as features, and finds that such features are either robust and useful, or non-robust and not useful. However, you can get useful, non-robust features by ensembling or contamination (see response for details).

Response: Adversarially Robust Neural Style Transfer (Reiichiro Nakano): The original paper showed that adversarial examples don't transfer well to VGG, and that VGG doesn't tend to learn similar non-robust features as a ResNet. Separately, VGG works particularly well for style transfer. Perhaps since VGG doesn't capture non-robust features as well, the results of style transfer look better to humans? This response and the author's response investigate this hypothesis in more detail and find that it seems broadly supported, but there are still finnicky details to be worked out.

Rohin's opinion: This is an intriguing empirical fact. However, I don't really buy the theoretical argument that style transfer works because it doesn't use non-robust features, since I would typically expect that a model that doesn't use L_p-fragile features would instead use features that are fragile or non-robust in some other way.

Technical AI alignment   Problems

Problems in AI Alignment that philosophers could potentially contribute to (Wei Dai): Exactly what it says. The post is short enough that I'm not going to summarize it -- it would be as long as the original.

Iterated amplification

Delegating open-ended cognitive work (Andreas Stuhlmüller): This is the latest explanation of the approach Ought is experimenting with: Factored Evaluation (in contrast to Factored Cognition (AN #36)). With Factored Cognition, the idea was to recursively decompose a high-level task until you reach subtasks that can be directly solved. Factored Evaluation still does recursive decomposition, but now it is aimed at evaluating the work of experts, along the same lines as recursive reward modeling (AN #34).

This shift means that Ought is attacking a very natural problem: how to effectively delegate work to experts while avoiding principal-agent problems. In particular, we want to design incentives such that untrusted experts under the incentives will be as helpful as experts intrinsically motivated to help. The experts could be human experts or advanced ML systems; ideally our incentive design would work for both.

Currently, Ought is running experiments with reading comprehension on Wikipedia articles. The experts get access to the article while the judge does not, but the judge can check whether particular quotes come from the article. They would like to move to tasks that have a greater gap between the experts and the judge (e.g. allowing the experts to use Google), and to tasks that are more subjective (e.g. whether the judge should get Lasik surgery).

Rohin's opinion: The switch from Factored Cognition to Factored Evaluation is interesting. While it does make it more relevant outside the context of AI alignment (since principal-agent problems abound outside of AI), it still seems like the major impact of Ought is on AI alignment, and I'm not sure what the difference is there. In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal. The switch would be useful if you expect the reinforcement learning to work significantly better than imitation learning.

However, with Factored Evaluation, the agent that you train iteratively is one that must be good at evaluating tasks, and then you'd need another agent that actually performs the task (or you could train the same agent to do both). In contrast, with Factored Cognition you only need an agent that is performing the task. If the decompositions needed to perform the task are different from the decompositions needed to evaluate the task, then Factored Cognition would presumably have an advantage.

Miscellaneous (Alignment)

Clarifying some key hypotheses in AI alignment (Ben Cottier et al): This post (that I contributed to) introduces a diagram that maps out important and controversial hypotheses for AI alignment. The goal is to help researchers identify and more productively discuss their disagreements.

Near-term concerns   Privacy and security

Evaluating and Testing Unintended Memorization in Neural Networks (Nicholas Carlini et al)

Machine ethics

Towards Empathic Deep Q-Learning (Bart Bussmann et al): This paper introduces the empathic DQN, which is inspired by the golden rule: "Do unto others as you would have them do unto you". Given a specified reward, the empathic DQN optimizes for a weighted combination of the specified reward, and the reward that other agents in the environment would get if they were a copy of the agent. They show that this results in resource sharing (when there are diminishing returns to resources) and avoiding conflict in two toy gridworlds.

Rohin's opinion: This seems similar in spirit to impact regularization methods: the hope is that this is a simple rule that prevents catastrophic outcomes without having to solve all of human values.

AI strategy and policy

AI Algorithms Need FDA-Style Drug Trials (Olaf J. Groth et al)

Other progress in AI   Critiques (AI)

Evidence against current methods leading to human level artificial intelligence (Asya Bergal and Robert Long): This post briefly lists arguments that current AI techniques will not lead to high-level machine intelligence (HLMI), without taking a stance on how strong these arguments are.

News

Ought: why it matters and ways to help (Paul Christiano): This post discusses the work that Ought is doing, and makes a case that it is important for AI alignment (see the summary for Delegating open-ended cognitive work above). Readers can help Ought by applying for their web developer role, by participating in their experiments, and by donating.

Project Proposal: Considerations for trading off capabilities and safety impacts of AI research (David Krueger): This post calls for a thorough and systematic evaluation of whether AI safety researchers should worry about the impact of their work on capabilities.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Discuss

### Implications of Quantum Computing for Artificial Intelligence Alignment Research

22 августа, 2019 - 13:33
Published on August 22, 2019 10:33 AM UTC

[Crossposted from arXiv]

ABSTRACT: We introduce a heuristic model of Quantum Computing and apply it to argue that a deep understanding of quantum computing is unlikely to be helpful to address current bottlenecks in Artificial Intelligence Alignment. Our argument relies on the claims that Quantum Computing leads to compute overhang instead of algorithmic overhang, and that the difficulties associated with the measurement of quantum states do not invalidate any major assumptions of current Artificial Intelligence Alignment research agendas. We also discuss tripwiring, adversarial blinding, informed oversight and side effects as possible exceptions.

KEYWORDS: Quantum Computing, Artificial Intelligence Alignment, Quantum Speedup, Quantum Obfuscation, Quantum Resource Asymmetry.

EPISTEMIC STATUS: Exploratory, we could have overlooked key considerations.

Introduction

Quantum Computing (QC) is a disruptive technology that may not be too far ahead in the horizon. Small proof-of-concept quantum computers have already been built [1] and major obstacles to large-scale quantum computing are being heavily researched [2].

Among its potential uses, QC will allow breaking classical cryptographic codes, simulate large quantum systems and faster search and optimization [3]. This last use case is of particular interest to Artificial Intelligence (AI) Strategy. In particular, variants of the Grover algorithm can be exploited to gain a quadratic speedup in search problems, and some recent Quantum Machine Learning (QML) developments have led to exponential gains in certain Machine Learning tasks [4] (though with important caveats which may invalidate their practical use [5]).

These ideas have the potential to exert a transformative effect on research in AI (as noted in [6], for example). Furthermore the technical aspects of QC, which put some physical limits on the observation of the inner workings of a quantum machine and hinder the verification of quantum computations [7], may pose an additional challenge for AI Alignment concerns.

In this short article we introduce a heuristic model of quantum computing that captures the most relevant characteristics of QC for technical AI Alignment research.

We then apply our model to abstractly answer in which areas we expect knowledge of QC might be relevant, and discuss four specific avenues of current research where it might come into play: tripwiring, adversarial blinding, informed oversight and avoiding side effects.

A model of Quantum Computing for AI Alignment

Here we give a very short and simplified introduction to QC for AI Alignment Researchers. For a longer introduction to QC we recommend Quantum Computing for the Very Curious [8]. If you're already familiar with QC you may want to check our technical refresher in the footnotes in appendix A.

We introduce three heuristics, which to the best of our knowledge capture all relevant aspects of QC for AI Alignment concerns:

1. Quantum speedup - quantum computers are, at most, as powerful as classical computers we allow to run for an exponential amount of time .

QC usually just gets you quadratic advantages (ie Grover database search [9], Quantum Walks [10]...), although in some cases they are exponential with respect to the best known classical algorithms (eg. Shor [11], Hamiltonian simulation [12], HHL [13]...).

Technically, the class of problems efficiently solvable with a probabilistic classical computer (BPP) is a subset of the class of problems efficiently solvable by a quantum computer (BQP), and that one is itself a subset of the problems solvable in exponential time by a classical deterministic computer (EXP). This means that QC are at least as fast as classical computers, but no more than exponentially faster [14, 15]

2. Quantum obfuscation - there is no efficient way of reading the state of a quantum computer while it is operating.

Quantum operations cannot copy quantum states (“No-cloning theorem”) [16, 17] and performing a partial or total measure of a quantum state will collapse that part of the state, resulting in loss of information.

To recover information from a quantum state one has several inefficient options: performing the inner product of the state with another vector (usually through a procedure called the swap test [18]), perform many measurements on many identically prepared states to do statistics on the entries (tomography [19]), or using amplitude estimation [20] to estimate a single amplitude.

The former procedure depends on the precision quadratically and destroys the state, albeit it is independent of the dimension of the vector. The latter two require a number of repetitions at least linear with respect to the dimension of the vector, which grows exponentially with the number of qubits.

3. Quantum isolation - a quantum computer cannot interact with the classical world without its state becoming at least partially classic.

In other words, if a quantum computer create a side-channel to the outside world during their quantum computation it destroys its coherence and randomizes state according to well defined rules.

This is directly derived from the postulates of Quantum Mechanics, and in particular the collapse of the wave function when it interacts with the outside world.

In the following two sections we look at how this model can be applied to gather insight on the phases of research in AI Alignment and kinds of alignment strategies where QC may or may not be relevant.

Bottlenecks in Artificial Intelligence Alignment research

In this section we introduce a simplified way of thinking about the different phases of research through which we expect the field of AI Alignment to go, and reason about the relevance of quantum computing during each of these phases.

Looking at some landmark achievements in computer science, it seems that most research begins with working on the formalization of a problem, which then is followed by a period where researchers try to find solutions to the problem, at first just theoretical, then inefficient and finally practical implementations (see for example the history of chess playing, from Shannon’s seminal paper in 1950 to the Deeper Blue vs Kasparov match in 1997 [21]).

We expect research in AI Alignment to develop in a similar fashion, and while we have some formalized frameworks to handle some subsets of the problem (see for example IRL [22]), there is no agreed upon formalization that captures the essence of the whole alignment problem.

On the other hand, theoretical proposals for QC applications are mostly concerned with speeding up classical algorithms, sometimes with notable improvements (see for example Shor’s algorithm for factorization [11]), and in some rare cases it has inspired the creation of novel algorithmic strategies [23]. In no case that we know of has QC lead to a formalization insight of the kind that we believe AI Alignment is bottlenecked on.

That is, QC has so far only helped find efficient solutions to problems that were already properly formalized, while we believe that the most significant problems in AI Alignment have not yet matured into proper formalizations.

This observation serves as an empirical verification of the quantum speedup heuristic, that instructs us to think about quantum computing as a black box accelerator rather than a novel approach to algorithmic design, and thus we should not expect formalization insights to come from QC. In other words, QC may lead to what would be equivalent to compute overhang, but not lead to significant insight overhang.

We conclude that while QC may help in a later phase of AI Alignment research with making safe AI algorithms practical and competitive, it is very unlikely that it will lead to novel theoretical insights that fundamentally change how we think about AI Alignment.

As a side note, the same reasoning applies to AI capabilities research; QC is unlikely to lead to new formal insights on that field. However, the quantum speedup may enable the practical use of algorithms which were previously considered inefficient. This is concerning to the extent that we expect compute overhang to lead to more opaque and/or less safe algorithms.

Alignment Strategies: incentive design versus active oversight

In this section we introduce a distinction between two main broad complementary strategies for achieving AI Alignment: incentive design and active oversight, and reason about how QC may interact with them.

By incentive design we mean static strategies, where the design of an agent is verified to have certain safety properties that incentivize the agent to pursue desirable goals.

By active oversight we refer to dynamic strategies, where an agent, which may or may not be safe, is monitored, and if certain metrics indicate unsafeness, an intervention is made to safely interrupt or modify the agent.

We believe that a complete solution to the AI Alignment problem will include both elements of incentive design and active oversight.

Since we can treat QC as a black box accelerator according to the quantum speedup heuristic, we can see that QC does not present any additional challenges for incentive design. In other words, since designing the right incentives for the behaviour of the agent is independent of the hardware and implementation efficiency, QC is unlikely to help with this endeavour.

Instead of concerning themselves with the intricacies of QC, AI Alignment researchers may just treat QC as a black box and reason about classical equivalents instead. This is analogous to how AI Safety researchers do not need to think about whether an agent will be implemented over GPUs or serial processors and only focus on the formalisms.

Those alternatives may be exponentially slower; however, if the safety of our agent design relies on the assumption that the agent will not be able to acquire exponentially more resources for computation we cannot really deem it safe, making this a moot point.

One possible counter-argument to our line of reasoning above is that quantum computing may influence which AI designs end up being successful, and knowledge about the AI designs that are going to be more popular in the future can allow us to make extra assumptions to facilitate incentive design. We leave as an open question whether we can reliable predict features about QC-powered agents that will allow us to make these kind of simplifying assumptions.

Moving on from incentive design, we point out that QC may pose an additional challenge to active oversight due to quantum obfuscation, as it might interfere with measures of the internal state of a QC. In the next section we explore some AI Alignment research areas where this fact comes into play.

Review of AI Alignment research agendas

We have reviewed the AI Alignment agendas of Concrete Problems in AI Safety [24], MIRI’s research agendas [25] and Iterated Distillation and Amplification [26].

Most of their work focuses on incentive design as the de facto route to a solution. As such, QC does not affect the main assumptions and considered solutions. However, we have been able to identify three areas of active oversight where QC may be a relevant consideration: tripwiring, adversarial blinding, informed oversight and avoiding side effects.

We discuss each of these areas in depth.

Tripwiring

Tripwiring is a class of mechanisms that alert of, or directly prevent attempts of an agent to hack its reward function [24]. If a particular tripwire design relies on access to the internal state of the agent, quantum obfuscation may impede its effective implementation.

For example, in the Transparency section of Techniques for optimizing worst-case performance [27], Paul Christiano writes:

“While training a model, we are free to look inside and ‘see what it’s thinking’. If we could actually understand those thoughts, we might be able to more easily predict under what conditions it would fail malignantly.”

Unfortunately, due to quantum obfuscation, this seems rather complicated if the agent has quantum capabilities. You will just be able to efficiently read the input/output of the used quantum subroutines, but not its inner operation.

Nevertheless, there might be easier ways for a misaligned agent to obfuscate what it is doing, such as classically encoding its hidden plans in ways that an overseer will not understand (for example, it can encode its state with quantum-resistant cryptography [28]). However some early work points in the direction that quantum-based obfuscation may be significantly more powerful than classical alternatives [29].

Another consideration is that quantum obfuscation offers a more plausible excuse for obfuscation than classical cryptography. For a classical agent that doesn't need to encrypt information during its expected workflow, any encryption attempt might indicate misalignment and trigger a shutdown. If, in contrast, the QC agent hides important information in a quantum superposition, this might be misaligned behavior or just a consequence of wanting to exploit the quantum speedup.

Finally, one could imagine some kind of quantum tripwire that monitors the quantum computer from within the superposition. For example, a quantum bit that flags whether the tripwire has been activated. Since quantum computing allows for conditional operations, one could imagine freezing the state if such state has been reached in the superposition, only allowing it to proceed while it has not. Then one could perform amplitude estimation to check that the flag has not been activated in any branch of the computation, and otherwise kill it.

Adversarial blinding consists of using adversarial techniques to blind the agent with respect to some variables. This technique could be used to make it difficult for the agent to understand how its reward is generated and therefore how it can be hacked [24].

In the case where there is an asymmetry of resources and the overseer has access to quantum computers but the agent does not, we know that there are distributions that are just not efficiently computable classically (will need exponential resources), but are efficiently computable quantically. This suggests the possibility of using one of those distributions to generate a difficult to hack reward.

Informed oversight

In the informed oversight scenario, the verifier checks that the behavior of an agent is safe, and the requirements on the verifier do not grow too much as the agent becomes more and more powerful [30].

If there is an asymmetry in resources where the verifier has access to QC while the agent does not, this may be exploited using quantum speedup to allow the verifier to do certain checks in polynomial time for operations that will take the agent exponential time to check themselves.

In particular, quantum speedup may help with the amplification step of IDA, where we give an agent access to QC to design and verify a stronger successor, whose resources will be constrained as in not having access to QC themselves during the verification step.

It is however unclear whether this scheme of only giving access to the agent to QC after it has been verified is possible, and whether we can reliably restrain the agent from accessing QC while it is being verified.

Side effects and AI boxing

We would like to be able to reliably detect and prevent AI systems from interacting with the world in unexpected ways (this is referred to in the literature as avoiding side effects [24]).

Per the quantum isolation heuristic, when a process implemented in a quantum computer tries to affect the outside world it has to necessarily partially collapse part of the state (unless the state is already classical).

One may think of taking advantage of this phenomena to monitor whether a quantum agent has interacted with the outside world. There are however some associated difficulties.

Firstly, and most importantly, we cannot distinguish a partial collapse from a constructive interference when the amplitude of the subset of qubits we are interested in is 1. That is, it may be the case that part of the state is classical and we cannot effectively tell whether this is due to collapse or because all states in superposition agree on the state of a particular subset of qubits.

Secondly, to tell whether the process has (partially) collapsed we need to measure it, which will collapse it in turn, and even worse, we will not be able to know whether the state had collapsed before or because of our measurement. Thus we would need to resort to one of the inefficient ways of measuring the quantum state, like amplitude estimation.

All this means it does not seem possible to strongly and efficiently infer interaction with the outside world through learning that part of the state of the agent’s mind is classical.

Relatedly, AI boxing refers to restraintment strategies where an AI system is isolated from the environment to prevent unexpected side effects [31].

The quantum isolation heuristic may mislead us into thinking that boxing may be easier to achieve in QC. However this is not the case; as long as a process inside a QC is willing to partially collapse part of its state it can escape its quantum box. In fact, one of the most active areas of current research in QC is investigating how to isolate the quantum system so it does not collapse unexpectedly.

So, in conclusion, one may think of quantum computers as oracles, but that does not mean that they are naturally boxed already.

Conclusion and open questions

We have presented a simplified model of quantum computing for AI Alignment, summarized in two heuristics: the quantum speedup heuristic (QC allows a quadratic and sometimes exponential speedup) and the quantum obfuscation heuristic (qubits cannot be read without destroying the computation).

Our main conclusion is that we do not expect Quantum Computing to be relevant to the current technical AI Alignment research agendas.

In particular, we do not expect QC techniques to be relevant for safety until we are actually interested in making algorithms and protocol as efficient as possible.

Furthermore, we expect QC to not be relevant for general incentive design purposes. However, we have also listed some current research problems related to active oversight where it might be helpful to have our simplified model of quantum computing in mind. In any case we do not expect these issues to have high relevance right now, as most of the current work in current AI Alignment falls under incentive design strategies rather than active oversight.

Some further questions that came up during our research:

• Can we expect humanity to develop reliable QC before AGI? How do the developments in each field interact with one another? Will quantum machine learning significantly increase AI capabilities? Since one of the most straightforward and promising applications of QC is material science, should we expect QC to lead to further improvements in non-QC computing technology? How does that affect AI Capabilities?
• How does QC affect AI Governance? How easy is to deploy powerful QC-powered AI systems? How easily can we monitor the labs that have QC capabilities? If QC is relevant to the development of AGI, can the fact that QC capabilities are more concentrated affect the dynamics of development of AGI?
• How does the quantum speedup possibility affect AI design? Can developments in QC lead to opaque AI, as the compute improvements allow raw search design to be used instead of design from first principles? Relatedly, what AI designs would be particularly favored by QC technology versus raw improvements in compute?
• Is quantum obfuscation the most efficient obfuscation strategy a misaligned agent can have access to? Can classical cryptography be used to obfuscate information?
• Is asymmetry of resources a reasonable assumption to make in verifier / agent scenarios? How can asymmetry of QC resources be exploited for safety purposes?
• How could we design quantum tripwires? What are their strengths and limitations?
• How would we go about implementing an adversarial blinding scheme based on quantum distribution sampling?
• In the case where QC is relevant for the design of advanced AIs, can we expect to have an actual quantum agent in the future, or will it just be a classical agent with access to quantum subroutines, in a CAIS fashion?

Article by Jaime Sevilla (FHI summer research fellow) and Pablo Moreno (PhD student in QC, Complutense University of Madrid).

We want to thank Linh Chi Nguyen, Adrian Hutter, Anders Sandberg, Max Daniel, Richard Möhn and Daniel Eth for incredibly useful feedback, editing and discussion.

Appendix A: Speed technical introduction to Quantum Computing

Quantum states are complex unitary vectors. A basis vector is just a classical state, whereas any other is called a superposition (a linear combination of basis states) [32]. Quantum Computing is based on unitary transformations of these quantum states. Non-unitary dynamics can be introduced via measurements: a measurement projects the quantum state into a basis state (classical state) with a probability equal to the square of the amplitude of that state (the coefficient in the linear combination).

Bibliography

[1] Córcoles, A. D., et al. «Demonstration of a Quantum Error Detection Code Using a Square Lattice of Four Superconducting Qubits». Nature Communications, vol. 6, n.o 1, noviembre de 2015, p. 6979. DOI.org (Crossref), doi:10.1038/ncomms7979.

[2] Almudever, C. G., et al. «The engineering challenges in quantum computing». Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, IEEE, 2017, pp. 836-45. DOI.org (Crossref), doi:10.23919/DATE.2017.7927104.

[3] de Wolf, Ronald. «The Potential Impact of Quantum Computers on Society». Ethics and Information Technology, vol. 19, n.o 4, diciembre de 2017, pp. 271-76. DOI.org (Crossref), doi:10.1007/s10676-017-9439-z.

[4] Tacchino, Francesco, et al. «An Artificial Neuron Implemented on an Actual Quantum Processor». Npj Quantum Information, vol. 5, n.o 1, 2019, p. 26. DOI.org (Crossref), doi:10.1038/s41534-019-0140-4.

[5] Aaronson, Scott. «Quantum Machine Learning Algorithms: Read the Fine Print». https://www.scottaaronson.com/papers/qml.pdf.

[6] Dafoe, Allan. AI Governance: A Research Agenda. Future of Humanity Institute, University of Oxford, https://www.fhi.ox.ac.uk/wp-content/uploads/GovAIAgenda.pdf.

[7] Mahadev, Urmila. «Classical Verification of Quantum Computations». 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), IEEE, 2018, pp. 259-67. DOI.org (Crossref), doi:10.1109/FOCS.2018.00033.

[8] Matuschak, Andy, and Nielsen, Michael. «Quantum Computing for the very curious». https://quantum.country/qcvc.

[9] Grover, Lov K. «A Fast Quantum Mechanical Algorithm for Database Search». Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing - STOC ’96, ACM Press, 1996, pp. 212-19. DOI.org (Crossref), doi:10.1145/237814.237866.

[10] Szegedy, M. «Quantum Speed-Up of Markov Chain Based Algorithms». 45th Annual IEEE Symposium on Foundations of Computer Science, IEEE, 2004, pp. 32-41. DOI.org (Crossref), doi:10.1109/FOCS.2004.53.

[11] Shor, Peter W. «Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer». SIAM Journal on Computing, vol. 26, n.o 5, octubre de 1997, pp. 1484-509. DOI.org (Crossref), doi:10.1137/S0097539795293172.

[12] Berry, Dominic W., et al. «Hamiltonian Simulation with Nearly Optimal Dependence on all Parameters». 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, IEEE, 2015, pp. 792-809. DOI.org (Crossref), doi:10.1109/FOCS.2015.54.

[13] Harrow, Aram W., et al. «Quantum Algorithm for Linear Systems of Equations». Physical Review Letters, vol. 103, n.o 15, octubre de 2009, p. 150502. DOI.org (Crossref), doi:10.1103/PhysRevLett.103.150502.

[14] Aaronson, Scott. «BQP and the Polynomial Hierarchy». Proceedings of the 42nd ACM Symposium on Theory of Computing - STOC ’10, ACM Press, 2010, p. 141. DOI.org (Crossref), doi:10.1145/1806689.1806711.

[15] Petting Zoo - Complexity Zoo. https://complexityzoo.uwaterloo.ca/Petting_Zoo.

[16] Wootters, W. K., y W. H. Zurek. «A Single Quantum Cannot Be Cloned». Nature, vol. 299, n.o 5886, octubre de 1982, pp. 802-03. DOI.org (Crossref), doi:10.1038/299802a0.

[17] Scarani, Valerio, et al. «Quantum Cloning». Reviews of Modern Physics, vol. 77, n.o 4, noviembre de 2005, pp. 1225-56. DOI.org (Crossref), doi:10.1103/RevModPhys.77.1225.

[18] Schuld, Maria and Petruccione, Francesco. «Supervised learning with quantum computers». Springer Berlin Heidelberg, 2018.

[19] Cramer, Marcus, et al. «Efficient Quantum State Tomography». Nature Communications, vol. 1, n.o 1, 2010, p. 149. DOI.org (Crossref), doi:10.1038/ncomms1147.

[20] Brassard, Gilles, et al. «Quantum Amplitude Amplification and Estimation». Contemporary Mathematics, editado por Samuel J. Lomonaco y Howard E. Brandt, vol. 305, American Mathematical Society, 2002, pp. 53-74. DOI.org (Crossref), doi:10.1090/conm/305/05215.

[21] «CS221». Deep Blue, https://stanford.edu/~cpiech/cs221/apps/deepBlue.html.

[22] Ng, Andrew Y., and Stuart J. Russell. «Algorithms for inverse reinforcement learning.» Icml. Vol. 1. 2000.

[23] Tang, Ewin. «A Quantum-Inspired Classical Algorithm for Recommendation Systems». Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing - STOC 2019, ACM Press, 2019, pp. 217-28. DOI.org (Crossref), doi:10.1145/3313276.3316310.

[24] Amodei, Dario, et al. «Concrete problems in AI safety.» arXiv preprint arXiv:1606.06565 (2016).

[25] Demski, A., & Garrabrant, S.. «Embedded agency. arXiv preprint arXiv:1902.09469. (2019)

[26] Cotra, Ajeya. «Iterated Distillation and Amplification». Medium, 29 de abril de 2018, https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616.

[27] Christiano, Paul. «Techniques for Optimizing Worst-Case Performance». Medium, 2018, https://ai-alignment.com/techniques-for-optimizing-worst-case-performance-39eafec74b99.

[28] Lily Chen, et al. «Report on Post-Quantum Cryptography». National Institute of Standards and Technology, https://nvlpubs.nist.gov/nistpubs/ir/2016/nist.ir.8105.pdf

[29] Alagic, Gorjan, & Fefferman, Bill. «On Quantum Obfuscation». arXiv:1602.01771 [quant-ph], febrero de 2016. arXiv.org, http://arxiv.org/abs/1602.01771.

[30] Christiano, Paul. «The Informed Oversight Problem». Medium, 4 de julio de 2017, https://ai-alignment.com/the-informed-oversight-problem-1b51b4f66b35.

[31] Armstrong, Stuart, et al. «Thinking Inside the Box: Controlling and Using an Oracle AI». Minds and Machines, vol. 22, n.o 4, noviembre de 2012, pp. 299-324. DOI.org (Crossref), doi:10.1007/s11023-012-9282-2.

[32] Nielsen, Michael A., and Isaac L. Chuang. «Quantum computation and quantum information». 10th anniversary ed, Cambridge University Press, 2010.

Discuss

### Markets are Universal for Logical Induction

22 августа, 2019 - 09:44
Published on August 22, 2019 6:44 AM UTC

Background

The general idea of logical induction is to assign probability-like numbers to logical statements like “the (10101010)th digit of pi is 3”, and to refine these “probabilities” over time as the system thinks more. To create these “probabilities”, each statement is associated with an asset in a prediction market, which eventually pays $1 if the statement is proven true, or$0 if it is proven false. The “probabilities” are then the prices of these assets.

(It’s also possible that a statement is never proven or disproven, and one of the many interesting results of the paper is that logical inductors assign useful prices in that case too.)

The logical induction paper has two main pieces. First, it introduces the logical induction criterion: a system which assigns prices to statements over time is called a “logical inductor” if the prices cannot be exploited by any polynomial-time trading algorithm. The paper then shows that this criterion implies that the prices have a whole slew of useful, intuitive, probability-like properties.

The second main piece of the paper proves that at least one logical inductor is computable: the paper constructs an (extremely slow) algorithm to compute inexploitable prices for logical statements. The algorithm works by running a prediction market in which every possible polynomial-time trader is a participant. Naturally, the prices in this market turn out to be inexploitable by any polynomial-time trader - so, this giant simulated prediction market is a logical inductor.

Our Goal

An analogy: one could imagine a decision theorist making a list of cool properties they want their decision theory to have, then saying “well, here’s one possible decision algorithm which satisfies these properties: maximize expected utility”. That would be cool and useful, but what we really want is a theorem saying that any possible decision algorithm which satisfies the cool properties can be represented as maximizing expected utility.

This is analogous to the situation in the logical induction paper: there’s this cool criterion for handling logical uncertainty, and it implies a bunch of cool properties. The paper then says “well, here’s one possible algorithm which satisfies these properties: simulate a prediction market containing every possible polynomial-time trader”. That’s super cool and useful, but what we really want is a theorem saying that any possible algorithm which satisfies the cool properties can represented as a prediction market containing every possible polynomial-time trader.

That’s our goal. We want to show that any possible logical inductor can be represented by a market of traders - i.e. there is some market of traders which produces exactly the same prices.

The Proof

We’ll start with a slightly weaker theorem: any prices which are inexploitable by a particular trader T can be represented by a market in which T is a participant.

Conceptually, we can sketch out the proof as:

• The “trader” T is a function which takes in prices P, and returns a portfolio T(P) specifying how much of each asset it wants to hold.
• The rest of the market is represented by an aggregate trader M (for “market”), which takes in prices P and returns a portfolio M(P) specifying how much of each asset the rest of the market wants to hold.
• The “market maker” mechanism chooses prices so that the total portfolio held by everyone is zero - i.e. T(P) + M(P) = 0. This means that any long position held by T must be balanced by a short position held by M, and vice versa, at the market prices.
• We know the trader’s function T, and we know the prices P which we want to represent, so we solve for M: M(P) = -T(P)

… so the market containing M and T reproduces the prices P, as desired. One problem: this conceptual “proof” works even if the prices are exploitable. What gives?

The main trick here is budgeting: the real setup gives the traders limited budget, and they can’t keep trading if they go broke. Since M is doing exactly the opposite of T, M will go broke when T has unbounded gains - i.e. when T exploits the prices (this is basically the definition of exploitation used in the paper). But if the prices are inexploitable, then T’s possible gains are bounded, therefore M’s possible losses are bounded, and M can keep counterbalancing T’s trades indefinitely.

Let’s formalize that a bit. I’ll re-use notation from the logical induction paper without redefining everything, so check the paper for full definitions.

First, let’s right out the correct version of “T(P) + M(P) = 0”. The missing piece is budgeting: rather than letting traders trade directly, the logical induction algorithm builds “budgeted” traders BbT(T) and BbM(M), where bT and bM are the two traders’ starting budgets. At each time t, the market maker mechanism then finds prices P_t for which

BbT(T)(Pt,t)+BbM(M)(Pt,t)=0

The budgeting function is a bit involved; see the paper for more. The important points are that:

• BbT(T) will exploit the prices as long as T does
• Budgeting doesn’t change anything at all as long as the traders don’t put more money on the line than they have available; otherwise it scales down the trader’s investments to match their budget.

Enforcing the second piece involves finding the worst-possible world for each trader’s portfolio. M’s worst-possible world is BbT(T)’s best-possible world, so we reason:

• Since T cannot exploit the prices, neither can BbT(T)
• Since BbT(T) cannot exploit the prices, its best-case gain is bounded
• Since BbT(T)’s best-case gain is bounded, the opposite strategy −BbT(T)’s maximum loss is bounded
• We can set M’s budget bM higher than this maximum loss, and set M=−BbT(T), so that BbM(M)=−BbT(T), as desired.

To recap: given a series of prices over time Pt inexploitable by trader T, we constructed a market containing T which reproduces the prices Pt.

The proof approach generalizes easily to more traders: simply replace “BbT(T)” with ∑iBbTi(Ti) to sum over the contribution of each individual trader, then select M to balance them out, as before. Since the prices are inexploitable by every trader, the traders’ aggregate best-case gains are bounded above, so M’s worst-case loss is bounded below. This works even with infinite traders, as long as the total budget of all traders remains finite.

In particular, if we consider all polynomial-time traders (with budgeting and other details handled as in the paper), we find that any prices satisfying the logical induction criterion can be represented by a market containing all of the polynomial-time traders.

The Bigger Picture

Why does this matter?

First and foremost, it neatly characterizes the class of logical inductors: there are degrees of freedom in budgeting, and a huge degree of freedom in choosing the trader M which shares the market with our polynomial-time traders, but that’s it - that’s all we need to represent all possible logical inductors. (Note that this does not mean that simulating a prediction market containing all possible polynomial-time traders is the only way to implement a logical inductor - just that there is always a prediction market which produces the same prices as the logical inductor.)

Second, the proof is short and simple enough to generalize well. We should expect a similar technique to work under other notions of exploitability, and in scenarios beyond logical induction. This ties in to a general research path I’ve been playing with: many conditions of “inexploitability” or “efficiency” which we typically associate with utility functions instead produce markets when we relax the assumptions somewhat. Since markets are strictly more general than utility functions, and typically satisfy the same inexploitability/inefficiency criteria under more general assumptions, these kinds of results suggest that we should use markets of subagents in many models where we currently use utilities - see “Why Subagents?” for more along these lines.

Discuss

### Announcement: Writing Day Today (Thursday)

22 августа, 2019 - 07:48
Published on August 22, 2019 4:48 AM UTC

The MIRI Summer Fellows Program is having a writing day, where the participants are given a whole day to write whatever LessWrong / AI Alignment Forum posts that they like. This is to practice the skills needed for forum participation as a research strategy.

On an average day LessWrong gets about 5-6 posts, but last year this generated 28 posts. It's likely this will happen again, starting mid-to-late afternoon PDT.

It's pretty overwhelming to try to read-and-comment on 28 posts in a day, so we're gonna make sure this isn't the only chance you'll have to interact with these posts. In the coming weeks I'll post roundups, highlighting 3-5 of the posts, giving them a second airing for comments and discussion.

Discuss

### Call for contributors to the Alignment Newsletter

21 августа, 2019 - 21:21
Published on August 21, 2019 6:21 PM UTC

TL;DR: I am looking for (possibly paid) contributors to write summaries and opinions for the Alignment Newsletter. This is currently experimental, but I estimate ~80% chance that it will become long-term, and so I’m looking for people who are likely to contribute at least 20 summaries over the course of their tenure at the newsletter (see caveats in the post). To apply, read this doc, write an example summary + opinion, and fill out this form by Friday, September 6. I am also looking for someone to take over the work of publishing the newsletter (~1-3 hours per week); please send me an email if you’d be interested in this.

Roles I am looking for

Publisher: Once all of the summaries and opinions are written, you would turn them into an actual newsletter, send it out for proofreading, fix any typos found, update the database, etc. This currently takes me around half an hour per newsletter. Ideally, you would also take on some tasks that I haven’t found the time for: improving the visual design of the newsletter, A/B testing different versions to see what people engage with, publicity, and so on, for a total of ~1-3 hours per week.

Since I don’t yet have the setup to pay people to help with the newsletter, I am only looking for expressions of interest. If you think you’d be interested in this role, click this link to email me at rohinmshah@berkeley.edu with the subject line “Interested in publisher role for Alignment Newsletter EOM”. If I do end up hiring for the publisher role I’ll reach out to you with more details.

The rest of this doc will be focused on the more substantial role:

Content creator: You would choose articles that you’re interested in, and write summaries and opinions for them, that would then be published in the newsletter.

Why am I looking for content creators?

In the past few months, I haven’t been allocating as much time to the newsletter (you may have noticed they’re coming out every other week now). There have been many other things that seem more important to do. This is both because I’m more optimistic about the other work I’m doing, and because I no longer find it as useful to read papers as I did when I started the newsletter. As a result, I now have over 100 articles that I would probably want to send out, but haven’t gotten around to yet. This is also partly because there’s just more stuff coming out now. (I mentioned some of these points in the retrospective.)

Another reason for more content creators is that as I have learned more since starting the newsletter, I have developed my own idiosyncratic beliefs, and I think I have become worse at intuitively interpreting other posts from the author’s perspective rather than my own. (In other words, I would perform worse at an Ideological Turing Test of their position than I would have in the past, unless I put in a lot of effort into it.) I expect that with more writers the newsletter will better reflect a diversity of opinions.

Why should you do it?

It’s impactful. See the retrospective for more on this point. I’m not currently able to get a (normal length) newsletter out every week; you’d likely be causally responsible for getting back to weekly newsletters.

You will improve your analytical writing skills. Hopefully clear.

You’ll learn more about safety by reading papers. You could do this by yourself, but by summarizing the papers, you’re also providing a valuable service for everyone else.

You might learn more about AI safety by getting feedback from me. This is a “might” because I don’t know how much feedback I will end up giving to you about your summaries and opinions that’s actually about key ideas in AI safety (as opposed to feedback about the writing itself).

You might build career capital. I certainly have built career capital by creating this newsletter -- it has made me well known in some communities. I don’t know to what extent this will transfer to you.

You might be paid. Currently this is experimental, so I haven’t actually thought much about payment. I expect that I could get a grant to pay you if I ended up deciding that it would be worth it. However, it might be that dealing with all of the paperwork + tax implications cancels out any time savings, though I think this is unlikely. If this is an important factor to you, please do let me know when you apply.

Qualifications
• Likely to contribute at least 20 summaries to the newsletter over time, at least 4 of which are in the first month (for onboarding purposes). Alternatively, if you have deep expertise in a topic that the newsletter covers infrequently, such as formal verification, you should be likely to summarize relevant papers for at least the next 6 months.
• Basic familiarity with AI safety arguments
• Medium familiarity with the topic that you want to write summaries about
• Good writing skills (though I recommend just applying regardless and letting me evaluate based on your example summary)
Application process

Fill out this form. The main part of the application is to write an example summary and opinion for an article (which I may send out in the newsletter, if you give me permission to). Ideally you would write a summary on one of the articles from the list below, but if there isn’t an article in the subarea you’d like to write on, you can choose some other article (that hasn’t already been summarized in the newsletter) and summarize that. The whole process should take 1-4 hours, depending on how much time you put into the summary and opinion.

List of articles:

Discuss

### Two senses of “optimizer”

21 августа, 2019 - 19:23
Published on August 21, 2019 4:02 PM UTC

The word “optimizer” can be used in at least two different ways.

First, a system can be an “optimizer” in the sense that it is solving a computational optimization problem. A computer running a linear program solver, a SAT-solver, or gradient descent, would be an example of a system that is an “optimizer” in this sense. That is, it runs an optimization algorithm. Let “optimizer_1” denote this concept.

Second, a system can be an “optimizer” in the sense that it optimizes its environment. A human is an optimizer in this sense, because we robustly take actions that push our environment in a certain direction. A reinforcement learning agent can also be thought of as an optimizer in this sense, but confined to whatever environment it is run in. This is the sense in which “optimizer” is used in posts such as this. Let “optimizer_2” denote this concept.

These two concepts are distinct. Say that you somehow hook up a linear program solver to a reinforcement learning environment. Unless you do the “hooking up” in a particularly creative way there is no reason to assume that the output of the linear program solver would push the environment in a particular direction. Hence a linear program solver is an optimizer_1, but not an optimizer_2. On the other hand, a simple tabular RL agent would eventually come to systematically push the environment in a particular direction, and is hence an optimizer_2. However, such a system does not run any internal optimization algorithm, and is therefore not an optimizer_1. This means that a system can be an optimizer_1 while not being an optimizer_2, and vice versa.

There are some arguments related to AI safety that seem to conflate these two concepts. In Superintelligence (pg 153), on the topic of Tool AI, Nick Bostrom writes that:

A second place where trouble could arise is in the course of the software’s operation. If the methods that the software uses to search for a solution are sufficiently sophisticated, they may include provisions for managing the search process itself in an intelligent manner. In this case, the machine running the software may begin to seem less like a mere tool and more like an agent. Thus, the software may start by developing a plan for how to go about its search for a solution. The plan may specify which areas to explore first and with what methods, what data to gather, and how to make best use of available computational resources. In searching for a plan that satisfies the software’s internal criterion (such as yielding a sufficiently high probability of finding a solution satisfying the user-specified criterion within the allotted time), the software may stumble on an unorthodox idea. For instance, it might generate a plan that begins with the acquisition of additional computational resources and the elimination of potential interrupters (such as human beings).

To me, this argument seems to make an unexplained jump from optimizer_1 to optimizer_2. It begins with the observation that a powerful Tool AI would be likely to optimize its internal computation in various ways, and that this optimization process could be quite powerful. In other words, a powerful Tool AI would be a strong optimizer_1. It then concludes that the system might start pursuing convergent instrumental goals – in other words, that it would be an optimizer_2. The jump between the two is not explained.

The implicit assumption seems to be that an optimizer_1 could turn into an optimizer_2 unexpectedly if it becomes sufficiently powerful. It is not at all clear to me that this is the case – I have not seen any good argument to support this, nor can I think of any myself. The fact that a system is internally running an optimization algorithm does not imply that the system is selecting its output in such a way that this output optimizes the environment of the system.

The excerpt from Superintelligence is just one example of an argument that seems to slide between optimizer_1 and optimizer_2. For example, some parts of Dreams of Friendliness seem to be doing so, or at least it's not always clear which of the two is being talked about. I’m sure there are more examples as well.

Be mindful of this distinction when reasoning about AI. I propose that “consequentialist” (or perhaps "goal-directed") is used to mean what I have called “optimizer_2”. I don’t think there is a need for a special word to denote what I have called “optimizer_1” (at least not once the distinction between optimizer_1 and optimizer_2 has been pointed out).

Note: It is possible to raise a sort of embedded agency-like objection against the distinction between optimizer_1 and optimizer_2. One might argue that:

There is no sharp boundary between the inside and the outside of a computer. An “optimizer_1” is just an optimizer whose optimization target is defined in terms of the state of the computer it is installed on, whereas an “optimizer_2” is an optimizer whose optimization target is defined in terms of something outside the computer. Hence there is no categorical difference between an optimizer_1 and an optimizer_2.

I don’t think that this argument works. Consider the following two systems:

• A computer that is able to very quickly solve very large linear programs.
• A computer that solves linear programs, and tries to prevent people from turning it off as it is doing so, etc.

System 1 is an optimizer_1 that solves linear programs, whereas system 2 is an optimizer_2 that is optimizing the state of the computer that it is installed on. These two things are different. (Moreover, the difference isn’t just that system 2 is “more powerful” than system 1 – system 1 might even be a better linear program solver than system 2.)

Discuss