## Вы здесь

### Do Sufficiently Advanced Agents Use Logic?

Новости LessWrong.com - 13 сентября, 2019 - 22:53
Published on September 13, 2019 7:53 PM UTC

This is a continuation of a discussion with Vanessa from the MIRIxDiscord group. I'll make some comments on things Vanessa has said, but those should not be considered a summary of the discussion so far. My comments here are also informed by discussion with Sam.

1: Logic as Proxy1a: The Role of Prediction

Vanessa has said that predictive accuracy is sufficient; consideration of logic is not needed to judge (partial) models. A hypothesis should ultimately ground out to perceptual information. So why is there any need to consider other sorts of "predictions" it can make? (IE, why should we think of it as possessing internal propositions which have a logic of their own?)

But similarly, why should agents use predictive accuracy to learn? What's the argument for it? Ultimately, predicting perceptions ahead of time should only be in service of achieving higher reward.

We could instead learn from reward feedback alone. A (partial) "hypothesis" would really be a (partial) strategy, helping us to generate actions. We would judge strategies on (something like) average reward achieved, not even trying to predict precise reward signals. The agent still receives incoming perceptual information, and strategies can use it to update internal states and to inform actions. However, strategies are not asked to produce any predictions. (The framework I'm describing is, of course, model-free RL.)

Intuitively, it seems as if this is missing something. A model-based agent can learn a lot about the world just by watching, taking no actions. However, individual strategies can implement prediction-based learning within themselves. So, it seems difficult to say what benefit model-based RL provides beyond model-free RL, besides a better prior over strategies.

It might be that we can't say anything recommending model-based learning over model free in a standard bounded-regret framework. (I actually haven't thought about it much -- but the argument that model-free strategies can implement models internally seems potentially strong. Perhaps you just can't get much in the AIXI framework because there are no good loss bounds in that framework at all, as Vanessa mentions.) However, if so, this seems like a weakness of standard bounded-regret frameworks. Predicting the world seems to be a significant aspect of intelligence; we should be able to talk about this formally somehow.

Granted, it doesn't make sense for bounded agents to pursue predictive accuracy above all else. There is a computational trade-off, and you don't need to predict something which isn't important. My claim is something like, you should try and predict when you don't yet have an effective strategy. After you have an effective strategy, you don't really need to generate predictions. Before that, you need to generate predictions because you're still grappling with the world, trying to understand what's basically going on.

If we're trying to understand intelligence, the idea that model-free learners can internally manage these trade-offs (by choosing strategies which judiciously choose to learn from predictions when it is efficacious to do so) seems less satisfying than a proper theory of learning from prediction. What is fundamental vs non-fundamental to intelligence can get fuzzy, but learning from prediction seems like something we expect any sufficiently intelligent agent to do (whether it was built-in or learned behavior).

(It should also be noted that a reward-learning framework presumes we get feedback about utility at all. If we get no feedback about reward, then we're forced to only judge hypotheses by predictions, and make what inferences about utility we will. A dire situation for learning theory, but a situation where we can still talk about rational agency more generally.)

1b: The Analogy to Logic

My argument is going to be that if achieving high reward is task A, and predicting perception is task B, logic can be task C. Like task B, it is very different from task A. Like task B, it nonetheless provides useful information. Like task B, it seems to me that a theory of (boundedly) rational agency is missing something without it.

The basic picture is this. Perceptual prediction provides a lot of good feedback about the quality of cognitive algorithms. But if you really want to train up some good cognitive algorithms for yourself, it is helpful to do some imaginative play on the side.

One way to visualize this is an agent making up math puzzles in order to strengthen its reasoning skills. This might suggest a picture where the puzzles are always well-defined (terminating) computations. However, there's no special dividing line between decidable and undecidable problems -- any particular restriction to a decidable class might rule out some interesting (decidable but non-obviously so) stuff which we could learn from. So we might end up just going with any computations (halting or no).

Similarly, we might not restrict ourselves to entirely well-defined propositions. It makes a lot of sense to test cognitive heuristics on scenarios closer to life.

Why do I think sufficiently advanced agents are likely to do this?

Well, just as it seems important that we can learn a whole lot from prediction before we ever take an action in a given type of situation, it seems important that we can learn a whole lot by reasoning before we even observe that situation. I'm not formulating a precise learning-theoretic conjecture, but intuitively, it is related to whether we could reasonably expect the agent to get something right on the first try. Good perceptual prediction alone does not guarantee that we can correctly anticipate the effects of actions we have never tried before, but if I see an agent generate an effective strategy in a situation it has never intervened in before (but has had opportunity to observe), I expect that internally it is learning from perception at some level (even if it is model-free in overall architecture). Similarly, if I see an agent quickly pick up a reasoning-heavy game like chess, then I suspect it of learning from hypothetical simulations at some level.

Again, "on the first try" is not supposed to be a formal learning-theoretic requirement; I realize you can't exactly expect anything to work on the first try with learning agents. What I'm getting at has something to do with generalization.

2: Learning-Theoretic Criteria

Part of the frame has been learning-theory-vs-logic. One might interpret my closing remarks from the previous section that way; I don't know how to formulate my intuition learning-theoretically, but I expect that reasoning helps agents in particular situations. It may be that the phenomena of the previous section cannot be understood learning-theoretically, and only amount to a "better prior over strategies" as I mentioned. However, I don't want it to be a learning-theory-vs-logic argument. I would hope that something learning-theoretic can be said in favor of learning from perception, and in favor of learning from logic. Even if it can't, learning theory is still an important component here, regardless of the importance of logic.

I'll try to say something about how I think learning theory should interface with logic.

Vanessa said some relevant things in a comment, which I'll quote in full:

Heterodox opinion: I think the entire MIRIesque (and academic philosophy) approach to decision theory is confused. The basic assumption seems to be, that we can decouple the problem of learning a model of the world from the problem of taking a decision given such a model. We then ignore the first problem, and assume a particular shape for the model (for example, causal network) which allows us to consider decision theories such as CDT, EDT etc. However, in reality the two problems cannot be decoupled. This is because the type signature of a world model is only meaningful if it comes with an algorithm for how to learn a model of this type.For example, consider Newcomb's paradox. The agent makes a decision under the assumption that Omega behaves in a certain way. But, where did the assumption come from? Realistic agents have to learn everything they know. Learning normally requires a time sequence. For example, we can consider the iterated Newcomb's paradox (INP). In INP, any reinforcement learning (RL) algorithm will converge to one-boxing, simply because one-boxing gives it the money. This is despite RL naively looking like CDT. Why does it happen? Because in the learned model, the "causal" relationships are not physical causality. The agent comes to believe that taking the one box causes the money to appear there.In Newcomb's paradox EDT succeeds but CDT fails. Let's consider an example where CDT succeeds and EDT fails: the XOR blackmail. The iterated version would be IXB. In IXB, classical RL doesn't guarantee much because the environment is more complex than the agent (it contains Omega). To overcome this, we can use RL with incomplete models. I believe that this indeed solves both INP and IXB.Then we can consider e.g. counterfactual mugging. In counterfactual mugging, RL with incomplete models doesn't work. That's because the assumption that Omega responds in a way that depends on a counterfactual world is not in the space of models at all. Indeed, it's unclear how can any agent learn such a fact from empirical observations. One way to fix it is by allowing the agent to precommit. Then the assumption about Omega becomes empirically verifiable. But, if we do this, then RL with incomplete models can solve the problem again.The only class of problems that I'm genuinely unsure how to deal with is game-theoretic superrationality. However, I also don't see much evidence the MIRIesque approach has succeeded on that front. We probably need to start with just solving the grain of truth problem in the sense of converging to ordinary Nash (or similar) equilibria (which might be possible using incomplete models). Later we can consider agents that observe each other's source code, and maybe something along the lines of this can apply.

Besides the MIRI-vs-learning frame, I agree with a lot of this. I wrote a comment elsewhere making some related points about the need for a learning-theoretic approach. Some of the points also relate to my CDT=EDT sequence; I have been arguing that CDT and EDT don't behave as people broadly imagine (often not having the bad behavior which people broadly imagine). Some of those arguments were learning-theoretic while others were not, but the conclusions were similar either way.

In any case, I think the following criterion (originally mentioned to me by Jack Gallagher) makes sense:

A decision problem should be conceived as a sequence, but the algorithm deciding what to do on a particular element of the sequence should not know/care what the whole sequence is.

Asymptotic decision theory was the first major proposal to conceive of decision problems as sequences in this way. Decision-problem-as-sequence allows decision theory to be addressed learning-theoretically; we can't expect a learning agent to necessarily do well in any particular case (because it could have a sufficiently poor prior, and so still be learning in that particular case), but we can expect it to eventually perform well (provided the problem meets some "fairness" conditions which make it learnable).

As for the second part of the criterion, requiring that the agent is ignorant of the overall sequence when deciding what to do on an instance: this captures the idea of learning from logic. Providing the agent with the sequence is cheating, because you're essentially giving the agent your interpretation of the situation.

Jack mentioned this criterion to me in a discussion of averaging decision theory (AvDT), in order to explain why AvDT was cheating.

AvDT is based on a fairly simple idea: look at the average performance of a strategy so far, rather than its expected performance on this particular problem. Unfortunately, "performance so far" requires things to be defined in terms of a training sequence (counter to the logical-induction philosophy of non-sequential learning).

I created AvDT to try and address some shortcomings of asymptotic decision theory (let's call it AsDT). Specifically, AsDT does not do well in counterlogical mugging. AvDT is capable of doing well in counterfactual mugging. However, it depends on the training sequence. Counterlogical mugging requires the agent to decide on the "probability" of Omega asking for money vs paying up, to figure out whether participation is worth it overall. AvDT solves this problem by looking at the training sequence to see how often Omega pays up. So, the problem of doing well in decision problems is "reduced" to specifying good training sequences. This (1) doesn't obviously make things easier, and (2) puts the work on the human trainers.

Jack is saying that the system should be looking through logic on its own to find analogous scenarios to generalize from. When judging whether a system gets counterlogical mugging right, we have to define counterlogical mugging as a sequence to enable learning-theoretic analysis; but the agent has to figure things out on its own.

This is a somewhat subtle point. A realistic agent experiences the world sequentially, and learns by treating its history as a training sequence of sorts. This is physical time. I have no problem with this. What I'm saying is that if an agent is also learning from analogous circumstances within logic, as I suggested sophisticated agents will do in the first part, then Jack's condition should come into play. We aren't handed, from on high, a sequence of logically defined scenarios which we can locate ourselves within. We only have regular physical time, plus a bunch of hypothetical scenarios which we can define and whose relevance we have to determine.

This gets back to my earlier intuition about agents having a reasonable chance of getting certain things right on the first try. Learning-theoretic agents don't get things right on the first try. However, agents who learn from logic have "lots of tries" before their first real try in physical time. If you can successfully determine which logical scenarios are relevantly analogous to your own, you can learn what to do just by thinking. (Of course, you still need a lot of physical-time learning to know enough about your situation to do that!)

So, getting back to Vanessa's point in the comment I quoted: can we solve MIRI-style decision problems by considering the iterated problem, rather than the single-shot version? To a large extent, I think so: in logical time, all games are iterated games. However, I don't want to have to set an agent up with a training sequence in which it encounters those specific problems many times. For example, finding good strategies in chess via self-play should come naturally from the way the agent thinks about the world, rather than being an explicit training regime which the designer has to implement. Once the rules for chess are understood, the bottleneck should be thinking time rather than (physical) training instances.

Discuss

### A Critique of Functional Decision Theory

Новости LessWrong.com - 13 сентября, 2019 - 22:23
Published on September 13, 2019 7:23 PM UTC

A Critique of Functional Decision Theory

NB: My writing this note was prompted by Carl Shulman, who suggested we could try a low-time-commitment way of attempting to understanding the disagreement between some folks in the rationality community and academic decision theorists (including myself, though I’m not much of a decision theorist). Apologies that it’s sloppier than I’d usually aim for in a philosophy paper, and lacking in appropriate references. And, even though the paper is pretty negative about FDT, I want to emphasise that my writing this should be taken as a sign of respect for those involved in developing FDT. I’ll also caveat I’m unlikely to have time to engage in the comments; I thought it was better to get this out there all the same rather than delay publication further.

1. Introduction

There’s a long-running issue where many in the rationality community take functional decision theory (and its variants) very seriously, but the academic decision theory community does not. But there’s been little public discussion of FDT from academic decision theorists (one exception is here); this note attempts to partly address this gap.

So that there’s a clear object of discussion, I’m going to focus on Yudkowsky and Soares’ ‘Functional Decision Theory’ (which I’ll refer to as Y&S), though I also read a revised version of Soares and Levinstein’s Cheating Death in Damascus.

This note is structured as follows. Section II describes causal decision theory (CDT), evidential decision theory (EDT) and functional decision theory (FDT). Sections III-VI describe problems for FDT: (i) that it sometimes makes bizarre recommendations, recommending an option that is certainly lower-utility than another option; (ii) that it fails to one-box in most instances of Newcomb’s problem, even though the correctness of one-boxing is supposed to be one of the guiding motivations for the theory; (iii) that it results in implausible discontinuities, where what is rational to do can depend on arbitrarily small changes to the world; and (iv) that, because there’s no real fact of the matter about whether a particular physical process implements a particular algorithm, it’s deeply indeterminate what FDT’s implications are. In section VII I discuss the idea that FDT ‘does better at getting utility’ than EDT or CDT; I argue that Y&S’s claims to this effect are unhelpfully vague, and on any more precise way of understanding their claim, aren’t plausible.  In section VIII I briefly describe a view that captures some of the motivation behind FDT, and in my view is more plausible. I conclude that FDT faces a number of deep problems and little to say in its favour.

In what follows, I’m going to assume a reasonable amount of familiarity with the debate around Newcomb’s problem.

II. CDT, EDT and FDT

Informally: CDT, EDT and FDT differ in what non-causal correlations they care about when evaluating a decision. For CDT, what you cause to happen is all that matters; if your action correlates with some good outcome, that’s nice to know, but it’s not relevant to what you ought to do.  For EDT, all correlations matter: you should pick whatever action will result in you believing you will have the highest expected utility. For FDT, only some non-causal correlations matter, namely only those correlations between your action and events elsewhere in time and space that would be different in the (logically impossible) worlds in which the output of the algorithm you’re running is different. Other than for those correlations, FDT behaves in the same way as CDT.

Formally, where ​S​ represents states of nature, ​A, B​ etc represent acts, ​P ​is a probability function, and U(Si & A) represents the utility the agent gains from the outcome of choosing ​A​ given state Si , ​and ‘≽’ represents the ‘at least as choiceworthy as’ relation:

On EDT, ​A​ ≽ ​B​ iff ∑P(Si|A)U(Si & A) ≥ ∑P(Si|B)U(Si & B)

Where ‘|’ represents conditional probability.

On CDT, ​A​ ≽ ​B​ iff ∑P(SiA)U(Si & A) ≥ ∑P(SiB)U(Si & B)

Where ‘\’ is a ‘causal probability function’ that represents the decision-maker’s judgments about her ability to causally influence the events in the world by doing a particular action. Most often, this is interpreted in counterfactual terms (so P (SA) represents something like the probability of ​S​ coming about were I to choose ​A​) but it needn’t be.

On FDT, ​A​ ≽ ​B​ iff ∑P(SiA)U(Si & A) ≥ ∑P(SiB)U(Si & B)

Where I introduce the operator “ † ” to represent the special sort of function that Yudkowsky and Soares propose, where P (SA) represents the probability of ​Soccurring were the output of the algorithm that the decision-maker is running, in this decision situation, to be A. (I’m not claiming that it’s clear what this means. E.g. seehere​, second bullet point, arguing there can be no such probability function, because any probability function requires certainty in logical facts and all their entailments. I also note that strictly speaking FDT doesn’t assess acts in the same sense that CDT assesses acts; rather it assesses algorithmic outputs, and that Y&S have a slightly different formal set up than the one I describe above. I don’t think this will matter for the purposes of this note, though.)

With these definitions on board, we can turn to objections to FDT.

III. FDT sometimes makes bizarre recommendations

The criterion that Y&S regard as most important in assessing a decision theory is ‘amount of utility achieved’. I think that this idea is importantly underspecified (which I discuss more in section VII), but I agree with the spirit of it.  But FDT does very poorly by that criterion, on any precisification of it.

In particular, consider the following principle:

Guaranteed Payoffs: In conditions of certainty — that is, when the decision-maker has no uncertainty about what state of nature she is in, and no uncertainty about the utility payoff of each action is — the decision-maker should choose the action that maximises utility.

That is: for situations where there’s no uncertainty, we don’t need to appeal to expected utility theory in any form to work out what we ought to do. You just ought to do whatever will give you the highest utility payoff. This should be a constraint on any plausible decision theory. But FDT violates that principle.

Consider the following case:

Bomb. You face two open boxes, Left and Right, and you must take one of them. In the Left box, there is a live bomb; taking this box will set off the bomb, setting you ablaze, and you certainly will burn slowly to death. The Right box is empty, but you have to pay $100 in order to be able to take it. A long-dead predictor predicted whether you would choose Left or Right, by running a simulation of you and seeing what that simulation did. If the predictor predicted that you would choose Right, then she put a bomb in Left. If the predictor predicted that you would choose Left, then she did not put a bomb in Left, and the box is empty. The predictor has a failure rate of only 1 in a trillion trillion. Helpfully, she left a note, explaining that she predicted that you would take Right, and therefore she put the bomb in Left. You are the only person left in the universe. You have a happy life, but you know that you will never meet another agent again, nor face another situation where any of your actions will have been predicted by another agent. What box should you choose? The right action, according to FDT, is to take Left, in the full knowledge that as a result you will slowly burn to death. Why? Because, using Y&S’s counterfactuals, if your algorithm were to output ‘Left’, then it would also have outputted ‘Left’ when the predictor made the simulation of you, and there would be no bomb in the box, and you could save yourself$100 by taking Left. In contrast, the right action on CDT or EDT is to take Right.

The recommendation is implausible enough. But if we stipulate that in this decision-situation the decision-maker is certain in the outcome that her actions would bring about, we see that FDT violates Guaranteed Payoffs

(One might protest that no good Bayesian would ever have credence 1 in an empirical proposition. But, first, that depends on what we could as ‘evidence’ — if a proposition is part of your evidence base, you have credence 1 in it. And, second, we could construct very similar principles to Guaranteed Payoffs that don’t rely on the idea of certainty, but on approximations to certainty.)

Note that FDT’s recommendation in this case is much more implausible than even the worst of the prima facie implausible recommendations of EDT or CDT. So, if we’re going by appeal to cases, or by ‘who gets more utility’, FDT is looking very unmotivated.

IV. FDT fails to get the answer Y&S want in most instances of the core example that’s supposed to motivate it

On FDT, you consider what things would look like in the closest (logically impossible) world in which the algorithm you are running were to produce a different output than what it in fact does. Because, so the argument goes, in Newcomb problems the predictor is also running your algorithm, or a ‘sufficiently similar’ algorithm, or a representation of your algorithm, you consider the correlation between your action and the predictor’s prediction (even though you don’t consider other sorts of correlations.)

However, the predictor needn’t be running your algorithm, or have anything like a representation of that algorithm, in order to predict whether you’ll one box or two-box. Perhaps the Scots tend to one-box, whereas the English tend to two-box. Perhaps the predictor knows how you’ve acted prior to that decision. Perhaps the Predictor painted the transparent box green, and knows that’s your favourite colour and you’ll struggle not to pick it up. In none of these instances is the Predictor plausibly doing anything like running the algorithm that you’re running when you make your decision. But they are still able to predict what you’ll do. (And bear in mind that the Predictor doesn’t even need to be very reliable. As long as the Predictor is better than chance, a Newcomb problem can be created.)

In fact, on the vast majority of ways that the Predictor could predicting your behavior, she isn’t running the algorithm that you are running, or representing it. But if the Predictor isn’t running the algorithm that you are running, or representing it, then, on the most natural interpretation, FDT will treat this as ‘mere statistical correlation’, and therefore act like CDT. So, in the vast majority of Newcomb cases, FDT would recommend two-boxing. But the intuition in favour of one-boxing in Newcomb cases was exactly what was supposed to motivate FDT in the first place.

Could we instead interpret FDT, such that it doesn’t have to require the Predictor to be running the exact algorithm — some similar algorithm would do? But I’m not sure how that would help: in the examples given above, the Predictor’s predictions aren’t based on anything like running your algorithm. In fact, the predictor may know very little about you, perhaps only whether you’re English or Scottish.

One could suggest that, even though the Predictor is not running a sufficiently similar algorithm to you, nonetheless the Predictor’s prediction is subjunctively dependent on your decision (in the Y&S sense of ‘subjunctive’). But, without any account of Y&S’s notion of subjunctive counterfactuals, we just have no way of assessing whether that’s true or not. Y&S note that specifying an account of their notion of counterfactuals is an ‘open problem,’ but the problem is much deeper than that.  Without such an account, it becomes completely indeterminate what follows from FDT, even in the core examples that are supposed to motivate it — and that makes FDT not a new decision theory so much as a promissory note.

Indeed, on the most plausible ways of cashing this out, it doesn’t give the conclusions that Y&S would want. If I imagine the closest world in which 6288 + 1048 = 7336 is false (Y&S’s example), I imagine a world with laws of nature radically unlike ours — because the laws of nature rely, fundamentally, on the truths of mathematics, and if one mathematical truth is false then either (i) mathematics as a whole must be radically different, or (ii) all mathematical propositions are true because it is simple to prove a contradiction and every proposition follows from a contradiction. Either way, when I imagine worlds in which FDT outputs something different than it in fact does, then I imagine valueless worlds (no atoms or electrons, etc) — and this isn’t what Y&S are wanting us to imagine.

Alternatively (as Abram Demski suggested to me in a comment), Y&S could accept that the decision-maker should two-box in the cases given above. But then, it seems to me, that FDT has lost much of its initial motivation: the case for one-boxing in Newcomb’s problem didn’t seem to stem from whether the Predictor was running a simulation of me, or just using some other way to predict what I’d do.

V. Implausible discontinuities

A related problem is as follows: FDT treats ‘mere statistical regularities’ very differently from predictions. But there’s no sharp line between the two. So it will result in implausible discontinuities. There are two ways we can see this.

First, take some physical processes S (like the lesion from the Smoking Lesion) that causes a ‘mere statistical regularity’ (it’s not a Predictor). And suppose that the existence of S tends to cause both (i) one-boxing tendencies and (ii) whether there’s money in the opaque box or not when decision-makers face Newcomb problems.  If it’s S alone that results in the Newcomb set-up, then FDT will recommending two-boxing.

But now suppose that the pathway by which S causes there to be money in the opaque box or not is that another agent looks at S and, if the agent sees that S will cause decision-maker X to be a one-boxer, then the agent puts money in X’s opaque box. Now, because there’s an agent making predictions, the FDT adherent will presumably want to say that the right action is one-boxing. But this seems arbitrary — why should the fact that S’s causal influence on whether there’s money in the opaque box or not go via another agent much such a big difference? And we can think of all sorts of spectrum cases in between the ‘mere statistical regularity’ and the full-blooded Predictor: What if the ‘predictor’ is a very unsophisticated agent that doesn’t even understand the implications of what they’re doing? What if they only partially understand the implications of what they’re doing? For FDT, there will be some point of sophistication at which the agent moves from simply being a conduit for a causal process to instantiating the right sort of algorithm, and suddenly FDT will switch from recommending two-boxing to recommending one-boxing.

Second, consider that same physical process S, and consider a sequence of Newcomb cases, each of which gradually make S more and more complicated and agent-y, making it progressively more similar to a Predictor making predictions. At some point, on FDT, there will be a point at which there’s a sharp jump; prior to that point in the sequence, FDT would recommend that the decision-maker two-boxes; after that point, FDT would recommend that the decision-maker one-boxes. But it’s very implausible that there’s some S such that a tiny change in its physical makeup should affect whether one ought to one-box or two-box.

VI.  FDT is deeply indeterminate

Even putting the previous issues aside, there’s a fundamental way in which FDT is indeterminate, which is that there’s no objective fact of the matter about whether two physical processes A and B are running the same algorithm or not, and therefore no objective fact of the matter of which correlations represent implementations of the same algorithm or are ‘mere correlations’ of the form that FDT wants to ignore. (Though I’ll focus on ‘same algorithm’ cases, I believe that the same problem would affect accounts of when two physical processes are running similar algorithms, or any way of explaining when the output of some physical process, which instantiates a particular algorithm, is Y&S-subjunctively dependent on the output of another physical process, which instantiates a different algorithm.)

To see this, consider two calculators. The first calculator is like calculators we are used to. The second calculator is from a foreign land: it’s identical except that the numbers it outputs always come with a negative sign (‘–’) in front of them when you’d expect there to be none, and no negative sign when you expect there to be one.  Are these calculators running the same algorithm or not? Well, perhaps on this foreign calculator the ‘–’ symbol means what we usually take it to mean — namely, that the ensuing number is negative — and therefore every time we hit the ‘=’ button on the second calculator we are asking it to run the algorithm ‘compute the sum entered, then output the negative of the answer’. If so, then the calculators are systematically running different algorithms.

But perhaps, in this foreign land, the ‘–’ symbol, in this context, means that the ensuing number is positive and the lack of a ‘–’ symbol means that the number is negative. If so, then the calculators are running exactly the same algorithms; their differences are merely notational.

Ultimately, in my view, all we have, in these two calculators, are just two physical processes. The further question of whether they are running the same algorithm or not depends on how we interpret the physical outputs of the calculator. There is no deeper fact about whether they’re ‘really’ running the same algorithm or not. And in general, it seems to me, there’s no fact of the matter about which algorithm a physical process is implementing in the absence of a particular interpretation of the inputs and outputs of that physical process.

But if that’s true, then, even in the Newcomb cases where a Predictor is simulating you, it’s a matter of choice of symbol-interpretation whether the predictor ran the same algorithm that you are now running (or a representation of that same algorithm). And the way you choose that symbol-interpretation is fundamentally arbitrary. So there’s no real fact of the matter about whether the predictor is running the same algorithm as you. It’s indeterminate how you should act, given FDT: you should one-box, given one way of interpreting the inputs and outputs of the physical process the Predictor is running, but two-box given an alternative interpretation.

Now, there’s a bunch of interesting work on concrete computation, trying to give an account of when two physical processes are performing the same computation. The best response that Y&S could to make this problem is to provide a compelling account of when two physical processes are running the same algorithm that gives them the answers they want.  But almost all accounts of computation in physical processes have the issue that very many physical processes are running very many different algorithms, all at the same time. (Because most accounts rely on there being some mapping from physical states to computational states, and there can be multiple mappings.) So you might well end up with the problem that in the closest (logically impossible) world in which FDT outputs something other than what it does output, not only do the actions of the Predictor change, but so do many other aspects of the world. For example, if the physical process underlying some aspect of the US economy just happened to be isomorphic with FDT’s algorithm, then in the logically impossible world where FDT outputs a different algorithm, not only does the predictor act differently, but so does the US economy. And that will probably change the value of the world under consideration, in a way that’s clearly irrelevant to the choice at hand.

VII. But FDT gets the most utility!

Y&S regard the most important criterion to be ‘utility achieved’, and thinks that FDT does better than all its rivals in this regard. Though I agree with something like the spirit of this criterion, its use by Y&S is unhelpfully ambiguous. To help explain this, I’ll go on a little detour to present some distinctions that are commonly used by academic moral philosophers and, to a lesser extent, decision theorists. (For more on these distinctions, see Toby Ord’s DPhil thesis.)

Evaluative focal points

An evaluative focal point is an object of axiological or normative evaluation. (‘Axiological’ means ‘about goodness/badness’; ‘normative’ means ‘about rightness/wrongness’. If you’re a consequentialist, x is best iff it’s right, but if you’re a non-consequentialist the two can come apart.) When doing moral philosophy or decision theory, the most common evaluative focal points are acts, but we can evaluate other things too: characters, motives, dispositions, sets of rules, beliefs, and so on.

Any axiological or normative theory needs to specify which focal point it is evaluating. The theory can evaluate a single focal point (e.g. act utilitarianism, which only evaluates acts) or many (e.g. global utilitarianism, which evaluates everything).

The theory can also differ on whether it is direct or indirect with respect to a given evaluative focal point. For example, Hooker’s rule-consequentialism is a direct theory with respect to sets of rules, and an indirect theory with respect to acts: it evaluates sets of rules on the basis of their consequences, but evaluates acts with respect to how they conform to those sets of rules. Because of this, on Hooker’s view, the right act need not maximize good consequences.

Criterion of rightness vs decision procedure

In chess, there’s a standard by which it is judged who has won the game, namely, the winner is whoever first puts their opponent’s king into checkmate. But relying solely on that standard of evaluation isn’t going to go very well if you actually want to win at chess. Instead, you should act according to some other set of rules and heuristics, such as: “if white, play e4 on the first move,” “don’t get your Queen out too early,” “rooks are worth more than bishops” etc.

A similar distinction can be made for axiological or normative theories. The criterion of rightness, for act utilitarianism, is, “The right actions are those actions which maximize the sum total of wellbeing.”  But that’s not the decision procedure one ought to follow. Instead, perhaps, you should rely on rules like ‘almost never lie’, ‘be kind to your friends and family’, ‘figure out how much you can sustainably donate to effective charities, and do that,’ and so on.

For some people, in fact, learning that utilitarianism is true will cause one to be a worse utilitarian by the utilitarian’s criterion of rightness! (Perhaps you start to come across as someone who uses others as means to an end, and that hinders your ability to do good.) By the utilitarian criterion of rightness, someone could in principle act rightly in every decision, even though they have never heard of utilitarianism, and therefore never explicitly tried to follow utilitarianism.

These distinctions and FDT

From Y&S, it wasn’t clear to me whether FDT is really meant to assess acts, agents, characters, decision procedures, or outputs of decision procedures, and it wasn’t clear to me whether it is meant to be a direct or an indirect theory with respect to acts, or with respect to outputs of decision procedures. This is crucial, because it’s relevant to which decision theory ‘does best at getting utility’.

With these distinctions in hand, we can see that Y&S employ multiple distinct interpretations of their key criterion. Sometimes, for example, Y&S talk about how “FDT agents” (which I interpret as ‘agents who follow FDT to make decisions’) get more utility, e.g.:

• “Using one simple and coherent decision rule, functional decision theorists (for example) achieve more utility than CDT on Newcomb’s problem, more utility than EDT on the smoking lesion problem, and more utility than both in Parfit’s hitchhiker problem.”
• “We propose an entirely new decision theory, functional decision theory (FDT), that maximizes agents’ utility more reliably than CDT or EDT.”
• “FDT agents attain high utility in a host of decision problems that have historically proven challenging to CDT and EDT: FDT outperforms CDT in Newcomb’s problem; EDT in the smoking lesion problem; and both in Parfit’s hitchhiker problem.”
• “It should come as no surprise that an agent can outperform both CDT and EDT as measured by utility achieved; this has been known for some time (Gibbard and Harper 1978).”
• “Expanding on the final argument, proponents of EDT, CDT, and FDT can all
agree that it would be great news to hear that a beloved daughter adheres to FDT, because FDT agents get more of what they want out of life. Would it not then be strange if the correct theory of rationality were some alternative to the theory that produces the best outcomes, as measured in utility? (Imagine hiding decision theory textbooks from loved ones, lest they be persuaded to adopt the “correct” theory and do worse thereby!) We consider this last argument—the argument from utility—to be the one that gives the precommitment and value-of-information arguments their teeth. If self- binding or self-blinding were important for getting more utility in certain scenarios, then we would plausibly endorse those practices. Utility has primacy, and FDT’s success on that front is the reason we believe that FDT is a more useful and general theory of rational choice.”

Sometimes Y&S talk about how different decision theories produce more utility on average if they were to face a specific dilemma repeatedly:

• “Measuring by utility achieved on average over time, CDT outperforms EDT in some well-known dilemmas (Gibbard and Harper 1978), and EDT outperforms CDT in others (Ahmed 2014b).”
• “Imagine an agent that is going to face first Newcomb’s problem, and then the smoking lesion problem. Imagine measuring them in terms of utility achieved, by which we mean measuring them by how much utility we expect them to attain, on average, if they face the dilemma repeatedly. The sort of agent that we’d expect to do best, measured in terms of utility achieved, is the sort who one-boxes in Newcomb’s problem, and smokes in the smoking lesion problem.”

Sometimes Y&S talk about which agent will achieve more utility ‘in expectation’, though they don’t define the point at which they gain more expected utility (or what notion of ‘expected utility’ is being used):

• “One-boxing in the transparent Newcomb problem may look strange, but it works. Any predictor smart enough to carry out the arguments above can see that CDT and EDT agents two-box, while FDT agents one-box. Followers of CDT and EDT will therefore almost always see an empty box, while followers of FDT will almost always see a full one. Thus, FDT agents achieve more utility in expectation.”

Sometimes they talk about how much utility ‘decision theories tend to achieve in practice’:

• “It is for this reason that we turn to Newcomblike problems to distinguish between the three theories, and demonstrate FDT’s superiority, when measuring in terms of utility achieved.”
• “we much prefer to evaluate decision theories based on how much utility they tend to achieve in practice.”

Sometimes they talk about how well the decision theory does in a circumscribed class of cases (though they note in footnote 15 that they can’t define what this class of cases are):

• “FDT does appear to be superior to CDT and EDT in all dilemmas where the agent’s beliefs are accurate and the outcome depends only on the agent’s behavior in the dilemma at hand. Informally, we call these sorts of problems “fair problems.””
• “FDT, we claim, gets the balance right. An agent who weighs her options by imagining worlds where her decision function has a different output, but where logical, mathematical, nomic, causal, etc. constraints are otherwise respected, is an agent with the optimal predisposition for whatever fair dilemma she encounters.”

And sometimes they talk about how much utility the agent would receive in different possible worlds than the one she finds herself in:

• “When weighing actions, Fiona simply imagines hypotheticals corresponding to those actions, and takes the action that corresponds to the hypothetical with higher expected utility—even if that means imagining worlds in which her observations were different, and even if that means achieving low utility in the world corresponding to her actual observations.”

As we can see, the most common formulation of this criterion is that they are looking for the decision theory that, if run by an agent, will produce the most utility over their lifetime. That is, they’re asking what the best decision procedure is, rather than what the best criterion of rightness is, and are providing an indirect account of the rightness of acts, assessing acts in terms of how well they conform with the best decision procedure.

But, if that’s what’s going on, there are a whole bunch of issues to dissect. First, it means that FDT is not playing the same game as CDT or EDT, which are proposed as criteria of rightness, directly assessing acts. So it’s odd to have a whole paper comparing them side-by-side as if they are rivals.

Second, what decision theory does best, if run by an agent, depends crucially on what the world is like. To see this, let’s go back to question that Y&S ask of what decision theory I’d want my child to have. This depends on a whole bunch of empirical facts: if she might have a gene that causes cancer, I’d hope that she adopts EDT; though if, for some reason, I knew whether or not she did have that gene and she didn’t, I’d hope that she adopts CDT. Similarly, if there were long-dead predictors who can no longer influence the way the world is today, then, if I didn’t know what was in the opaque boxes, I’d hope that she adopts EDT (or FDT); if I did know what was in the opaque boxes (and she didn’t) I’d hope that she adopts CDT. Or, if I’m in a world where FDT-ers are burned at the stake, I’d hope that she adopts anything other than FDT.

Third, the best decision theory to run is not going to look like any of the standard decision theories. I don’t run CDT, or EDT, or FDT, and I’m very glad of it; it would be impossible for my brain to handle the calculations of any of these decision theories every moment. Instead I almost always follow a whole bunch of rough-and-ready and much more computationally tractable heuristics; and even on the rare occasions where I do try to work out the expected value of something explicitly, I don’t consider the space of all possible actions and all states of nature that I have some credence in — doing so would take years.

So the main formulation of Y&S’s most important principle doesn’t support FDT. And  I don’t think that the other formulations help much, either. Criteria of how well ‘a decision theory does on average and over time’, or ‘when a dilemma is issued repeatedly’  run into similar problems as the primary formulation of the criterion. Assessing by how well the decision-maker does in possible worlds that she isn’t in fact in doesn’t seem a compelling criterion (and EDT and CDT could both do well by that criterion, too, depending on which possible worlds one is allowed to pick).

Fourth, arguing that FDT does best in a class of ‘fair’ problems, without being able to define what that class is or why it’s interesting, is a pretty weak argument. And, even if we could define such a class of cases, claiming that FDT ‘appears to be superior’ to EDT and CDT in the classic cases in the literature is simply begging the question: CDT adherents claims that two-boxing is the right action (which gets you more expected utility!) in Newcomb’s problem; EDT adherents claims that smoking is the right action (which gets you more expected utility!) in the smoking lesion. The question is which of these accounts is the right way to understand ‘expected utility’; they’ll therefore all differ on which of them do better in terms of getting expected utility in these classic cases.

Finally, in a comment on a draft of this note, Abram Demski said that: “The notion of expected utility for which FDT is supposed to do well (at least, according to me) is expected utility with respect to the prior for the decision problem under consideration.” If that’s correct, it’s striking that this criterion isn’t mentioned in the paper. But it also doesn’t seem compelling as a principle by which to evaluate between decision theories, nor does it seem FDT even does well by it. To see both points: suppose I’m choosing between an avocado sandwich and a hummus sandwich, and my prior was that I prefer avocado, but I’ve since tasted them both and gotten evidence that I prefer hummus. The choice that does best in terms of expected utility with respect to my prior for the decision problem under consideration is the avocado sandwich (and FDT, as I understood it in the paper, would agree). But, uncontroversially, I should choose the hummus sandwich, because I prefer hummus to avocado.

VIII. An alternative approaches that captures the spirit of FDT’s aims

Academic decision theorists tends to focus on what actions are rational, but not talk very much about what sort of agent to become.  Something that’s distinctive and good about the rationalist community’s discussion of decision theory is that there’s more of an emphasis on what sort of agent to be, and what sorts of rules to follow.

But this is an area where we can eat our cake and have it. There’s nothing to stop us assessing agents, acts and anything else we like in terms of our favourite decision theory.

Let’s define: Global expected utility theory =df for any x that is an evaluative focal point, the right x is that which maximises expected utility.

I think that Global CDT can get everything we want, without the problems that face FDT. Consider, for example, the Prisoner’s Dilemma. On the global version of CDT, we can say both that (i) the act of defecting is the right action (assuming that the other agent will use their money poorly); and that (ii) the right sort of person to be is one who cooperates in prisoner’s dilemmas.

(ii) would be true, even though (i) is true, if you will face repeated prisoner’s dilemmas, if whether or not you find yourself in opportunities to cooperate depend on whether or not you’ve cooperated in the past, if other agents can tell what sort of person you are even independently in your actions in Prisoner’s Dilemmas, and so on. Similar things can be said about blackmail cases and about Parfit’s Hitchhiker. And similar things can be said more broadly about what sort of person to be given consequentialism — if you become someone who keeps promises, doesn’t tell lies, sticks up for their friends (etc), and who doesn’t analyse these decisions in consequentialist terms, you’ll do more good than someone who tries to apply the consequentialist criterion of rightness for every decision.

(Sometimes behaviour like this is described as ‘rational irrationality’. But I don’t think that’s an accurate description. It’s not that one and the same thing (the act) is both rational and irrational. Instead, we continue to acknowledge that the act is the irrational one; we just also acknowledge that it results from the rational disposition to have.)

There are other possible ways of capturing some of the spirit of FDT, such as a sort of rule-consequentialism, where the right set of rules to follow are those that would produce the best outcome if all agents followed those rules, and the right act is that which conforms to that set of rules. But I think that global causal decision theory is the most promising idea in this space.

IX. Conclusion

In this note, I argued that FDT faces multiple major problems. In my view, these are fatal to FDT in its current form. I think it’s possible that, with very major work, a version of FDT could be developed that could overcome some of these problems (in particular, the problems described in sections IV, V and VI, that are based, in one way or another, on the issue of when two processes are Y&S-subjunctively dependent on one another). But it’s hard to see what the motivation for doing so is: FDT in any form will violate Guaranteed Payoffs, which should be one of the most basic constraints on a decision theory; and if, instead, we want to seriously undertake the project of what decision-procedure is the best for an agent to run (or ‘what should we code into an AI?’), the answer will be far messier, and far more dependent on particular facts about the world and the computational resources of the agent in question, than any of EDT, CDT or FDT.

Discuss

### Theory of Ideal Agents, or of Existing Agents?

Новости LessWrong.com - 13 сентября, 2019 - 20:38
Published on September 13, 2019 5:38 PM UTC

Within the context of AI safety, why do we want a theory of agency? I see two main reasons:

• We expect AI have agenty properties, so we want to know how to design agenty systems which e.g. perform well in a maximally wide array of environments, and which reason about themselves and self-modify while maintaining their goals. The main use-case is to design an artificial agent.
• We want AI to model humans as agenty, so we need a theory of e.g. how things made of atoms can “want” things, model the world, or have ontologies, and how to reliably back out those wants/models/ontologies from physical observables. The main use-case is to describe existing agenty systems.

The ideas involved overlap to a large extent, but I’ve noticed some major differences in what kinds of questions researchers ask, depending on which of these two goals they’re usually thinking about.

One type of difference seems particularly central: results which identify one tractable design within some class, vs characterizing all designs within a class.

• If the goal is to design an artificial agenty system with ideal properties, then the focus is on existence-type proofs: given some properties, find a design which satisfies them.
• If the goal is to describe agenty systems in the wild, then the focus is on characterizing all such systems: given some properties, show that any system which satisfies them must have some specific form or additional properties.

• If the goal is to design one system with the best performance we can achieve in the widest variety of environments possible, then we’ll want the strongest properties we can get. On the other hand, it’s fine if there’s lots of possible agenty things which don’t satisfy our strong properties.
• If the goal is to describe existing agents, then we’ll want to describe the widest variety of possible agenty things we can. On the other hand, it’s ok if the agents we describe don’t have very strong properties, as long as the properties they do have are realistic.

As an example, consider logical induction. Logical induction was a major step forward in designing agent systems with strong properties - e.g. eventually having sane beliefs over logic statements despite finite resources. On the other hand, for the most part it doesn’t help us describe existing agenty systems much - bacteria or cats or (more debatably) humans probably don’t have embedded logical inductors.

Diving more into different questions/subgoals:

• Logical counterfactuals, Lobian reasoning, and the like are much more central to designing artificial agents than to describing existing agents (although still relevant).
• Detecting agents in the environment, and backing out their models/goals, is much more central to describing existing agents than to designing artificial agents (although still relevant).
• The entire class of robust delegation problems is mainly relevant to designing ideal agents, and only tangentially relevant to describing existing agents.
• Questions about the extent to which agent-like behavior requires agent-like architecture are mainly relevant to describing existing agents, and only tangentially relevant to designing artificial agents.

I’ve been pointing out differences, but of course there’s a huge amount of overlap between the theoretical problems of these two use-cases. Most of the problems of embedded world-models are central to both use-cases, as is the lack of a Cartesian boundary and all the problems which stem from that.

My general impression is that most MIRI folks (at least within the embedded agents group) are more focused on the AI design angle. Personally, my interest in embedded agents originally came from wanting to describe biological organisms, neural networks, markets and other real-world systems as agents, so I’m mostly focused on describing existing agents. I suspect that a lot of the disagreements I have with e.g. Abram stem from these differing goals.

In terms of how the two use-cases “should” be prioritized, I certainly see both as necessary for best-case AI outcomes. Description of existing agents seems more directly relevant to human alignment: in order to reliably point to human values, we need a theory of how things made of atoms can coherently “want” things in a world they can’t fully model. AI design problems seem more related to “scaling up” a human-aligned AI, i.e. having it self-modify and copy itself and whatnot without wireheading or value drift.

I’d be interested to hear from some agency researchers who focus more on the design use-case if all this sounds like an accurate representation of how you’re thinking, or if I’m totally missing the target here. Also, if you think that design-type use-cases really are more central to AI safety than description-type use-cases, or that the distinction isn't useful at all, I’d be interested to hear why.

Discuss

### Клуб чтения цепочек

События в Кочерге - 13 сентября, 2019 - 19:30
В эту пятницу будем обсуждать, могут ли сердиться атомы и искусственный интеллект. Вспомним, как связаны тепло и движение, очевидно, что это одно и то же, но как трудно было раньше людям это представить. А что из устройства реальности трудно представить людям сейчас?

### Рациональное додзё. Getting Things Done

События в Кочерге - 13 сентября, 2019 - 19:30
GTD – одна из самых известных систем ведения дел. На додзё мы поговорим о принципах GTD и поделимся советами про то, как предотвратить развал такой системы.

### Reversal Tests in Argument and Debate

Новости LessWrong.com - 13 сентября, 2019 - 12:18
Published on September 13, 2019 9:18 AM UTC

One thing that I've noticed recently is that simple reversal tests can be very useful for detecting bias when it comes to evaluating policy arguments or points made in a debate.

In other words, when encountering an argument it can be useful to think "Would I accept this sort of argument if it were being made for the other side?" or perhaps "If the ideological positions here were reversed, would this sort of reasoning be acceptable?"

This can be a very easy check to determine whether there is biased thinking going on. Here are some examples of situations where one might be able to apply this:

• Someone is advocating a locally unpopular belief and being attacked for it. (Ask yourself whether the same sort of advocacy and reasoning would be mocked if it were being made towards locally popular conclusions; ask yourself whether the mockery would be accepted if it were being made against someone locally popular.)
• Someone advocates an easy dismissal of one of the perspectives in an argument. (Ask yourself whether this sort of dismissal would seem reasonable if made against one of your own points.)
• Someone makes arguments against a locally unpopular organization or belief. (Ask yourself whether these arguments would pass muster against something that wasn't already derided locally.)

Often one will find that in fact that sort of argument or reasoning would not fly. This can be a good way to check your biases -- people are often prone to accepting weak arguments for things that they already agree with or against thing they already disagree with, and stopping to check whether that reasoning would work in the "other direction" is useful.

Discuss

### Update: Predicted AI alignment event/meeting calendar

Новости LessWrong.com - 13 сентября, 2019 - 12:05
Published on September 13, 2019 9:05 AM UTC

If you found the Predicted AI alignment event/meeting calendar useful, check it out again. Much new event information came out over the last month.

Discuss

### Request for stories of when quantitative reasoning was practically useful for you.

Новости LessWrong.com - 13 сентября, 2019 - 10:21
Published on September 13, 2019 7:21 AM UTC

I'm studying more math and CS these days than I have in the past, and I would like to seize any opportunities to generalize those mental skillsets to other domains. I think that generalization would be easier if I had concrete targets: I knew of the specific low level skills that have been useful for folks.

Therefore, I'm looking for anecdotes that express the value of quantitative thinking, and mathematical competency, in "real life". What does that skill set allow you to do? What concrete problems has it solved for you? etc.

Feel free to interpret "quantitative thinking" or "mathematical competency", as broadly as you want. If there's an attitude or mindset that you learned from studying biology, or or from building software, and that mindset has proved practically useful for you outside of that domain, please share.

Discuss

### What You See Isn't Always What You Want

Новости LessWrong.com - 13 сентября, 2019 - 07:17
Published on September 13, 2019 4:17 AM UTC

I feel somewhat confused about this right now; however, the content feels important to consider. To aid communication, I’m going append a technical rephrasing after some paragraphs.

It’s known to be hard to give non-trivial goals to reinforcement learning agents. However, I haven’t seen much discussion of the following: even ignoring wireheading, it seems impossible to specify reward functions that get what we want – at least, if the agent is farsighted, smart, and can’t see the entire world all at once, and the reward function only grades what the agent sees in the moment. If this really is impossible in our world, then the designer’s job gets way harder.

It’s well-known that perfect reward specification is hard for non-trivial tasks. However, the situation may be worse: even ignoring wireheading, it could be impossible to supply a reward function such that most optimal policies lead to desirable behavior – at least, if the agent is farsighted and able to compute the optimal policy, the environment is partially observable (which it is, for the real world), and the reward function is Markovian. If so, then the designer’s job becomes far more difficult.

I think it’s important to understand why and how the designer’s job gets harder, but first, the problem.

Let’s suppose that we magically have a reward function which, given an image from the agent’s camera, outputs what an idealized person would think of the image. That is, given an image, suppose a moral and intelligent person considers the image at length (magically avoiding issues of slowly becoming a different person over the course of reflection), figures out how good it is, and produces out a scalar rating – the reward.

The problem here is that multiple world states can correspond to the same camera input. Is it good to see a fully black image? I don’t know – what else is going on? Is it bad to see people dying? I don’t know, are they real, or perfectly Photoshopped? I think this point is obvious, but I want to make it so I can move on to the interesting part: there just isn’t enough information to meaningfully grade inputs. Contrast with being able to grade universe-histories via utility functions: just assign 1 to histories that lead to better things than we have right now, and 0 elsewhere.

The problem is that the mapping from world state to images is not at all injective... in contrast, grading universe-histories directly doesn’t have this problem: simply consider an indicator function on histories leading to better worlds than the present (for some magical, philosophically valid definition of “better”).

Now, this doesn’t mean we need to have systems grading world states. But what I’m trying to get at is, Markovian reward functions are fundamentally underdefined. To say the reward function will incentivize the right things, we have to consider the possibilities available to the agent: which path through time is the best?

The bad thing here is that the reward function is no longer actually grading what the agent sees, but rather trying to output the right things to shape the agent’s behavior in the right ways. For example, to consider the behavior incentivized by a reward function linear in the number of blue pixels, we have to think about how the world is set up. We have to see, oh, this doesn’t just lead to the agent looking at blue objects; rather, there exist better possibilities, like showing yourself solid blue images forever.

But maybe there don’t exist such possibilities – maybe we have in fact made it so the only way to get reward is by looking at blue objects. The only way to tell is by looking at the dynamics – at how the world changes as the agent acts. In many cases, you simply cannot make statements like “the agent is optimizing for X” without accounting for the dynamics.

Under this view, alignment isn’t a property of reward functions: it’s a property of a reward function in an environment. This problem is much, much harder: we now have the joint task of designing a reward function such that the best way of stringing together favorable observations lines up with what we want. This task requires thinking about how the world is structured, how the agent interacts with us, the agent’s possibilities at the beginning, how the agent’s learning algorithm affects things…

Yikes.

Qualifications

The argument seems to hold for n-step Markovian reward functions, if n isn’t ridiculously large. If the input observation space is rich, then the problem probably relaxes. The problem isn't present in fully observable environments: by force of theorem (which presently assumes determinism and a finite state space), there exist Markovian reward functions whose only optimal policy is desirable.

This doesn’t apply to e.g. Iterated Distillation and Amplification (updates based on policies), or Deep RL from Human Preferences (observation trajectories are graded). That is, you can get a wider space of optimal behaviors by updating policies on information other than a Markovian reward.

It’s quite possible (and possibly even likely) that we use an approach for which this concern just doesn’t hold. However, this “what you see” concept feels important to understand, and serves as the billionth argument against specifying Markovian observation-based reward functions.

Thanks to Rohin Shah and TheMajor for feedback.

Discuss

### Who's an unusual thinker that you recommend following?

Новости LessWrong.com - 13 сентября, 2019 - 07:11
Published on September 13, 2019 4:11 AM UTC

Some (optional!) desiderata to guide your recommendation:

• They probably have not been heard of around these parts.
• They disagree with a consensus.
• They have genuinely original thoughts.
• You have updated from their views and/or been emotionally affected.
• They're actively doing some interesting thing(s) and seem effective at those things.
• Reading/listening doesn't feel like an obtuse poetic maze; they attempt to convey things relatively clearly.
• Maybe they use an unusual method of communication.

Feel free to disregard these criteria if it doesn't work for answering the question! :)

Discuss

### London Rationalish meetup (part of SSC meetups everywhere)

Новости LessWrong.com - 12 сентября, 2019 - 23:32
Published on September 12, 2019 8:32 PM UTC

Discuss

### The Power to Understand "God"

Новости LessWrong.com - 12 сентября, 2019 - 21:38
Published on September 12, 2019 6:38 PM UTC

This is Part VI of the Specificity Sequence

“God” is a classic case of something that people should be more specific about. Today’s average college-educated person talking about God sounds like this:

Liron: Do you believe in God?Stephanie: Yeah, I’d say I believe in God, in some sense.Liron: Ok, in what sense?Stephanie: Well, no one really knows, right? Like, what’s the meaning of life? Why is there something instead of nothing? Even scientists admit that we don’t understand these things. I respect that there’s a deeper kind of truth out there.Liron: Ok, I remember you said your parents are Christian and they used to make you go to church sometimes… are you still a Christian now or what?Stephanie: At this point I wouldn’t be dogmatic about any one religion, but I wouldn’t call myself an atheist either. Maybe I could be considered agnostic? I feel like all the different religions have an awareness that there’s this deep force, or energy, or whatever you want to call it.Liron: Ok, so, um… do you pray to God?Stephanie: No, I don’t really say prayers anymore, but I prayed when I was younger and I appreciate the ritual. The Western religions believe in praying to God and being saved. The Eastern religions believe in meditation and striving toward enlightenment. There are a lot of different paths to the same spiritual truth… I just have faith that there is a truth, you know?

What did you think of Stephanie’s answers? I’m pretty sure most people would be like, “Yeah, sounds right to me. I’m just glad you didn’t ask me to explain it. I really would have struggled if you’d put me on the spot like that, but she did a pretty good job.”

If we were asking Stephanie about her fantasy sports team picks, we’d expect her to explain what she believes and why she believes it, grounding her claims in specific predictions about the outcomes of future sportsball games.

But there’s a social norm that when we ask someone about “God”, it’s okay for them to squirt an ink cloud of “truth”, “energy”, “enlightenment”, “faith”, and so on, and make their getaway.

Let’s activate our specificity powers. We’ve seen that the best way to define a term is often to ground the term, to slide it down the ladder of abstraction. How would Stephanie ground her concept of “God”? Her attempts might look like this:

• The universal force that people pray to.
• The energy that makes the universe exist.
• The destination that all religions lead to.

Ooh, do you notice how these descriptions of "God" have an aura of poetic mystery if you read them out loud? Unfortunately, any mystery they have means they’re doing a shitty job grounding the concept in concrete terms. I’m calling in dialogue-Liron.

It would also be interesting to ask Stephanie what Sam Harris asked a guest on his Making Sense podcast: “Would you believe in God if God didn’t exist?” Is there something in our external reality, beyond your own preferences for what words you were taught to say and what words you feel good saying, that can in principle be flipped one way or the other to determine whether or not you believe in God? I suspect the answer is no.

Now consider how the dialogue would have gone differently if I were talking to Bob, a bible literalist who believes that God answers his prayers:

Liron: Let’s say we’ve never talked with each other about God before, so neither of us knows yet whether the other believes in God or not. If we just go about our lives, when would we first notice that we don’t see eye to eye about the universe?Bob: You’ll see me kneeling in prayer, and then you’ll see my prayers are more likely to get answered. Like when I know someone is in the hospital, I pray for their speedy recovery, and the folks I pray for will usually recover more speedily than the folks no one prays for.Liron: Oh okay, cool. So we can ground your “God” as “the thing which makes people recover in the hospital faster when you pray for them”. Congratulations, that’s a nicely grounded concept whose associated phenomena extend beyond verbalizations you make.

Unlike Stephanie’s concept, Bob’s concept of “God” has earned the respectable status of being concrete. Bob’s prayer-answering God is as concrete as Helios, the chariot-pulling sun god.

I wonder if this party died when someone invented dark glasses to look at the sun with.

From a specificity standpoint, Stephanie’s concept of God is empty and weird, while Bob’s concept is completely normal and fine.

On the other hand, Bob’s concept is demonstrably wrong, and Stephanie’s isn’t. But it’s easy to be not-wrong when you’re talking about empty concepts. Here’s Stephanie being not-wrong about “spirits”:

Liron: Do you believe in spirits?Stephanie: YesLiron: Ok, how do I ground “spirit”? How do I know when to label something as a spirit?Stephanie: They’re a special type of beings.Liron: Do you hear them speak to you?Stephanie: No, I don’t think so.Liron: If tomorrow spirits suddenly didn’t exist anymore, do you think you’d notice?Stephanie: Hm, I don’t know about that.

Ladies and gentleman, she’s not wrong!

Of course, she’s not right either, but that’s okay with her. She doesn’t enjoy talking about “God”, and when she does she’s only playing to not-lose, not playing to win.

Equipped with the power of specificity, it’s easy for us to observe the emptiness of Stephanie’s “God”. It’s harder for us to explain why all the intelligent Stephanies of the world are choosing to utter the sentence, “I believe in [empty concept]”.

Eliezer Yudkowsky traces the historical lineage of how Bobs (God-believers) begat Stephanies (God-believers-in):

Back in the old days, people actually believed their religions instead of just believing in them. The biblical archaeologists who went in search of Noah’s Ark did not think they were wasting their time; they anticipated they might become famous. Only after failing to find confirming evidence — and finding disconfirming evidence in its place — did religionists execute what William Bartley called the retreat to commitment, “I believe because I believe.”

In Taboo Your Words, Eliezer uses the power of specificity to demolish the nice-sounding claim that religions are all paths to the same universal truth:

The illusion of unity across religions can be dispelled by making the term “God” taboo, and asking them to say what it is they believe in; or making the word “faith” taboo, and asking them why they believe it.Though mostly they won’t be able to answer at all, because it is mostly profession in the first place, and you cannot cognitively zoom in on an audio recording.

In your own life, try to avoid the word “God”, and just discuss what you specifically want to discuss. If your conversation partner introduces the word “God”, your best move is to ground their terms. Establish what specifically they’re talking about, then have a conversation about that specific thing.

Next post: The Power to Be Emotionally Mature (coming next week)

Discuss

### The Power to Solve Climate Change

Новости LessWrong.com - 12 сентября, 2019 - 21:37
Published on September 12, 2019 6:37 PM UTC

This is Part V of the Specificity Sequence

Companion post: Examples of Examples

Most people agree that climate change is a big problem we should be solving, but couldn't tell you what specifically "solving climate change" means. By the end of this post, I promise you'll know what specifically "solving climate change" means... and also what it doesn't mean.

The Climate Crisis

The first step is knowing the key specific facts of the climate crisis. I admit I couldn't recite them accurately before writing this post. Here's a quick summary from Tomorrow and NASA (both great friendly explanations worth checking out):

• Earth's average temperature has shot up by 1°C in the last 50 years.
• The causal link from greenhouse gas emissions to Earth's rising temperature has been well established.
• On a 1M-year timescale, Earth's temperature has been fluctuating plus or minus a few degrees tops, so this 1°C change is a big fluctuation.
• We're predicting it to be as high as a 6°C warming by 2100, so it's actually a huge fluctuation.
• The current rate of temperature rise is a whopping 20 times faster than the rate at which Earth's temperature historically fluctuates, so it's actually an oh, SHIT fluctuation.
Thiel's Definite vs. Indefinite Attitude

In Peter Thiel's smart and highly original book Zero to One: Notes on Startups, or How to Build the Future, the chapter called "You Are Not A Lottery Ticket" centers around the concept of definite vs. indefinite attitudes:

You can expect the future to take a definite form or you can treat it as hazily uncertain. If you treat the future as something definite, it makes sense to understand it in advance and work to shape it. But if you expect an indefinite future ruled by randomness, you'll give up on trying to master it.Indefinite attitudes to the future explain what's the most dysfunctional in our world today. Process trumps substance: when people lack concrete plans to carry out, they use formal rules to assemble a portfolio of various options. This describes Americans today. In middle school, we're encouraged to start hoarding "extracurricular activities". In high school, ambitious students compete even harder to appear omnicompetent. By the time a student gets to college, he's spent a decade curating a bewilderingly diverse résumé to prepare for a completely unknowable future. Come what may, he's ready—for nothing in particular.A definite view, by contrast, favors firm convictions. Instead of pursuing many-sided mediocrity and calling it "well-roundedness," a definite person determines the one best thing to do and then does it. [...] This is not what young people do today, because everyone around them has long since lost faith in a definite world.

When you have a definite attitude, you do stuff like:

• Claim that skill is more important than luck (or you can "make your own luck")
• Back a political candidate who has a long-term plan you believe in
• Start a company to make something people want
• Work on megaprojects
• Invest actively in companies and expect a high return

While if you have an indefinite attitude, you do stuff like:

• Claim that success "seems to stem as much from context as from personal attributes" (a quote from Malcolm Gladwell's Outliers)
• Follow election polls to see which candidate is the most popular this week
• Work on a bunch of smaller projects to become well-rounded and keep your options open
• Invest passively in a portfolio of stocks and bonds and expect low returns

You've probably guessed why Thiel's distinction is relevant to our exploration of specificity: Activating your specificity powers is basically the same thing as switching from an indefinite to a definite attitude.

When a startup lacks a specific story about how they'll create value for specific users, but they're working frantically to build a product anyway, they're revealing an attitude of what Thiel calls indefinite optimism: "The future will be better than the present, even though I can't tell you specifically how." Consistently with my post about judging startup ideas, Thiel says:

A company is the strangest place of all for an indefinite optimist: why should you expect your own business to succeed without a plan to make it happen?

(I consider it ironic that Thiel's Founders Fund has an investment in Golden, but it was a relatively small amount and unrepresentative of their normal decision-making.)

Thiel says that the western world had an era of definite optimism beginning in the 17th century and lasting through to the 1950s and '60s:

In 1914, the Panama Canal cut short the route from Atlantic to Pacific. [...] The Interstate Highway System began construction in 1956, and the first 20,000 miles of road were open for driving by 1965. [...] NASA's Apollo Program began in 1961 and put 12 men on the moon before it finished in 1972.

But, he says, the shared attitude in the US and Europe has now shifted to indefinite:

The government used to be able to coordinate complex solutions to problems like atomic weaponry and lunar exploration. But today, after 40 years of indefinite creep, the government mainly just provides insurance; our solutions to big problems are Medicare, Social Security, and a dizzying array of other transfer payment programs.

In order to solve the climate crisis, I think we desperately need a cultural shift back to definite optimism. We need specific plans for solving our problems.

Definite Climate Change Solutions

Here are all the efforts I've heard of that are aiming at definite solutions to climate change.

United Nations Framework Convention on Climate Change

The United Nations Framework Convention on Climate Change (UNFCCC) is an international treaty whose stated objective is:

To stabilize greenhouse gas concentrations in the atmosphere at a level that would prevent dangerous anthropogenic interference with the climate system.

The UNFCCC feels like it comes from an alternate universe containing an adequate civilization sane enough to coordinate top-down solutions in the face of existential threats. In fact, it was ratified in 1994 of the actual timeline we're in by all member states of the United Nations and the European Union.

Under the UNFCCC's guidelines, all 197 parties meet every year and negotiate more specific international treaties. This is how we got the 1997 Kyoto Protocol (84 signatories) and the even-more-of-a-big-deal 2016 Paris Agreement (195 signatories).

But while the UNFCCC provides definite per-country targets for greenhouse gas reduction, it doesn't provide definite mechanisms or incentives for countries to hit the targets.

At one point, the Kyoto Protocol tried to fix the problem by defining binding commitments on 37 countries to reduce their greenhouse gas emissions by specific amounts, but only seven of them ratified that part.

The Paris Agreement doesn't even attempt anything as definite as that:

The Paris Agreement has a 'bottom up' structure in contrast to most international environmental law treaties, which are 'top down', characterized by standards and targets set internationally, for states to implement. Unlike its predecessor, the Kyoto Protocol, which sets commitment targets that have legal force, the Paris Agreement, with its emphasis on consensus-building, allows for voluntary and nationally determined targets. [Source]

And:

There will be only a "name and shame" system or as János Pásztor, the U.N. assistant secretary-general on climate change, told CBS News (US), a "name and encourage" plan. As the agreement provides no consequences if countries do not meet their commitments, consensus of this kind is fragile. A trickle of nations exiting the agreement could trigger the withdrawal of more governments, bringing about a total collapse of the agreement. [Source]

So guess what? Countries haven't been trying that hard to live up to the UNFCCC's targets:

A pair of studies in Nature have said that, as of 2017, none of the major industrialized nations were implementing the policies they had envisioned and have not met their pledged emission reduction targets, and even if they had, the sum of all member pledges (as of 2016) would not keep global temperature rise "well below 2 °C". Source

Ouch.

But on a positive note, we have this international framework-treaty, the UNFCCC, which is at least trying to define specific targets. That's what the start of a good definite solution looks like, no matter how bad the follow-through has been so far.

Y Combinator's Request for Carbon Removal Technologies

In 2018, Sam Altman and Y Combinator announced an ongoing Request for Carbon Removal Technology Startups. They're interested to fund and advise anyone working in the space, whether it's a for-profit company or nonprofit research. (Consider applying!)

While the 25-year-old UNFCCC has always been focused on renewable energy and reducing greenhouse gas emissions, YC's Request for Startups focuses entirely on removing greenhouse gases from the atmosphere, because we're now in a more advanced phase of the climate crisis than we used to be:

"Phase 1" of climate change is reversible by reducing emissions, but we are no longer in "Phase 1." We're now in "Phase 2" and stopping climate change requires both emission reduction and removing CO2 from the atmosphere. "Phase 2" is occurring faster and hotter than we thought. If we don't act soon, we'll end up in "Phase 3" and be too late for both of these strategies to work.

YC names four major categories of frontier technologies which they admit "straddle the border between very difficult to science fiction", but are nevertheless worth exploring:

In a 2012 Stanford lecture, Thiel lamented that these kinds of definite proposals are outside the Overton window:

It’s worth noting that something like geoengineering [projects to save the environment] would fall in the definite optimistic quadrant. Maybe we could scatter iron filings throughout the ocean to induce phytoplankton to absorb carbon dioxide. Potential solutions of that nature are not even remotely in the public debate. Only radically indefinite things make for acceptable discourse.

But YC's RFS announcement was met with positive headlines like "Carbon removal tech is having a moment" and got 1187 points on Hacker News. While, sadly, it didn't get much attention at all outside the tech community, it's fair to say that using frontier technologies to remove carbon from the atmosphere now is acceptable discourse.

Project Drawdown

Project Drawdown evaluates specific solutions to the climate change problem by modeling their expected impact on greenhouse gas reduction. Here's their ranking of 80 solutions.

Can you guess what's the #1 solution they recommend as highest impact in terms of total reduction in atmospheric greenhouse gases?

...

...

...

Ok here are the top five:

Refrigerant management?! I totally thought the top project was going to be nuclear, but nuclear is only #20. Apparently it's expensive and slow, and not just because of regulations, but also because of technical and economic factors. I'm still optimistic that a nuclear Manhattan Project is a good investment, but I guess it's not a slam dunk compared to the other top-20 projects.

Bret Victor's List

Bret Victor's nicely-designed What Can A Technologist Do About Climate Change? page is a thoughtful list of areas for definite action:

• Public and private investment
• Implementation details of efficient clean energy production
• Transporting energy
• Coordinating energy consumption
• Energy-efficient devices
• Tools for scientists and engineers
• Media for understanding situations
• Nuclear power
• Geoengineering
• Foundational technology

Without attempting to be comprehensive, he offers many specific ways that technologists can attack the climate change problem.

Tesla

Tesla is the only company with a $44B market cap whose core mission is closely tied to solving the climate crisis. According to its about page, Tesla's mission is to accelerate the world's transition to sustainable energy. As we've seen with SpaceX, Elon Musk is a master of running a company according to a definite multi-decade strategy. In his 2006 post, The Secret Tesla Motors Master Plan (just between you and me), he lays it out like this: Build sports car. Use that money to build an affordable car. Use that money to build an even more affordable car. While doing above, also provide zero emission electric power generation options. From our current perspective 13 years later, we can see that Musk accurately predicted the future by building it: • Tesla built home solar panels and Powerwall to enable a home to run entirely on solar energy by collecting sporadic bursts of sunlight and releasing the energy later when it's needed. They also provide industrial-scale solar energy production and storage. • Tesla built Gigafactory 1, the biggest battery factory in the world, to provide batteries for its electric vehicles and Powerwalls. Ok, it's actually bigger than all other battery factories in the world combined... ok, it's actually the biggest building in the world in terms of floor space... you get the idea. • After building the Roadster (a$100,000 sports car) and Model S (a $75,000 "affordable car"), Tesla built the Model 3, an all-electric luxury sedan that retails at$35,000 USD whose sales completely dominated the Small + Midsize Luxury Cars category in December 2018.

I recommend this incredible WaitButWhy post for a deeper dive into Tesla.

Thiel writes:

A business with a good definite plan will always be underrated in a world where people see the future as random.

This is basically why I'm currently holding a significant amount of Tesla stock. (Plus I think that markets generally don't price in the degree of demonstrated product and engineering excellence.)

When I started thinking about how to solve climate change, the first thing that popped into mind is what I tweeted a couple months ago:

The economy is this amazing system we have for balancing all the stuff we care about without guilting anyone into righteous sacrifice. Just internalize the CO2 externality into the economy.

While Project Drawdown purposely doesn't include Cap and Trade in their list—

We do not model incentive-based policies and financial mechanisms, such as carbon and congestion pricing, because they would be guesses, not models.

—it sure seems like cap and trade would go a long way toward solving the problem: It incentivizes greenhouse gas emitters to reduce emissions, it incentivizes private investment in frontier tech to reduce atmospheric CO2, and it incentivizes a bunch of other creative stuff like private efforts to destroy HFC refrigerants.

President Obama tried to get congressional approval for a version of Cap and Trade, but he couldn't. President Trump doesn't seem to care for it either. So when it comes to an economic-policy solution for climate change, the world's largest economy is currently doing nothing.

Indefinite Climate Change Solutions

We’ve seen a bunch of definite ways people can help get closer to solving climate change:

• Work for Tesla
• Work on any of the areas in Bret Victor’s list
• Build a company or do research under Y Combinator’s Request for Carbon Removal Technologies
• Work on a political campaign to support Cap and Trade and other pro-environment policies
• Donate to Project Drawdown or any other organizations where you can understand the specific causal chain from their work to a better climate

But what does indefinite thinking about climate change look like? For most people, climate change is a very indefinite thing:

• They can’t summarize the problem like I did in the beginning of this post
• They can’t name any of the specific actions above that they can take to help solve climate change
• And to make things worse, they focus on the most useless thing: personal behavior change
Personal Behavior Change

I just Googled for "reduce carbon footprint" to grab a ridiculous example of personal behavior change advice. The top result I got is The 35 Easiest Ways to Reduce Your Carbon Footprint by Columbia University's Earth Institute.

Some of the recommendations are just good advice with no downside:

8. Wash your clothing in cold water. The enzymes in cold water detergent are designed to clean better in cold water. Doing two loads of laundry weekly in cold water instead of hot or warm water can save up to 500 pounds of carbon dioxide each year.

Okay if the enzymes in cold water detergent are designed to clean better in cold water, why not!

But most of the recommendations are asking you to make tradeoffs without acknowledging the downside:

12. If you're in the market for a new computer, opt for a laptop instead of a desktop. Laptops require less energy to charge and operate than desktops.

Wait, if you were going to buy a desktop computer, opting for the same-priced laptop instead will probably get you a noticeably slower machine. They should acknowledge these kinds of cost/benefit tradeoffs. But that's not even the biggest problem I have with personal behavior change recommendations.

The biggest problem with personal behavior change is that it naively feels like definite thinking, but it’s actually vague indefinite thinking, because your causal model of how your actions will affect greenhouse gas concentrations is missing the concept of an economic equilibrium.

Our economy, a free-market economy without Cap and Trade, is a classic tragedy of the commons. A low-greenhouse-gas atmosphere is the commons, and opting for a laptop instead of a desktop (or driving less, or buying less stuff, or eating less meat) is the equivalent of restraining your cows from grazing. It just doesn’t make game-theoretic sense.

When you unilaterally dial back your CO2 emissions a certain amount, it means other actors (people, companies and the government) can get away with that same amount of not dialing back theirs, and they’ll act on that incentive. There needs to be an actual coordination mechanism to solve the problem — you know, the kind of thing government is for.

I've summarized my thoughts about personal behavior change in two tweets:

To end homelessness, should we all give more spare change? To fix the US budget deficit, should taxpayers voluntarily pay higher taxes? Individuals buying carbon offsets is the same flawed logic. The equilibrium movement of a complex system is not a sum of these local nudges.If stealing were legal, the solution would be law change, not charity to offset theft. Similarly, our economy lets everyone emit carbon for free. Individuals voluntarily offsetting small amounts of carbon is not an effective response to this situation.

When people bring up personal behavior change as a solution to climate change, this is my reaction:

Steve: I should eat less meat because farm animals contribute to climate change!Liron: Don't bother; for every animal you don’t eat, I’m going to eat three. By the way, did you know Tesla is hiring?

Steve doesn’t feel guilty that he’s contributing to the budget deficit by only paying the minimum taxes that he owes, so we know he understands economic equilibrium dynamics in that domain. I’m trying to do Steve a favor in this domain by tearing his focus away from personal behavior change, and pointing him toward a specific course of action (joining Tesla) that can actually add up to a complete solution even once you zoom out and factor in everyone's economic incentives.

What about "offsetting" your own impact by paying an organization to reduce a certain amount of greenhouse gas from the atmosphere on your behalf, i.e. buying carbon credits?

But I put "offsetting" in scare quotes because I don't think it's accurate to say that having an individual pay to prevent one ton of CO2 emissions in one place reduces the system equilibrium amount of greenhouse gas emission by one ton of CO2, or even necessarily half a ton. Similarly, I don't think taxpayers voluntarily paying an extra $1k/yr in taxes will actually reduce the budget deficit by$1k/yr/taxpayer. In both cases, some other actor in the system will be getting incentivized to emit more CO2 or to budget for more government spending.

When you just think about one person paying to plant a few trees, it's hard to imagine how the system equilibrium will bounce back against this perturbation. So let's break it down into two possible models: the government-oversight model and the no-government-oversight model.

The Government-Oversight Model

Consider the situation if the US Government were committed to hitting the Paris Agreement target emissions levels each year [graph source]:

In this model, private citizens voluntarily paying to reduce carbon emissions is pointless because it's like paying taxes—everyone already has to do it. Perhaps it's already part of federal discretionary spending, or maybe a Cap and Trade policy has already priced carbon emissions into the goods and services we buy.

Ah, but what if an individual US citizen desires to beat the Paris Agreement targets? Even then, it's tricky to do it by purchasing "offsets". The effect of purchasing "offsets" will depend on the specific government-oversight model of the country where you intervene. For example, if you plant an extra tree in Country X, then you may cause Country X's companies to be allowed to emit more carbon that year.

The No-Government-Oversight Model

Consider the situation where the US Government doesn't bother limiting greenhouse gas emissions, a good approximation of the current Trump administration. The action of a few citizens to offset carbon won't move the needle in solving the climate crisis, until the point where enough citizens are donating that their combined political will can change government policy. So in this model, "offsetting" also has less impact than all the other specific climate change solutions discussed in this post.

Wren is a for-profit startup that launched a few months ago as part of Y Combinator's Summer 2019 batch (not part of YC's Request for Carbon Removal Technologies).

Wren charges about $24/month for the average American to "offset" their carbon footprint. What's the definite future here? What's the specific causal link between Wren's actions and climate change getting solved? Presumably it's that if a significant fraction of Americans pay$24/month ($288/yr), then the US's net carbon emissions will be a lot lower, and then the US could meet its Paris Agreement targets. If every American adult were a member of Wren, this should be sufficient to offset the whole US's climate change, and that implies a cost of only$288/yr x $250M =$72B/yr to offset all Americans' carbon.

But consider this cost estimate from Washington governor and former 2020 presidential candidate Jay Inslee's 100% Clean Energy for America Plan (which Massachusetts senator and still-presidential-candidate Elizabeth Warren now endorses):

Climate change cost the U.S. economy at least $240 billion per year during the past decade, and that figure is projected to rise to$360 billion per year in the coming 10 years. We cannot afford the cost of inaction.

If the US only cared about economic self-interest, not solving climate change per se, it's apparently worth paying at least $240B/yr to solve the problem. So something feels very indefinite to me about Wren's approach to the problem, and "offsetting" carbon in general. If the solution is such an obvious slam dunk, isn't it faster to build up enough political support for a$78B/yr expenditure, rather than collecting it from individuals paying $288/yr at a time? On the plus side, I'll admit that "offsetting" your own carbon footprint is at least a better option than personal behavior change, for the same reason that lawyers help more by donating to soup kitchens than by volunteering at them. "Offsetting" does let you direct money, the unit of caring, toward solving climate change. It's just that buying individual units of carbon removal is a bad deal because the system-level effect comes out to be so much weaker than an indefinite thinker imagines. Activating your specificity powers to solve climate change means directing your resources toward efforts that have a systemically definite model of how they actually help. So while this was largely an object-level post about how to solve climate change, I hope it was also a good meta-level demonstration of the power of specificity. Next post: The Power to Understand "God" (coming this weekend) Discuss ### Answering some questions about EA Новости LessWrong.com - 12 сентября, 2019 - 20:50 Published on September 12, 2019 5:50 PM UTC A teacher in Vienna recently wrote to say that they had assigned an article about us as part of a Social Issues class, and asked whether I would be up for answering questions. The students asked me their questions via video snippets, and here are my answers: Gergo: When was the first time you gave money to other people? Did it start when you were younger, or when you were in college? The first donation I remember was a pair of political donations leading up to the 2008 election. I gave$50 each to two candidates who were very unlikely to win: Mike Gravel and Ron Paul. I was hoping to expand the range of ideas people were discussing in the primaries.

The next time I donated was \$500 two years later, after deciding to give half of what I earned to charity starting in 2009. I think I decided to make my first donation in 2008 instead of 2009 because I misunderstood how income taxes worked and thought it was better to spread donations across years (in the US it's actually the opposite).

Emilie: How did you get into this to start with and how do you keep up with it? How did you get into the mindset of "we can give so much and that's what we want to do", without flaking out?

When I started my first real job and was suddenly making much more money than I'd ever had before, the contrast between what I had and what I really needed weighed on me. Based on my upbringing and temperament I would have continued living simply and saved the money, but after a lot of discussions with Julia I realized that I wasn't ok having so much while others had so little, and couldn't keep it.

This contrast hasn't gone away, and the level of need in the world remains unconscionably high, so I can't see doing something else.

It also helps that I've written a lot about this publicly, and it would be pretty embarrassing to go back and say "you know what, I'm done with this 'charity' thing, saving up to buy a yacht is just too important to me."

Danilla: At 22 you were broke and even though you had no job you still donated all of your savings to the charity. Why you did that and how did you live at that time if you didn't have any money?

This question is phrased as if it's intended for Julia, but I can give my perspective. While Julia hadn't found a job yet I was working full time, and so I think she felt secure enough to donate her savings. In retrospect I probably should have suggested keeping some, in case I lost my job, but I don't think she would have listened to me anyway.

Danilla: In 2011 I noticed that you didn't make any donations? Did something happen, or did you change your mind?

It was a combination of two things. Julia had decided to go back to school to become a social worker. She went to school full time in order to finish school sooner, which meant she didn't have any income to donate. I had joined a startup which meant that part of my compensation was stock options instead of being entirely cash. I decided that if my options ended up being worth something I would donate them, which made sense to me as a way to handle me being risk-averse for myself but risk-neutral for charity. I no longer think this was a good decision, primarily because I overestimated the value of the options.

Jin: Why do you donate only internationally, even though America currently has a lot of problems? For example, something like 9/11 could happen again, or there could be a natural disaster like an earthquake. Why do you donate globally to help people outside of your country?

While the US does have a lot of problems it's also a very rich country in a strong position to address those problems. Both 9/11 and earthquakes are examples of the kind of thing the US government takes very seriously and has highly-funded agencies to handle. On the other hand, many other countries are much poorer and don't have the same kind of resources to address their problems. For example, we've donated to the Against Malaria Foundation to fund bednet distribution in countries like Malawi, helping people protect themselves from malaria. Malawi is a very poor country, and the US is about 175x richer, per-capita, which means our money is much more able to help there.

This doesn't mean that Americans don't have problems or don't deserve help, but with the current levels of inequality, problems in the US aren't at the top of my list.

Emelie: How long are you continuing this, and what are you planning for the future now that you have a child?

I'm planning to continue trying to use a substantial portion of my time/money to help others as best as I can at least until I retire. Having children (we now have two, 5y and 3y) has been wonderful, but hasn't led me to change my goals or values.

Olivia: Where do you see yourself in ten years? What is your ultimate goal or what you intend to achieve? In the article that I read it said that as your salary began to increase you began to give away more, but where do you see yourself ultimately reaching? What are your limitations and aspirations as you go into your future years?

If I started making substantially more money I could see trying to donate more than 50%, but for now 50% feels like a good place. I might also, at some point, switch from trying to have a positive impact via donations to something similar via directly doing valuable things. I tried this in 2017 and would do it again for the right opportunity.

Neja: What brings you joy in donating money?

This varies a lot between people, but for me it's not really about joy. It's about seeing how wrong things are and trying to make it better. It's about wanting to do the right thing.

I didn't used to feel any emotional connection to giving, but that changed when I had kids. As a parent of small children, looking at how malaria nets mean fewer parents have to bury their children makes me choke up. The world is so incredibly unfair right now, and some people have to go through so much pain, but we can all help make things better. So while I wouldn't call it joy it definitely makes me determined to help.

Aviva: Do you ever have second thoughts about donating? For example, if one month you're on a tight budget do you think, like, "maybe this month we shouldn't donate". Or, does it come naturally for you?

Sofia: Do you ever find yourself in a situation where you could have used the money you gave away to charity, and do you have any regrets about that? Do you have any money saved up for those situations? Do you put money aside for emergency situations or do you try to manage with the money that you have left after donating?

We're fine financially, mostly because we're lucky enough to have lucrative skills, and also because we're reasonably good at living within our means. We've been incredibly fortunate, and I'm very happy with our lives.

On the other hand, when I look at the money we've donated and think about how our lives would be different if we had saved it instead it does make me a bit sad. We could have gone the FIRE route, in which case I think we would have been retiring around now, in our early 30s, with more time for kids and hobbies. Being able to afford to play music full time or create new musical instruments would be really fun. But then I look at how many people are living in poverty, dealing with sickness, hunger, disasters, war, unemployment, and it's very clear to me that this would have been the wrong choice.

Comment via: facebook, the EA Forum

Discuss

### Ненасильственное общение. Тренировка

События в Кочерге - 12 сентября, 2019 - 19:30
Как меньше конфликтовать, не поступаясь при этом своими интересами? Ненасильственное общение — это набор навыков для достижения взаимопонимания с людьми. Приходите на наши практические занятия, чтобы осваивать эти навыки и общаться чутче и эффективнее.

### Predictable Identities - Midpoint Review

Новости LessWrong.com - 12 сентября, 2019 - 17:39
Published on September 12, 2019 2:39 PM UTC

Predictable Identities is a series that I’m writing on Ribbonfarm; you can also track its chronological progress on Putanumonit with short summaries of each post. The 17th post in the series is below, summarizing the story so far and providing links to all the preceding posts in a narrative, rather than chronological order.

Predictable Identities is 16 posts in, which is going much better than I would’ve predicted. It’s time to review what we covered so far: the principles of predictive processing and how we apply them to other people.

Our brains constantly predict sensory inputs using a hierarchy of models. Learning new and better models improves our predictive ability in the long term but can be so painful in the short term that we will fight against updating, and often fight the people who force us to update. It’s important to take all models with a grain of salt and resist the lure of all-explaining ideologies.

We predict the world to exploit and act on it, and the same applies to other people. We need to know how to get people to do nice things for us, using stereotypes for strangers we don’t know and more detailed mind-simulations for people we do. We don’t need detailed models for people unlikely to be nice no matter what, and we’re creeped out by people who don’t fit our models at all like those who blow their non-conformity budget. We encourage those around us to conform to our narratives and predictions of them, which means that changing one’s opinions and behavior takes great effort in the face of the expectations of your social surroundings.

Finally, our predictions of ourselves interplay in complex ways with how others see us, since our minds are neither transparent to us nor opaque to those around us. For example, to be treated nicely we must honestly believe that we are nice, even if that is self-deceiving. Our models, predictions, and beliefs about our selves form our identities. This will be the topic for the second half of this blogchain.

Thanks for joining me on this journey, I predict exciting things ahead.

Discuss

### Toy model piece #5: combining partial preferences

Новости LessWrong.com - 12 сентября, 2019 - 06:31
Published on September 12, 2019 3:31 AM UTC

My previous approach to combining preferences went like this: from the partial preferences Pi, create a normalised utility function ˆUi that is defined over all worlds (and which is indifferent to the information that didn't appear in the partial model). Then simply add these utilities, weighted according to the weight/strength of the preference.

But this method fails. Consider for example the following partial preferences, all weighted with the same weight of 1:

• C">P1:A>C.
• B">P2:A>B.
• C">P3:B>C.
• D">P4:B>D.
• E">P5:B>E.

If we follow the standard normalisation approach, then the normalised utility ˆU1 will be defined[1] as:

• ˆU1(A)=0.5, ˆU1(C)=−0.5, and otherwise ˆU1(−)=0.

Then adding together all five utility functions would give:

• U⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝ABCDE⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝11−1−0.5−0.5⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

There are several problems with this utility. Firstly, the utility of A and the utility of B are the same, even though in the only case where there is a direct comparison between them, A is ranked higher. We might say that we are missing the comparisons between A and D and E, and could elicit these preferences using one-step hypotheticals. But what if comparing A to D is a complex preference, and all that happens is that the agent combines B">A>B and D">B>D? If we added another partial preference that said F">B>F, then B would end up ranked above A!

Another, more subtle point, is that the difference between A and C is too large. Simply having B">A>B and C">B>C would give U(A)−U(C)=1. Adding in C">A>C moves this difference to 2. But note that C">A>C is already implicit in B">A>B and C">B>C, so adding it shouldn't make the difference larger.

In fact, if the difference in utility between A and C were larger than 1, adding in C">A>C should make the difference between U(A) and U(C) smaller: because having C">A>C weighted at 1 means that the agent's preference of A over C is not that strong.

Energy minimising between utilities

So, how should we combine these preferences otherwise? Well, if I have a preference Pi, of weight wi, that ranks outcome G below outcome H (write this as G<iH), then, if these outcomes appear nowhere else in any partial preference, U(G)−U(H) will be wi.

So in a sense, that partial preference is trying to set the distance between those two outcomes to wi. Call this the energy-minimising condition for Pi.

Then for a utility function U, we can define the energy of U, as compared with the (partially defined) normalised utility ˆUi corresponding to Pi. It is:

• ∑G<iH(wi(ˆUi(H)−ˆUi(G))−(U(H)−U(G)))2.

This is the difference between the weighted distance between the outcomes that wiˆUi, and the one that U actually gives.

Because different partial preferences have different number of elements to compare, we can compute the average energy of U:

• E(U,Pi)=∑G<iH(wi(ˆUi(H)−ˆUi(G))−(U(H)−U(G)))2∑G<iH1.
Global energy minimising condition

But weights have another role to play here; they measure not only how much H is preferred to G, but how important it is to reach that preference. So, for humans, "G<H with weight ϵ" means both:

• H is not much preferred to G.
• The humans isn't too fussed about the ordering of G and H.

For general agents, these two could be separate phenomena; but for humans, they generally seem to be the same thing. So we can reuse the weights to compute the global energy for U as compared to all partial preferences, which is just the weighted sum of its average energy for each partial preference:

• E(U,{Pi})=∑PiwiE(U,Pi)=∑Piwi∑G<iH(wi(ˆUi(H)−ˆUi(G))−(U(H)−U(G)))2∑G<iH1.

Then the actual ideal U is defined to be the U that minimises this energy term.

Solutions

Now, it's clear this expression is convex. But it need not be strictly convex (which would imply a single solution): for example, if P1 (C">A>C) and P4 (D">B>D) were the only partial preferences, then there would be no conditions on the relative utilities of {A,C}, {B,D} and {E}.

Say that H is linked to G, by defining a link as "there exists a Pi with G≤iH or H≤iG", and then making this definition transitive and reflexive (it's automatically symmetric). In the example above, with Pi, 1≤i≤5, all of {A,B,C,D,E} are linked.

Being linked is an equivalence relation. And within a class of linked worlds, if we fix the utility of one world, then the energy minimisation equation becomes strictly convex (and hence has a single solution). Thus, within a class of linked worlds, the energy minimisation equation has a single solution, up to translation.

So if we want a single U, translate the solution for each linked class so that the average utility in that class is equal to the average of every other linked class. And this would then define U uniquely (up to translation).

For example, if we only had P1 (C">A>C) and P4 (D">B>D), this could set U to be:

• U⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝ABCDE⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝0.50.5−0.5−0.50⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

Here, the average utility in each linked class ({A,C}, {B,D} and {E}) is 0.

Applying this to the example

So, applying this approach to the full set of the Pi, 1≤i≤5 above (and fixing U(B)=0), we'd get:

• U⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝ABCDE⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝2/30−2/3−1−1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

Here B is in the middle of A and C, as it should be, while the utilities of D and E are defined by their distance from B only. The distance between A and C is 4/3≈1.33333.... This is between 2 (which would be given by B">A>B and C">B>C only) and 1 (which would be given by C">A>C only).

1. I've divided the normalisation from that post by 2, to fit better with the methods of this post. Dividing everything in a sum by the same constant gives the same equivalence class of utility functions. ↩︎

Discuss

### Toy model piece #4: partial preferences, re-re-visited

Новости LessWrong.com - 12 сентября, 2019 - 06:31
Published on September 12, 2019 3:31 AM UTC

I initially defined partial preferences in terms of foreground variables Y and background variables Z.

Then a partial preference would be defined by y+ and y− in Y, such that, for any z∈Z, the world described by (y+,z) would be better than the world described by (y−,z). The idea being that, everything else being equal (ie the same z), a world with y+ was better than a world with y−. The other assumption is that, within mental models, human preferences can be phrased as one or many binary comparisons. So if we have a partial preference like P1: "I prefer a chocolate ice-cream to getting kicked in the groin", then (y+,z) and (y−,z) are otherwise identical worlds with a chocolate ice-cream and a groin-kick, respectively.

Note that in this formalism, there are two subsets of the set of worlds, y+×Z and y−×Z, and map l between them (which just sends (y+,z) to (y−,z)).

In a later post, I realised that such a formalism can't capture seemingly simple preferences, such as P2: "n+1 people is better than n people". The problem is that that preferences like that don't talk about just two subsets of worlds, but many more.

Thus a partial preference was defined as a preorder. Now, a preorder is certainly rich enough to include preferences like P2, but its allows for far too many different types of structures, needing a complicated energy-minimisation procedure to turn a preorder into a utility function.

This post presents another formalism for partial preferences, that keeps the initial intuition but can capture preferences like P2.

The formalism

Let W be the (finite) set of all worlds, seen as universes with their whole history.

Let X be a subset of W, and let l be an injective (one-to-one) map from X to W. Define Y=l(X), the image of l, and l−1:Y→X as the inverse.

Then the preference is determined by:

• For all x∈X, l(x)">x>l(x).

If X and Y are disjoint, this just reproduces the original definition, with X=y+×Z and Y=y−×Z.

But it also allows preferences like P, defining l(x) as something like "the same world as x, but with one less person". In that case, l maps some parts of X to itself.

Then for any element x∈X, we can construct its upwards and downwards chain:

• …,l−3(x),l−2(x),l−1(x),x,l(x),l2(x),l3(x),….

These chains end when they cycle: so there is an n and an m so that l−n(x)=lm(x) (equivalently, lm+n(x)=x).

If they don't cycle, the upwards chain ends when there is an l−n(x) which is not an element of Y (hence l−1 is not defined on in), and the downward chain ends when there is an l(x) which is not in X (and hence l is not defined on it).

So, for example, for P1, all the chains contain two elements only: x and l(x). For P2, there are no cycles, and the lower chain ends when the population hits zero, while the upper chain ends when the population hits some maximal value.

Utilities difference between clearly comparable worlds

Since the worlds of X∪Y decompose either into chains or cycles via l, there is not need for the full machinery for utilities constructed in this post.

One thing we can define unambiguously, is the relative utility between two elements of the same chain/cycle:

• If x and y=ln(x) are in the same cycle, then Ul(x)=Ul(y).
• Otherwise, if x and y=ln(x) are in the same chain, then Ul(x)−Ul(y)=n.

Currently, lets normalise these relative utilities to ˆUl, by normalising each chain individually; note that if every world in the chain is reachable, this is the same as the mean-max normalisation on each chain:

• If x and y=ln(x) are in the same cycle, then Ul(x)=Ul(y).
• Otherwise, if x and y=ln(x) are in the same chain with m total elements in the chain, then Ul(x)−Ul(y)=n/(m−1).

We we could try and extend ˆUl to a global utility function which compares different chains and compares values in chains with values outside of X∪Y. But as we shall see in the next post, this doesn't work when combining different partial preferences.

Interpretation of l

The interpretation of l is something like "this is the key difference in features that causes the difference in world-rankings". So, for P1, the l switches out a chocolate ice-cream and substitutes a groin-kick. While for P2, the l simply removes one person from the world.

This means that, locally, we can express X∪Y in the same Y×Z formalism as in the first post. Here the Z are the background variables, while Y is a discrete variable that l operates on.

We cannot necessarily express this Y×Z product globally. Consider, for P2, a situation where z0 is an idyllic village, z1 is an Earthbound human population, and z2 a star-spanning civilization with extensive use of human uploads.

And if Y denotes the number of people in each world, it's clear that Y hits a low maximum for z0 (thousands?), can rise much higher for z1 (trillions?), and even higher for z2 (need to use scientific notation). So though (1020,z2) makes sense, (1020,z0) is nonsense. So there is no global decomposition of these worlds as Y×Z.

Discuss

### Conversation with Paul Christiano

Новости LessWrong.com - 12 сентября, 2019 - 02:25
Published on September 11, 2019 11:20 PM UTC

AI Impacts talked to AI safety researcher Paul Christiano about his views on AI risk. With his permission, we have transcribed this interview.

Participants
Summary

We spoke with Paul Christiano on August 13, 2019. Here is a brief summary of that conversation:

• AI safety is worth working on because AI poses a large risk and AI safety is neglected, and tractable.
• Christiano is more optimistic about the likely social consequences of advanced AI than some others in AI safety, in particular researchers at the Machine Intelligence Research Institute (MIRI), for the following reasons:
• The prior on any given problem reducing the expected value of the future by 10% should be low.
• There are several ‘saving throws’–ways in which, even if one thing turns out badly, something else can turn out well, such that AI is not catastrophic.
• Many algorithmic problems are either solvable within 100 years, or provably impossible; this inclines Christiano to think that AI safety problems are reasonably likely to be easy.
• MIRI thinks success is guaranteeing that unaligned intelligences are never created, whereas Christiano just wants to leave the next generation of intelligences in at least as good of a place as humans were when building them.
• ‘Prosaic AI’ that looks like current AI systems will be less hard to align than MIRI thinks:
• Christiano thinks there’s at least a one-in-three chance that we’ll be able to solve AI safety on paper in advance.
• A common view within ML is that that we’ll successfully solve problems as they come up.
• Christiano has relatively less confidence in several inside view arguments for high levels of risk:
• Building safe AI requires hitting a small target in the space of programs, but building any AI also requires hitting a small target.
• Because Christiano thinks that the state of evidence is less clear-cut than MIRI does, Christiano also has a higher probability that people will become more worried in the future.warrants a lot of pessimism
• Just because we haven’t solved many problems in AI safety yet doesn’t mean they’re intractably hard– many technical problems feel this way and then get solved in 10 years of effort.
• Evolution is often used as an analogy to argue that general intelligence (humans with their own goals) becomes dangerously unaligned with the goals of the outer optimizer (evolution selecting for reproductive fitness). But this analogy doesn’t make Christiano feel so pessimistic, e.g. he thinks that if we tried, we could breed animals that are somewhat smarter than humans and are also friendly and docile.
• Christiano is optimistic about verification, interpretability, and adversarial training for inner alignment, whereas MIRI is pessimistic.
• MIRI thinks the outer alignment approaches Christiano proposes are just obscuring the core difficulties of alignment, while Christiano is not yet convinced there is a deep core difficulty.
• Christiano thinks there are several things that could change his mind and optimism levels, including:
• Learning about institutions and observing how they solve problems analogous to AI safety.
• Seeing whether AIs become deceptive and how they respond to simple oversight.
• Seeing how much progress we make on AI alignment over the coming years.
• Christiano is relatively optimistic about his iterated amplification approach:
• Christiano cares more about making aligned AIs that are competitive with unaligned AIs, whereas MIRI is more willing to settle for an AI with very narrow capabilities.
• Iterated amplification is largely based on learning-based AI systems, though it may work in other cases.
• Even if iterated amplification isn’t the answer to AI safety, it’s likely to have subproblems in common with problems that are important in the future.
• There are still many disagreements between Christiano and the Machine Intelligence Research Institute (MIRI) that are messy and haven’t been made precise.

This transcript has been lightly edited for concision and clarity.

Transcript

Asya Bergal: Okay. We are recording. I’m going to ask you a bunch of questions related to something like AI optimism.

I guess the proposition that we’re looking at is something like ‘is it valuable for people to be spending significant effort doing work that purports to reduce the risk from advanced artificial intelligence’? The first question would be to give a short-ish version of the reasoning around that.

Paul Christiano: Around why it’s overall valuable?

Asya Bergal: Yeah. Or the extent to which you think it’s valuable.

Paul Christiano: I don’t know, this seems complicated. I’m acting from some longtermerist perspective, I’m like, what can make the world irreversibly worse? There aren’t that many things, we go extinct. It’s hard to go extinct, doesn’t seem that likely.

Robert Long: We keep forgetting to say this, but we are focusing less on ethical considerations that might affect that. We’ll grant…yeah, with all that in the background….

Paul Christiano: Granting long-termism, but then it seems like it depends a lot on what’s the probability? What fraction of our expected future do we lose by virtue of messing up alignment * what’s the elasticity of that to effort / how much effort?

Robert Long: That’s the stuff we’re curious to see what people think about.

Paul Christiano: They probably did. I don’t remember exactly what’s in there, but it was a lot of words.

I don’t know. I’m like, it’s a lot of doom probability. Like maybe I think AI alignment per se is like 10% doominess. That’s a lot. Then it seems like if we understood everything in advance really well, or just having a bunch of people working on now understanding what’s up, could easily reduce that by a big chunk.

Ronny Fernandez: Sorry, what do you mean by 10% doominesss?

Paul Christiano: I don’t know, the future is 10% worse than it would otherwise be in expectation by virtue of our failure to align AI. I made up 10%, it’s kind of a random number. I don’t know, it’s less than 50%. It’s more than 10% conditioned on AI soon I think.

Ronny Fernandez: And that’s change in expected value.

Paul Christiano: Yeah. Anyway, so 10% is a lot. Then I’m like, maybe if we sorted all our shit out and had a bunch of people who knew what was up, and had a good theoretical picture of what was up, and had more info available about whether it was a real problem. Maybe really nailing all that could cut that risk from 10% to 5% and maybe like, you know, there aren’t that many people who work on it, it seems like a marginal person can easily do a thousandth of that 5% change. Now you’re looking at one in 20,000 or something, which is a good deal.

Asya Bergal: I think my impression is that that 10% is lower than some large set of people. I don’t know if other people agree with that.

Paul Christiano: Certainly, 10% is lower than lots of people who care about AI risk. I mean it’s worth saying, that I have this slightly narrow conception of what is the alignment problem. I’m not including all AI risk in the 10%. I’m not including in some sense most of the things people normally worry about and just including the like ‘we tried to build an AI that was doing what we want but then it wasn’t even trying to do what we want’. I think it’s lower now or even after that caveat, than pessimistic people. It’s going to be lower than all the MIRI folks, it’s going to be higher than almost everyone in the world at large, especially after specializing in this problem, which is a problem almost no one cares about, which is precisely how a thousand full time people for 20 years can reduce the whole risk by half or something.

Asya Bergal: I’m curious for your statement as to why you think your number is slightly lower than other people.

Paul Christiano: Yeah, I don’t know if I have a particularly crisp answer. Seems like it’s a more reactive thing of like, what are the arguments that it’s very doomy? A priori you might’ve been like, well, if you’re going to build some AI, you’re probably going to build the AI so it’s trying to do what you want it to do. Probably that’s that. Plus, most things can’t destroy the expected value of the future by 10%. You just can’t have that many things, otherwise there’s not going to be any value left in the end. In particular, if you had 100 such things, then you’d be down to like 1/1000th of your values. 1/10 hundred thousandth? I don’t know, I’m not good at arithmetic.

Anyway, that’s a priori, just aren’t that many things are that bad and it seems like people would try and make AI that’s trying to do what they want. Then you’re like, okay, we get to be pessimistic because of some other argument about like, well, we don’t currently know how to build an AI which will do what we want. We’re like, there’s some extrapolation of current techniques on which we’re concerned that we wouldn’t be able to. Or maybe some more conceptual or intuitive argument about why AI is a scary kind thing, and AIs tend to want to do random shit.

Then like, I don’t know, now we get into, how strong is that argument for doominess? Then a major thing that drives it is I am like, reasonable chance there is no problem in fact. Reasonable chance, if there is a problem we can cope with it just by trying. Reasonable chance, even if it will be hard to cope with, we can sort shit out well enough on paper that we really nail it and understand how to resolve it. Reasonable chance, if we don’t solve it the people will just not build AIs that destroy everything they value.

It’s lots of saving throws, you know? And you multiply the saving throws together and things look better. And they interact better than that because– well, in one way worse because it’s correlated: If you’re incompetent, you’re more likely to fail to solve the problem and more likely to fail to coordinate not to destroy the world. In some other sense, it’s better than interacting multiplicatively because weakness in one area compensates for strength in the other. I think there are a bunch of saving throws that could independently make things good, but then in reality you have to have a little bit here and a little bit here and a little bit here, if that makes sense. We have some reasonable understanding on paper that makes the problem easier. The problem wasn’t that bad. We wing it reasonably well and we do a bunch of work and in fact people are just like, ‘Okay, we’re not going to destroy the world given the choice.’ I guess I have this somewhat distinctive last saving throw where I’m like, ‘Even if you have unaligned AI, it’s probably not that bad.’

That doesn’t do much of the work, but you know you add a bunch of shit like that together.

Asya Bergal: That’s a lot of probability mass on a lot of different things. I do feel like my impression is that, on the first step of whether by default things are likely to be okay or things are likely to be good, people make arguments of the form, ‘You have a thing with a goal and it’s so hard to specify. By default, you should assume that the space of possible goals to specify is big, and the one right goal is hard to specify, hard to find.’ Obviously, this is modeling the thing as an agent, which is already an assumption.

Paul Christiano: Yeah. I mean it’s hard to run or have much confidence in arguments of that form. I think it’s possible to run tight versions of that argument that are suggestive. It’s hard to have much confidence in part because you’re like, look, the space of all programs is very broad, and the space that do your taxes is quite small, and we in fact are doing a lot of selecting from the vast space of programs to find one that does your taxes– so like, you’ve already done a lot of that.

And then you have to be getting into more detailed arguments about exactly how hard is it to select. I think there’s two kinds of arguments you can make that are different, or which I separate. One is the inner alignment treacherous turney argument, where like, we can’t tell the difference between AIs that are doing the right and wrong thing, even if you know what’s right because blah blah blah. The other is well, you don’t have this test for ‘was it right’ and so you can’t be selecting for ‘does the right thing’.

This is a place where the concern is disjunctive, you have like two different things, they’re both sitting in your alignment problem. They can again interact badly. But like, I don’t know, I don’t think you’re going to get to high probabilities from this. I think I would kind of be at like, well I don’t know. Maybe I think it’s more likely than not that there’s a real problem but not like 90%, you know? Like maybe I’m like two to one that there exists a non-trivial problem or something like that. All of the numbers I’m going to give are very made up though. If you asked me a second time you’ll get all different numbers.

Asya Bergal: That’s good to know.

Paul Christiano: Sometimes I anchor on past things I’ve said though, unfortunately.

Asya Bergal: Okay. Maybe I should give you some fake past Paul numbers.

Paul Christiano: You could be like, ‘In that interview, you said that it was 85%’. I’d be like, ‘I think it’s really probably 82%’.

Asya Bergal: I guess a related question is, is there plausible concrete evidence that you think could be gotten that would update you in one direction or the other significantly?

Paul Christiano: Yeah. I mean certainly, evidence will roll in once we have more powerful AI systems.

One can learn… I don’t know very much about any of the relevant institutions, I may know a little bit. So you can imagine easily learning a bunch about them by observing how well they solve analogous problems or learning about their structure, or just learning better about the views of people. That’s the second category.

We’re going to learn a bunch of shit as we continue thinking about this problem on paper to see like, does it look like we’re going to solve it or not? That kind of thing. It seems like there’s lots of sorts of evidence on lots of fronts, my views are shifting all over the place. That said, the inconsistency between one day and the next is relatively large compared to the actual changes in views from one day to the next.

Robert Long: Could you say a little bit more about evidence from once more advanced AI starts coming in? Like what sort things you’re looking for that would change your mind on things?

Paul Christiano: Well you get to see things like, on inner alignment you get to see to what extent do you have the kind of crazy shit that people are concerned about? The first time you observe some crazy shit where your AI is like, ‘I’m going to be nice in order to assure that you think I’m nice so I can stab you in the back later.’ You’re like, ‘Well, I guess that really does happen despite modest effort to prevent it.’ That’s a thing you get. You get to learn in general about how models generalize, like to what extent they tend to do– this is sort of similar to what I just said, but maybe a little bit broader– to what extent are they doing crazy-ish stuff as they generalize?

You get to learn about how reasonable simple oversight is and to what extent do ML systems acquire knowledge that simple overseers don’t have that then get exploited as they optimize in order to produce outcomes that are actually bad. I don’t have a really concise description, but sort of like, to the extent that all these arguments depend on some empirical claims about AI, you get to see those claims tested increasingly.

Ronny Fernandez: So the impression I get from talking to other people who know you, and from reading some of your blog posts, but mostly from others, is that you’re somewhat more optimistic than most people that work in AI alignment. It seems like some people who work on AI alignment think something like, ‘We’ve got to solve some really big problems that we don’t understand at all or there are a bunch of unknown unknowns that we need to figure out.’ Maybe that’s because they have a broader conception of what solving AI alignment is like than you do?

Paul Christiano: That seems like it’s likely to be part of it. It does seem like I’m more optimistic than people in general, than people who work in alignment in general. I don’t really know… I don’t understand others’ views that well and I don’t know if they’re that– like, my views aren’t that internally coherent. My suspicion is others’ views are even less internally coherent. Yeah, a lot of it is going to be done by having a narrower conception of the problem.

Then a lot of it is going to be done by me just being… in terms of do we need a lot of work to be done, a lot of it is going to be me being like, I don’t know man, maybe. I don’t really understand when people get off the like high probability of like, yeah. I don’t see the arguments that are like, definitely there’s a lot of crazy stuff to go down. It seems like we really just don’t know. I do also think problems tend to be easier. I have more of that prior, especially for problems that make sense on paper. I think they tend to either be kind of easy, or else– if they’re possible, they tend to be kind of easy. There aren’t that many really hard theorems.

Robert Long: Can you say a little bit more of what you mean by that? That’s not a very good follow-up question, I don’t really know what it would take for me to understand what you mean by that better.

Paul Christiano: Like most of the time, if I’m like, ‘here’s an algorithms problem’, you can like– if you just generate some random algorithms problems, a lot of them are going to be impossible. Then amongst the ones that are possible, a lot of them are going to be soluble in a year of effort and amongst the rest, a lot of them are going to be soluble in 10 or a hundred years of effort. It’s just kind of rare that you find a problem that’s soluble– by soluble, I don’t just mean soluble by human civilization, I mean like, they are not provably impossible– that takes a huge amount of effort.

It normally… it’s less likely to happen the cleaner the problem is. There just aren’t many very clean algorithmic problems where our society worked on it for 10 years and then we’re like, ‘Oh geez, this still seems really hard.’ Examples are kind of like… factoring is an example of a problem we’ve worked a really long time on. It kind of has the shape, and this is the tendency on these sorts of problems, where there’s just a whole bunch of solutions and we hack away and we’re a bit better and a bit better and a bit better. It’s a very messy landscape, rather than jumping from having no solution to having a solution. It’s even rarer to have things where going from no solution to some solution is really possible but incredibly hard. There were some examples.

Robert Long: And you think that the problems we face are sufficiently similar?

Paul Christiano: I mean, I think this is going more into the like, ‘I don’t know man’ but my what do I think when I say I don’t know man isn’t like, ‘Therefore, there’s an 80% chance that it’s going to be an incredibly difficult problem’ because that’s not what my prior is like. I’m like, reasonable chance it’s not that hard. Some chance it’s really hard. Probably more chance that– if it’s really hard, I think it’s more likely to be because all the clean statements of the problem are impossible. I think as statements get messier it becomes more plausible that it just takes a lot of effort. The more messy a thing is, the less likely it is to be impossible sometimes, but also the more likely it’s just a bunch of stuff you have to do.

Ronny Fernandez: It seems like one disagreement that you have with MIRI folks is that you think prosaic AGI will be easier to align than they do. Does that perception seem right to you?

Paul Christiano: I think so. I think they’re probably just like, ‘that seems probably impossible’. Was related to the previous point.

Ronny Fernandez: If you had found out that prosaic AGI is nearly impossible to align or is impossible to align, how much would that change your-

Paul Christiano: It depends exactly what you found out, exactly how you found it out, et cetera. One thing you could be told is that there’s no perfectly scalable mechanism where you can throw in your arbitrarily sophisticated AI and turn the crank and get out an arbitrarily sophisticated aligned AI. That’s a possible outcome. That’s not necessarily that damning because now you’re like okay, fine, you can almost do it basically all the time and whatever.

That’s a big class of worlds and that would definitely be a thing I would be interested in understanding– how large is that gap actually, if the nice problem was totally impossible? If at the other extreme you just told me, ‘Actually, nothing like this is at all going to work, and it’s definitely going to kill everyone if you build an AI using anything like an extrapolation of existing techniques’, then I’m like, ‘Sounds pretty bad.’ I’m still not as pessimistic as MIRI people.

I’m like, maybe people just won’t destroy the world, you know, it’s hard to say. It’s hard to say what they’ll do. It also depends on the nature of how you came to know this thing. If you came to know it in a way that’s convincing to a reasonably broad group of people, that’s better than if you came to know it and your epistemic state was similar to– I think MIRI people feel more like, it’s already known to be hard, and therefore you can tell if you can’t convince people it’s hard. Whereas I’m like, I’m not yet convinced it’s hard, so I’m not so surprised that you can’t convince people it’s hard.

Then there’s more probability, if it was known to be hard, that we can convince people, and therefore I’m optimistic about outcomes conditioned on knowing it to be hard. I might become almost as pessimistic as MIRI if I thought that the problem was insolubly hard, just going to take forever or whatever, huge gaps aligning prosaic AI, and there would be no better evidence of that than currently exists. Like there’s no way to explain it better to people than MIRI currently can. If you take those two things, I’m maybe getting closer to MIRI’s levels of doom probability. I might still not be quite as doomy as them.

Ronny Fernandez: Why does the ability to explain it matter so much?

Paul Christiano: Well, a big part of why you don’t expect people to build unaligned AI is they’re like, they don’t want to. The clearer it is and the stronger the case, the more people can potentially do something. In particular, you might get into a regime where you’re doing a bunch of shit by trial and error and trying to wing it. And if you have some really good argument that the winging it is not going to work, then that’s a very different state than if you’re like, ‘Well, winging it doesn’t seem that good. Maybe it’ll fail.’ It’s different to be like, ‘Oh no, here’s an argument. You just can’t… It’s just not going to work.’

I don’t think we’ll really be in that state, but there’s like a whole spectrum from where we’re at now to that state and I expect to be further along it, if in fact we’re doomed. For example, if I personally would be like, ‘Well, I at least tried the thing that seemed obvious to me to try and now we know that doesn’t work.’ I sort of expect very directly from trying that to learn something about why that failed and what parts of the problem seem difficult.

Ronny Fernandez: Do you have a sense of why MIRI thinks aligning prosaic AI is so hard?

Paul Christiano: We haven’t gotten a huge amount of traction on this when we’ve debated it. I think part of their position, especially on the winging it thing, is they’re like – Man, doing things right generally seems a lot harder than doing them. I guess probably building an AI will be harder in a way that’s good, for some arbitrary notion of good– a lot harder than just building an AI at all.

There’s a theme that comes up frequently trying to hash this out, and it’s not so much about a theoretical argument, it’s just like, look, the theoretical argument establishes that there’s something a little bit hard here. And once you have something a little bit hard and now you have some giant organization, people doing the random shit they’re going to do, and all that chaos, and like, getting things to work takes has all these steps, and getting this harder thing to work is going to have some extra steps, and everyone’s going to be doing it. They’re more pessimistic based on those kinds of arguments.

That’s the thing that comes up a lot. I think probably most of the disagreement is still in the, you know, theoretically, how much– certainly we disagree about like, can this problem just be solved on paper in advance? Where I’m like, reasonable chance, you know? At least a third chance, they’ll just on paper be like, ‘We have nailed it.’ There’s really no tension, no additional engineering effort required. And they’re like, that’s like zero. I don’t know what they think it is. More than zero, but low.

Ronny Fernandez: Do you guys think you’re talking about the same problem exactly?

Paul Christiano: I think there we are probably. At that step we are. Just like, is your AI trying to destroy everything? Yes. No. The main place there’s some bleed over–  the main thing that MIRI maybe considers in scope and I don’t is like, if you build an AI, it may someday have to build another AI. And what if the AI it builds wants to destroy everything? Is that our fault or is that the AI’s fault? And I’m more on like, that’s the AI’s fault. That’s not my job. MIRI’s maybe more like not distinguishing those super cleanly, but they would say that’s their job. The distinction is a little bit subtle in general, but-

Ronny Fernandez: I guess I’m not sure why you cashed out in terms of fault.

Paul Christiano: I think for me it’s mostly like: there’s a problem we can hope to resolve. I think there’s two big things. One is like, suppose you don’t resolve that problem. How likely is it that someone else will solve it? Saying it’s someone else’s fault is in part just saying like, ‘Look, there’s this other person who had a reasonable opportunity to solve it and it was a lot smarter than us.’ So the work we do is less likely to make the difference between it being soluble or not. Because there’s this other smarter person.

And then the other thing is like, what should you be aiming for? To the extent there’s a clean problem here which one could hope to solve, or one should bite off as a chunk, what fits in conceptually the same problem versus what’s like– you know, an analogy I sometimes make is, if you build an AI that’s doing important stuff, it might mess up in all sorts of ways. But when you’re asking, ‘Is my AI going to mess up when building a nuclear reactor?’ It’s a thing worth reasoning about as an AI person, but also like it’s worth splitting into like– part of that’s an AI problem, and part of that’s a problem about understanding managing nuclear waste. Part of that should be done by people reasoning about nuclear waste and part of it should be done by people reasoning about AI.

This is a little subtle because both of the problems have to do with AI. I would say my relationship with that is similar to like, suppose you told me that some future point, some smart people might make an AI. There’s just a meta and object level on which you could hope to help with the problem.

I’m hoping to help with the problem on the object level in the sense that we are going to do research which helps people align AI, and in particular, will help the future AI align the next AI. Because it’s like people. It’s at that level, rather than being like, ‘We’re going to construct a constitution of that AI such that when it builds future AI it will always definitely work’. This is related to like– there’s this old argument about recursive self-improvement. It’s historically figured a lot in people’s discussion of why the problem is hard, but on a naive perspective it’s not obvious why it should, because you do only a small number of large modifications before your systems are sufficiently intelligent relative to you that it seems like your work should be obsolete. Plus like, them having a bunch of detailed knowledge on the ground about what’s going down.

It seems unclear to me how– yeah, this is related to our disagreement– how much you’re happy just deferring to the future people and being like, ‘Hope that they’ll cope’. Maybe they won’t even cope by solving the problem in the same way, they might cope by, the crazy AIs that we built reach the kind of agreement that allows them to not build even crazier AIs in the same way that we might do that. I think there’s some general frame of, I’m just taking responsibility for less, and more saying, can we leave the future people in a situation that is roughly as good as our situation? And by future people, I mean mostly AIs.

Ronny Fernandez: Right. The two things that you think might explain your relative optimism are something like: Maybe we can get the problem to smarter agents that are humans. Maybe we can leave the problem to smarter agents that are not humans.

Paul Christiano: Also a lot of disagreement about the problem. Those are certainly two drivers. They’re not exhaustive in the sense that there’s also a huge amount of disagreement about like, ‘How hard is this problem?’ Which is some combination of like, ‘How much do we know about it?’ Where they’re more like, ‘Yeah, we’ve thought about it a bunch and have some views.’ And I’m like, ‘I don’t know, I don’t think I really know shit.’ Then part of it is concretely there’s a bunch of– on the object level, there’s a bunch of arguments about why it would be hard or easy so we don’t reach agreement. We consistently disagree on lots of those points.

Ronny Fernandez: Do you think the goal state for you guys is the same though? If I gave you guys a bunch of AGIs, would you guys agree about which ones are aligned and which ones are not? If you could know all of their behaviors?

Paul Christiano: I think at that level we’d probably agree. We don’t agree more broadly about what constitutes a win state or something. They have this more expansive conception– or I guess it’s narrower– that the win state is supposed to do more. They are imagining more that you’ve resolved this whole list of future challenges. I’m more not counting that.

We’ve had this… yeah, I guess I now mostly use intent alignment to refer to this problem where there’s risk of ambiguity… the problem that I used to call AI alignment. There was a long obnoxious back and forth about what the alignment problem should be called. MIRI does use aligned AI to be like, ‘an AI that produces good outcomes when you run it’. Which I really object to as a definition of aligned AI a lot. So if they’re using that as their definition of aligned AI, we would probably disagree.

Ronny Fernandez: Shifting terms or whatever… one thing that they’re trying to work on is making an AGI that has a property that is also the property you’re trying to make sure than AGI has.

Paul Christiano: Yeah, we’re all trying to build an AI that’s trying to do the right thing.

Ronny Fernandez: I guess I’m thinking more specifically, for instance, I’ve heard people at MIRI say something like, they want to build an AGI that I can tell it, ‘Hey, figure out how to copy a strawberry, and don’t mess anything else up too badly.’ Does that seem like the same problem that you’re working on?

Paul Christiano: I mean it seems like in particular, you should be able to do that. I think it’s not clear whether that captures all the complexity of the problem. That’s just sort of a question about what solutions end up looking like, whether that turns out to have the same difficulty.

The other things you might think are involved that are difficult are… well, I guess one problem is just how you capture competitiveness. Competitiveness for me is a key desideratum. And it’s maybe easy to elide in that setting, because it just makes a strawberry. Whereas I am like, if you make a strawberry literally as well as anyone else can make a strawberry, it’s just a little weird to talk about. And it’s a little weird to even formalize what competitiveness means in that setting. I think you probably can, but whether or not you do that’s not the most natural or salient aspect of the situation.

So I probably disagree with them about– I’m like, there are probably lots of ways to have agents that make strawberries and are very smart. That’s just another disagreement that’s another function of the same basic, ’How hard is the problem’ disagreement. I would guess relative to me, in part because of being more pessimistic about the problem, MIRI is more willing to settle for an AI that does one thing. And I care more about competitiveness.

Asya Bergal: Say you just learn that prosaic AI is just not going to be the way we get to AGI. How does that make you feel about the IDA approach versus the MIRI approach?

Paul Christiano: So my overall stance when I think about alignment is, there’s a bunch of possible algorithms that you could use. And the game is understanding how to align those algorithms. And it’s kind of a different game. There’s a lot of common subproblems in between different algorithms you might want to align, it’s potentially a different game for different algorithms. That’s an important part of the answer. I’m mostly focusing on the ‘align this particular’– I’ll call it learning, but it’s a little bit more specific than learning– where you search over policies to find a policy that works well in practice. If we’re not doing that, then maybe that solution is totally useless, maybe it has common subproblems with the solution you actually need. That’s one part of the answer.

Another big difference is going to be, timelines views will shift a lot if you’re handed that information. So it will depend exactly on the nature of the update. I don’t have a strong view about whether it makes my timelines shorter or longer overall. Maybe you should bracket that though.

In terms of returning to the first one of trying to align particular algorithms, I don’t know. I think I probably share some of the MIRI persp– well, no. It feels to me like there’s a lot of common subproblems. Aligning expert systems seems like it would involve a lot of the same reasoning as aligning learners. To the extent that’s true, probably future stuff also will involve a lot of the same subproblems, but I doubt the algorithm will look the same. I also doubt the actual algorithm will look anything like a particular pseudocode we might write down for iterated amplification now.

Asya Bergal: Does iterated amplification in your mind rely on this thing that searches through policies for the best policy? The way I understand it, it doesn’t feel like it necessarily does.

Paul Christiano: So, you use this distillation step. And the reason you want to do amplification, or this short-hop, expensive amplification, is because you interleave it with this distillation step. And I normally imagine the distillation step as being, learn a thing which works well in practice on a reward function defined by the overseer. You could imagine other things that also needed to have this framework, but it’s not obvious whether you need this step if you didn’t somehow get granted something like the–

Asya Bergal: That you could do the distillation step somehow.

Paul Christiano: Yeah. It’s unclear what else would– so another example of a thing that could fit in, and this maybe makes it seem more general, is if you had an agent that was just incentivized to make lots of money. Then you could just have your distillation step be like, ‘I randomly check the work of this person, and compensate them based on the work I checked’. That’s a suggestion of how this framework could end up being more general.

But I mostly do think about it in the context of learning in particular. I think it’s relatively likely to change if you’re not in that setting. Well, I don’t know. I don’t have a strong view. I’m mostly just working in that setting, mostly because it seems reasonably likely, seems reasonably likely to have a bunch in common, learning is reasonably likely to appear even if other techniques appear. That is, learning is likely to play a part in powerful AI even if other techniques also play a part.

Asya Bergal: Are there other people or resources that you think would be good for us to look at if we were looking at the optimism view?

Paul Christiano: Before we get to resources or people, I think one of the basic questions is, there’s this perspective which is fairly common in ML, which is like, ‘We’re kind of just going to do a bunch of stuff, and it’ll probably work out’. That’s probably the basic thing to be getting at. How right is that?

This is the bad view of safety conditioned on– I feel like prosaic AI is in some sense the worst– seems like about as bad as things would have gotten in terms of alignment. Where, I don’t know, you try a bunch of shit, just a ton of stuff, a ton of trial and error seems pretty bad. Anyway, this is a random aside maybe more related to the previous point. But yeah, this is just with alignment. There’s this view in ML that’s relatively common that’s like, we’ll try a bunch of stuff to get the AI to do what we want, it’ll probably work out. Some problems will come up. We’ll probably solve them. I think that’s probably the most important thing in the optimism vs pessimism side.

And I don’t know, I mean this has been a project that like, it’s a hard project. I think the current state of affairs is like, the MIRI folk have strong intuitions about things being hard. Essentially no one in… very few people in ML agree with those, or even understand where they’re coming from. And even people in the EA community who have tried a bunch to understand where they’re coming from mostly don’t. Mostly people either end up understanding one side or the other and don’t really feel like they’re able to connect everything. So it’s an intimidating project in that sense. I think the MIRI people are the main proponents of the everything is doomed, the people to talk to on that side. And then in some sense there’s a lot of people on the other side who you can talk to, and the question is just, who can articulate the view most clearly? Or who has most engaged with the MIRI view such that they can speak to it?

Ronny Fernandez: Those are people I would be particularly interested in. If there are people that understand all the MIRI arguments but still have broadly the perspective you’re describing, like some problems will come up, probably we’ll fix them.

Paul Christiano: I don’t know good– I don’t have good examples of people for you. I think most people just find the MIRI view kind of incomprehensible, or like, it’s a really complicated thing, even if the MIRI view makes sense in its face. I don’t think people have gotten enough into the weeds. It really rests a lot right now on this fairly complicated cluster of intuitions. I guess on the object level, I think I’ve just engaged a lot more with the MIRI view than most people who are– who mostly take the ‘everything will be okay’ perspective. So happy to talk on the object level, and speaking more to arguments. I think it’s a hard thing to get into, but it’s going to be even harder to find other people in ML who have engaged with the view that much.

They might be able to make other general criticisms of like, here’s why I haven’t really… like it doesn’t seem like a promising kind of view to think about. I think you could find more people who have engaged at that level. I don’t know who I would recommend exactly, but I could think about it. Probably a big question will be who is excited to talk to you about it.

Asya Bergal: I am curious about your response to MIRI’s object level arguments. Is there a place that exists somewhere?

Paul Christiano: There’s some back and forth on the internet. I don’t know if it’s great. There’s some LessWrong posts. Eliezer for example wrote this post about why things were doomed, why I in particular was doomed. I don’t know if you read that post.

Asya Bergal: I can also ask you about it now, I just don’t want to take too much of your time if it’s a huge body of things.

Paul Christiano: The basic argument would be like, 1) On paper I don’t think we yet have a good reason to feel doomy. And I think there’s some basic research intuition about how much a problem– suppose you poke at a problem a few times, and you’re like ‘Agh, seems hard to make progress’. How much do you infer that the problem’s really hard? And I’m like, not much. As a person who’s poked at a bunch of problems, let me tell you, that often doesn’t work and then you solve in like 10 years of effort.

So that’s one thing. That’s a point where I have relatively little sympathy for the MIRI way. That’s one set of arguments: is there a good way to get traction on this problem? Are there clever algorithms? I’m like, I don’t know, I don’t feel like the kind of evidence we’ve seen is the kind of evidence that should be persuasive. As some evidence in that direction, I’d be like, I have not been thinking about this that long. I feel like there have often been things that felt like, or that MIRI would have defended as like, here’s a hard obstruction. Then you think about it and you’re actually like, ‘Here are some things you can do.’ And it may still be a obstruction, but it’s no longer quite so obvious where it is, and there were avenues of attack.

That’s one thing. The second thing is like, a metaphor that makes me feel good– MIRI talks a lot about the evolution analogy. If I imagine the evolution problem– so if I’m a person, and I’m breeding some animals, I’m breeding some superintelligence. Suppose I wanted to breed an animal modestly smarter than humans that is really docile and friendly. I’m like, I don’t know man, that seems like it might work. That’s where I’m at. I think they are… it’s been a little bit hard to track down this disagreement, and I think this is maybe in a fresher, rawer state than the other stuff, where we haven’t had enough back and forth.

But I’m like, it doesn’t sound necessarily that hard. I just don’t know. I think their position, their position when they’ve written something has been a little bit more like, ‘But you couldn’t breed a thing, that after undergoing radical changes in intelligence or situation would remain friendly’. But then I’m normally like, but it’s not clear why that’s needed? I would really just like to create something slightly superhuman, and it’s going to work with me to breed something that’s slightly smarter still that is friendly.

We haven’t really been able to get traction on that. I think they have an intuition that maybe there’s some kind of invariance and things become gradually more unraveled as you go on. Whereas I have more intuition that it’s plausible. After this generation, there’s just smarter and smarter people thinking about how to keep everything on the rails. It’s very hard to know.

That’s the second thing. I have found that really… that feels like it gets to the heart of some intuitions that are very different, and I don’t understand what’s up there. There’s a third category which is like, on the object level, there’s a lot of directions that I’m enthusiastic about where they’re like, ‘That seems obviously doomed’. So you could divide those up into the two problems. There’s the family of problems that are more like the inner alignment problem, and then outer alignment stuff.

That’s one possible approach, where the other would be something more like interpretability, where you say like, ‘Here’s what the model is doing. In addition to it’s behavior we get this other signal that the paper was depending on this fact, its predicate paths, which it shouldn’t have been dependent on.’ The question is, can either of those yield good behavior? I’m like, I don’t know, man. It seems plausible. And they’re like ‘Definitely not.’ And I’m like, ‘Why definitely not?’ And they’re like ‘Well, that’s not getting at the real essence of the problem.’ And I’m like ‘Okay, great, but how did you substantiate this notion of the real essence of the problem? Where is that coming from? Is that coming from a whole bunch of other solutions that look plausible that failed?’ And their take is kind of like, yes, and I’m like, ‘But none of those– there weren’t actually even any candidate solutions there really that failed yet. You’ve got maybe one thing, or like, you showed there exists a problem in some minimal sense.’ This comes back to the first of the three things I listed. But it’s a little bit different in that I think you can just stare at particular things and they’ll be like, ‘Here’s how that particular thing is going to fail.’ And I’m like ‘I don’t know, it seems plausible.’

That’s on inner alignment. And there’s maybe some on outer alignment. I feel like they’ve given a lot of ground in the last four years on how doomy things seem on outer alignment. I think they still have some– if we’re talking about amplification, I think the position would still be, ‘Man, why would that agent be aligned? It doesn’t at all seem like it would be aligned.’ That has also been a little bit surprisingly tricky to make progress on. I think it’s similar, where I’m like, yeah, I grant the existence of some problem or some thing which needs to be established, but I don’t grant– I think their position would be like, this hasn’t made progress or just like, pushed around the core difficulty. I’m like, I don’t grant the conception of the core difficulty in which this has just pushed around the core difficulty. I think that… substantially in that kind of thing, being like, here’s an approach that seems plausible, we don’t have a clear obstruction but I think that it is doomed for these deep reasons. I have maybe a higher bar for what kind of support the deep reasons need.

I also just think on the merits, they have not really engaged with– and this is partly my responsibility for not having articulated the arguments in a clear enough way– although I think they have not engaged with even the clearest articulation as of two years ago of what the hope was. But that’s probably on me for not having an even clearer articulation that that, and also definitely not up to them to engage with anything. To the extent it’s a moving target, not up to them to engage with the most recent version. Where, most recent version– the proposal doesn’t really change that much, or like, the case for optimism has changed a little bit. But it’s mostly just like, the state of argument concerning it, rather than the version of the scheme.

Discuss

### Scott Alexander visits DC

Новости LessWrong.com - 12 сентября, 2019 - 01:31
Published on September 11, 2019 10:31 PM UTC

Scott is visiting Washington, DC, so the meetup group is having an event from 5-9pm on Tuesday, September 24. (Note that the date has changed from Monday the 23rd.) We'll be meeting at a coffee shop called Teaism, located at 400 8th Street NW, Washington, DC.

Discuss