# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 13 минут 21 секунда назад

### Is Science Slowing Down?

27 ноября, 2018 - 06:30
Published on Tue Nov 27 2018 03:30:01 GMT+0000 (UTC)

[This post was up a few weeks ago before getting taken down for complicated reasons. They have been sorted out and I’m trying again.]

Is scientific progress slowing down? I recently got a chance to attend a conference on this topic, centered around a paper by Bloom, Jones, Reenen & Webb (2018).

BJRW identify areas where technological progress is easy to measure – for example, the number of transistors on a chip. They measure the rate of progress over the past century or so, and the number of researchers in the field over the same period. For example, here’s the transistor data:

This is the standard presentation of Moore’s Law – the number of transistors you can fit on a chip doubles about every two years (eg grows by 35% per year). This is usually presented as an amazing example of modern science getting things right, and no wonder – it means you can go from a few thousand transistors per chip in 1971 to many million today, with the corresponding increase in computing power.

But BJRW have a pessimistic take. There are eighteen times more people involved in transistor-related research today than in 1971. So if in 1971 it took 1000 scientists to increase transistor density 35% per year, today it takes 18,000 scientists to do the same task. So apparently the average transistor scientist is eighteen times less productive today than fifty years ago. That should be surprising and scary.

But isn’t it unfair to compare percent increase in transistors with absolute increase in transistor scientists? That is, a graph comparing absolute number of transistors per chip vs. absolute number of transistor scientists would show two similar exponential trends. Or a graph comparing percent change in transistors per year vs. percent change in number of transistor scientists per year would show two similar linear trends. Either way, there would be no problem and productivity would appear constant since 1971. Isn’t that a better way to do things?

A lot of people asked paper author Michael Webb this at the conference, and his answer was no. He thinks that intuitively, each “discovery” should decrease transistor size by a certain amount. For example, if you discover a new material that allows transistors to be 5% smaller along one dimension, then you can fit 5% more transistors on your chip whether there were a hundred there before or a million. Since the relevant factor is discoveries per researcher, and each discovery is represented as a percent change in transistor size, it makes sense to compare percent change in transistor size with absolute number of researchers.

Anyway, most other measurable fields show the same pattern of constant progress in the face of exponentially increasing number of researchers. Here’s BJRW’s data on crop yield:

The solid and dashed lines are two different measures of crop-related research. Even though the crop-related research increases by a factor of 6-24x (depending on how it’s measured), crop yields grow at a relatively constant 1% rate for soybeans, and apparently declining 3%ish percent rate for corn.

BJRW go on to prove the same is true for whatever other scientific fields they care to measure. Measuring scientific progress is inherently difficult, but their finding of constant or log-constant progress in most areas accords with Nintil’s overview of the same topic, which gives us graphs like

…and dozens more like it. And even when we use data that are easy to measure and hard to fake, like number of chemical elements discovered, we get the same linearity:

Meanwhile, the increase in researchers is obvious. Not only is the population increasing (by a factor of about 2.5x in the US since 1930), but the percent of people with college degrees has quintupled over the same period. The exact numbers differ from field to field, but orders of magnitude increases are the norm. For example, the number of people publishing astronomy papers seems to have dectupled over the past fifty years or so.

BJRW put all of this together into total number of researchers vs. total factor productivity of the economy, and find…

…about the same as with transistors, soybeans, and everything else. So if you take their methodology seriously, over the past ninety years, each researcher has become about 25x less productive in making discoveries that translate into economic growth.

Participants at the conference had some explanations for this, of which the ones I remember best are:

1. Only the best researchers in a field actually make progress, and the best researchers are already in a field, and probably couldn’t be kept out of the field with barbed wire and attack dogs. If you expand a field, you will get a bunch of merely competent careerists who treat it as a 9-to-5 job. A field of 5 truly inspired geniuses and 5 competent careerists will make X progress. A field of 5 truly inspired geniuses and 500,000 competent careerists will make the same X progress. Adding further competent careerists is useless for doing anything except making graphs look more exponential, and we should stop doing it. See also Price’s Law Of Scientific Contributions.

2. Certain features of the modern academic system, like underpaid PhDs, interminably long postdocs, endless grant-writing drudgery, and clueless funders have lowered productivity. The 1930s academic system was indeed 25x more effective at getting researchers to actually do good research.

3. All the low-hanging fruit has already been picked. For example, element 117 was discovered by an international collaboration who got an unstable isotope of berkelium from the single accelerator in Tennessee capable of synthesizing it, shipped it to a nuclear reactor in Russia where it was attached to a titanium film, brought it to a particle accelerator in a different Russian city where it was bombarded with a custom-made exotic isotope of calcium, sent the resulting data to a global team of theorists, and eventually found a signature indicating that element 117 had existed for a few milliseconds. Meanwhile, the first modern element discovery, that of phosphorous in the 1670s, came from a guy looking at his own piss. We should not be surprised that discovering element 117 needed more people than discovering phosphorous.

Needless to say, my sympathies lean towards explanation number 3. But I worry even this isn’t dismissive enough. My real objection is that constant progress in science in response to exponential increases in inputs ought to be our null hypothesis, and that it’s almost inconceivable that it could ever be otherwise.

Consider a case in which we extend these graphs back to the beginning of a field. For example, psychology started with Wilhelm Wundt and a few of his friends playing around with stimulus perception. Let’s say there were ten of them working for one generation, and they discovered ten revolutionary insights worthy of their own page in Intro Psychology textbooks. Okay. But now there are about a hundred thousand experimental psychologists. Should we expect them to discover a hundred thousand revolutionary insights per generation?

Or: the economic growth rate in 1930 was 2% or so. If it scaled with number of researchers, it ought to be about 50% per year today with our 25x increase in researcher number. That kind of growth would mean that the average person who made $30,000 a year in 2000 should make$50 million a year in 2018.

Or: in 1930, life expectancy at 65 was increasing by about two years per decade. But if that scaled with number of biomedicine researchers, that should have increased to ten years per decade by about 1955, which would mean everyone would have become immortal starting sometime during the Baby Boom, and we would currently be ruled by a deathless God-Emperor Eisenhower.

Or: the ancient Greek world had about 1% the population of the current Western world, so if the average Greek was only 10% as likely to be a scientist as the average modern, there were only 1/1000th as many Greek scientists as modern ones. But the Greeks made such great discoveries as the size of the Earth, the distance of the Earth to the sun, the prediction of eclipses, the heliocentric theory, Euclid’s geometry, the nervous system, the cardiovascular system, etc, and brought technology up from the Bronze Age to the Antikythera mechanism. Even adjusting for the long time scale to which “ancient Greece” refers, are we sure that we’re producing 1000x as many great discoveries as they are? If we extended BJRW’s graph all the way back to Ancient Greece, adjusting for the change in researchers as civilizations rise and fall, wouldn’t it keep the same shape as does for this century? Isn’t the real question not “Why isn’t Dwight Eisenhower immortal god-emperor of Earth?” but “Why isn’t Marcus Aurelius immortal god-emperor of Earth?”

Or: what about human excellence in other fields? Shakespearean England had 1% of the population of the modern Anglosphere, and presumably even fewer than 1% of the artists. Yet it gave us Shakespeare. Are there a hundred Shakespeare-equivalents around today? This is a harder problem than it seems – Shakespeare has become so venerable with historical hindsight that maybe nobody would acknowledge a Shakespeare-level master today even if they existed – but still, a hundred Shakespeares? If we look at some measure of great works of art per era, we find past eras giving us far more than we would predict from their population relative to our own. This is very hard to judge, and I would hate to be the guy who has to decide whether Harry Potter is better or worse than the Aeneid. But still? A hundred Shakespeares?

Or: what about sports? Here’s marathon records for the past hundred years or so:

In 1900, there were only two local marathons (eg the Boston Marathon) in the world. Today there are over 800. Also, the world population has increased by a factor of five (more than that in the East African countries that give us literally 100% of top male marathoners). Despite that, progress in marathon records has been steady or declining. Most other Olympics sports show the same pattern.

All of these lines of evidence lead me to the same conclusion: constant growth rates in response to exponentially increasing inputs is the null hypothesis. If it wasn’t, we should be expecting 50% year-on-year GDP growth, easily-discovered-immortality, and the like. Nobody expected that before reading BJRW, so we shouldn’t be surprised when BJRW provide a data-driven model showing it isn’t happening. I realize this in itself isn’t an explanation; it doesn’t tell us why researchers can’t maintain a constant level of output as measured in discoveries. It sounds a little like “God wouldn’t design the universe that way”, which is a kind of suspicious line of argument, especially for atheists. But it at least shifts us from a lens where we view the problem as “What three tweaks should we make to the graduate education system to fix this problem right now?” to one where we view it as “Why isn’t Marcus Aurelius immortal?”

And through such a lens, only the “low-hanging fruits” explanation makes sense. Explanation 1 – that progress depends only on a few geniuses – isn’t enough. After all, the Greece-today difference is partly based on population growth, and population growth should have produced proportionately more geniuses. Explanation 2 – that PhD programs have gotten worse – isn’t enough. There would have to be a worldwide monotonic decline in every field (including sports and art) from Athens to the present day. Only Explanation 3 holds water.

I brought this up at the conference, and somebody reasonably objected – doesn’t that mean science will stagnate soon? After all, we can’t keep feeding it an exponentially increasing number of researchers forever. If nothing else stops us, then at some point, 100% (or the highest plausible amount) of the human population will be researchers, we can only increase as fast as population growth, and then the scientific enterprise collapses.

I answered that the Gods Of Straight Lines are more powerful than the Gods Of The Copybook Headings, so if you try to use common sense on this problem you will fail.

Imagine being a futurist in 1970 presented with Moore’s Law. You scoff: “If this were to continue only 20 more years, it would mean a million transistors on a single chip! You would be able to fit an entire supercomputer in a shoebox!” But common sense was wrong and the trendline was right.

“If this were to continue only 40 more years, it would mean ten billion transistors per chip! You would need more transistors on a single chip than there are humans in the world! You could have computers more powerful than any today, that are too small to even see with the naked eye! You would have transistors with like a double-digit number of atoms!” But common sense was wrong and the trendline was right.

Or imagine being a futurist in ancient Greece presented with world GDP doubling time. Take the trend seriously, and in two thousand years, the future would be fifty thousand times richer. Every man would live better than the Shah of Persia! There would have to be so many people in the world you would need to tile entire countries with cityscape, or build structures higher than the hills just to house all of them. Just to sustain itself, the world would need transportation networks orders of magnitude faster than the fastest horse. But common sense was wrong and the trendline was right.

I’m not saying that no trendline has ever changed. Moore’s Law seems to be legitimately slowing down these days. The Dark Ages shifted every macrohistorical indicator for the worse, and the Industrial Revolution shifted every macrohistorical indicator for the better. Any of these sorts of things could happen again, easily. I’m just saying that “Oh, that exponential trend can’t possibly continue” has a really bad track record. I do not understand the Gods Of Straight Lines, and honestly they creep me out. But I would not want to bet against them.

Grace et al’s survey of AI researchers show they predict that AIs will start being able to do science in about thirty years, and will exceed the productivity of human researchers in every field shortly afterwards. Suddenly “there aren’t enough humans in the entire world to do the amount of research necessary to continue this trend line” stops sounding so compelling.

At the end of the conference, the moderator asked how many people thought that it was possible for a concerted effort by ourselves and our institutions to “fix” the “problem” indicated by BJRW’s trends. Almost the entire room raised their hands. Everyone there was smarter and more prestigious than I was (also richer, and in many cases way more attractive), but with all due respect I worry they are insane. This is kind of how I imagine their worldview looking:

I realize I’m being fatalistic here. Doesn’t my position imply that the scientists at Intel should give up and let the Gods Of Straight Lines do the work? Or at least that the head of the National Academy of Sciences should do something like that? That Francis Bacon was wasting his time by inventing the scientific method, and Fred Terman was wasting his time by organizing Silicon Valley? Or perhaps that the Gods Of Straight Lines were acting through Bacon and Terman, and they had no choice in their actions? How do we know that the Gods aren’t acting through our conference? Or that our studying these things isn’t the only thing that keeps the straight lines going?

I don’t know. I can think of some interesting models – one made up of a thousand random coin flips a year has some nice qualities – but I don’t know.

I do know you should be careful what you wish for. If you “solved” this “problem” in classical Athens, Attila the Hun would have had nukes. Remember Yudkowsky’s Law of Mad Science: “Every eighteen months, the minimum IQ necessary to destroy the world drops by one point.” Do you really want to make that number ten points? A hundred? I am kind of okay with the function mapping number of researchers to output that we have right now, thank you very much.

The conference was organized by Patrick Collison and Michael Nielsen; they have written up some of their thoughts here.

Discuss

### Bodega Bay: workshop

27 ноября, 2018 - 06:20
Published on Tue Nov 27 2018 03:20:01 GMT+0000 (UTC)

What was the workshop like? Well there was really amazing carpet. I don’t know how relevant this is, but a lot of people commented on it.

Another thing that was salient to me—and many of the participants—was these things called ‘back jacks’. They are like adjustable chairs, without arms or legs.

Some part of me believes that with the right carpet and back jack, my life could be at least twenty percent better. Here is a picture from Amazon, which is on good terms with this part of me:

There were also lots of sessions that I went to, about things like how to be curious, or trade tastes with other people. Then they eventually started talking about AI, after which I started avoiding the classes. There is something nice about avoiding classes in a beautiful place, among people actively and idealistically doing things.

Before the AI bits, I went to a class called something like ‘seeking IPC’, which they said was not well named, and that if I could name it better, that would be helpful. So I payed attention to what it was about, to this end. As far as I can tell it was about how you should be curious all the time. For instance, if instead of being annoyed, one is curious, then one might get to find out the fascinating answer to the question of what the hell is going on inside the other person’s head.

(I spent the next day feeling annoyed with everything. I considered being curious about this, but found that I definitely did not want to. I did have a lot of second level curiosity about why I was so incurious.)

I found the continuing line of thought on curiosity to be fascinating and radically perspective-changing, but struggle to explain my thoughts in a way where they don’t seem obvious. This might be because they are obvious. I claim that this doesn’t undermine the importance or difficulty of realizing them. The problem with obvious things often is that you don’t realize that you have to realize them, because you assume that you already know them because  they are obvious. Anyway, I’m going to keep the actual point for another time when I figure out how to explain really obvious things to people. But indescribable revelation is right up there with nice carpet and back jacks for making a good workshop I think.

I went to some classes on double crux. Double crux is basically the idea that when you are having an argument with someone, you should both try to figure out what it would take to change your mind, and then attend to whether any of those things are true. Internal double crux is when you do that, but the argument is between conflicting parts of yourself. Constructive internal double crux is when you want to do that, but you don’t have conflicting parts of yourself, so you try to coax at least one new combatant into internal existence. Counterfactual constructive internal double crux is when you don’t want to do that, so you create a version of you who does want to do it. That last one is fake, but all the others are real.

Every night after dinner they asked us what we wanted to do. Every night I wanted to go in the hot tub. Finally on the third night there was enough water in the hot tub, and my proposal for hot tub constructive internal double crux got several takers, and all was pretty idyllic (except for dire warnings to avoid breathing outside).

(This post is weird because it was constructed by pair blogging.)

Discuss

27 ноября, 2018 - 02:10
Published on Mon Nov 26 2018 23:10:03 GMT+0000 (UTC)

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through the database of all summaries.

Highlights

Scalable agent alignment via reward modeling (Jan Leike): This blog post and the associated paper outline a research direction that DeepMind's AGI safety team is pursuing. The key idea is to learn behavior by learning a reward and a policy simultaneously, from human evaluations of outcomes, which can scale to superhuman performance in tasks where evaluation is easier than demonstration. However, in many cases it is hard for humans to evaluate outcomes: in this case, we can train simpler agents using reward modeling that can assist the human in evaluating outcomes for the harder task, a technique the authors call recursive reward modeling. For example, if you want to train an agent to write a fantasy novel, it would be quite expensive to have a human evaluate outcomes, i.e. rate how good the produced fantasy novels are. We could instead use reward modeling to train agents that can produce plot summaries, assess prose quality and character development, etc. which allows a human to assess the fantasy novels. There are several research challenges, such as what kind of feedback to get, making it sufficiently sample efficient, preventing reward hacking and unacceptable outcomes, and closing the reward-result gap. They outline several promising approaches to solving these problems.

Rohin's opinion: The proposal sounds to me like a specific flavor of narrow value learning, where you learn reward functions to accomplish particular tasks, rather than trying to figure out the "true human utility function". The recursive aspect is similar to iterated amplification and debate. Iterated amplification and debate can be thought of as operating on a tree of arguments, where each node is the result of considering many child nodes (the considerations that go into the argument). Importantly, the child nodes are themselves arguments that can be decomposed into smaller considerations. Iterated amplification works by learning how to compose and decompose nodes from children, while debate works by having humans evaluate a particular path in the argument tree. Recursive reward modeling instead uses reward modeling to train agents that can help evaluate outcomes on the task of interest. This seems less recursive to me, since the subagents are used to evaluate outcomes, which would typically be a different-in-kind task than the task of interest. This also still requires the tasks to be fast -- it is not clear how to use recursive reward modeling to eg. train an agent that can teach math to children, since it takes days or months of real time to even produce outcomes to evaluate. These considerations make me a bit less optimistic about recursive reward modeling, but I look forward to seeing future work that proves me wrong.

The post also talks about how reward modeling allows us to separate what to do (reward) from how to do it (policy). I think it is an open question whether this is desirable. Past work found that the reward generalized somewhat (whereas policies typically don't generalize at all), but this seems relatively minor. For example, rewards inferred using deep variants of inverse reinforcement learning often don't generalize. Another possibility is that the particular structure of "policy that optimizes a reward" provides a useful inductive bias that makes things easier to learn. It would probably also be easier to inspect a specification of "what to do" than to inspect learned behavior. However, these advantages are fairly speculative and it remains to be seen whether they pan out. There are also practical advantages: any advances in deep RL can immediately be leveraged, and reward functions can often be learned much more sample efficiently than behavior, reducing requirements on human labor. On the other hand, this design "locks in" that the specification of behavior must be a reward function. I'm not a fan of reward functions because they're so unintuitive for humans to work with -- if we could have agents that work with natural language, I suspect I do not want the natural language to be translated into a reward function that is then optimized.

Technical AI alignmentIterated amplification sequence

Prosaic AI alignment (Paul Christiano): It is plausible that we can build "prosaic" AGI soon, that is, we are able to build generally intelligent systems that can outcompete humans without qualitatively new ideas about intelligence. It seems likely that this would use some variant of RL to train a neural net architecture (other approaches don't have a clear way to scale beyond human level). We could write the code for such an approach right now (see An unaligned benchmark from AN #33), and it's at least plausible that with enough compute and tuning this could lead to AGI. However, this is likely to be bad if implemented as stated due to the standard issues of reward gaming and Goodhart's Law. We do have some approaches to alignment such as IRL and executing natural language instructions, but neither of these are at the point where we can write down code that would plausibly lead to an aligned AI. This suggests that we should focus on figuring out how to align prosaic AI.

There are several reasons to focus on prosaic AI. First, since we know the general shape of the AI system under consideration, it is easier to think about how to align it (while ignoring details like architecture, variance reduction tricks, etc. which don't seem very relevant currently). Second, it's important, both because we may actually build prosaic AGI, and because even if we don't the insights gained will likely transfer. In addition, worlds with short AGI timelines are higher leverage, and in those worlds prosaic AI seems much more likely. The main counterargument is that aligning prosaic AGI is probably infeasible, since we need a deep understanding of intelligence to build aligned AI. However, it seems unreasonable to be confident in this, and even if it is infeasible, it is worth getting strong evidence of this fact in order change priorities around AI development, and coordinate on not building an AGI that is too powerful.

Rohin's opinion: I don't really have much to say here, except that I agree with this post quite strongly.

Approval-directed agents: overview and Approval-directed agents: details (Paul Christiano): These two posts introduce the idea of approval-directed agents, which are agents that choose actions that they believe their operator Hugh the human would most approve of, if he reflected on it for a long time. This is in contrast to the traditional approach of goal-directed behavior, which are defined by the outcomes of the action.

Since the agent Arthur is no longer reasoning about how to achieve outcomes, it can no longer outperform Hugh at any given task. (If you take the move in chess that Hugh most approves of, you probably still lose to Gary Kasparov.) This is still better than Hugh performing every action himself, because Hugh can provide an expensive learning signal which is then distilled into a fast policy that Arthur executes. For example, Hugh could deliberate for a long time whenever he is asked to evaluate an action, or he could evaluate very low-level decisions that Arthur makes billions of times. We can also still achieve superhuman performance by bootstrapping (see the next summary).

The main advantage of approval-directed agents is that we avoid locking in a particular goal, decision theory, prior, etc. Arthur should be able to change any of these, as long as Hugh approves it. In essence, approval-direction allows us to delegate these hard decisions to future overseers, who will be more informed and better able to make these decisions. In addition, any misspecifications seem to cause graceful failures -- you end up with a system that is not very good at doing what Hugh wants, rather than one that works at cross purposes to him.

We might worry that internally Arthur still uses goal-directed behavior in order to choose actions, and this internal goal-directed part of Arthur might become unaligned. However, we could even have internal decision-making about cognition be approval-based. Of course, eventually we reach a point where decisions are simply made -- Arthur doesn't "choose" to execute the next line of code. These sorts of things can be thought of as heuristics that have led to choosing good actions in the past, that could be changed if necessary (eg. by rewriting the code).

How might we write code that defines approval? If our agents can understand natural language, we could try defining "approval" in natural language. If they are able to reason about formally specified models, then we could try to define a process of deliberation with a simulated human. Even in the case where Arthur learns from examples, if we train Arthur to predict approval from observations and take the action with the highest approval, it seems possible that Arthur would not manipulate approval judgments (unlike AIXI).

There are also important details on how Hugh should rate -- in particular, we have to be careful to distinguish between Hugh's beliefs/information and Arthur's. For example, if Arthur thinks there's a 1% chance of a bridge collapsing if we drive over it, then Arthur shouldn't drive over it. However, if Hugh always assigns approval 1 to the optimal action and approval 0 to all other actions, and Arthur believes that Hugh knows whether the bridge will collapse, then the maximum expected approval action is to drive over the bridge.

The main issues with approval-directed agents is that it's not clear how to define them (especially from examples), whether they can be as useful as goal-directed agents, and whether approval-directed agents will have internal goal-seeking behavior that brings with it all of the problems that approval was meant to solve. It may also be a problem if some other Hugh-level intelligence gets control of the data that defines approval.

Rohin's opinion: Goal-directed behavior requires an extremely intelligent overseer in order to ensure that the agent is pointed at the correct goal (as opposed to one the overseer thinks is correct but is actually slightly wrong). I think of approval-directed agents as providing the intuition that we may only require an overseer that is slightly smarter than the agent in order to be aligned. This is because the overseer can simply "tell" the agent what actions to take, and if the agent makes a mistake, or tries to optimize a heuristic too hard, the overseer can notice and correct it interactively. (This is assuming that we solve the informed oversight problem so that the agent doesn't have information that is hidden from the overseer, so "intelligence" is the main thing that matters.) Only needing a slightly smarter overseer opens up a new space of solutions where we start with a human overseer and subhuman AI system, and scale both the overseer and the AI at the same time while preserving alignment at each step.

Approval-directed bootstrapping (Paul Christiano): To get a very smart overseer, we can use the idea of bootstrapping. Given a weak agent, we can define a stronger agent that happens from letting the weak agent think for a long time. This strong agent can be used to oversee a slightly weaker agent that is still stronger than the original weak agent. Iterating this process allows us to reach very intelligent agents. In approval-directed agents, we can simply have Arthur ask Hugh to evaluate approval for actions, and in the process of evaluation Hugh can consult Arthur. Here, the weak agent Hugh is being amplified into a stronger agent by giving him the ability to consult Arthur -- and this becomes stronger over time as Arthur becomes more capable.

Rohin's opinion: This complements the idea of approval from the previous posts nicely: while approval tells us how to build an aligned agent from a slightly smarter overseer, bootstrapping tells us how to improve the capabilities of the overseer and the agent.

You could also combine this with particular ML algorithms in an attempt to define versions of those algorithms aligned with Hugh's enlightened judgment. For example, for RL algorithm A, we could define max-HCH_A to be A's chosen action when maximizing Hugh's approval after consulting max-HCH_A.

Rohin's opinion: This has the same nice recursive structure of bootstrapping, but without the presence of the agent. This probably makes it more amenable to formal analysis, but I think that the interactive nature of bootstrapping (and iterated amplification more generally) is quite important for ensuring good outcomes: it seems way easier to control an AI system if you can provide input constantly with feedback.

Fixed point sequence

Fixed Point Discussion (Scott Garrabrant): This post discusses the various fixed point theorems from a mathematical perspective, without commenting on their importance for AI alignment.

Technical agendas and prioritization

Integrative Biological Simulation, Neuropsychology, and AI Safety (Gopal P. Sarma et al): See Import AI and this comment.

Learning human intent

Scalable agent alignment via reward modeling (Jan Leike): Summarized in the highlights!

A Geometric Perspective on the Transferability of Adversarial Directions (Zachary Charles et al)

AI strategy and policy

MIRI 2018 Update: Our New Research Directions (Nate Soares): This post gives a high-level overview of the new research directions that MIRI is pursuing with the goal of deconfusion, a discussion of why deconfusion is so important to them, an explanation of why MIRI is now planning to leave research unpublished by default, and a case for software engineers to join their team.

Rohin's opinion: There aren't enough details on the technical research for me to say anything useful about it. I'm broadly in support of deconfusion but am either less optimistic on the tractability of deconfusion, or more optimistic on the possibility of success with our current notions (probably both). Keeping research unpublished-by-default seems reasonable to me given the MIRI viewpoint for the reasons they talk about, though I haven't thought about it much. See also Import AI.

Other progress in AIReinforcement learning

Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search (Lars Buesing et al) (summarized by Richard): This paper aims to alleviate the data inefficiency of RL by using a model to synthesise data. However, even when environment dynamics can be modeled accurately, it can be difficult to generate data which matches the true distribution. To solve this problem, the authors use a Structured Causal Model trained to predict the outcomes which would have occurred if different actions had been taken from previous states. Data is then synthesised by rolling out from previously-seen states. The authors test performance in a partially-observable version of SOKOBAN, in which their system outperforms other methods of generating data.

Richard's opinion: This is an interesting approach which I can imagine becoming useful. It would be nice to see more experimental work in more stochastic environments, though.

Natural Environment Benchmarks for Reinforcement Learning (Amy Zhang et al) (summarized by Richard): This paper notes that RL performance tends to be measured in simple artificial environments - unlike other areas of ML in which using real-world data such as images or text is common. The authors propose three new benchmarks to address this disparity. In the first two, an agent is assigned to a random location in an image, and can only observe parts of the image near it. At every time step, it is able to move in one of the cardinal directions, unmasking new sections of the image, until it can classify the image correctly (task 1) or locate a given object (task 2). The third type of benchmark is adding natural video as background to existing Mujoco or Atari tasks. In testing this third category of benchmark, they find that PPO and A2C fall into a local optimum where they ignore the observed state when deciding the next action.

Richard's opinion: While I agree with some of the concerns laid out in this paper, I'm not sure that these benchmarks are the best way to address them. The third task in particular is mainly testing for ability to ignore the "natural data" used, which doesn't seem very useful. I think a better alternative would be to replace Atari with tasks in procedurally-generated environments with realistic physics engines. However, this paper's benchmarks do benefit from being much easier to produce and less computationally demanding.

Deep learning

Do Better ImageNet Models Transfer Better? (Simon Kornblith et al) (summarized by Dan H)

Dan H's opinion: This paper shows a strong correlation between a model's ImageNet accuracy and its accuracy on transfer learning tasks. In turn, better ImageNet models learn stronger features. This is evidence against the assertion that researchers are simply overfitting ImageNet. Other evidence is that the architectures themselves work better on different vision tasks. Further evidence against overfitting ImageNet is that many architectures which are desgined for CIFAR-10, when trained on ImageNet, can be highly competitive on ImageNet.

Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks (Jie Hu, Li Shen, Samuel Albanie et al) (summarized by Dan H)

Read more: This method uses spatial summarization for increasing convnet accuracy and was discovered around the same time as this similar work. Papers with independent rediscoveries tend to be worth taking more seriously.

Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations (Xander Steenbrugge et al)

Discuss

### Boltzmann Brains, Simulations and self refuting hypothesis

26 ноября, 2018 - 22:09
Published on Mon Nov 26 2018 19:09:42 GMT+0000 (UTC)

Lets suppose, for the purposes of this post, that our best model of dark energy is such that an exponentially vast number of Boltzman brains will exist in the far future. The idea that we may be in an ancestor simulation is similar in its self refuting nature but slightly vaguer, as it depends on the likely goals of future societies.

What do I mean when I say that these arguments are self refuting? I mean that accepting the conclusion seems to give a good reason to reject the premise. Once you actually accept that you are a Boltzmann brain, all your reasoning about the nature of dark energy becomes random noise. There is no reason to think that you have the slightest clue about how the universe works. We seem to be getting evidence that all our evidence is nonsense, including the evidence that told us that. The same holds for the simulation hypothesis, unless you conjecture that all civilizations make ancestor simulations almost exclusively.

Whats actually going on here. We have three hypothesis.

1) No Boltzmann brains, the magic dark energy fairy stops them being created somehow. (Universe A)

2) Boltzmann brains exist, And I am not one. (Universe B)

3) I am a Boltzmann brain. (Universe B)

As all these hypothesis fit the data, we have to tell them apart on priors, and anthropic decision theory, with the confusion coming from not having decided on an anthropic theory to use, but ad-lib-ing it with intuition.

SIA Selects from all possible observers, and so tells you that 3) is by far the most likely.

SSA, with an Ocamian Prior says that Universe B is slightly more likely, because it takes fewer bits to specify. However most of the observers in Universe B are Boltzmann brains seeing random gibberish. The observation of any kind of pattern gives an overwhelming update towards option 1).

Discuss

### Quantum Mechanics, Nothing to do with Consciousness

26 ноября, 2018 - 21:59
Published on Mon Nov 26 2018 18:59:19 GMT+0000 (UTC)

Epistemic status: A quick rejection of the quantum consciousness woo. If you have already read the sequences, there's nothing new in here. If your new to the site, or need a single page to point people to, here it is.

Real Quantum mechanics looks like pages of abstract maths, after which you have deduced the results of a physics experiment. Given how hard the maths is, most of the systems that we use quantum mechanics to predict are quite simple. One common experiment is to take a glass tube full of a particular element and run lots of electricity through it. The element will produce coloured light, like sodium producing orange, or neon producing red. So take a prism and split that light to see what colours are being produced. Quantum physisists will do lots of tricky maths about how the electrons move between energy levels to work out what colour different elements will produce.

There have been no quantum mechanics experiments that show consciousness to have any relevance to particle physics. The laws of physics do not say what is or is not conscious, in much the same way that they don't say what is or is not a work of art.

For the writers among you, think of a word processor feature that takes some text, and turns it into ALL CAPS. You can put a great novel into this feature if you want. The point is that the rule its self acts the same way whether or not it's given great literature. You can't use the rule to tell what is great literature, you have to read it and decide yourself. Consciousness, like literature, is a high level view that's hard to pin down precisely, and is largely a matter of how we choose to define it. Quantum mechanics is a simple, mechanistic rule.

Yes I know that some of you are thinking of the double slit experiment. You make a screen with two slits, shine light through and get an interference pattern. Put a detector at one slit, attach a dial to the detector, and have a scientist watching the dial so they can see which slit the photon went through, and the interference pattern disappears. Clearly, thought some idiot, consciousness causes the quantum wave function to collapse, the universe doesn't like us knowing which slit the photon goes through.

However, lets do a few more experiments. Repeat the previous one, except that the scientist is sleeping in front of the dial. No interference pattern. Turn the dial to face the wall, remove the scientist entirely. Still no interference pattern. Unplug the dial from the detector, so electrical impulses run up the wire and then can't go anywhere. Again, no interference. Whatever is stopping interference patterns, it looks like detectors, not consciousness.

It turns out that any interaction with any other particles, such that the position the other particle ends up in depends on which slit a photon went through, creates entanglement between the photon and the other particle, which destroys interference. And the atoms in the dial, the electrons in the wire, and particles in the detector its self, all have there position depend on where the photon went.

In general, the way to get rid of mysteries is to break them up into smaller mysteries, until your left with loads of tiny mysteries. How life worked used to be one big mystery. But thanks to modern biology, we now have thousands of tiny mysteries about how yeast metabolism can tolerate high levels of alcohol, or how protozoa DNA doesn't get tangled when they replicate. And these are surrounded by large amounts of well understood science. (I'm not a biologist, so these particular things might be solved by now, but you get the Idea) Big mysteries get broken down into a pile of fact, and several smaller ones.

Gluing the "mystery" of quantum mechanics, to the "mystery" of consciousness to make a bigger and more mysterious mystery, would be a mistake even if both of these things were actually mysteries to humanity. Mystery is a blank textbook, not a feature of the world, and in this case, there is a clear picture of quantum mechanics, and a rough sketch of consciousness in the textbooks.

Discuss

### Status model

26 ноября, 2018 - 19:44
Published on Mon Nov 26 2018 16:44:36 GMT+0000 (UTC)

[Epistemic status: My best guess.]

Following a conversation on a previous post, I decided to do some research into the community’s thoughts on status. The community talks about status a lot and I spent a few happy hours sifting through everything I could find.

I came up with a status model based on the posts and comments which I read:

Essentially, status-related mental adaptations are executed which lead to certain status behaviours. These behaviours determine our social standing which, at least in the ancestral environment, tended to affect out fitness.

I don’t think there’s anything ground breaking here (a similar model would probably apply to any adaptation execution vs fitness maximising effect) but I haven’t seen it sketched out specifically for status before.

The word "status" gets used to refer to items on all 4 levels and this can lead to confusion where two people are referring to different levels.

For instance, what is status and how can I measure it? One way is to look directly at row 2 (status rating): who respects whom and how much? Maybe I could give everyone a questionnaire to rate each other’s status. Or I can look at row 3 (status behaviours): who acts like they have high status? Or row 1 (status benefits): who has the most social control etc.? Whoever gets the most benefits probably has the most status.

Each option has its advantages and disadvantages e.g. accuracy, ease of assessment but it is important to know which level is being referred to.

Probably the most commonly debated issue on this topic is whether status is zero-sum.

If we consider status in row 2 then status is probably going to be relatively zero-sum, although you can maybe get around this a bit by splitting into smaller sub-cultures.

If we consider row 1, insofar as the status rating determines the benefits, they are close to zero-sum. However, the benefits are also controlled by things other than status (how good are we at getting food, how well coordinated are we as a group?) and so are able to be positive sum.

Row 4 is where it gets really interesting - our adaptations which implement status behaviour are not zero-sum. We can feel more self-esteem without increasing our status rating (see That Other Kind Of Status). These mental adaptations are the things which we care about on a gut level and give plenty of scope for positive sum behaviours (e.g. give praise).

I don't pretend that this answers the zero-sum question completely but I think it does put it in a helpful frame.

The model is incomplete in a number of ways.

The “status rating” row is a massive simplification. In reality there are all of the different ways which humans judge status, how status changes depending on group and circumstances and the effects of social allies. I only listed two kinds of status to simplify visually.

The status benefits and status adaptations listed are also only a subset of the actual benefits and adaptations.

The relationships between the rows are leaky. The status adaptations lead to behaviours which aren’t necessarily related to status and the status benefits can be affected by things other than status rating.

Despite the model's limitations, I hope it is a useful simplification.

Discuss

### MIRI's 2017 Fundraiser

26 ноября, 2018 - 16:45
Published on Mon Nov 26 2018 13:45:32 GMT+0000 (UTC)

MIRI’s 2017 fundraiser is live through the end of December! Our progress so far (updated live):

Donate Now

MIRI is a research nonprofit based in Berkeley, California with a mission of ensuring that smarter-than-human AI technology has a positive impact on the world. You can learn more about our work at “Why AI Safety?” or via MIRI Executive Director Nate Soares’ Google talk on AI alignment.

In 2015, we discussed our interest in potentially branching out to explore multiple research programs simultaneously once we could support a larger team. Following recent changes to our overall picture of the strategic landscape, we’re now moving ahead on that goal and starting to explore new research directions while also continuing to push on our agent foundations agenda. For more on our new views, see “There’s No Fire Alarm for Artificial General Intelligence” and our 2017 strategic update. We plan to expand on our relevant strategic thinking more in the coming weeks.

Our expanded research focus means that our research team can potentially grow big, and grow fast. Our current goal is to hire around ten new research staff over the next two years, mostly software engineers. If we succeed, our point estimate is that our 2018 budget will be $2.8M and our 2019 budget will be$3.5M, up from roughly $1.9M in 2017.1 We’ve set our fundraiser targets by estimating how quickly we could grow while maintaining a 1.5-year runway, on the simplifying assumption that about 1/3 of the donations we receive between now and the beginning of 2019 will come during our current fundraiser.2 Hitting Target 1 ($625k) then lets us act on our growth plans in 2018 (but not in 2019); Target 2 ($850k) lets us act on our full two-year growth plan; and in the case where our hiring goes better than expected, Target 3 ($1.25M) would allow us to add new members to our team about twice as quickly, or pay higher salaries for new research staff as needed.

We discuss more details below, both in terms of our current organizational activities and how we see our work fitting into the larger strategy space.

What’s new at MIRI

New developments this year have included:

Thanks in part to this major support, we’re currently in a position to scale up the research team quickly if we can find suitable hires. We intend to explore a variety of new research avenues going forward, including making a stronger push to experiment and explore some ideas in implementation.4 This means that we’re currently interested in hiring exceptional software engineers, particularly ones with machine learning experience.

The two primary things we’re looking for in software engineers are programming ability and value alignment. Since we’re a nonprofit, it’s also worth noting explicitly that we’re generally happy to pay excellent research team applicants with the relevant skills whatever salary they would need to work at MIRI. If you think you’d like to work with us, apply here!

In that vein, I’m pleased to announce that we’ve made our first round of hires for our engineer positions, including:

Jesse Liptrap, who previously worked on the Knowledge Graph at Google for four years, and as a bioinformatician at UC Berkeley. Jesse holds a PhD in mathematics from UC Santa Barbara, where he studied category-theoretic underpinnings of topological quantum computing.

Nick Tarleton, former lead architect at the search startup Quixey. He previously studied computer science and decision science at Carnegie Mellon University, and Nick worked with us at the first iteration of our summer fellows program, studying consequences of proposed AI goal systems.

On the whole, our initial hiring efforts have gone quite well, and I’ve been very impressed with the high caliber of our hires and of our pool of candidates.

On the research side, our recent work has focused heavily on open problems in decision theory, and on other questions related to naturalized agency. Scott Garrabrant divides our recent work on the agent foundations agenda into four categories, tackling different AI alignment subproblems:

Decision theory — Traditional models of decision-making assume a sharp Cartesian boundary between agents and their environment. In a naturalized setting in which agents are embedded in their environment, however, traditional approaches break down, forcing us to formalize concepts like “counterfactuals” that can be left implicit in AIXI-like frameworks.

Recent focus areas:

Naturalized world-models — Similar issues arise for formalizing how systems model the world in the absence of a sharp agent/environment boundary. Traditional models leave implicit aspects of “good reasoning” such as causal and multi-level world-modeling, reasoning under deductive limitations, and agents modeling themselves.

Recent focus areas:

• Kakutani’s fixed-point theorem and reflective oracles: “Hyperreal Brouwer.”
• Transparency and merging of opinions in logical inductors.
• Ontology merging, a possible approach to reasoning about ontological crises and transparency.
• Attempting to devise a variant of logical induction that is “Bayesian” in the sense that its belief states can be readily understood as conditionalized prior probability distributions.

Subagent avoidance — A key reason that agent/environment boundaries are unhelpful for thinking about AGI is that a given AGI system may consist of many different subprocesses optimizing many different goals or subgoals. The boundary between different “agents” may be ill-defined, and a given optimization process is likely to construct subprocesses that pursue many different goals. Addressing this risk requires limiting the ways in which new optimization subprocesses arise in the system.

Recent focus areas:

Robust delegation — In cases where it’s desirable to delegate to another agent (e.g. an AI system or a successor), it’s critical that the agent be well-aligned and trusted to perform specified tasks. The value learning problem and most of the AAMLS agenda fall in this category. Recent focus areas:

Additionally, we ran several research workshops, including one focused on Paul Christiano’s research agenda.

Fundraising goals

To a first approximation, we view our ability to make productive use of additional dollars in the near future as linear in research personnel additions. We don’t expect to run out of additional top-priority work we can assign to highly motivated and skilled researchers and engineers. This represents an important shift from our past budget and team size goals.5

Growing our team as much as we hope to is by no means an easy hiring problem, but it’s made significantly easier by the fact that we’re now looking for top software engineers who can help implement experiments we want to run, and not just productive pure researchers who can work with a high degree of independence. (In whom we are, of course, still very interested!) We therefore think we can expand relatively quickly over the next two years (productively!), funds allowing.

In our mainline growth scenario, our reserves plus next year’s $1.25M installment of the Open Philanthropy Project’s 3-year grant would leave us with around 9 months of runway going into 2019. However, we have substantial uncertainty about exactly how quickly we’ll be able to hire additional researchers and engineers, and therefore about our 2018–2019 budgets. Our 2018 budget breakdown in the mainline success case looks roughly like this: 2018 Budget Estimate (Mainline Growth) To determine our fundraising targets this year, we estimated the support levels (above the Open Philanthropy Project’s support) that would make us reasonably confident that we can maintain a 1.5-year runway going into 2019 in different growth scenarios, assuming that our 2017 fundraiser looks similar to next year’s fundraiser and that our off-fundraiser donor support looks similar to our on-fundraiser support: Basic target —$625,000. At this funding level, we’ll be in a good position to pursue our mainline hiring goal in 2018, although we will likely need to halt or slow our growth in 2019.

Mainline-growth target — $850,000. At this level, we’ll be on track to fully fund our planned expansion over the next few years, allowing us to roughly double the number of research staff over the course of 2018 and 2019. Rapid-growth target —$1,250,000. At this funding level, we will be on track to maintain a 1.5-year runway even if our hiring proceeds a fair amount faster than our mainline prediction. We’ll also have greater freedom to pay higher salaries to top-tier candidates as needed.

Beyond these growth targets: if we saw an order-of-magnitude increase in MIRI’s funding in the near future, we have several ways we believe we can significantly accelerate our recruitment efforts to grow the team faster. These include competitively paid trial periods and increased hiring outreach across venues and communities where we expect to find high-caliber candidates. Funding increases beyond the point where we could usefully use the money to hire faster would likely cause us to spin off new initiatives to address the problem of AI x-risk from other angles; we wouldn’t expect them to go to MIRI’s current programs.

On the whole, we’re in a very good position to continue expanding, and we’re enormously grateful for the generous support we’ve already received this year. Relative to our present size, MIRI’s reserves are much more solid than they have been in the past, putting us in a strong position going into 2018.

Given our longer runway, this may be a better year than usual for long-time MIRI supporters to consider supporting other projects that have been waiting in the wings. That said, we don’t personally know of marginal places to put additional dollars that we currently view as higher-value than MIRI, and we do expect our fundraiser performance to affect our growth over the next two years, particularly if we succeed in growing the MIRI team as fast as we’re hoping to.

Donate Now

Strategic background

Taking a step back from our immediate organizational plans: how does MIRI see the work we’re doing as tying into positive long-term, large-scale outcomes?

A lot of our thinking on these issues hasn’t yet been written up in any detail, and many of the issues involved are topics of active discussion among people working on existential risk from AGI. In very broad terms, however, our approach to global risk mitigation is to think in terms of desired outcomes, and to ask: “What is the likeliest way that the outcome in question might occur?” We then repeat this process until we backchain to interventions that actors can take today.

Ignoring a large number of subtleties, our view of the world’s strategic situation currently breaks down as follows:

1. Long-run good outcomes. Ultimately, we want humanity to figure out the best possible long-run future and enact that kind of future, factoring in good outcomes for all sentient beings. However, there is currently very little we can say with confidence about what desirable long-term outcomes look like, or how best to achieve them; and if someone rushes to lock in a particular conception of “the best possible long-run future,” they’re likely to make catastrophic mistakes both in how they envision that goal and in how they implement it.

In order to avoid making critical decisions in haste and locking in flawed conclusions, humanity needs:

2. A stable period during which relevant actors can accumulate whatever capabilities and knowledge are required to reach robustly good conclusions about long-run outcomes. This might involve decisionmakers developing better judgment, insight, and reasoning skills in the future, solving the full alignment problem for fully autonomous AGI systems, and so on.

Given the difficulty of the task, we expect a successful stable period to require:

3. A preceding end to the acute risk period. If AGI carries a significant chance of causing an existential catastrophe over the next few decades, this forces a response under time pressure; but if actors attempt to make irreversible decisions about the long-term future under strong time pressure, we expect the result to be catastrophically bad. Conditioning on good outcomes, we therefore expect a two-step process where addressing acute existential risks takes temporal priority.

To end the acute risk period, we expect it to be necessary for actors to make use of:

4. A risk-mitigating technology. On our current view of the technological landscape, there are a number of plausible future technologies that could be leveraged to end the acute risk period.

We believe that the likeliest way to achieve a technology in this category sufficiently soon is through:

5. AGI-empowered technological development carried out by task-directed AGI systems. Depending on early AGI systems’ level of capital-intensiveness, on whether AGI is a late-paradigm or early-paradigm invention, and on a number of other factors, AGI might be developed by anything from a small Silicon Valley startup to a large-scale multinational collaboration. Regardless, we expect AGI to be developed before any other (meta)technology that can be employed to end the acute risk period, and if early AGI systems can be used safely at all, then we expect it to be possible for an AI-empowered project to safely automate a reasonably small set of concrete science and engineering tasks that are sufficient for ending the risk period. This requires:

6. Construction of minimal aligned AGI. We specify “minimal” because we consider success much more likely if developers attempt to build systems with the bare minimum of capabilities for ending the acute risk period. We expect AGI alignment to be highly difficult, and we expect additional capabilities to add substantially to this difficulty.

If an aligned system of this kind were developed, we would expect two factors to be responsible:

7a. A technological edge in AGI by a strategically adequate project. By “strategically adequate” we mean a project with strong opsec, research closure, trustworthy command, a commitment to the common good, security mindset, requisite resource levels, and heavy prioritization of alignment work. A project like this needs to have a large enough lead to be able to afford to spend a substantial amount of time on safety measures, as discussed at FLI’s Asilomar conference.

7b. A strong white-boxed system understanding on the part of the strategically adequate project during late AGI development. By this we mean that developers go into building AGI systems with a good understanding of how their systems decompose and solve particular cognitive problems, of the kinds of problems different parts of the system are working on, and of how all of the parts of the system interact.

On our current understanding of the alignment problem, developers need to be able to give a reasonable account of how all of the AGI-grade computation in their system is being allocated, similar to how secure software systems are built to allow security professionals to give a simple accounting of why the system has no unforeseen vulnerabilities. See “Security Mindset and Ordinary Paranoia” for more details.

Developers must be able to explicitly state and check all of the basic assumptions required for their account of the system’s alignment and effectiveness to hold. Additionally, they need to design and modify AGI systems only in ways that preserve understandability — that is, only allow system modifications that preserve developers’ ability to generate full accounts of what cognitive problems any given slice of the system is solving, and why the interaction of all of the system’s parts is both safe and effective.

Our view is that this kind of system understandability will in turn require:

8. Steering toward alignment-conducive AGI approaches. Leading AGI researchers and developers need to deliberately direct research efforts toward ensuring that the earliest AGI designs are relatively easy to understand and align.

We expect this to be a critical step, as we do not expect most approaches to AGI to be alignable after the fact without long, multi-year delays.

We plan to say more in the future about the criteria for strategically adequate projects in 7a. We do not believe that any project meeting all of these conditions currently exists, though we see various ways that projects could reach this threshold.

The above breakdown only discusses what we view as the “mainline” success scenario.6 If we condition on good long-run outcomes, the most plausible explanation we can come up with cites a strategically adequate AI-empowered project ending the acute risk period, and appeals to the fact that those future AGI developers maintained a strong understanding of their system’s problem-solving work over the course of development, made use of advance knowledge about which AGI approaches conduce to that kind of understanding, and filtered on those approaches.

For that reason, MIRI does research to intervene on 8 from various angles, such as by examining holes and anomalies in the field’s current understanding of real-world reasoning and decision-making. We hope to thereby reduce our own confusion about alignment-conducive AGI approaches and ultimately help make it feasible for developers to construct adequate “safety-stories” in an alignment setting. As we improve our understanding of the alignment problem, our aim is to share new insights and techniques with leading or up-and-coming developer groups, who we’re generally on good terms with.

A number of the points above require further explanation and motivation, and we’ll be providing more details on our view of the strategic landscape in the near future.

Further questions are always welcome at contact@intelligence.org, regarding our current organizational activities and plans as well as the long-term role we hope to play in giving AGI developers an easier and clearer shot at making the first AGI systems robust and safe. For more details on our fundraiser, including corporate matching, see our Donate page.

1 Note that this $1.9M is significantly below the$2.1–2.5M we predicted for the year in April. Personnel costs are MIRI’s most significant expense, and higher research staff turnover in 2017 meant that we had fewer net additions to the team this year than we’d budgeted for. We went under budget by a relatively small margin in 2016, spending $1.73M versus a predicted$1.83M.

Our 2018–2019 budget estimates are highly uncertain, with most of the uncertainty coming from substantial uncertainty about how quickly we’ll be able to take on new research staff.

2 This is roughly in line with our experience in previous years, when excluding expected grants and large surprise one-time donations. We’ve accounted for the former in our targets but not the latter, since we think it unwise to bank on unpredictable windfalls.

Note that in previous years, we’ve set targets based on maintaining a 1-year runway. Given the increase in our size, I now think that a 1.5-year runway is more appropriate.

3 Including the $1.01 million donation and the first$1.25 million from the Open Philanthropy Project, we have so far raised around $3.16 million this year, overshooting the$3 million goal we set earlier this year

4 We emphasize that, as always, “experiment” means “most things tried don’t work.” We’d like to avoid setting expectations of immediate success for this exploratory push.

5 Our previous goal was to slowly ramp up to the \$3–4 million level and then hold steady with around 13–17 research staff. We now expect to be able to reach (and surpass) that level much more quickly.

6 There are other paths to good outcomes that we view as lower-probability, but still sufficiently high-probability that the global community should allocate marginal resources to their pursuit.

Discuss

### Humans Consulting HCH

26 ноября, 2018 - 02:18
Published on Sun Nov 25 2018 23:18:55 GMT+0000 (UTC)

That is, Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh…

Let’s call this process HCH, for “Humans Consulting HCH.”

I’ve talked about many variants of this process before, but I find it easier to think about with a nice handle. (Credit to Eliezer for proposing using a recursive acronym.)

HCH is easy to specify very precisely. For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.

Elaborations

We can define realizable variants of this inaccessible ideal:

• For a particular prediction algorithm P, define HCHᴾ as:
“P’s prediction of what a human would say after consulting HCHᴾ”
• For a reinforcement learning algorithm A, define max-HCHᴬ as:
“A’s output when maximizing the evaluation of a human after consulting max-HCHᴬ”
• For a given market structure and participants, define HCHᵐᵃʳᵏᵉᵗ as:
“the market’s prediction of what a human will say after consulting HCHᵐᵃʳᵏᵉᵗ”

Note that e.g. HCHᴾ is totally different from “P’s prediction of HCH.” HCHᴾ will generally make worse predictions, but it is easier to implement.

Hope

The best case is that HCHᴾ, max-HCHᴬ, and HCHᵐᵃʳᵏᵉᵗ are:

• As capable as the underlying predictor, reinforcement learner, or market participants.
• Aligned with the enlightened judgment of the human, e.g. as evaluated by HCH.

(At least when the human is suitably prudent and wise.)

It is clear from the definitions that these systems can’t be any more capable than the underlying predictor/learner/market. I honestly don’t know whether we should expect them to match the underlying capabilities. My intuition is that max-HCHᴬ probably can, but that HCHᴾ and HCHᵐᵃʳᵏᵉᵗ probably can’t.

It is similarly unclear whether the system continues to reflect the human’s judgment. In some sense this is in tension with the desire to be capable — the more guarded the human, the less capable the system but the more likely it is to reflect their interests. The question is whether a prudent human can achieve both goals.

This was originally posted here.

Tomorrow's AI Alignment Forum sequences will take a break, and tomorrow's post will be Issue #34 of the Alignment Newsletter.

The next post in this sequence is 'Corrigibility' by Paul Christiano, which will be published on Tuesday 27th November.

Discuss

### Approval-directed bootstrapping

26 ноября, 2018 - 02:18
Published on Sun Nov 25 2018 23:18:47 GMT+0000 (UTC)

Approval-directed behavior works best when the overseer is very smart. Where can we find a smart overseer?

One approach is bootstrapping. By thinking for a long time, a weak agent can oversee an agent (slightly) smarter than itself. Now we have a slightly smarter agent, who can oversee an agent which is (slightly) smarter still. This process can go on, until the intelligence of the resulting agent is limited by technology rather than by the capability of the overseer. At this point we have reached the limits of our technology.

This may sound exotic, but we can implement it in a surprisingly straightforward way.

Suppose that we evaluate Hugh’s approval by predicting what Hugh would say if we asked him; the rating of action a is what Hugh would say if, instead of taking action a, we asked Hugh, “How do you rate action a?”

Now we get bootstrapping almost for free. In the process of evaluating a proposed action, Hugh can consult Arthur. This new instance of Arthur will, in turn, be overseen by Hugh—and in this new role Hugh can, in turn, be assisted by Arthur. In principle we have defined the entire infinite regress before Arthur takes his first action.

We can even learn this function by examples — no elaborate definitions necessary. Each time Arthur proposes an action, we actually ask Hugh to evaluate the action with some probability, and we use our observations to train a model for Hugh’s judgments.

In practice, Arthur might not be such a useful assistant until he has acquired some training data. As Arthur acquires training data, the Hugh+Arthur system becomes more intelligent, and so Arthur acquires training data from a more intelligent overseer. The bootstrapping unfolds over time as Arthur adjusts to increasingly powerful overseers.

This was originally posted here.

Tomorrow's AI Alignment Forum sequences will take a break, and tomorrow's post will be Issue #34 of the Alignment Newsletter.

The next post in this sequence is 'Humans consulting HCH', also released today.

Discuss

### How rapidly are GPUs improving in price performance?

25 ноября, 2018 - 22:54
Published on Sun Nov 25 2018 19:54:10 GMT+0000 (UTC)

Discuss

### Values Weren't Complex, Once.

25 ноября, 2018 - 12:17
Published on Sun Nov 25 2018 09:17:02 GMT+0000 (UTC)

The central argument of this post is that human values are only complex because all the obvious constraints and goals are easily fulfilled. The resulting post-optimization world is deeply confusing, and leads to noise as the primary driver of human values. This has worrying implications for any kind of world-optimizing. (This isn't a particularly new idea, but I am taking it a bit farther and/or in a different direction than this post by Scott Alexander, and I think it is worth making clear, given the previously noted connection to value alignment and effective altruism.)

First, it seems clear that formerly simple human values are now complex. "Help and protect relatives, babies, and friends" as a way to ensure group fitness and survival is mostly accomplished, so we find complex ethical dilemmas about the relative values of different behavior. "Don't hurt other people" as a tool for ensuring reciprocity has turned into compassion for humanity, animals, and perhaps other forms of suffering. These are more complex than they could possibly have been expressed in the ancestral environment, given restricted resources. It's worth looking at what changed, and how.

In the ancestral environment, humans had three basic desires; they wanted food, fighting, and fornication. Food is now relatively abundant, leading to people's complex preferences about exactly which flavors they like most. These differ because the base drive for food is overoptimizing. Fighting was competition between people for resources - and since we all have plenty, this turns into status-seeking in ways that aren't particularly meaningful outside of human social competition. The varieties of signalling and counter-signalling are the result. And fornication was originally for procreation, but we're adaptation executioners, not fitness maximizers, so we've short-cutted that with birth control and pornography, leading to an explosion in seeking sexual variety and individual kinks.

Past the point where maximizing the function has a meaningful impact on the intended result, we see the tails come apart. The goal seeking of human nature, however, needs to find some direction to push the optimization process. The implication from this is that humanity finds diverging goals because they are past the point where the basic desires run out. As Randall Munroe points out in an XKCD Comic, this leads to increasingly complex and divergent preferences for ever less meaningful results. And that comic would be funny if it weren't a huge problem for aligning group decision making and avoiding longer term problems.

If this is correct, the key takeaway is that as humans find ever fewer things to need, they inevitably to find ever more things to disagree about. Even though we expect convergent goals related to dominating resources, narrowly implying that we want to increase the pool of resources to reduce conflict, human values might be divergent as the pool of such resources grows.

Discuss

### The Post-Singularity Social Contract and Bostrom's "Vulnerable World Hypothesis"

25 ноября, 2018 - 04:34
Published on Sun Nov 25 2018 01:34:23 GMT+0000 (UTC)

I thought it might be worth outlining a few interesting (at least in my view!) parallels between Nick Bostrom’s recent working draft paper on the “vulnerable world hypothesis” (VWH) and my slightly less recent book chapter “Superintelligence and the Future of Governance: On Prioritizing the Control Problem at the End of History.”

In brief, my chapter offers an explicit (and highly speculative) proposal for extricating ourselves from the “semi-anarchic default condition” in which we currently find ourselves, although I don’t use that term. But I do more or less describe three features that Bostrom identifies of this condition:

(1) The phenomenon of “agential risks”—which Bostrom refers to as the “apocalyptic residual,” a phrase that I think will ultimately introduce confusion and thus ought to be eschewed!—entails that, once the destructive technological means become available, the world will almost certainly face global-scale disaster.

(2) I argue that sousveillance and the “transparent society” model (of David Brin) are both inadequate to prevent with 100 percent reliability global-scale attacks from risky agents. Furthermore, I contend that asymmetrical invasive global surveillance systems will almost certainly be misused and abused by those ("mere humans") in charge, thus threatening another type of existential hazard: Totalitarianism.

(3) Finally, I suggest that a human-controlled singleton (or global governing system) is unlikely to take shape on the relevant timescales; i.e., yes, there appears to be some general momentum toward a unipolar configuration (see Bostrom’s “singleton hypothesis,” which seems likely true to me), but unprecedented destructive capabilities will likely be widely distributed among nonstate actors before this happens.

I then argue for a form of algocracy: One way to avoid almost certain doom is to design a superintelligent algorithm for the purpose of coordinating planetary affairs and essentially spying on all citizens for the purpose of preemptively obviating global-scale attacks, some of which could have irreversible consequences. Ultimately, my conclusion is that, given the exponential development of dual-use emerging technologies, we may need to actually accelerate work on both (a) the philosophical control problem, and (b) the technical problem of creating an artificial superintelligence. As I write:

First, everything hangs on our ability to solve the control problem and create a friendly superintelligence capable of wise governance. This challenge is formidable enough given that many AI experts anticipate a human-level AI within this century—meaning that there appears to be a deadline—but the trends outlined in Figure 2 [just below] open up the possibility that we may have even less time to figure out what our “human values” are and how they can be encoded in “the AI’s programming language, and ultimately in primitives such as mathematical operators and addresses pointing to the contents of individual memory registers” (Bostrom 2014). Thus, the present paper offers a novel reason for allocating large amounts of resources for projects that focus on solving the control problem: not only will continued progress in computer science make the control problem probably unavoidable, but the convergence of state and nonstate power could require new forms of global governance—namely, a friendly supersingleton—within the coming decades.

This being said, here are some differences between the two papers:

(i) Bostrom also emphasizes phenomena like non-omnicidal actors who are locked in situations that, for structural reasons, lead them to pursue actions that cause disaster.

(ii) Whereas Bostrom focuses on the possibility of extracting a “black ball”—i.e., a “technology that invariably or by default destroys the civilization that invents it”—out of the tubular urn of innovation, I am essentially suggesting that we may have already extracted such a ball. For example, synthetic biology (more specifically: de-skilling plus the Internet plus lowering costs of lab equipment) will almost certainly place unprecedented destructive power within arm’s reach of a large number of terrorist groups or even single individuals. As I have elsewhere demonstrated—in two articles on agential risks—there are plenty of violent individuals with omnicidal inclinations looming in the shadows of society! This being said, some pretty simple calculations reveal that the probability of a successful global-scale attack by any one group or individual need be only negligible for annihilation to be more or less guaranteed over the course of decades or centuries, due to the fact that probability accumulates across space and time. Both John Sotos (in an article titled “Biotechnology and the Lifetime of Technical Civilizations”) and I (in an article titled “Facing Disaster: The Great Challenges Framework,” although I had similar calculations in my 2017 book) have crunched the numbers to show that, quoting from “Facing Disaster”:

For the sake of illustration, let’s posit that there are 1,000 terror agents in a population of 10 billion and that the probability per decade of any one of these individuals gaining access to world-destroying weapons … is only 1 percent. What overall level of existential risk would this expose the entire population to? It turns out that, given these assumptions, the probability of a doomsday attack per decade would be a staggering 99.995 percent. One gets the same result if the number of terror agents is 10,000 and the probability of access is 0.1 percent, or if the number is 10 million and the probability is 0.000001. Now consider that the probability of access may become far greater than 0.000001—or even 1—percent, given the trend of [the radical democratization of science and technology], and that the number of terror agents could exceed 10 million, which is a mere 0.1 percent of 10 billion. It appears that an existential strike could be more or less inescapable.

This suggests that a scenario as extreme as the “easy nuke” one that Bostrom outlines in his paper need not be the case for the conclusion of the VWH to obtain. Recall that this hypothesis states:

If technological development continues then a set of capabilities will at some point be attained that make the devastation of civilization extremely likely, unless civilization sufficiently exits the semi-anarchic default condition.

My sense is, again, that these capabilities are already with us today—that is, they are emerging (rapidly) from the exponentially growing field of synthetic biology, although it’s possible that they could arise from atomically-precise manufacturing as well. Indeed, this is a premise of my argument in “Superintelligence and the Future of Governance,” which I refer to as “The Threat of Universal Unilateralism.” Indeed, I summarize my argument in section 1 of the paper as follows:

(i) The Threat of Universal Unilateralism: Emerging technologies are enabling a rapidly growing number of nonstate actors to unilaterally inflict unprecedented harm on the global village; this trend of mass empowerment is significantly increasing the probability of an existential catastrophe—and could even constitute a Great Filter (Sotos 2017).

(ii) The Preemption Principle: If we wish to obviate an existential catastrophe, then societies will need a way to preemptively avert not just most but all possible attacks with existential consequences, since the consequences of an existential catastrophe are by definition irreversible.

(iii) The Need for a Singleton: The most effective way to preemptively avert attacks is through some regime of mass surveillance that enables governing bodies to monitor the actions, and perhaps even the brain states, of citizens; ultimately, this will require the formation of a singleton.

(iv) The Threat of State Dissolution: The trend of (i) will severely undercut the capacity of governing bodies to effectively monitor their citizens, because the capacity of states to provide security depends upon a sufficiently large “power differential” between themselves and their citizens.

(v) The Limits of Security: If states are unable to effectively monitor their citizens, they will be unable to neutralize the threat posed by (i), thus resulting in a high probability of an existential catastrophe.

There seem, at least to my eyes, to be some significant overlaps here with Bostrom's (fascinating) paper, which suggests a possible convergence of scholars on a single conception of humanity's (near-)future predicament on spaceship Earth. Perhaps the main difference is, once more, that Bostrom’s thesis hinges upon the notion of a not-yet-realized-but-maybe-possible “black ball” technology, whereas the message of my analysis is far more urgent: “Weapons of total destruction” (WTDs) already exist, are in their adolescence (so to speak), and will likely mature in the coming years and decades. Put differently, humanity will need to escape the semi-anarchic default condition in which we current reside quite soon or else face almost certain annihilation. My solution is algocratic, with everything depending upon success on the control problem.

This post is, admittedly, quite hastily written. Please don’t hesitate to let me know if aspects of it are opaque and thus require clarification. I would also—as always—welcome comments of any sort!!

Discuss

### A culture of exploitation?

25 ноября, 2018 - 01:00
Published on Sat Nov 24 2018 22:00:09 GMT+0000 (UTC)

Thoughts in the car:

I just had a discussion with a coworker who bemoans the lack of respect and responsibility of the newer generations. While I do think these concerns are typically overstated due to availability bias, there could be some truth to them.

I hypothesize that a significant contributing factor is our ever-expanding population; our increasing ability to discard one social group and easily form a new one. One of our primary motivations for being kind to others is the hope that others will be kind to us. (see the thought experiment 'Prisoner's Dilemma') Yes, most of us have empathy as well, but this can vary by degree and situation; the most reliable motivations are usually the selfish ones. If we are able to exploit one social group, then move on to a new group, thereby avoiding the consequences of our actions, this encourages exploitation ("exploitation" is defined here as reaping the benefits of a social group while not respecting others or taking responsibility for your part in the group). Side note: A lack of consequences for our actions is also an explanation for why internet communications can so easily turn toxic.

Before we can falsify this, we would have to first establish quantifiable data to represent our culture of exploitation. Then we could demonstrate a similar environment (large choice of social groups) where such a culture of exploitation (among relatively equal members) doesn't occur. Or, we could find a culture where there ISN'T a large choice of social groups, where exploitation (among relatively equal members) occurs regularly. Finally, we could identify a different factor that better explains the data.

Discuss

### Fixed Point Discussion

24 ноября, 2018 - 23:53
Published on Sat Nov 24 2018 20:53:39 GMT+0000 (UTC)

Warning: This post contains some important spoilers for Topological Fixed Point Exercises, Diagonalization Fixed Point Exercises, and Iteration Fixed Point Exercises. If you plan to even try the exercises, reading this post will significantly reduce the value you can get from doing them.

Core Ideas

Fixed point theorems come in three flavors: Topological, Diagonal, and Iterative. (I sometimes refer to them by central examples as Brouwer, Lawvere, and Banach, respectively.)

Topological fixed points are non constructive. If f is continuous, 0">f(0)>0, and f(1)<1, then we know f must fave some fixed point between 0 and 1, since f(x) must somewhere transition from being greater than x to being less than x. This does not tell us where it happens. This can be especially troublesome when there are multiple fixed points, and there is no principled way to choose between them.

Diagonal fixed points are constructed with a weird trick where you feed a code for a function into that function itself. Given a function f, if you can construct a function g, which on input x, interprets x as a function, runs x on itself, and then runs f on the result (i.e. g(x):=f(x(x)).), then g(g) is a fixed point of f because g(g)=f(g(g)). This is not just an example; everything in the cluster looks like this. It is a weird trick, but it is actually very important.

Iterative fixed points can be found through iteration. For example, if f(x)=−x2, then starting with any x value, iterating f forever will converge to the unique fixed point x=0.

(There is a fourth cluster in number theory discussed here, but I am leaving it out, since it does not seem relevant to AI, and because I am not sure whether to put it by itself or to tack it onto the topological cluster.)

Topological Fixed Points

Examples of topological fixed point theorems include Sperner's lemma, the Brouwer fixed point theorem, the Kakutani fixed point theorem, the intermediate value theorem, and the Poincaré-Miranda theorem.

• Sperner's Lemma is a discrete analogue which is used in one proof of Brouwer.
• Kakutani is a strengthening of Brouwer to degenerate set valued functions, that look almost like continuous functions.
• Poincaré-Miranda is an alternate formulation of Brouwer, which is about finding zeros rather than fixed points.
• The Intermediate Value Theorem is a special case of Poincaré-Miranda. To a first approximation, you can think of all of these theorems as one big theorem.

Topological fixed point theorems also have some very large applications. The Kakutani fixed point theorem is used in game theory to show that Nash equilibria exist, and to show that markets have equilibrium prices! Sperner's lemma is also used in some envy-free fair division results. Brouwer is also used is to show the existence of some differential equations.

In MIRI's agent foundations work, Kakutani is used to construct probabilisitic truth predicates and reflective oracles, and Brouwer is used to construct logical inductors.

These applications all use topological fixed points very directly, and so carry with them most of the philosophical baggage of topological fixed points. For example, while nash equilibria exist, they are not unique, are computationally hard to find, and feel non-constructive and arbitrary.

Diagonal Fixed Points

The pattern is used in many places.

In MIRI's agent foundations work, this shows up in the Löbian obstacle to self-trust, Löbian handshakes in Modal Combat and Bounded Open Source Prisoner's Dilemma, as well as providing a basic foundation for why an agent reasoning about itself might make sense at all through Quines.

Iterative Fixed Points

Iterative fixed point theorem are less of one cluster than the others, I will factor it into two sub-clusters, centered around the Banach fixed point theorem and Tarski fixed point theorem. (Each the same size as the original.)

The Tarski cluster is about fixed points of monotonic functions on (partially) ordered sets found by iteration. Tarski's fixed point theorem states that any order preserving function on a complete lattice has a fixed point (and further the set of fixed points forms a complete lattice). The least fixed point can be found by starting with the least element and iterating the function transfinitely. This, for example, implies that every monotonic function on from [0,1] to itself has a fixed point, even if it is not continuous. Kleene's fixed point theorem strengthens the assumptions of Tarski by adding a form of continuity (and also removes some irrelevant assumptions), which gives us that the least fixed point can be found by iterating the function only ω times. The fixed point lemma for normal functions is similar to Kleene, but with ordinals rather than partial orders. It states that any strictly increasing continuous function on ordinals has arbitrarily large fixed points.

The Banach cluster is about fixed points of contractive functions on metric spaces found by iteration. A contractive function is a function that sends points closer together. A function f is contractive if there exists an 0">ϵ>0 such that for all x≠y d(f(x),f(y))≤(1−ϵ)d(x,y). Banach's fixed point theorem state that any contractive function has a unique fixed point. This fixed point is lim n→∞fn(x) for any starting point x. An application of this to linear functions is that any ergotic stationary Markov chain has a stationary distribution (which is a fixed point of the transition map), which is converged to via iteration. This is also used in showing that correlated equilibria exist and can be found quickly. Banach can also be used to show that gradient descent converges exponentially quickly on a strongly convex function.

Interdisciplinary Nature

I think of Pure Mathematics as divided at the top into 5 subfields: Algebra, Analysis, Topology, Logic, and Combinatorics.

The mapping of the key fixed point theorems discussed in the exercises into these categories is surjective:

• Lawvere's fixed point theorem is Algebra
• Banach's fixed point theorem is Analysis
• Brouwer fixed point theorem is Topology
• Gödel's first incompleteness theorem is Logic
• Sperner's lemma is Combinatorics.

On top of that, major applications of fixed point theorems show up in Differential Equations, CS theory, Machine Learning, Game Theory, and Economics.

Tomorrow's AI Alignment Forum sequences post will be two short posts 'Approval-directed bootstrapping' and 'Humans consulting HCH' by Paul Christiano in the sequence Iterated Amplification.

The next posting in this sequence will be four posts of Agent Foundations research that use fixed point theorems, on Wednesday 28th November. These will be re-posts of content from the now-defunct Agent Foundations forum, all of whose content is now findable on the AI Alignment Forum (and all old links will soon be re-directed to the AI Alignment Forum).

Discuss

### Four factors which moderate the intensity of emotions

24 ноября, 2018 - 23:40
Published on Sat Nov 24 2018 20:40:12 GMT+0000 (UTC)

If you’re short on time or just skimming, I suggest skipping to the description of Closeness-of-the-counterfactual. It’s the factor I think is least recognized and the one I most wanted to write about.

What makes some emotions stronger than others? Why are some the faintest whispers, easily missed and others roaring, crashing storms which threaten to consume us?

The obvious answer is that emotions vary in intensity in proportion to the magnitude of what they’re about. Things which are a little bit good or a little bit bad evoke weak pleasant or aversive feelings, while things which are amazingly good or terribly bad provoke strong feeling. However, I assert that this magnitude is only one factor among several and is insufficient on its own to explain what causes strong or weak emotions.

In this post, I list the factors which are salient to me: magnitude plus three others. I do not think that they are especially surprising or profound, but I claim paying attention to them allows us to be better appreciate the mechanistic and lawful operation of emotions. This appreciation is practical in that it lets us better recognize and remedy common emotional pathologies. [See Section 2: Problematic manipulations of the factors].

Section 1: The Factors
• Magnitude [of the stimulus]
• Attention
• Closeness-of-the-counterfactual
• Actionability

Magnitude [of the stimulus]

This factor is the most obvious and least profound of all of the factors. However, it is illuminating to note just how insufficient it is to drive an emotion in the absence of the other factors.

Taking it as an assumption that all emotions are about something in the world [1], the strength of the emotion generally scales with magnitude of the “goodness” or “badness” which caused it. One feels stronger grief when their house burns down than when they dropped their cookie in the dirt. Making thousands from Bitcoin feels better than finding a twenty dollar bill on the street.

Attention

Obvious and yet still underappreciated.

Any given person is aware of thousands of situations, circumstances, and facts that could evoke just about any feeling. Meditate on your good fortune and you might feel happiness; think about those starving and diseased and you’ll feel sad; remember that unfair thing your teacher did in third grade and you’ll feel mad, and so on.

Emotions are usually about reality, but the emotions we experience are not about all the realities of which we are knowledgeable. No human could ever contain that much emotion. Instead, our emotions tend to be about whatever happens to be in our attention (broadly defined).

It’s like humans have this “attention slot” where you can put something, i.e. think about it, and then that’s what you’ll have emotions about. I am using attention in a broad and loose sense here. What I’m gesturing isn’t fully under one’s control and extends beyond conscious awareness. Think how strong grief can stay present somewhere in your mind even while you try to do other things.]

That we have emotions only about things in the “attention slot” solves the problem of humans not being able to simultaneously have emotions about all the realities of which they are aware (or could imagine), but more importantly it serves the adaptive purpose of emotions. Emotions are meant to guide behavior. It makes sense that emotions should be driven by the immediate, contextual, things we’re currently dealing with, i.e. those thing we’re paying attention to. It’s not valuable to be feeling happy about something good which happened a month ago if right now you’re in a bad situation you should get out of. Your emotions will be about the current situation, assuming that is, you’re paying attention to your present.

Even if you’re ruminating about your past, it should be [2] so that you can learn from mistakes and successes in a healthy so as to succeed going forward. If you are fantasizing about possible futures, it should be to motivate you to work towards them.

The attention-moderator nature of emotion gives rise to a number of common observations:

• We’re less upset by things over time; they’re no longer events we’re paying attention to and they’re less relevant.
• People attempt to feel better by distracting themselves from upsetting circumstances [see below Section 2: Problematic manipulation of these factors].
• Gratitude journaling makes people happier.
• We can evoke emotions just by thinking about things, i.e. placing our attention on them, regardless of whether they’re real or imaginary, good or bad.

Closeness-of-the-counterfactual

Despite its supreme importance, this is the most under-recognized moderator of emotion intensity. As far as I’m aware, not that I’ve scoured the psych literature, there isn’t a common name for it.

For the most part, it is a lot more frustrating to have missed your flight by five minutes than it is to miss your flight by five hours. It is a lot more frustrating to miss your flight when it seems you could have changed one or two small actions to have made it rather than when success was just out of your control. For example, it is more frustrating to miss the flight if the cause was you spent too long on Facebook rather than your car happened to be improbably stolen right from your garage.

It is more disappointing to not get a job when you thought you would, and more exhilarating to win a competition if you weren’t sure you’d win.

In general, emotions which relate to a counterfactual (i.e., how things might have been different, i.e., pretty much all emotions), scale in intensity in proportion to how easy it is to imagine [3] the counterfactual having been true. It’s easier to imagine having made your flight when it would have taken a small decision on your part to make the difference, and somewhat harder if something out of your control would have needed happen differently or it would have taken an improbably large amount of effort on your part.

I call the “how easy it is to imagine the counterfactual” property closeness-of-the-counterfactual, or counterfactual-closeness for short.

As usual, counterfactual-closeness as a moderator of emotion intensity makes sense if emotions are supposed to be adaptive. It’s adaptive to have strong emotions about counterfactual realities you could have nearly reached - if only you’d done a few things different - than about realities, no matter how pleasant, that never seemed in reach. It just doesn’t get me anywhere to be dreadfully sad all the time that I wasn’t born able to fly.

It’s a simple principle, yet is consistent with many observations beyond the above.

• We are more upset by things that people around us have than things no one has. I am sad that I don’t have a swimming pool when my neighbors do and I could technically afford it, but not sad that I don’t have a spaceship because that just seems unrealistic - unless I’m Musk or Bezos, I have no reasonable expectation that I would have one.
• Unexpected good fortune feels a lot more rewarding than expected good fortune.
• Sometimes people try to feel better about a failure by saying “it was always hopeless.” In effect, they are trying to create counterfactual-distance so there is less pain.
• This is related to a “sour grapes” response.
• It is consistent with Eliezer’s observation that most people can’t find motivation to do things they think are less than 70% likely to succeed. Perhaps you need to assign 70% probability of success to have enough counterfactual-closeness to evoke emotion.
• Vivid descriptions of things, images, and videos cause us to have stronger emotions. The representations make them easier to imagine (and also help load them into attention).

Probably related: Book Review: Surfing Uncertainty

Actionability

Continuing the theme that emotions ought to be adaptive, it makes a lot more sense to have emotions about situations where you can do something than ones where you can’t, or at least about situations where you could have done something different.

Even in cases where an emotion might seem inert, the emotion itself is probably trying to effect the world.

However, I’m not as sure of my understanding of this moderator as the others. It might actually bring about more of a qualitative difference in emotions than quantitative. The observation I’m drawing on here is that the emotions related to unresolved conflict with a colleague feel different from those attending grief about something which is done and dusted. In the former, there’s a “pulling” from the emotion as though it wants something, while in the latter the emotional tone is clear and pure, just signalling to my mind that something bad happened and I should do things different in the future.

Section 2: Problematic manipulations of the factors

Humans are crafty creatures with awareness of their own minds and the ability to manipulate the inputs they feed into their own minds to game the system. In short, we have some ability to wirehead. Or even if we’re not wireheading, these factors are pieces of the system which can be vulnerable to specific attacks or their own failure modes.

Attention and counterfactual-closeness are clear examples.

Manipulating attention: distraction to avoid unpleasant emotions

It’s easy to see that many people distract themselves from unpleasant realities to escape unpleasant emotions. I think it’s underappreciated just how absurdly widespread the behavior is.

The pernicious part of avoidance-behaviors is that many behaviors used for distraction, i.e. almost anything pleasurable, could be done purely for the sake of pleasure and in many cases is perfectly healthy and good. It’s easy to claim that you’re eating cake simply because cake is tasty unrelated to anything else. Yet, people are often compulsively and habitually looking for something stimulating to keep attention of the painful [4].

Compulsion is likely the differentiator about whether pleasure seeking behavior is driven by distraction and avoidance. Is a person reading a novel because they really feel like it and it’s a good time, or is there something it is helping them avoid, e.g. an assignment? The test I apply in this case is to ask whether I think a particular behavior optimizes my life as a whole or whether it’s just this experience in the moment being optimized. Over what timescale does this behavior improve on my life? Manipulation of attention to avoid pain (wireheading) will often be to the detriment of one’s life overall - pleasant in the short run, worse in the long-run.

Manipulating counterfactual-closeness

There is a temptation to artificially increase counterfactual-closeness to realities which are pleasant. Someone might cling to a dream of becoming an olympic runner, even as evidence mounts against them. They focus solely on the positive signs and they repeatedly and obsessively enumerate the pathways through which it all might work out. They’ve distorted their view of the world to see the desired world as much closer to reality than it is, because it feels good. They even become attached to the fantasy they’ve constructed. To maintain, it they have to twist the evidence and twist their epistemics, i.e. a one-sided counting of all evidence in one direction while ignoring all contrary data. This behavior is common in romantic contexts too. I assert that overall people behaving this way would be better served by good epistemics and accurate assessments of counterfactual-closeness.

There is equally a motivation to decrease counterfactual-closeness, i.e, increase counterfactual-distance. Things are easier to accept when they seem necessary, unavoidable, and so believing them so is a way to avoid pain. Most people do not appear to be pained that millions of people are dying, millions starving, millions diseased. They’re not distressed by their own imminent and assured death. Partly I think the pain is avoided by avoiding place attention on these topics, but also I think there’s motivated cognition to believe that it is impossible to do anything. Merely believing that there is something which could be done, having greater counterfactual-closeness, means experiencing pain that the something hasn’t been done yet.

This explains that people are resistant if you try to tell them things they could do. Believing something could be done would require them to move counter to a hedonic gradient, out of a local optimum. Believing something could be done (but hasn’t been) hurts more than believing nothing could be done, even though the former is how you get the best state of all - where something has been done successfully.

Manipulating magnitude and actionability

I imagine that people will readily recognize the behavior of people protesting through tears that something is “no big deal” in attempt to minimize their feelings - they are minimizing magnitude. And the behavior of insisting “nothing can be done” to quieten any nagging sense of responsibility, external or internal - they are minimizing actionability.

Endnotes

[1] Arguably some might emotions are not about anything, e.g. in emotion dysregulation disorders such as depression. My counter is that even if some emotions are detached from reality, that is a breakdown in the proper operation whose design and purpose is guide behavior within reality. Any “healthy” emotion will be “about” something.

[2] “Should” from the perspective of what emotions are “designed for”, namely that they are trying to drive adaptive action..

[3] The relevant kind of “imagination” here is a S1, instinctive feeling around gut expectations about what will happen or could have happened. It’s more than a S2, abstract picturing of a scenario in your mind.

[4] When I say “painful”, I mean anything at all slightly aversive. If I’m shy and dislike phone calls, I might put off calling the bank about the mistaken charge for weeks to avoid my slight discomfort. Us humans are sensitive to even the gentlest hedonic gradients.

Discuss

### deluks917 on Online Weirdos

24 ноября, 2018 - 20:03
Published on Sat Nov 24 2018 17:03:20 GMT+0000 (UTC)

Cross-posted from Putanumonit.

This is part 1 of an interview with deluks917. Stay tuned for part 2 in which we talk about Buddhism and peanut butter, and part 3 which covers archetypes, superego poisoning, and Moana.

Since I have a face for radio and a voice for blogging, our conversation is transcribed for your pleasure and edited for readability.  This is my second transcribed interview; suggestions for other people to chat with are welcome!

Jacob: I met you through the New York rationalist meetup, which now often takes place in your apartment. How did you even become part of this world?

deluks917: I had sporadically attended the New York meetups in the past, and then I was away in Pennsylvania for three years. I had been reading SlateStarCodex and commented under a couple of different names there. Later I made the SlateStarCodex Discord, and it took off around the time I came back to New York. The stars aligned. There was a room in a rationalist house that needed to be filled and no one else was going to take it.

For me, rationality filled a hole that had always existed in my life. Being nerdy and a systematizing thinker I was always reading psychology books and the like, but nothing seemed quite right until I read the Sequences.

I tried reading the Sequences a couple of times, starting seriously about eight years ago. I found them pretty confusing. But then I came back later and they made more sense. It took some time for things to marinate and for my mood to change.

I have always been part of nerdy communities, either online or IRL. I feel like my intellectual interests change pretty regularly and I happened to come back to rationality at the right time. By 2013 or so ideas about AI suddenly seemed more profound. I don’t know if I was wise or foolish not to be convinced about AI risk in 2010, maybe I was just too young. But by 2013 I was pretty convinced.

You created the SSC Discord and you do a lot of moderation of online communities. What does that involve day to day? And why do you do it?

When I first started it, it took a lot of my time. I wanted to get things on the right track. It has progressively taken less and less time. One thing that does that still takes time is the rationality feed: a daily digest of recommended rationality articles. In addition to vetting the articles, I include a short review or summary quote. Originally I started the feed to encourage people to check the SSC discord. This takes a lot of time, even though I do get something out of reading the articles themselves.

So what do you get out of the Discord? I tried it a couple of times, but I have several rationalist friends I can talk to live in New York, and I can read articles. I don’t feel like spending a lot of time just chatting with people online. You seem to find a lot more in it than I do.

I like talking to people in real life, but the Discord is also a community. People spend a lot of time there; they find friends. There are certain channels in the SSC Discord that don’t hold my attention at all but other people find valuable. For example, the “relationships” channel which is very friendly but not very interesting to me.

People like talking to people online. I have online friends that I talk about all sorts of things with. People want to share ideas and get feedback quickly.

What percent of your communication with friends happens online?

75%, maybe.

There’s not a ton that’s systematically different. A lot of it has to do with the medium. If you’re a relatively introverted person, the experience of talking to someone online and IRL is completely different. I think of myself as a relatively introverted person. If I’m talking to you online I can play any music I want, at any moment I can go take a walk for a few minutes and there’s no expectation that I’ll respond to your PM immediately. I can be watching YouTube videos or reading SSC and tabbing over. So for someone who’s introverted, it doesn’t feel as mentally draining.

Also if you’re someone who requires conscious effort to have good body language and eye contact, none of this applies online. That’s also freeing. So those are the biggest differences. And perhaps you just don’t know many people you want to talk to IRL.

I like to think of myself as someone with a high tolerance for weirdos. I try to maintain diversity in my social circles. I occasionally hang out with anti-Semites, I’m friends with extreme rationalists who only want to talk about AI and also people who hate rationalists and think that we’re some weird cult.

A sex cult.

A sex cult without even that much sex! Or maybe the sex is all in Berkeley.

So I look for weirdos, but I feel like I’m not doing a good enough job of it. There are still strong filters that seem to apply to everyone I hang out with, filters on things like IQ and interests. I feel like you’re doing a much better job. The diversity of minds of the people you interact with, and their backgrounds, seems huge. Do you cultivate this on purpose? How do you know so many weirdos?

This is hard to answer while respecting privacy, but I do know some really weird people.

One answer is that spaces that attract a lot of weirdos will push away “normies”, or less weird people. This is a fundamental problem for moderators to tackle.

There’s a Discord I spent a lot of time in that started out weird, but not super weird. It was based around a certain video game. Over time it got weirder and weirder, to the point where it became pretty out there.

People were posting about crazy politics. A person who posted there a decent amount used to identify as an incel and still identifies as a violent communist revolutionary. People openly chatted about sexual fetishes. They didn’t post porn but they openly talked about their sexual fetishes. There are a lot of running jokes, like #KillAllBoomers. That kind of thing.

I’m surprised that’s not a mainstream meme yet, #KillAllBoomers.

People started saying that because of some NIMBY politics that Boomers are blamed for. But if you want to attract less weird people, you don’t just casually post #KillAllBoomers.

Many spaces where weird people aren’t filtered out have these sorts of memes. It explains an affinity of some very weird people to the very crude right-wing politics. For example, open support for Trump and MAGA hats are posted by people who don’t seem like normal Trump supporters. They are very far demographically from the median Trump voter.

You think they just hate normies and all the normies they know are mainstream Democrats?

No. I just think less weird people won’t tolerate a chat that contains pictures of anime girls wearing MAGA hats. Right-wing politics claim that space in a sort of memetic battle.

In the mainstream, high IQ intellectual culture far-right politics are very taboo. Really weird people are less likely to feel revulsion at anime girls wearing MAGA hats, even if they are on the left. If the chat was authentically right-wing in the sense of Breitbart media then they would leave. But they don’t really care about the memes.

We live among educated, middle-class New Yorkers, so we’re deep in the blue tribe. If you’re conservative or libertarian, or if you’re aligned with Trump on some esoteric issue that’s orthogonal to politics, you need to escape online to talk about it. I wonder if there’s a nerdy kid in, say, Alabama, who needs a safe online space to talk about socialism.

I would guess that the Chapo Trap House community is probably pretty weird. Tankie communities of communist apologists are also very weird.

I don’t mean to say that the only spaces where you see really weird stuff are right-wing. There are online spaces of rationalists that have sexually explicit channels, where people post naked pictures of themselves. That norm that is not going to attract low openness-to-experience people. I think you see the same effects in parts of Berkeley where there’s sort of an implicit acceptance of BDSM. I think that the Berkeley community is certainly weirder than the New York one.

If you’re going 85 on the highway when there are people going 100 you’re not that afraid of getting a speeding ticket.

All these communities probably share some things in common, they’re all nerdy, they’re into games and computers. Do you feel like it’s one tribe that has something in common, even if it’s made up of Trump supporters talking to tankies on an anime Discord?

I wouldn’t say it’s the defining focus, but one thing I really appreciate about this group of people is what I refer to as “the will to think”. The will to figure things out yourself.

I contrast this to taking the outside view, or trusting in experts. Though explicit outside view thinking can also get weird. In these communities you see a lot of people who are willing to just come to their own conclusions. I find this deeply attractive.

Scott Alexander wrote a story sort of about this. There’s an earring you put on and it gives you amazing advice. The first thing it says is “it’s better for you if you take me off”, and after that, it never warns you again. Eventually, this earring is always telling you what to do, and it’s always giving you great advice – not perfect but better than what you would have come up with. And by the end, the earring is fully interlaced with your nervous system. Your brain rots because you don’t need it anymore.

Let’s say you and I are having a conversation about the ongoing world chess championship, even if we’re not betting any money. We could talk about chess and the two players and it would be very intellectually stimulating. Or, we could just try to find fair betting odds and say that the market has already integrated anything we could talk about. Doing the latter spiritually feels like putting on the earring.

You feel like the earring represents expert knowledge and the efficient market? I would have thought that the usual earring is social pressure. Just doing what is the median course of action among your friend group and the celebrities you follow on Twitter, the five people you spend the most time with. Outsourcing decisions and opinions to your social network.

That’s one sort of earring, but something like the efficient market is my earring. I’m not saying that most people are looking up betting odds for everything. But a lot of people are saying to just trust the experts.

Here is a different example. Say you are looking for a TV show to watch. People often ask for recommendations. Robin Hanson would ask: why are they doing that? Major genres, like anime or western prestige TV, have rating sites. You can just take the highest recommended shows or you can just try to figure out how to do a regression on these ratings and your preferences. That seems like its more effective than taking recommendations.

When I tell people I’m going to Paris and they tell me I just have to check out this or that restaurant, I just ignore it. We have Yelp now, and it consistently recommends good restaurants for me.

Indeed it does, and I agree with you about restaurants. But doing that for intellectual topics would feel like jabbing a dagger into my own soul. Something would be lost that’s not coming back.

In the online communities, what percent of people are non-American? I instinctively assume that everyone I meet online in rationalist spaces is an American male in his thirties even though that’s a stupid assumption.

In the SSC Discord you see many people from around the world, and we have a reasonable number of women. There’s probably a decent number of women who have reasons not to state that they’re women online. Even if a space is sufficiently non-sexist being a woman draws a lot of attention. You don’t necessarily want to be marked by gender when chatting about rationality.

There are definitely more women than is obvious in spaces that are nerdy.

Almost every real-life space that I’m in is also, on some level, a mating market. It’s very hard to avoid. And some communities that I joined turned out to be nothing besides a mating market.

When I was in Chapel Hill I went to Chabad, where they ask you to join them for Jewish prayer and then you get a Friday night meal. I’m an atheist, and in Israel I avoid the Chabad people who chase me with a tefillin on Fridays, but I won’t say no to a free meal! It turned that Chabad is 95% a dating service for Jews, 5% food and prayer. I showed up with a Chinese girlfriend to a Chabad event and we were stink-eyed.

I feel like in social contexts some chunk of my mind is always preoccupied with executing sexual strategies, and this is worrying. I don’t want to be doing that. How often do you notice this happening in online spaces? Is this another reason to prefer online to IRL?

Even in those spaces, there is some level of sexuality. If there are people who are looking to date, even if they live across oceans – we invented planes, so there’s always a possibility. Banishing sexual market dynamics completely is very challenging.

But yes, it’s a much lower percentage. When I’m chatting with people on the Discord I don’t consciously feel like it’s a mating market for me.

Does it bother you how much processing power everybody seems to be spending on mating strategies in in-person communities?

Yes. I have a ‘galaxy brain’ hammer I like to spam on these sort of problems.

Say you have a parameter that often increases more than you want. Perhaps the elephant is pushing that parameter up. My strategy as the rider is to push it down. So at normal rationalist events I try to suppress elephant-approved status or sexual strategies. At parties, I think that sexier norms are ok.

My theory is that if the elephant will push me to do something more than I want, the rider has to push back. This is also true for material standards of living. The elephant wants a porcelain, elephant-sized bathtub. You as the rider have to go the other way.

I just wrote a post about mandatory obsessions, about the value of not having the same priorities as everyone around you. You certainly have priorities that are far from mainstream even in the rationality community, in politics for example. What are some of them?

People in other countries. There are all sorts of policies that Americans don’t seem to care that much about, where the body count is high. If the body counts were American, even if they were a third as high, people would not defend the same policies.

So right now we have Yemen, the Uyghurs in China, trouble in the Central African Republic. We just had to suffer months of midterm coverage, and these were basically never mentioned.

The Uyghurs are a great example. As far as I can tell, the majority of true racist concentration camp style bullshit seems to be going on in China, leaving North Korea aside. And people just don’t seem to care. This is truly shocking to me.

It’s funny that we’re talking about a million Uyghurs in China. 250 years ago Adam Smith wrote in The Theory of Moral Sentimentsthat if someone finds out that a million Chinese perished in an earthquake they would say “Oh! That’s terrible!” and not lose any sleep over it. But if their little finger is to be amputated tomorrow they wouldn’t be able to sleep at all.

So there could be a million people in China in concentration camps. But for the last 250 years at least, and probably for the entire history of our species, there’s just no way to get people to care about geographically remote suffering. Even Effective Altruists are remarkably uninterested in this.

The EA community certainly doesn’t seem overly interested. I would maybe say that preventing certain wars doesn’t seem all that tractable, at least using the obvious methods. But the EA community as a whole is involved in things that seem at least as intractable and lower impact.

There is interest in trying to influence US elections. This isn’t very tractable because there is already so much money in most elections, even house elections, it’s hard to change the balance. In addition, it is unclear how much money in politics even helps. So trying to influence elections via money seems both intractable and not particularly high-impact.

The best steelman is that the US international order is pretty good, and you gotta break some eggs to make the US omelet. I think that’s people’s felt sense even if they wouldn’t articulate it that way.

Really? Take somebody who doesn’t spend one second of their day thinking about Yemen. If you tell them about it they’ll agree that it’s terrible. And then if you say that it’s actually fine because overall this US-led world order is pretty good they’ll ask what the fuck you’re talking about, it’s horrible!

I agree, it’s horrible!

It’s like the galaxy brain meme. Small brain is not knowing about Yemen, large brain is feeling bad for Yemen, and galaxy brain is “you’ve got to break some eggs”.

I’ve definitely run into the explicit version of the galaxy brain take. For some smart altruistic people maybe the rider is in the small brain but the elephant is galaxy brain. That’s my model anyway.

Discuss

### Approval-directed agents: "implementation" details

24 ноября, 2018 - 02:26
Published on Fri Nov 23 2018 23:26:08 GMT+0000 (UTC)

Follow-up to approval-directed agents: overview.

So far I’ve talked about approval-direction imprecisely. Maybe I’m talking about something incoherent, which has desirable properties only in the same sense as a four-sided triangle—vacuously. I won’t really be able to dispel this concern here, but I’ll at least take some steps.

How do you define approval?

Eventually you would have to actually write code implementing approval-directed behavior. What might that code look like? I want to set aside the problem “what does a sophisticated AI look like?” since I obviously don’t know. So let’s suppose we had some black box that did all of the hard work. I’ll consider a few cases for what the black box does, ranging from “easy to work with” to “very hard to work with.”

(Note: I now believe that we can target AI systems trained (nearly) end-to-end with gradient descent, which is most similar to “learning from examples.”)

Natural language

As an easy case, suppose we have a natural language question-answering system, which can assign a probability to any natural language sentence. In this case, we ask the question:

“Suppose that Hugh understood the current situation, was asked `on a scale from 0 to 1, how good is the action a?’ and was given a few hours to determine his answer. What would his answer be?”

We then loop over each action a and take the action with the highest expected answer.

In this framework, it is easy to replace Hugh by a more powerful overseer—all you have to do is specify the replacement in natural language.

“Math intuition module”

At an opposite extreme, suppose we have a “math intuition module,” a system which can assign probabilities only to perfectly precise statements—perhaps of the form “algorithm A returns output y on input x.”

I’ve written about defining “approval upon reflection” algorithmically (see here, here). These definition can be used to define approval-directed behavior completely precisely. I’m pretty hesitant about these definitions, but I do think it is promising that we can get traction even in such an extreme case.

In reality, I expect the situation to be somewhere in between the simple case of natural language and the hard case of mathematical rigor. Natural language is the case where we share all of our concepts with our machines, while mathematics is the case where we share only the most primitive concepts. In reality, I expect we will share some but not all of our concepts, with varying degrees of robustness. To the extent that approval-directed decisions are robust to imprecision, we can safely use some more complicated concepts, rather than trying to define what we care about in terms of logical primitives.

Learning from examples

In an even harder case, suppose we have a function learner which can take some labelled examples f(x) = y and then predict a new value f(x’). In this case we have to define “Hugh’s approval” directly via examples. I feel less comfortable with this case, but I’ll take a shot anyway.

In this case, our approval-directed agent Arthur maintains a probabilistic model over sequences observation[T] and approval[T](a). At each step T, Arthur selects the action a maximizing approval[T](a). Then the timer T is incremented, and Arthur records observation[T+1] from his sensors. Optionally, Hugh might specify a value approval[t](a) for any time t and any action a’. Then Arthur updates his models, and the process continues.

Like AIXI, if Arthur is clever enough he eventually learns that approval[T](a)refers to whatever Hugh will retroactively input. But unlike AIXI, Arthur will make no effort to manipulate these judgments. Instead he takes the action maximizing his expectation of approval[T] — i.e., his prediction about what Hugh will say in the future, if Hugh says anything at all. (This depends on his self-predictions, since what Hugh does in the future depends on what Arthur does now.)

At any rate, this is quite a lot better than AIXI, and it might turn out fine if you exercise appropriate caution. I wouldn’t want to use it in a high-stakes situation, but I think that it is a promising idea and that there are many natural directions for improvement. For example, we could provide further facts about approval (beyond example values), interpolating continuously between learning from examples and using an explicit definition of the approval function. More ambitiously, we could implement “approval-directed learning,” preventing it from learning complicated undesired concepts.

How should Hugh rate?

So far I’ve been very vague about what Hugh should actually do when rating an action. But the approval-directed behavior depends on how Hugh decides to administer approval. How should Hugh decide?

If Hugh expects action a to yield better consequences than action b, then he should give action a a higher rating than action b. In simple environments he can simply pick the best action, give it a rating of 1, and give the other options a rating of 0.

If Arthur is so much smarter than Hugh that he knows exactly what Hugh will say, then we might as well stop here. In this case, approval-direction amounts to Arthur doing exactly what Hugh instructs: “the minimum of Arthur’s capabilities and Hugh’s capabilities” is equal to “Hugh’s capabilities.”

But most of the time, Arthur won’t be able to tell exactly what Hugh will say. The numerical scale between 0 and 1 exists to accomodate Arthur’s uncertainty.

To illustrate the possible problems, suppose that Arthur is considering whether to drive across a bridge that may or may not collapse. Arthur thinks the bridge will collapse with 1% probability. But Arthur also think that Hugh knows for sure whether or not the bridge will collapse. If Hugh always assigned the optimal action a rating of 1 and every other action a rating of 0, then Arthur would take the action that was most likely to be optimal — driving across the bridge.

Hugh should have done one of two things:

• Give a bad rating for risky behavior. Hugh should give Arthur a high rating only if he drives across the bridge and knows that it is safe. In general, give a rating of 1 to the best action ex ante.
• Assign a very bad rating to incorrectly driving across the bridge, and only a small penalty for being too cautious. In general, give ratings that reflect the utilities of possible outcomes—to the extent you know them.

Probably Hugh should do both. This is easier if Hugh understands what Arthur is thinking and why, and what range of possibilities Arthur is considering.

Other details

I am leaving out many other important details in the interest of brevity. For example:

• In order to make these evaluations Hugh might want to understand what Arthur is thinking and why. This might be accomplished by giving Hugh enough time and resources to understand Arthur’s thoughts; or by letting different instances of Hugh “communicate” to keep track of what is going on as Arthur’s thoughts evolve; or by ensuring that Arthur’s thoughts remains comprehensible to Hugh (perhaps by using approval-directed behavior at a lower level, and only approving of internal changes that can be rendered comprehensible).
• It is best if Hugh optimizes his ratings to ensure the system remains robust. For example, in high stakes settings, Hugh should sometimes make Arthur consult the real Hugh to decide how to proceed—even if Arthur correctly knows what Hugh wants. This ensures that Arthur will seek guidance when he incorrectly believes that he knows what Hugh wants.

…and so on. The details I have included should be considered illustrative at best. (I don’t want anyone to come away with a false sense of precision.)

Problems

It would be sloppy to end the post without a sampling of possible pitfalls. For the most part these problems have more severe analogs for goal-directed agents, but it’s still wise to keep them in mind when thinking about approval-directed agents in the context of AI safety.

My biggest concerns

I have three big concerns with approval-directed agents, which are my priorities for follow-up research:

• Is an approval-directed agent generally as useful as a goal-directed agent, or does this require the overseer to be (extremely) powerful? Based on the ideas in this post, I am cautiously optimistic.
• Can we actually define approval-directed agents by examples, or do they already need a shared vocabulary with their programmers? I am again cautiously optimistic.
• Is it realistic to build an intelligent approval-directed agent without introducing goal-directed behavior internally? I think this is probably the most important follow-up question. I would guess that the answer will be “it depends on how AI plays out,” but we can at least get insight by addressing the question in a variety of concrete scenarios.
Motivational changes for the overseer

“What would I say if I thought for a very long time?” might have a surprising answer. The very process of thinking harder, or of finding myself in a thought experiment, might alter my priorities. I may care less about the real world, or may become convinced that I am living in a simulation.

This is a particularly severe problem for my proposed implementation of indirect normativity, which involves a truly outlandish process of reflection. It’s still a possible problem for defining approval-direction, but I think it is much less severe.

“What I would say after a few hours,” is close enough to real life that I wouldn’t expect my thought process to diverge too far from reality, either in values or beliefs. Short time periods are much easier to predict, and give less time to explore completely unanticipated lines of thought. In practice, I suspect we can also define something like “what I would say after a few hours of sitting at my desk under completely normal conditions,” which looks particularly innocuous.

Over time we will build more powerful AI’s with more powerful (and perhaps more exotic) overseers, but making these changes gradually is much easier than making them all at once: small changes are more predictable, and each successive change can be made with the help of increasingly powerful assistants.

Treacherous turn

If Hugh inadvertently specifies the wrong overseer, then the resulting agent might be motivated to deceive him. Any rational overseer will be motivated to approve of actions that look reasonable to Hugh. If they don’t, Hugh will notice the problem and fix the bug, and the original overseer will lose their influence over the world.

This doesn’t seem like a big deal—a failed attempt to specify “Hugh” probably won’t inadvertently specify a different Hugh-level intelligence, it will probably fail innocuously.

There are some possible exceptions, which mostly seem quite obscure but may be worth having in mind. The learning-from-examples protocol seems particularly likely to have problems. For example:

• Someone other than Hugh might be able to enter training data for approval[T](a). Depending on how Arthur is defined, these examples might influence Arthur’s behavior as soon as Arthur expects them to appear. In the most pathological case, these changes in Arthur’s behavior might have been the very reason that someone had the opportunity to enter fraudulent training data.
• Arthur could accept the motivated simulation argument, believing himself to be in a simulation at the whim of a simulator attempting to manipulate his behavior.
• The simplest explanation for Hugh’s judgments may be a simple program motivated to “mimic” the series approval[T] and observation[T] in order to influence Arthur.
Ignorance

An approval-directed agent may not be able to figure out what I approve of.

I’m skeptical that this is a serious problem. It falls under the range of predictive problems I’d expect a sophisticated AI to be good at. So it’s a standard objective for AI research, and AI’s that can’t make such predictions probably have significantly sub-human ability to act in the world. Moreover, even a fairly weak reasoner can learn generalizations like “actions that lead to Hugh getting candy, tend to be approved of” or “actions that take control away from Hugh, tend to be disapproved of.”

If there is a problem, it doesn’t seem like a serious one. Straightforward misunderstandings will lead to an agent that is inert rather than actively malicious (see the “Fail gracefully” section). And deep misunderstandings can be avoided, by Hugh approving of the decision “consult Hugh.”

Conclusion

Making decisions by asking “what action would your owner most approve of?” may be more robust than asking “what outcome would your owner most approve of?” Choosing actions directly has limitations, but these might be overcome by a careful implementation.

More generally, the focus on achieving safe goal-directed behavior may have partially obscured the larger purpose of the AI safety community, which should be achieving safe and useful behavior. It may turn out that goal-directed behavior really is inevitable or irreplaceable, but the case has not yet been settled.

This post was originally posted here.

Tomorrow's AI Alignment Forum sequences post will be 'Fixed Point Discussion' by Scott Garrabrant, in the sequence 'Fixed Points'.

The next posts in this sequence will be 'Approval directed bootstrapping' and 'Humans consulting HCH', two short posts which will come out on Sunday 25th November.

Discuss

### What if people simply forecasted your future choices?

23 ноября, 2018 - 13:52
Published on Fri Nov 23 2018 10:52:25 GMT+0000 (UTC)

tldr: If you could have a team of smart forecasters predicting your future decisions & actions, they would likely improve them in accordance with your epistemology. This is a very broad method that's less ideal than more reductionist approaches for specific things, but possibly simpler to implement and likelier to be accepted by decision makers with complex motivations.

Background

The standard way of finding questions to forecast involves a lot of work. As Zvi noted, questions should be very well-defined, and coming up with interesting yet specific questions takes considerable consideration.

One overarching question is how predictions can be used to drive decision making. One recommendation (one version called "Decision Markets") often comes down to estimating future parameters, conditional on each of a set of choices. Another option is to have expert evaluators probabilistically evaluate each option, and have predictors predict their evaluations (Prediction-Augmented Evaluations.)

Proposal

One prediction proposal I suggest is to have predictors simply predict the future actions & decisions of agents. I temporarily call this an "action prediction system." The evaluation process (the choosing process) would need to happen anyway, and the question becomes very simple. This may seem too basic to be useful, but I think it may be a lot better than at least I initially expected.

Say I'm trying to decide what laptop I should purchase. I could have some predictors predicting which one I'll decide on. In the beginning, the prediction aggregation shows that I have an 90% chance of choosing one option. While I really would like to be the kind of person who purchases a Lenovo with Linux, I'll probably wind up buying another Macbook. The predictors may realize that I typically check Amazon reviews and the Wirecutter for research, and they have a decent idea of what I'll find when I eventually do.

It's not clear to me how to best focus predictors on specific uncertain actions I may take. It seems like I would want to ask them mostly about specific decisions I am uncertain of.

One important aspect is that I should have a line of communication to the predictors. This means that some clever ones may eventually catch on to practices such as the following:

A forecaster-sales strategy

1. Find good decision options that have been overlooked

2. Make forecasts or bets on them succeeding

3. Provide really good arguments and research as to why they are overlooked

If I, the laptop purchaser, am skeptical, I could ignore the prediction feedback. But if I repeat the process for other decisions eventually I should eventually develop a sense of trust in the aggregation accuracy, and then in the predictor ability to understand my desires. I may also be very interested in what that community has to say, as they have developed a model of what my preferences are. If I'm generally a reasonable and intelligent person, I could learn how to best rely on these predictors to speed up and improve my future decisions.

In a way, this solution doesn't solve the problem of "how to decide the best option;" it just moves it into what may be a more manageable place. Over time I imagine that new strategies may emerge for what generally constitutes "good arguments", and those will be adopted. In the meantime, agents will be encouraged to quickly choose options they would generally want, using reasoning techniques they generally prefer. If one agent were really convinced by a decision market, then perhaps some forecasters would set one up in order to prove their point.

Failure Modes

There are few obvious failure modes to such a setup. I think that it could dilute signal quality, but am not as worried about some of the other obvious ones.

Weak Signals

I think it's fair to say that if one wanted to optimize for expected value, asking forecasters to predict actions instead could lead to weaker signals. Forecasters would be estimating a few things at once (how good an option is, and how likely the agent is to choose it.) If the agent isn't really intent on optimizing for specific things, and even if they are, it may be difficult to provide enough signal in their probabilities of chosen decisions for them to be useful. I think this would have to be empirically tested under different conditions.

There could also be complex feedback loops, especially for naive agents. An agent may trust its predictors too much. If the predictors believe the agent is too trusting or trusts the wrong signals, they could amplify those signals and find "easy stable points." I'm really unsure of how this would look or how much competence the agent or predictors would need to have net-beneficial outcomes. I'd be interested in testing and paying attention to this failure mode.

That said, the reference class of groups who were considering and interested in paying for using "action predictions" vs. "decision markets" or similar is a very small one, and one that I expect would be convinced only by pretty good arguments. So pragmatically, in the rare cases where the question of "would our organization be wise enough to get benefit from action predictions" is asked, I'd expect the answer to lean positively. I wouldn't expect obviously sleazy sales strategies to work to convince GiveWell of a new top cause area, for example.

Inevitable Failures

Say the predictors realized that a MacBook wouldn't make any sense for me, but that I was still 90% likely to choose it, even after I heard all of the best arguments. It would be somewhat of an "inevitable failure." The amount of utility I get from each item could be very uncorrelated with my chances of choosing that item, even after hearing about that difference.

While this may be unfortunate, it's not obvious what would work in these conditions. The goal of predictions shouldn't be to predict the future accurately, but instead to help agents make better decisions. If there were a different system that did a great job outlining the negative effect of a bad decision to my life, but I predictably ignored the system, then it just wouldn't be useful, despite being accurate. Value of information would be low. It's really tough for a system of information to be so good as to be useful even when ignored.

I'd also argue that the kinds of agents that would make predictably poor decisions would be ones that really aren't interested in getting accurate and honest information. It could seem pretty brutal to them; basically, it would involve them paying for a system that continuously tells them that they are making mistakes.

This previous discussion has assumed that the agents making the decisions are the same ones paying for the forecasting. This is not always the case, but in the counterexamples, setting up other proposals could easily be seen as hostile. If I set up a system to start evaluating the expected total values of all the actions of my friend George, knowing that George would systematically ignore the main ones, I could imagine George may not be very happy with his subsidized evaluations.

Principal-agent Problems

I think "action predictions" would help agents fulfill their actual goals, while other forecasting systems would more help them fulfill their stated goals. This has obvious costs and benefits.

Let's consider a situation with a CEO who wants to their company to be as big as possible, and corporate stakeholders who want instead for the company to be as profitable as possible.

Say the CEO commits to "maximizing shareholder revenue," and commits to making decisions that do so. If there were a decision market set up to tell how much "shareholder value" would be maximized for each of a set of options (different to a decision prediction system), and that information was public to shareholders, then it would be obvious to them when and how often the CEO disobeys that advice. This would be a very transparent set up that would allow the shareholders to police the CEO. It would take away a lot of flexibility and authority of the CEO and place it in the hands of the decision system.

On the contrary, say the CEO instead shares a transparent action prediction system. Predictor participants would, in this case, try to understand the specific motivations of the CEO and optimize their arguments as such. Even if they were being policed by shareholders, they could know this, and disguise their arguments accordingly. If discussing and correctly predicting the net impact to shareholders would be net harmful in terms of predicting the CEO's actions and convincing them as such, they could simply ignore it, or better yet find convincing arguments not to take that action. I expect that an action prediction system would essentially act to amplify the abilities of the decider, even if at the cost of other caring third parties.

Salesperson Melees

One argument against this is a gut reaction that it sounds very "salesy", so probably won't work. While I agree there are some cases where it may not too work well (stated above in the weak signal section), I think that smart people should be positively augmented by good salesmanship under reasonable incentives.

In many circumstances, salespeople practically are really useful. The industry is huge, and I'm under the impression that at least a significant fraction (>10%) is net-beneficial. Specific kinds of technical and corporate sales come to mind, where the "sales" professionals are some of the most useful for discussing technical questions with. There simply aren't other services willing to have lengthy discussions about some topics.

Externalities

Predictions used in this way would help the goals of the agents using them, but these agents may be self-interested, leading to additional negative externalities on others. I think this prediction process doesn't at all help in making people more altruistic. It simply would help agents better satisfy their own preferences. This is a common aspect to almost all intelligence-amplification proposals. I think it's important to consider, but I'm really recommending this proposal more as a "possible powerful tool", and not as a "tool that is expected to be highly globally beneficial if used." That would be a very separate discussion.

Discuss

### Oversight of Unsafe Systems via Dynamic Safety Envelopes

23 ноября, 2018 - 11:37
Published on Fri Nov 23 2018 08:37:30 GMT+0000 (UTC)

Idea

I had an idea for short-term, non-superhuman AI safety that I recently wrote up and will be posting on Arxiv. This post serves to introduce the idea, and request feedback from a more safety-oriented group than those that I would otherwise present the ideas to.

In short, the paper tries to adapt a paradigm that Mobileye has presented for autonomous vehicle safety to a much more general setting. The paradigm is to have a "safety envelope" that is dictated by a separate algorithm than the policy algorithm for driving, setting speed- and distance- limits for the vehicle based on the position of vehicles around it.

For self-driving cares, this works well because there is a physics based model of the system that can be used to find an algorithmic envelope. In arbitrary other systems, it works less well, because we don't have good fundamental models for what safe behavior means. For example, in financial markets there are "circuit breakers" that function as an opportunity for the system to take a break when something unexpected happens. The values for the circuit breakers are set via a simple heuristic that doesn't relate to the dynamics of the system in question. I propose taking a middle path - dynamically learning a safety envelope.

In building separate models for safety and for policy, I think the system can address a different problem being discussed in military and other AI contexts, which is that "Human-in-the-Loop" is impossible for normal ML systems, since it slows the reaction time down to the level of human reactions. The proposed paradigm of a safety-envelope learning system can be meaningfully controlled by humans, because the adaptive time needed for the system can be slower than the policy system that makes the lower level decisions.

Quick Q&A

1) How do we build heuristic safety envelopes in practice?

This depends on the system in question. I would be very interested in identifying domains where this class of solution could be implemented, either in toy models, or in full systems.

2) Why is this better than a system that optimizes for safety?

The issues with balancing optimization for goals versus optimization for safety can lead to perverse effects. If the system optimizing for safety is segregated, and the policy-engine is not given access to it, this should not occur.

This also allows the safety system to be built and monitored by a regulator, instead of by the owners of the system. In the case of Mobileye's proposed system, a self-driving car could have the parameters of the safety envelope dictated by traffic authorities, instead of needing to rely on the car manufacturers to implement systems that drive safely as determined by those manufacturers.

3) Are there any obvious shortcoming to this approach?

Yes. This does not scale to human- or superhuman- general intelligence, because a system aware of the constraints can attempt to design policies for avoiding them. It is primarily intended to serve as a stop-gap measure to marginally improve the safety of near-term Machine Learning systems.

Discuss

### 2018 strategy update from MIRI

23 ноября, 2018 - 02:42
https://intelligence.org/files/mirilogofb.jpg