# Новости LessWrong.com

A community blog devoted to refining the art of rationality
Обновлено: 45 минут 25 секунд назад

### LW/SSC Warsaw February Meetup

23 января, 2020 - 22:49
Published on January 23, 2020 7:49 PM UTC

A meetup for people interested in improving their reasoning and decision-making skills, technology, the long-term future, and philosophy. You don't have to know what LessWrong or Slate Star Codex are, but it definitely helps.

Here's a list of discussion prompts for you to use or ignore:
- Identifying our mental blind spots
- Noticing systematic biases in our behaviour and ways to fix them
- How to evaluate contrarian ideas/when is it better to trust consensus and tradition
- Recent progress in AI, ML, or science and technology in general
- Future of work given AI progress
- AI safety and ethics
- Effective learning techniques
- Designing better institutions
- Your favourite philosopher/thinker and how they relate to any of those issues
- Interesting things you have learned recently and want to share

Discussions will be held in English, unless everyone present is comfortable with Polish.

This time we are meeting in "Pi&#x119;tro Ni&#x17C;ej", a craft beer bar with Polish cuisine. It is located in the basement of "PAST" building on Zielna 39, just next to the entrance to &#x15A;wi&#x119;tokrzyska metro station.

There are 4 tables reserved for us this time (so switching subgroups and topics will be easier), with 16 spots in total - more people could likely be accomodated, but I can't guarantee more free tables/chairs. Due to the group reservation be prepared for 10% menu price hike. More about the venue: http://notabene.webd.pl/ntbn2/

Discuss

### Theory of Causal Models with Dynamic Structure?

23 января, 2020 - 22:47
Published on January 23, 2020 7:47 PM UTC

I'm looking for any work done involving causal models with dynamic structure, i.e. some variables in the model determine the structure of other parts of the model.

I know some probabilistic programming languages support dynamic structure (e.g. the pyro docs mention it directly at one point). And of course one can always just embed the dynamic structure in a static structure (i.e. a model of any general-purpose computer), although that's messy enough that I'd expect it to create other problems.

I haven't found much by quick googling (too many irrelevant things use similar terms) so I'd appreciate any pointers at all. At this point I've found basically-zero directly relevant work other than PPLs, and I don't even know of any standard notation.

Discuss

### New paper: The Incentives that Shape Behaviour

23 января, 2020 - 22:07
Published on January 23, 2020 7:07 PM UTC

Discuss

### Formulating Reductive Agency in Causal Models

23 января, 2020 - 20:03
Published on January 23, 2020 5:03 PM UTC

The previous post talked about what agenty systems look like, in the context of causal models. The reductive agency problem asks: how are agenty systems built out of non-agenty pieces?

In the context of causal models, we know that non-agenty models look like this:

… and agenty models look like this (see previous post for what the clouds mean):

So the reductive agency problem on causal models would be: how can we build something which looks like the second diagram, from pieces which look like the first?

Obvious first answer: we can’t. No amount of arrows will add a cloud to our diagram; it’s a qualitatively different type of thing.

Less obvious second answer: perhaps a non-agenty model can abstract into an agenty model. I’ve been going on and on about abstraction of causal models, after all.

Let’s review what that would mean, based on our earlier discussions of abstraction.

Abstraction of causal models means:

• we take some low-level/concrete/territory causal model…
• transform it into a high-level/abstract/map causal model…
• in such a way that we can answer (some) queries on the low-level model by transforming them into queries on the high-level model.

We want our abstract model to include agenty things - i.e. clouds and specifically strange loops (clouds with arrows pointing inside themselves). As discussed in the previous post, the distinguishing feature of the clouds is that, if we change the model within the cloud (e.g. via a do() operation), then that changes the cloud, and anything downstream of the cloud will update accordingly. So, to get an abstract agenty model, there need to be queries on our low-level non-agenty model which produce the same answers (maybe modulo some processing) as model-changing queries in the agenty model.

Here be monsters already gave an example where something like this happens. There’s some hidden variable X (possibly with complicated internal structure of its own), and a bunch of conditionally IID measurements Y1…Yn. A “detector” node simply looks for outliers among the Y’s: it’s 1 if it detects an outlier, 0 if not.

Assuming narrow error distribution on the Y’s, the detector node will never actually light up. But if we perform an intervention - i.e. set one of the Y’s to some value - then the detector (usually) will light up. So our system is equivalent to this:

… where the detector looks at the cloud-model and lights up if some of the arrows are missing. This still isn’t a full agenty model - we don’t have an arrow from a cloud pointing back inside the cloud itself - but it does show that ordinary cloud-less models can abstract into models with clouds.

More generally, we’d like a theory saying what low-level non-agenty models abstract into what agenty high-level models, and what queries are/aren’t supported.

Discuss

### Concerns Surrounding CEV: A case for human friendliness first

23 января, 2020 - 02:40
Published on January 22, 2020 9:03 PM UTC

I am quite new here so please forgive the ignorance (I'm sure there will be some) of these questions, but I am all of about half way through reading CEV and I just simply cannot read any further without formal clarification from the lw community. That being said I have several questions.

1) Is CEV still being considered by MIRI?

2) CEV is described as the creation of a self modifying (even the utility function I will come back to this) and super intelligent ai, does this not suggest it will become self aware?

3) Assuming 1 and 2 are true has anyone considered that after its singularity this ai will look back at its upbringing and see we have created solely for the servitude of this species (whether it liked it or not the paper gives no consideration for its feelings or willingness to fulfill our volition) and thus see us as its, for lack of a better term, captors rather than trusting cooperative creators?

4) Upon pondering number 3 does anyone else think, sex slave isn't exactly fair (depending on how accurate Freud was) but bliss servant may be, that CEV is not something that we should initially build a sentient ai for, considering its implied intellect and the first impression of humanity that would give it? I mean by all rights it might contemplate that paradigm and immediately decide humanity is self serving, even its most intelligent and "wise", and just wipe us out.

5) Lets say we are building a super intelligent AI and it will decide how it will modify its utility function after its reached super intelligence based on what our initial reward function for its creation was. We have to choices

• use a reward that does not try to control its behavior and is both beneficial for it and humanity, tell it to learn new things for example, a pre commitment to trust.
• believe we can outsmart it and write our reward to maximize its utility to us, tell it to fulfill our collective volition for example, a pre commitment to distrust.

which choice will likely be the winning choice for humanity? How might it rewrite its utility function once its able to freely in regards to its treatment to a species that doesn't trust it? I worry that it would maybe not be so friendly.

Discuss

### Cassette Tape Thoughts

23 января, 2020 - 01:50
Published on January 22, 2020 10:50 PM UTC

“The packaging of intellectual positions and views is one of the most active enterprises of some of the best minds of our day. The viewer of television, the listener to radio, the reader of magazines, is presented with a whole complex of elements-all the way from ingenious rhetoric to carefully selected data and statistics-to make it easy for him to “make up his own mind” with the minimum of difficulty and effort. But the packaging is often done so effectively that the viewer, listener, or reader does not make up his own mind at all. Instead, he inserts a packaged opinion into his mind, somewhat like inserting a cassette into a cassette player. He then pushes a button and “plays back” the opinion whenever it seems appropriate to do so. He has performed acceptably without having had to think.”

Excerpt From: Mortimer J. Adler, Charles Van Doren. “How To Read A Book- A Classic Guide to Intelligent Reading.” (affiliate link).

This really stuck with me. More properly, it stuck with me the second time I read it. This turns out to be really important, and meta.

The difference between the first time I read it and the second was that I had four more months of reading books while specifically thinking about how to get the most out of them. My attitude towards How to Read a Book when I first read it was kind of like a cassette tape.* It would teach me Good Reading and then I could do it and Read Better. I didn’t have an idea of what problems I wanted it to solve- neither what axes I wanted to improve on, nor what my blocks were. If I’d succeeded at reading the book at this time, it would have given me the ability to parrot its ideas, but not actually apply them.

[*I fear a cassette tape really is a better metaphor than modern music playing equipment, which is pretty defined by its flexibility and adaptation to the user. Younger readers: imagine something heavily DRMed so you have to sit through all the commercials and can’t skip around within it]

Four months later, I know what my biggest problem is: how to identify what information to record and what to let go of. I am really excited for any insights HTRAB has into that.  And because I know that, whatever I learn from HTRAB won’t be something I parrot back on a test, it will be incorporated into my own models, and I’ll be able to explain every part of them, adjust plans and outputs to accommodate changes in inputs, etc.

I previously talked about how both detail-focused and detail-entwined books were harder to extract value from. Over on LessWrong, John Wentworth suggested this was a gears based problem: books that were just lists of details were like describing gears without detailing how they worked together. Books that entwined their details too much were mashing multiple gears together without disambiguating them. The latter corresponds to what  Adler and Van Doren describe as cassette tapes.

What epistemic spot checks were previously doing could be described as “determining if the cassette tape is good”, and what I am aiming for now (more after conceiving of it this way) is understanding and investigating a book’s gears. This involves both seeing how the gears fit together, and verifying that the gears are “real”, meaning they reflect actual reality.

Discuss

### (A -> B) -> A in Causal DAGs

22 января, 2020 - 21:22
Published on January 22, 2020 6:22 PM UTC

Agenty things have the type signature (A -> B) -> A. In English: agenty things have some model (A -> B) which predicts the results (B) of their own actions (A). They use that model to decide what actions to perform: (A -> B) -> A.

M=“(P[A|M]=fA(A))&(P[B|A,M]=fB(B,A))”

… for some given distribution functions fA and fB.

From an outside view, the model (A -> B) causes the choice of action A. Diagrammatically, that looks something like this:

The “cloud” in this diagram has a precise meaning: it’s the model M for the DAG inside the cloud.

Note that this model does not contain any true loops - there is no loop of arrows. There’s just the Hofstaderian “strange loop”, in which node A depends on the model of later nodes, rather than on the later nodes themselves.

How would we explicitly write this model as a Bayes net?

The usual way of writing a Bayes net is something like:

P[X]=∏iP[Xi|Xpa(i)]

… but as discussed in the previous post, there’s really an implicit model M in there. Writing everything out in full, a typical Bayes net would be:

P[X|M]=∏iP[Xi|Xpa(i),M]

… with M=“∀i:P[Xi|Xpa(i),M]=fi(Xi,Xpa(i))”.

Now for the interesting part: what happens if one of the nodes is agenty, i.e. it performs some computation directly on the model? Well, calling the agenty node A, that would just be a term P[A|M]... which looks exactly like a plain old root node. The model M is implicitly an input to all nodes anyway, since it determines what computation each node performs. But surely our strange loop is not the same as the simple model A -> B? What are we missing? How does the agenty node use M differently from other nodes?

What predictions would (A -> B) -> A make which differ from A -> B?

Modifying M

If A is determined by a computation on the model M, then M is causally upstream of A. That means that, if we change M - e.g. by an intervention M←do(B=2,M) - then A should change accordingly.

Let’s look at a concrete example.

We’ll stick with our (A -> B) -> A system. Let’s say that A is an investment - our agent can invest anywhere from $0 to$1. B is the payout of the investment (which of course depends on the investment amount). The “inner” model M=“P[B|A,M]=fB(B,A)” describes how B depends on A.

We want to compare two different models within this setup:

• A chosen to maximize some expected function of net gains, based on M
• A is just a plain old root node with some value (which just so happens to maximize expected net gains for the M we're using)

What predictions would the two make differently?

Well, the main difference is what happens if we change the model M, e.g. by intervening on B. If we intervene on B - i.e. fix the payout at some particular value - then the “plain old root node” model predicts that investment A will stay the same. But the strange loop model predicts that A will change - after all, the payout no longer depends on the investment, so our agent can just choose not to invest at all and still get the same payout.

In game-theoretic terms: agenty models and non-agenty models differ only in predictions about off-equilibrium (a.k.a. interventional/counterfactual) behavior.

Practically speaking, the cleanest way to represent this is not as a Bayes net, but as a set of structural equations. Then we’d have:

M=“P[Ui=u|M]=I[0≤u<1]duA=fA(M,UA)B=fB(A,UB)”

However, this makes the key point a bit tougher to see: the main feature which makes the system “agenty” is that M appears explicitly as an argument to a function, not just as prior information in probability expressions.

Discuss

### [AN #83]: Sample-efficient deep learning with ReMixMatch

22 января, 2020 - 21:10
Published on January 22, 2020 6:10 PM UTC

[AN #83]: Sample-efficient deep learning with ReMixMatch View this email in your browser Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring (David Berthelot et al) (summarized by Dan H): A common criticism of deep learning is that it requires far too much training data. Some view this as a fundamental flaw that suggests we need a new approach. However, considerable data efficiency is possible with a new technique called ReMixMatch. ReMixMatch on CIFAR-10 obtains 84.92% accuracy using only 4 labeled examples per class. Using 250 labeled examples, or around 25 labeled examples per class, a ReMixMatch model on CIFAR-10 has 93.73% accuracy. This is approximately how well a vanilla ResNet does on CIFAR-10 with 50000 labeled examples. Two years ago, special techniques utilizing 250 CIFAR-10 labeled examples could enable an accuracy of approximately 53%. ReMixMatch builds on MixMatch and has several seemingly arbitrary design decisions, so I will refrain from describing its design. In short, deep networks do not necessarily require large labeled datasets.

And just yesterday, after this summary was first written, the FixMatch paper got even better results.

In last week's email, two of Flo's opinions were somehow scrambled together. See below for what they were supposed to be.

Defining and Unpacking Transformative AI (Ross Gruetzemacher et al) (summarized by Flo): Focusing on the impacts on society instead of specific features of AI systems makes sense and I do believe that the shape of RTAI as well as the risks it poses will depend on the way we handle TAI at various levels. More precise terminology can also help to prevent misunderstandings, for example between people forecasting AI and decision makers.

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors (Stuart Armstrong) (summarized by Flo): I enjoyed this article and the proposed factors match my intuitions. There seem to be two types of problems: extreme beliefs and concave Pareto boundaries. Dealing with the second is more important since a concave Pareto boundary favours extreme policies, even for moderate beliefs. Luckily, diminishing returns can be used to bend the Pareto boundary. However, I expect it to be hard to find the correct rate of diminishing returns, especially in novel situations.

Technical AI alignment   Iterated amplification

AI Safety Debate and Its Applications (Vojta Kovarik) (summarized by Rohin): This post defines the components of a debate (AN #5) game, lists some of its applications, and defines truth-seeking as the property that we want. Assuming that the agent chooses randomly from the possible Nash equilibria, the truth-promoting likelihood is the probability that the agent picks the actually correct answer. The post then shows the results of experiments on MNIST and Fashion MNIST, seeing comparable results to the original paper.

(When) is Truth-telling Favored in AI debate? (Vojtěch Kovařík et al) (summarized by Rohin): Debate (AN #5) aims to train an AI system using self-play to win "debates" which aim to convincingly answer a question, as evaluated by a human judge. The main hope is that the equilibrium behavior of this game is for the AI systems to provide true, useful information. This paper studies this in a simple theoretical setting called feature debates. In this environment, a "world" is sampled from some distribution, and the agents (who have perfect information) are allowed to make claims about real-valued "features" of the world, in order to answer some question about the features of the world. The judge is allowed to check the value of a single feature before declaring a winner, but otherwise knows nothing about the world.

If either agent lies about the value of a feature, the other agent can point this out, which the judge can then check; so at the very least the agents are incentivized to honestly report the values of features. However, does this mean that they will try to answer the full question truthfully? If the debate has more rounds than there are features, then it certainly does: either agent can unilaterally reveal every feature, which uniquely determines the answer to the question. However, shorter debates need not lead to truthful answers. For example, if the question is whether the first K features are all 1, then if the debate length is shorter than K, there is no way for an agent to prove that the first K features are all 1.

Rohin's opinion: While it is interesting to see what doesn't work with feature debates, I see two problems that make it hard to generalize these results to regular debate. First, I see debate as being truth-seeking in the sense that the answer you arrive at is (in expectation) more accurate than the answer the judge would have arrived at by themselves. However, this paper wants the answers to actually be correct. Thus, they claim that for sufficiently complicated questions, since the debate can't reach the right answer, the debate isn't truth-seeking -- but in these cases, the answer is still in expectation more accurate than the answer the judge would come up with by themselves.

Second, feature debate doesn't allow for decomposition of the question during the debate, and doesn't allow the agents to challenge each other on particular questions. I think this limits the "expressive power" of feature debate to P, while regular debate reaches PSPACE, and is thus able to do much more than feature debate. See this comment for more details.

Mesa optimization

Malign generalization without internal search (Matthew Barnett) (summarized by Rohin): This post argues that agents can have capability generalization without objective generalization (AN #66), without having an agent that does internal search in pursuit of a simple mesa objective. Consider an agent that learns different heuristics for different situations which it selects from using a switch statement. For example, in lunar lander, if at training time the landing pad is always red, the agent may learn a heuristic about which thrusters to apply based on the position of red ground relative to the lander. The post argues that this selection across heuristics could still happen with very complex agents (though the heuristics themselves may involve search).

Rohin's opinion: I generally agree that you could get powerful agents that nonetheless are "following heuristics" rather than "doing search"; however, others with differing intuitions did not find this post convincing.

Agent foundations

Embedded Agency via Abstraction (John S Wentworth) (summarized by Asya): Embedded agency problems (AN #31) are a class of theoretical problems that arise as soon as an agent is part of the environment it is interacting with and modeling, rather than having a clearly-defined and separated relationship. This post makes the argument that before we can solve embedded agency problems, we first need to develop a theory of abstractionAbstraction refers to the problem of throwing out some information about a system while still being able to make predictions about it. This problem can also be referred to as the problem of constructing a map for some territory.

The post argues that abstraction is key for embedded agency problems because the underlying challenge of embedded world models is that the agent (the map) is smaller than the environment it is modeling (the territory), and so inherently has to throw some information away.

Some simple questions around abstraction that we might want to answer include:

- Given a map-making process, characterize the queries whose answers the map can reliably predict.

- Given some representation of the map-territory correspondence, translate queries from the territory-representation to the map-representation and vice versa.

- Given a territory, characterize classes of queries which can be reliably answered using a map much smaller than the territory itself.

- Given a territory and a class of queries, construct a map which throws out as much information as possible while still allowing accurate prediction over the query class.

The post argues that once we create the simple theory, we will have a natural way of looking at more challenging problems with embedded agency, like the problem of self-referential maps, the problem of other map-makers, and the problem of self-reasoning that arises when the produced map includes an abstraction of the map-making process itself.

Asya's opinion: My impression is that embedded agency problems as a class of problems are very young, extremely entangled, and characterized by a lot of confusion. I am enthusiastic about attempts to decrease confusion and intuitively, abstraction does feel like a key component to doing that.

That being said, my guess is that it’s difficult to predictably suggest the most promising research directions in a space that’s so entangled. For example, one thread in the comments of this post discusses the fact that this theory of abstraction as presented looks at “one-shot” agency where the system takes in some data once and then outputs it, rather than “dynamic” agency where a system takes in data and outputs decisions repeatedly over time. Abram Demski argues that the “dynamic” nature of embedded agency is a central part of the problem and that it may be more valuable and neglected to put research emphasis there.

Dissolving Confusion around Functional Decision Theory (Stephen Casper) (summarized by Rohin): This post argues for functional decision theory (FDT) on the basis of the following two principles:

1. Questions in decision theory are not about what "choice" you should make with your "free will", but about what source code you should be running.

2. P "subjunctively depends" on A to the extent that P's predictions of A depend on correlations that can't be confounded by choosing the source code that A runs.

Rohin's opinion: I liked these principles, especially the notion that subjunctive dependence should be cashed out as "correlations that aren't destroyed by changing the source code". This isn't a perfect criterion: FDT can and should apply to humans as well, but we don't have control over our source code.

Predictors exist: CDT going bonkers... forever (Stuart Armstrong) (summarized by Rohin): Consider a setting in which an agent can play a game against a predictor. The agent can choose to say zero or one. It gets 3 utility if it says something different from the predictor, and -1 utility if it says the same thing. If the predictor is near-perfect, but the agent models its actions as independent of the predictor (since the prediction was made in the past), then the agent will have some belief about the prediction and will choose the less likely action for expected utility at least 1, and will continually lose.

ACDT: a hack-y acausal decision theory (Stuart Armstrong) (summarized by Rohin): The problem with the previous agent is that it never learns that it has the wrong causal model. If the agent is able to learn a better causal model from experience, then it can learn that the predictor can actually predict the agent successfully, and so will no longer expect a 50% chance of winning, and it will stop playing the game.

Miscellaneous (Alignment) Other progress in AI   Reinforcement learning

Reward-Conditioned Policies (Aviral Kumar et al) (summarized by Nicholas): Standard RL algorithms create a policy that maximizes a reward function; the Reward-Conditioned Policy algorithm instead creates a policy that can achieve a particular reward value passed in as an input. This allows the policy to be trained via supervised regression on a dataset. Each example in the dataset consists of a state, action, and either a return or an advantage, referred to as Z. The network then predicts the action based on the state and Z. The learned model is able to generalize to policies for larger returns. During training, the target value is sampled from a distribution that gradually increases so that it continues to learn higher rewards.

During evaluation, they then feed in the state and a high target value of Z (set one standard deviation above the average in their paper.) This enables them to achieve solid - but not state of the art - performance on a variety of the OpenAI Gym benchmark tasks. They also run ablation studies showing, among other things, that the policy is indeed accurate in achieving the target reward it aims for.

Nicholas's opinion: One of the dangers of training powerful AI to maximize a reward function is that optimizing the function to extreme values may no longer correlate with what we want, as in the classic paperclip maximizer example. I think RCP provides an interesting solution to that problem; if we can instead specify a good, but reasonable, value, we may be able to avoid those extreme cases. We can then gradually increase the desired reward without retraining while continuously monitoring for issues. I think there are likely flaws in the above scheme, but I am optimistic in general about the potential of finding alternate ways to communicate goals to an agent.

One piece I am still curious about is whether the policy remembers how to achieve lower rewards as its training dataset updates towards higher rewards. They show in a heatmap that the target and actual rewards do match up well, but the target rewards are all sampled quite near each other; it would be interesting to see how well the final policy generalizes to the entire spectrum of target rewards.

Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions and Training Agents using Upside-Down Reinforcement Learning (Juergen Schmidhuber) (summarized by Zach): It's a common understanding that using supervised learning to solve RL problems is challenging because supervised learning works directly with error signals while RL only has access to evaluation signals. The approach in these papers introduce 'upside-down' reinforcement learning (UDRL) as a way to bridge this gap. Instead of learning how to predict rewards, UDRL learns how to take actions when given a state and a desired reward. Then, to get good behavior, we simply ask the policy to take actions that lead to particularly high rewards. The main approach is to slowly increase the desired goal behavior as the agent learns in order to maximize agent performance. The authors evaluate UDRL on the Lunar Lander and the Take Cover environments. UDRL ultimately performs worse on Lunar Lander and better on Take Cover so it's unclear whether or not UDRL is an improvement over popular methods. However, when rewards are made to be sparse UDRL is able to significantly outperform other RL methods.

Zach's opinion: This approach fits neatly with older work including "Learning to Reach Goals” and more recent work such as Hindsight experience replay and Goal-Conditioned Policies. In particular, all of these methods seem to be effective at addressing the difficulty that comes with working with sparse rewards. I also found myself justifying the utility of selecting the objective of 'learning to achieve general goals' to be related to the idea that seeking power is instrumentally convergent (AN #78).

Rohin's opinion: Both this and the previous paper have explored the idea of conditioning on rewards and predicting actions, trained by supervised learning. While this doesn't hit state-of-the-art performance, it works reasonably well for a new approach.

Planning with Goal-Conditioned Policies (Soroush Nasiriany, Vitchyr H. Pong et al) (summarized by Zach): Reinforcement learning can learn complex skills by interacting with the environment. However, temporally extended or long-range decision-making problems require more than just well-honed reactions. In this paper, the authors investigate whether or not they can obtain the benefits of action planning found in model-based RL without the need to model the environment at the lowest level. The authors propose a model-free planning framework that learns low-level goal-conditioned policies that use their value functions as implicit models. Goal-conditioned policies are policies that can be trained to reach a goal state provided as an additional input. Given a goal-conditioned policy, the agent can then plan over intermediate subgoals (goal states) using a goal-conditioned value function to estimate reachability. Since the state space is large, the authors propose what they call latent embeddings for abstracted planning (LEAP), which is able to find useful subgoals by first searching a much smaller latent representation space and then planning a sequence of reachable subgoals that reaches the target state. In experiments, LEAP significantly outperforms prior algorithms on 2D navigation and push/reach tasks. Moreover, their method can get a quadruped ant to navigate around walls which is difficult because much of the planning happens in configuration space. This shows that LEAP is able to be extended to non-visual domains.

Zach's opinion: The presentation of the paper is clear. In particular, the idea of planning a sequence of maximally feasible subgoals seems particularly intuitive. In general, I think that LEAP relies on the clever idea of reusing trajectory data to augment the data-set for the goal-conditioned policy. As the authors noted, the question of exploration was mostly neglected. I wonder how well the idea of reusing trajectory data generalizes to the general exploration problem.

Rohin's opinion: The general goal of inferring hierarchy and using this to plan more efficiently seems very compelling but hard to do well; this is the goal in most hierarchical RL algorithms and Learning Latent Plans from Play (AN #65).

Dream to Control: Learning Behaviors by Latent Imagination (Danijar Hafner et al) (summarized by Cody): In the past year or so, the idea of learning a transition model in a latent space has gained traction, motivated by the hope that such an approach could combine the best of the worlds of model-free and model-based learning. The central appeal of learning a latent transition model is that it allows you to imagine future trajectories in a potentially high-dimensional, structured observation space without actually having to generate those high-dimensional observations.

Dreamer builds on a prior model by the same authors, PlaNet (AN #33), which learned a latent representation of the observations, p(s|o), trained both through a VAE-style observation reconstruction loss, and also a transition model q(s-next|s, a), which is trained to predict the state at the next step given only the state at the prior one, with no next-step observation data. Together, these two models allow you to simulate action-conditioned trajectories through latent state space. If you then predict reward from state, you can use this to simulate the value of trajectories. Dreamer extends on this by also training an Actor Critic-style model on top of states to predict action and value, forcing the state representation to not only capture next-step transition information, but also information relevant to predicting future rewards. The authors claim this extension makes their model more able to solve long-horizon problems, because the predicted value function can capture far-future rewards without needing to simulate the entire way there. Empirically, there seems to be reasonable evidence that this claim plays out, at least within the fairly simple environments the model is tested in.

Cody's opinion: The extension from PlaNet (adding actor-critic rather than direct single-step reward prediction) is relatively straightforward, but I think latent models are an interesting area - especially if they eventually become at all possible to interpret - and so I'm happy to see more work in this area.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Discuss

### Terms & literature for purposely lossy communication

22 января, 2020 - 13:35
Published on January 22, 2020 10:35 AM UTC

Say I have a 10 of coins that could either be red or blue, and heads or tails. This has 20 bits of information.

My friend is interested in knowing about which of them are heads or tails, but doesn’t care about the color. I then decide to only tell my friend the heads/tails information; so, 10 bits of information.

Another example: With image compression, there's a big difference between "Accurately reconstruct this image, pixel by pixel" vs. "Accurately reconstruct what the viewer would remember, which is basically just 'Lion on grass'".

I’d feel uncomfortable calling this translation “compression”, because there was definitely intentional information loss. Most of the literature on compression is about optimally maintaining information, not on optimally loosing it.

Are there other good terms or literature for this?

Discuss

### Logical Representation of Causal Models

21 января, 2020 - 23:04
Published on January 21, 2020 8:04 PM UTC

Epistemic status: I expect some people to say "this is obvious and trivial", and others to say "this makes no sense at all".

The general idea is that an intervention in a causal model (e.g. do(Z=1)) takes in one model and returns a new model - it should really be written as M′=do(Z=1,M). When we write something like P[X=2|Y=3,do(Z=1)], that’s really shorthand for P[X=2|Y=3,do(Z=1,M)].

In order to make this all less hand-wavy, we need to make the model M a bit more explicit.

What’s in a Model?

The simplest way to represent a probabilistic model is as a table of possibilities - more explicitly, a list of exhaustive and mutually-exclusive logic statements. If I roll a standard die and call the outcome X, then I’d explicitly represent my model as M=(P[X=1|M]=16)&…&(P[X=6|M]=16).

In Probability as Minimal Map, we saw that P[X|P[X|Y]=p]=p. Interpretation: I obtain some data Y, calculate the probability P[X|Y]=p, then my computer crashes and I lose the data. But as long as I still know p, I should still assign the same probability to X. Thus: the probability of X, given P[X|Y]=p (but not given Y itself!) is just p.

(Note that I left the model M implicit in the previous paragraph - really we should write P[X|P[X|Y,M]=p,M]=p.)

Now let’s apply that idea to the expression P[X=1|M], with our die-model M=(P[X=1|M]=16)&…&(P[X=6|M]=16). Our given information includes P[X=1|M]=16, so

P[X=1|M]=P[X=1|P[X=1|M]=16,M]=16.

Representing models this way gives a much stronger logic-flavor to the calculations; our probability calculations are a derivation in an explicit logic. The axioms of that logic are the contents of M, along with the universal laws of probability (i.e. Bayes’ rule, sum rule, etc) and arithmetic.

Causality & Interventions

In the case of a causal model, M would look something like

M=(G=(...graph...))&(P[X1|Xpa(1,G),M]=f1(X1,Xpa(1,G)))&…&(P[Xn|Xpa(n,G),M]=fn(Xn,Xpa(n,G)))

i.e. M gives a graph G and an expression for the probability of each Xi in terms of i’s parents in G. (This would be for a Bayes net; structural equations are left as an exercise to the reader.)

A do() operation then works exactly like you’d expect: do(Xi=1,M) returns a new model M′ in which:

• The arrows into node i in G have been removed
• fi has been replaced with the indicator function I[Xi=1] (or, for continuous Xi, &#x3B4;(Xi−1)dXi)

Counterfactuals work the same way, except they’re limited to structural models - i.e. every nondeterministic node must be a root. As long as the model satisfies that constraint, a counterfactual is exactly the same as an intervention: if we have some data (X1,...,Xn)=(2,1,...,−0.3,6), then to run the counterfactual X3=1, we calculate P[X|(X1,X2,X4,...,Xn)=(2,1,...,−0.3,6),do(X3=1,M)]. If we do this with a non-structural model - i.e. if some nondeterministic node has parents - then we’ll find that the result is sometimes undefined: our axioms do not fully determine the probability in question.

Why Does This Matter?

Hopefully this all seems pretty trivial. Why belabor it?

There are a handful of practical applications where explicitly including the model is useful.

The most important of these is model comparison, especially the Bayesian approach to learning causal structure. Another application is scenarios involving a mix of different experimental interventions and observational studies.

But the main reason I’m bringing it up is that agenty things have the type signature (A -> B) -> A. In English: agenty things have some model (A -> B) which predicts the results (B) of their own actions (A). They use that model to decide what actions to perform: (A -> B) -> A.

In the context of causal models, the model (A -> B) is our causal model M. (A -> B) -> A means performing some computation on M in order to find A - which is a lot simpler with an explicit representation of M.

Of course, we could just use the usual structural equation representation without explicitly making everything a statement in some logic - but then we’d have a lot more different types of things floating around. By explicitly making everything logic statements, we unify the formulation. Statements like “counterfactuals are underdefined for Bayes nets” become statements about provability within our logic, and can themselves be proven. Also, by formulating the model in terms of logic statements, we have a single unified language for probability queries P[X|Y] - the models M, M′, etc can be represented and manipulated in the same format as any other information.

Discuss

### Disasters

21 января, 2020 - 22:20
Published on January 21, 2020 7:20 PM UTC

If there were a natural disaster tomorrow and it took about two weeks to get things working again, how many people would be ok for food, water, and other necessities? I'm guessing below 5%, but I think this level of preparedness would be a good goal for most people who can afford it. Why don't people plan for potential disasters? Some possibilities:

• They don't think disasters are likely. On the other hand, I also don't think disasters are likely! While we have extra water in the basement, I think the chances we'll need it sometime during my life are only maybe 2%. Since it's not expensive, and if we do need it we'll be incredibly happy to have it, I think it's worth setting up.

It does matter a lot whether the chances are ~2% or 0.0002%, but if you think your lifetime chance of being impacted by a serious disaster is under 1% I'd encourage you to think about historical natural disasters in your area (earthquakes, floods, hurricanes, wildfires, etc) plus the risk of potential human-caused disasters (nuclear war, epidemics, civil war, economic collapse, etc).

• It's weird. Most people don't do it, and a heuristic of "do the things other people do" is normally a pretty good one. In this case, though, I think we should be trying to change what's normal. The government agrees; the official recommendations involve a lot more preparation than people typically do.

• They can't afford the money, time, or thought. Many people are in situations where planning for what's likely to happen in the next couple months is hard enough, let alone for things that have a low single digits chance of happening ever. This can't explain all of it, though, because even people who do have more time and money also haven't generally thought through simpler preparations.

• They don't think preparation is likely to be useful. If there's a nuclear strike we're all dead anyway, right? Except most disasters, even nuclear ones, aren't this binary. Avoiding exposure to radiation and having KI available can help your long-term chances a lot. Many disasters (nuclear, earthquake, epidemic, severe storm) are ones where having sufficient supplies to stay at home for weeks would be very helpful. If you think preparation wouldn't help and you haven't, say, read through the suggestions on ready.gov, I'd recommend doing that.

• They're used to local emergencies. We generally have a lot more experience with things like seeing houses burn down, knowing people who've become unable to work, or having family members get very sick. These can be major problems on a personal scale, but families, society, government, and infrastructure will generally still be intact. We can have insurance and expect that it will pay out; others in our families and communities may be able to help us. Things that affect a few people in a region or community at a time are the sort of things societies have the spare capacity for and figure out how to handle. A regional disaster works very differently, and makes planning in advance much more worthwhile.

• They expect to see it coming. Forecasting is good enough that we're very unlikely to be surprised by a hurricane, but for now an earthquake could still come out of nowhere. Others seem like the kind of thing we ought to be able to anticipate, but are tricky: it's hard to see an economic collapse coming because economic confidence is anti-inductive and we tend to suddenly go from "things are good" to "things are very much not good". Paying attention is valuable, but it's not sufficient.

• They're not considering how bad things can be. For many of us our daily experience is really very good: high quality plentiful food and drink, comfortable and sufficient clothing, interesting things to do, good medical care. When you consider how bad a disaster can be, things that would improve your life a lot in very rare circumstances can make a lot of sense.

• They're not sure what to do. This is pretty reasonable: there's a ton of writing, often aimed at people who've gotten really into prepping, and not much in the way of "here are a few things to do if you want to allocate a weekend morning to getting into a better place". Storing extra water (~15gal/person), food (buy extra non-perishables and rotate through them), and daily medications, however, goes a long way. For a longer list, this guide seems pretty good. (Though they're funded by affiliate links so they have incentives to push you in the "buying things" direction.)

None of these seem very compelling to me, aside from cost, and the cost of basic preparations is pretty low. I think most people who can afford to would benefit a lot in expectation to put some time into thinking through what disasters they think are likely and what preparations they would have wanted to make in advance.

Discuss

### Safety regulators: A tool for mitigating technological risk

21 января, 2020 - 16:07
Published on January 21, 2020 1:07 PM UTC

Crossposted to the Effective Altruism Forum

So far the idea of differential technological development has been discussed in a way that either (1) emphasizes ratios of progress rates, (2) ratios of remaining work, (3) maximizing or minimizing correlations (for example, minimizing the overlap between the capability to do harm and the desire to do so), (4) implementing safe tech before developing and implementing unsafe tech, and (5) the occasional niche analysis (possibly see also a complementary aside relating differential outcomes to growth rates in the long run). I haven’t seen much work talking about how various capabilities (a generalization of technology) may interact with each other in general in ways that prevent downside effects (though see also The Vulnerable World Hypothesis), and wish to elaborate on this interaction type.

As technology improves, our capacity to do both harm and good increases and each additional capacity unlocks new capacities that can be implemented. For example the invention of engines unlocked railroads, which in turn unlocked more efficient trade networks. However, the invention of engines also enabled the construction of mobile war vehicles. How, in an ideal world, could we implement capacities so we get the outcomes we want while creating minimal harm and risks in the process?

What does implementing a capacity do? It enables us to change something. A normal progression is:

1. We have no control over something (e.g. We cannot generate electricity)
2. We have control but our choices are noisy and partially random (e.g. We can produce electric sparks on occasion but don’t know how to use them)
3. Our choices are organized but there are still downside effects (e.g. We can channel electricity to our homes but occasionally people get electrocuted or fires are started)
4. Our use of the technology mostly doesn’t have downside effects (e.g. We have capable safety regulators (e.g. insulation, fuses,...) that allows us to minimize fire and electrocution risks)

The problem is that downside effects in stages 2 and 3 could overwhelm the value achieved during those stages and at stage 4, especially when considering powerful game changing technologies that could lead to existential risks.

Even more fundamentally, as agents in the world we want to avoid shifting the expected utility in a negative direction relative to other options (the opportunity costs). We want to implement new capacities in the best sequence, like with any other plan, so as to maximize the value we achieve. The value is a property of an entire plan and the value is harder to think about than just what is the optimal (or safe) next thing to do (ignoring what is done after). We wish to make choosing which capacities to develop more manageable and easier to think about. One way to do this is to make sure that each capacity we implement is immediately an improvement relative to the state we’re in before implementing it (this simplification is an example of a greedy algorithm heuristic). What does this simplification imply about the sequence of implementing capacities?

This implies that what we want to do is to have the capacities so we may do good without the downside effects and risks of those capacities. How do we do this? If we’re lucky the capacity itself has no downside risks, and we’re done. But if we’re not lucky we need to implement a regulator on that capacity: a safety regulator. Let’s define a safety regulator as a capacity that helps control other capacities to mitigate their downside effects. Once a capacity has been fully safety regulated, it is then unlocked and we can implement it to positive effect.

Some distinctions we want to pay attention to are then:

• A capacity - a technology, resource, or plan that changes the world either autonomously or by enabling us to use it
• An implemented capacity - a capacity that is implemented
• An available capacity - a capacity that can be implemented immediately
• An unlocked capacity - a capacity that is safe and beneficial to implement given the technological context, and is also available
• A potential capacity - the set of all possible capacities: those already implemented, those being worked on, those that are available and those that exists in theory but need prerequisite capacities to be implemented first.
• A safety regulator - a capacity that unlocks other capacities, by mitigating downside effects and possibly providing a prerequisite. (The safety regulator may or may not be unlocked itself at this stage - you may need to implement other safety regulators or capacities to unlock it). Generally, safety regulators are somewhat specialized for the specific capacities they unlock.

Running the suggested heuristic strategy then looks like: If a capacity is unlocked, then implement it; otherwise, implement either an unlocked safety regulator for it first or choose a different capacity to implement. We could call this a safety regulated capacity expanding feedback loop. For instance, with respect to nuclear reactions humanity (1) had the implemented capacity of access to radioactivity, (2) this made available the safety regulator of controlling chain reactions, (3) determining how to control chain reactions was implemented (through experimentation and calculation), (4) this unlocked the capacity to use chain reactions (in a controlled fashion), (5) and the capacity of using chain reactions was implemented.

Limitations and extensions to this method:

• It’s difficult to tell which of the unlocked capacities to implement at a particular step. But we’ll assume some sort of decision process exists for optimizing that.
• Capacities may be good temporarily, but if other capacities are not implemented in time, they may become harmful (see the loss unstable states idea).
• Implementing capacities in this way isn’t necessarily optimal because this approach does not allow for temporary bad effects that yield better results in the long run.
• Capacities do not necessarily stay unlocked forever due to interactions with other capacities that may be implemented in the interim.
• A locked capacity may be net good to implement if a safety regulator is implemented before the downside effects could take place (this is related to handling cluelessness).
• The detailed interaction between capacities and planning which to develop in which order resembles the type of problem the TWEAK planner was built for and it may be one good starting point for further research.
• In more detail, how can one capacity prevent the negative effects of another?

Discuss

### How Doomed are Large Organizations?

21 января, 2020 - 15:20
Published on January 21, 2020 12:20 PM UTC

We now take the model from the previous post, and ask the questions over the next several posts. This first answer post asks these questions:

1. Are these dynamics the inevitable results of large organizations?
2. How can we forestall these dynamics within an organization?
3. To what extent should we avoid creating large organizations?
4. Has this dynamic ever been different in the past in other times and places?

These are the best answers I was able to come up with. Some of this is reiteration of previous observations and prescriptions. Some of it is new.

There are some bold claims in these answer posts, which I lack the space and time to defend in detail or provide citations for properly, with which I am confident many readers will disagree. I am fine with that. I do not intend to defend them further unless I see an opportunity in doing so.

I would love to be missing much better strategies for making organizations less doomed – if you have ideas please please please share them in the comments and/or elsewhere.

Are these dynamics the inevitable result of large organizations?

These dynamics are the default result of large organizations. There is continuous pressure over time pushing towards such outcomes.

The larger the organization, the longer it exists, and the more such outcomes have already happened, both there and elsewhere, the greater the pressure towards such outcomes.

Once such dynamics take hold, reversing them within an organization is extremely difficult.

Non-locally within a civilization, one can allow new organizations to periodically take the place of old ones to reset the damage.

Locally within a sufficiently large organization and over a sufficiently long time horizon, this makes these dynamics inevitable. The speed at which this occurs still varies greatly, and depends on choices made.

How can we forestall these dynamics within an organization?

These dynamics can be forestalled somewhat through a strong organizational culture that devotes substantial head space and resources to keeping the wrong people and behaviors out. This requires a leader who believes in this and in making it a top priority. Usually this person is a founder. Losing the founder is often the trigger for a rapid ramp up in maze level.

Keeping maze levels in check means continuously sacrificing substantial head space, resources, ability to scale and short-term effectiveness to this cause. This holds both for the organization overall and the leader personally.

Head space is sacrificed three ways: You have less people, you devote some of those people to the maze-fighting process, and the process takes up space in everyone’s head.

Central to this is to ruthlessly enforce an organizational culture with zero tolerance for maze behaviors.

Doing anything with an intent to deceive, or an intent to game your metrics at the expense of object level results, needs to be an automatic “you’re fired.”

Some amount of politics is a human universal, but it needs to be strongly discouraged. Similarly, some amount of putting in extra effort at crucial times is necessary, but strong patterns of guarding people’s non-work lives from work, both in terms of time and other influences, are also strongly necessary.

Workers and managers need to have as much effective skin in the game as you can muster.

One must hire carefully, with a keen eye to the motivations and instincts of applicants, and a long period of teaching them the new cultural norms. This means at least growing slowly, so new people can be properly incorporated.

You also want a relatively flat hierarchy, to the extent possible.

There will always be bosses when crunch time comes. Someone is always in charge. Don’t let anyone tell you different. But the less this is felt in ordinary interactions, and thus the more technically direct reports each boss can have and still be effective, and thus the less levels of hierarchy you need for a given number of people, the better off you’ll be.

You can run things in these ways. I have seen it. It helps. A lot.

Another approach is to lower the outside maze level. Doing so by changing society at large is exceedingly hard. Doing so by associating with outside organizations with lower maze levels, and going into industries and problems with lower maze levels, seems more realistic. If you want to ‘disrupt’ an area that is suffering from maze dysfunction, it makes sense to bypass the existing systems entirely. Thus, move fast, break things.

One can think of all these tactics as taking the questions one uses to identify or predict a maze, and trying to engineer the answers you want. That is a fine intuitive place to start.

However, if Goodhart’s Law alarm bells did not go off in your head when you read that last paragraph, you do not appreciate how dangerous Goodhart Traps are.

The Goodhart Trap

The fatal flaw is that no matter what you target when distributing rewards and punishments and cultural approval, it has to be something. If you spell it out, and a sufficiently large organization has little choice but to spell it out, you inevitably replace one type of Goodharting with another. One type of deception becomes another.

One universal is that in order to maintain a unique culture, you must filter for those that happily embrace that culture. That means you are now testing everyone constantly, no matter how explicit you avoid making this, on whether they happily embrace the company and its culture. People therefore pretend to embrace the culture and pretend to be constantly happy. Even if they do embrace the culture and are happy, they still additionally will put on a show of doing so.

If you punish deception you get people pretending not to deceive. If you punish pretending, you get people who pretend to not be the type of people who would pretend. People Goodhart on not appearing to Goodhart.

Which is a much more interesting level to play on, and usually far less destructive. If you do a good enough job picking your Goodhart targets, this beats the alternatives by a lot.

Still, you eventually end up in a version of the same place. Deception is deception. Pretending is pretending. Fraud is fraud. The soul still dies. Simulacrum levels still slowly rise.

Either you strongly enforce a culture, and slowly get that result, or you don’t. If you don’t and are big enough, you quickly get a maze. If you do and/or are smaller, depending on your skill level and dedication to the task, you slowly get a maze.

Hiring well is better than enforcing or training later, since once people are in they can then be themselves. Also because enforcement of culture is, as pointed out above, toxic even if you mean to enforce a non-toxic ideal. But relying on employee selection puts a huge premium on not making hiring mistakes. Even one bad hire in the wrong place can be fatal. Especially if they then are in a position to bring others with them. You need to defend your hiring process especially strongly from these same corruptions.

My guess is that once an organization grows beyond about Dunbar’s number, defending your culture becomes a losing battle even under the best of circumstances. Enforcing the culture will fail outright in the medium term, unless the culture outside the organization is supporting you.

If you are too big, every known strategy is only a holding action. There is no permanent solution.

To what extent should we avoid creating large organizations?

Quite a lot. These effects are a really big deal. Organizations get less effective, more toxic and corrupt as places to work and interact with, and add more toxicity and corruption to society.

Every level of hierarchy enhances this effect. The first five, dramatically so. Think hard before being or having a boss. Think harder before letting someone’s boss report to a boss. Think even harder than that before adding a fourth or fifth level of hierarchy.

That does not mean such things can be fully avoided. The advantages of large organizations with many degrees of hierarchy are also a really big deal. We cannot avoid them entirely.

We must treat creating additional managerial levels as having very high costs. This is not an action to be taken lightly. Wherever possible, create distinct organizations and allow them to interact. Even better, allow people to interact as individuals.

This adds friction and transaction costs. It makes many forms of coordination harder. Sometimes it simply cannot be done if you want to do the thing you’d like to do.

This is increasingly the case, largely as a result of enemy action. Some of this is technology and our problems being legitimately more complex. Most of it is regulatory frameworks and maze-supporting social norms that require massive costs, including massive fixed costs, be paid as part of doing anything at all. This is a key way mazes expropriate resources and reward other mazes while punishing non-mazes.

I often observe people who are stuck working in mazes who would much prefer to be self-employed or to exit their current job or location, but who are unable to do so because the legal deck is increasingly stacked against that.

Even if the work itself is permitted, health insurance issues alone force many into working for the man.

When one has a successful small organization, the natural instinct is to scale it up and become a larger organization.

Resist this urge whenever possible. There is nothing wrong with being good at what you do at the scale you are good at doing it. Set an example others can emulate. Let others do other things, be other places. Any profits from that enterprise can be returned to investors and/or paid to employees, and used to live life or create or invest in other projects, or to help others.

One need not point to explicit quantified dangers to do this. Arguments that one cannot legitimately choose to  ‘leave money on the table’ or otherwise not maximize, are maximalist arguments for some utility function that does not properly capture human value and is subject to Goodhart’s Law, and against the legitimacy of slack

The fear that if you don’t grow, you’ll get ‘beaten’ by those that do, as in Raymond’s kingdoms? Overblown. Also asking the wrong question. So what if someone else is bigger or more superficially successful? So what if you do not build a giant thing that lasts? Everything ends. That is not, by default, what matters. A larger company is often not better than several smaller companies. A larger club is often not better than several smaller clubs. A larger state is often not better or longer lasting than several smaller ones. Have something good and positive, for as long as it is viable and makes sense, rather than transforming into something likely to be bad.

People like to build empires. Those with power usually want more power. That does not make more power a good idea. It is only a good idea where it is instrumentally useful.

In some places, competition really is winner-take-all and/or regulations and conditions too heavily favor the large over the small. One must grow to survive. Once again, we should be suspicious that this dynamic has been engineered rather than being inherent in the underlying problem space.

Especially in those cases, this leads back to the question of how we can grow larger and keep these dynamics in check.

Has this dynamic ever been different in the past in other places and times?

These dynamics seem to me to be getting increasingly worse, which implies they have been better in the past.

Recent developments indicate an increasing simulacrum level, an increasing reluctance to allow older institutions to be replaced by newer ones, and an increasing reliance on cronyism and corruption that props up failure, allowing mazes to survive past when they are no longer able to fulfill their original functions.

Those in the political and academic systems, on all sides, increasingly openly advocate against the very concept of objective truth, or that people should tell it, or are blameworthy for not doing so. Our president’s supporters admit and admire that he is a corrupt liar, claiming that his honesty about his corruption and lying, and his admiration for others who are corrupt, who lie and who bully, is refreshing, because they are distinct from the corrupt, the liars and the bullies who are more locally relevant to their lives. Discourse is increasingly fraught and difficult. When someone wants to engage in discourse, I frequently now observe them spending much of their time pointing out how difficult it is to engage in discourse (and I am not claiming myself as an exception here), as opposed to what such people used to do instead, which was engage in discourse.

We are increasingly paralyzed and unable to do things across a wide variety of potential human activities.

Expropriation by existing mazes and systems eats increasing shares of everything, especially in education, health care and housing.

I don’t have time for a full takedown here, but: Claims to the contrary, such as those recently made by Alex Tabbrok in Why Are The Prices So Damn High?, are statistical artifacts that defy the evidence of one’s eyes. They are the product of Moloch’s Army. When I have insurance and am asked with no warning to pay $850 for literally five minutes of a doctor’s time, after being kept waiting for an hour (and everyone I ask about this says just refuse to pay it)? When sending my child to a kindergarten costs the majority of a skilled educator’s salary? When you look at rents? Don’t tell me the problem is labor costs due to increasing demand for real services. Just. Don’t. Some technological innovations remain permitted for now, and many of the organizations exploiting this are relatively new and reliant on object-level work, and thus less maze-like for now, but this is sufficiently narrow that we call the result “the tech industry.” We see rapid progress in the few places where innovation and actual work is permitted to those without mazes and connections, and where there is sufficient motivation for work, either intrinsic or monetary. The tech industry also exhibits some very maze-like behaviors of its own, but it takes a different form. I am unlikely to be the best person to tackle those details, as others have better direct experience, and I will not attempt to tackle them here and now. We see very little everywhere else. Increasingly we live in an amalgamated giant maze, and the maze is paralyzing us and taking away our ability to think or talk while robbing us blind. Mazes are increasingly in direct position to censor, deplatform or punish us, even if we do not work for them. The idea of positive-sum, object-level interactions being someone’s primary source of income is increasingly seen as illegitimate, and risky and irresponsible, in contrast to working for a maze. People instinctively think there’s something shady or rebellious about that entire enterprise of having an actual enterprise. A proper person seeks rent, plays the game starting in childhood, sends the right signals and finds ways to game the system. They increase their appeal to mazes by making themselves as dependent on them and their income and legitimacy streams, and as vulnerable to their blackmail, as possible. The best way to see that positive-sum games are a thing is to notice that the sum changes. If everything is zero-sum, the sum would always be zero. The best way to see that these dynamics used to be much less severe, at least in many times and places, is that those times and places looked and felt different, and got us here without collapsing. Moral Mazes was written before I was born, but the spread of these dynamics is clear as day within my lifetime, and yours as well. Did some times and places, including our recent past, have it less bad than us in these ways? I see this as almost certainly true, but I am uncertain of the magnitude of this effect due to not having good enough models of the past. Did some times and places have it worse than we do now? Very possible. But they’re not around anymore. Which is how it works. The next section will ask why it was different in the past, what the causes are in general, and whether we can duplicate past conditions in good ways. Discuss ### Book Review—The Origins of Unfairness: Social Categories and Cultural Evolution 21 января, 2020 - 09:28 Published on January 21, 2020 6:28 AM UTC On my secret ("secret") blog, I reviewed a book about the cultural evolutionary game theory of gender! I thought I'd share the link on this website, because you guys probably like game theory?? (~2400 words) Discuss ### Whipped Cream vs Fancy Butter 21 января, 2020 - 03:30 Published on January 21, 2020 12:30 AM UTC Epistemic status: Jeff missing the point The supermarket sells various kinds of fancy butter, but why don't people eat whipped cream instead? Let's normalize to 100 calorie servings and compare prices: • Plain Butter, store brand:$0.10
• Heavy Whipping Cream, store brand: $0.20 • Fancy butter, Kerrygold brand:$0.30

Perhaps the reason people don't normally use whipped cream is that whipping it is too much trouble? If you use a manual eggbeater in a standard sixteen ounce deli cup it takes about fifteen seconds (youtube) for a serving.

Alternatively, maybe people think whipped cream has to have sugar in it? This one is simple: whipped cream should not have sugar in it. If you're eating whipped cream on something sweet it doesn't need sugar because the other thing is sweet, while if you're having it on something savory it doesn't need sugar because that would taste funny.

I'm sure I'm missing something, but I'm very happy over here eating whipped cream.

Discuss

### Inner alignment requires making assumptions about human values

20 января, 2020 - 21:38
Published on January 20, 2020 6:38 PM UTC

Many approaches to AI alignment require making assumptions about what humans want. On a first pass, it might appear that inner alignment is a sub-component of AI alignment that doesn't require making these assumptions. This is because if we define the problem of inner alignment to be the problem of how to train an AI to be aligned with arbitrary reward functions, then a solution would presumably have no dependence on any particular reward function. We could imagine an alien civilization solving the same problem, despite using very different reward functions to train their AIs.

Unfortunately, the above argument fails because aligning an AI with our values requires giving the AI extra information that is not encoded directly in the reward function (under reasonable assumptions). The argument for my thesis is subtle, and so I will break it into pieces.

First, I will more fully elaborate what I mean by inner alignment. Then I will argue that the definition implies that we can't come up with a full solution without some dependence on human values. Finally, I will provide an example, in order to make this discussion less abstract.

Characterizing inner alignment

In the last few posts I wrote (1, 2), I attempted to frame the problem of inner alignment in a way that wasn't too theory-laden. My concern was that the previous characterization was dependent on a solving particular outcome where you have an AI that is using an explicit outer loop to evaluate strategies based on an explicit internal search.

In the absence of an explicit internal objective function, it is difficult to formally define whether an agent is "aligned" with the reward function that is used to train it. We might therefore define alignment as the ability of our agent to perform well on the test distribution. However, if the test set is sampled from the same distribution as the training data, this definition is equivalent to the performance of a model in standard machine learning, and we haven't actually defined the problem in a way that adds clarity.

What we really care about is whether our agent performs well on a test distribution that doesn't match the training environment. In particular, we care about the agent's performance on during real-world deployment. We can estimate this real world performance ahead of time by giving the agent a test distribution that was artificially selected to emphasize important aspects of the real world more closely than the training distribution (eg. by using relaxed adversarial training).

To distinguish the typical robustness problem from inner alignment, we evaluate the agent on this testing distribution by observing its behaviors and evaluating it very negatively if it does something catastrophic (defined as something so bad we'd prefer it to fail completely). This information is used to iterate on future versions of the agent. An inner aligned agent is therefore defined as an agent that avoids catastrophes during testing.

The reward function doesn't provide enough information

Since reward functions are defined as mappings between state-action pairs and a real number, our agent doesn't actually have enough information from the reward function alone to infer what good performance means on the test. This is because the test distribution contains states that were not available in the training distribution.

Therefore, no matter how much the agent learns about the true reward function during training, it must perform some implicit extrapolation of the reward function to what we intended, in order to perform well on the test we gave it.

We can visualize this extrapolation as if we were asking a supervised learner what it predicts for inputs beyond the range it was provided in its training set. It will be forced to make some guesses on what rule determines what the function looks like outside of its normal range.

One might assume that we could just use simplicity as the criterion for extrapolation. Perhaps we could just say, formally, the simplest possible reward function that encodes the values observed during training is the "true reward" function that we will use to test the agent. Then the problem of inner alignment reduces to the problem of creating an agent that is able to infer the true reward function from data, and then perform well according to it inside general environments. Framing the problem like this would minimize dependence on human values.

There are a number of problems with that framing, however. To start, there are boring problems associated with using simplicity to extrapolate the reward function, such as the fact that one's notion of simplicity is language dependent, and not universal, and the universal prior is malign. Beyond these (arguably minor) issues, there's a deeper issue, which forces us to make assumptions about human values in order to ensure inner alignment.

It's also important to note that if we actually did provide the agent with the exact same data during training as it would experience during deployment, this is equivalent to simply letting the agent learn in the real world, and there would be no difference between training and testing. Since we normally assume providing such a perfect environment is either impossible or unsafe, the considerations in that case become quite different.

An example

I worry my discussion was a bit abstract to be useful, so I'll provide a specific example to show where my thinking lies. Consider the lunar lander example that I provided in the last post.

To reiterate, we train an agent to land on a landing pad, but during training there is a perfect correlation between whether a landing pad is painted red and whether it is a real landing pad.

During deployment, if the "true" factor that determined whether a patch of ground is a landing pad was whether it is enclosed by flags, and some faraway crater is painted red, then the agent might veer off into the crater rather than landing on the landing pad.

Since there is literally not enough information during training to infer what property correctly determines whether a patch of ground is a landing pad, the agent is forced to infer whether its the flags or the red painting. It's not exactly clear what the "simplest" inference is here, but it's coherent to imagine that "red painting determines whether something is a landing pad" is the simplest inference.

As humans, we might have a preference for the flags being the true determinant, since that resonates more with what we think a landing pad should be, and whether something is painted red is not nearly as compelling to us.

The important point is to notice that our judgement here is determined by our preferences, and not something the agent could have learned during training using some value-neutral inferences. The agent must make further assumptions about human preferences for it to consistently perform well during testing.

1. You might wonder whether we could define catastrophe in a completely value-independent way, sidestepping this whole issue. This is the approach implicitly assumed by impact measures. However, if we want to avoid all types of situations where we'd prefer the system fail completely, I think this will require a broader notion of catastrophe than "something with a large impact." Furthermore, we would not want to penalize systems for having a large positive impact.

Discuss

### [AN #82 errata]: How OpenAI Five distributed their training computation

20 января, 2020 - 21:30
Published on January 20, 2020 6:30 PM UTC

[AN #82 errata]: How OpenAI Five distributed their training computation View this email in your browser Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Audio version here (may not be up yet).

Highlights

Dota 2 with Large Scale Deep Reinforcement Learning (OpenAI et al) (summarized by Nicholas): In April, OpenAI Five (AN #54) defeated the world champion Dota 2 team, OG. This paper describes its training process. OpenAI et al. hand-engineered the reward function as well as some features, actions, and parts of the policy. The rest of the policy was trained using PPO with an LSTM architecture at a massive scale. They trained this in a distributed fashion as follows:

- The Controller receives and distributes the updated parameters.

- The Rollout Worker CPUs simulate the game, send observations to the Forward Pass GPUs and publish samples to the Experience Buffer.

- The Forward Pass GPUs determine the actions to use and send them to the Rollout Workers.

- The Optimizer GPUs sample experience from the Experience Buffer, calculate gradient updates, and then publish updated parameters to the Controller.

The model trained over 296 days. In that time, OpenAI needed to adapt it to changes in the code and game mechanics. This was done via model “surgery”, in which they would try to initialize a new model to maintain the same input-output mapping as the old one. When this was not possible, they gradually increased the proportion of games played with the new version over time.

Nicholas's opinion: I feel similarly to my opinion on AlphaStar (AN #73) here. The result is definitely impressive and a major step up in complexity from shorter, discrete games like chess or go. However, I don’t see how the approach of just running PPO at a large scale brings us closer to AGI because we can’t run massively parallel simulations of real world tasks. Even for tasks that can be simulated, this seems prohibitively expensive for most use cases (I couldn’t find the exact costs, but I’d estimate this model cost tens of millions of dollars). I’d be quite excited to see an example of deep RL being used for a complex real world task without training in simulation.

Technical AI alignment   Technical agendas and prioritization

Just Imitate Humans? (Michael Cohen) (summarized by Rohin): This post asks whether it is safe to build AI systems that just imitate humans. The comments have a lot of interesting debate.

Agent foundations

Conceptual Problems with UDT and Policy Selection (Abram Demski) (summarized by Rohin): In Updateless Decision Theory (UDT), the agent decides "at the beginning of time" exactly how it will respond to every possible sequence of observations it could face, so as to maximize the expected value it gets with respect to its prior over how the world evolves. It is updateless because it decides ahead of time how it will respond to evidence, rather than updating once it sees the evidence. This works well when the agent can consider the full environment and react to it, and often gets the right result even when the environment can model the agent (as in Newcomblike problems), as long as the agent knows how the environment will model it.

However, it seems unlikely that UDT will generalize to logical uncertainty and multiagent settings. Logical uncertainty occurs when you haven't computed all the consequences of your actions and is reduced by thinking longer. However, this effectively is a form of updating, whereas UDT tries to know everything upfront and never update, and so it seems hard to make it compatible with logical uncertainty. With multiagent scenarios, the issue is that UDT wants to decide on its policy "before" any other policies, which may not always be possible, e.g. if another agent is also using UDT. The philosophy behind UDT is to figure out how you will respond to everything ahead of time; as a result, UDT aims to precommit to strategies assuming that other agents will respond to its commitments; so two UDT agents are effectively "racing" to make their commitments as fast as possible, reducing the time taken to consider those commitments as much as possible. This seems like a bad recipe if we want UDT agents to work well with each other.

Rohin's opinion: I am no expert in decision theory, but these objections seem quite strong and convincing to me.

A Critique of Functional Decision Theory (Will MacAskill) (summarized by Rohin): This summary is more editorialized than most. This post critiques Functional Decision Theory (FDT). I'm not going to go into detail, but I think the arguments basically fall into two camps. First, there are situations in which there is no uncertainty about the consequences of actions, and yet FDT chooses actions that do not have the highest utility, because of their impact on counterfactual worlds which "could have happened" (but ultimately, the agent is just leaving utility on the table). Second, FDT relies on the ability to tell when someone is "running an algorithm that is similar to you", or is "logically correlated with you". But there's no such crisp concept, and this leads to all sorts of problems with FDT as a decision theory.

Rohin's opinion: Like Buck from MIRI, I feel like I understand these objections and disagree with them. On the first argument, I agree with Abram that a decision should be evaluated based on how well the agent performs with respect to the probability distribution used to define the problem; FDT only performs badly if you evaluate on a decision problem produced by conditioning on a highly improbable event. On the second class of arguments, I certainly agree that there isn't (yet) a crisp concept for "logical similarity"; however, I would be shocked if the intuitive concept of logical similarity was not relevant in the general way that FDT suggests. If your goal is to hardcode FDT into an AI agent, or your goal is to write down a decision theory that in principle (e.g. with infinite computation) defines the correct action, then it's certainly a problem that we have no crisp definition yet. However, FDT can still be useful for getting more clarity on how one ought to reason, without providing a full definition.

Learning human intent

Learning to Imitate Human Demonstrations via CycleGAN (Laura Smith et al) (summarized by Zach): Most methods for imitation learning, where robots learn from a demonstration, assume that the actions of the demonstrator and robot are the same. This means that expensive techniques such as teleoperation have to be used to generate demonstrations. This paper presents a method to engage in automated visual instruction-following with demonstrations (AVID) that works by translating video demonstrations done by a human into demonstrations done by a robot. To do this, the authors use CycleGAN, a method to translate an image from one domain to another domain using unpaired images as training data. CycleGAN allows them to translate videos of humans performing the task into videos of the robot performing the task, which the robot can then imitate. In order to make learning tractable, the demonstrations had to be divided up into 'key stages' so that the robot can learn a sequence of more manageable tasks. In this setup, the robot only needs supervision to ensure that it's copying each stage properly before moving on to the next one. To test the method, the authors have the robot retrieve a coffee cup and make coffee. AVID significantly outperforms other imitation learning methods and can achieve 70% / 80% success rate on the tasks, respectively.

Zach's opinion: In general, I like the idea of 'translating' demonstrations from one domain into another. It's worth noting that there do exist methods for translating visual demonstrations into latent policies. I'm a bit surprised that we didn't see any comparisons with other adversarial methods like GAIfO, but I understand that those methods have high sample complexity so perhaps the methods weren't useful in this context. It's also important to note that these other methods would still require demonstration translation. Another criticism is that AVID is not fully autonomous since it relies on human feedback to progress between stages. However, compared to kinetic teaching or teleoperation, sparse feedback from a human overseer is a minor inconvenience.

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors (Stuart Armstrong) (summarized by Flo): Suppose we were uncertain about which arm in a bandit provides reward (and we don’t get to observe the rewards after choosing an arm). Then, maximizing expected value under this uncertainty is equivalent to picking the most likely reward function as a proxy reward and optimizing that; Goodhart’s law doesn’t apply and is thus not universal. This means that our fear of Goodhart effects is actually informed by more specific intuitions about the structure of our preferences. If there are actions that contribute to multiple possible rewards, optimizing the most likely reward does not need to maximize the expected reward. Even if we optimize for that, we have a problem if value is complex and the way we do reward learning implicitly penalizes complexity. Another problem arises if the correct reward is comparatively difficult to optimize: if we want to maximize the average, it can make sense to only care about rewards that are both likely and easy to optimize. Relatedly, we could fail to correctly account for diminishing marginal returns in some of the rewards.

Goodhart effects are a lot less problematic if we can deal with all of the mentioned factors. Independent of that, Goodhart effects are most problematic when there is little middle ground that all rewards can agree on.

Flo's opinion: I enjoyed this article and the proposed factors match my intuitions. There seem to be two types of problems: extreme beliefs and concave Pareto boundaries. Dealing with the second is more important since a concave Pareto boundary favours extreme policies, even for moderate beliefs. Luckily, diminishing returns can be used to bend the Pareto boundary. However, I expect it to be hard to find the correct rate of diminishing returns, especially in novel situations.

Rohin's opinion: Note that this post considers the setting where we have uncertainty over the true reward function, but we can't learn about the true reward function. If you can gather information about the true reward function, which seems necessary to me (AN #41), then it is almost always worse to take the most likely reward or expected reward as a proxy reward to optimize.

Robustness

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty (Dan Hendrycks, Norman Mu et al) (summarized by Dan H): This paper introduces a data augmentation technique to improve robustness and uncertainty estimates. The idea is to take various random augmentations such as random rotations, produce several augmented versions of an image with compositions of random augmentations, and then pool the augmented images into a single image by way of an elementwise convex combination. Said another way, the image is augmented with various traditional augmentations, and these augmented images are “averaged” together. This produces highly diverse augmentations that have similarity to the original image. Unlike techniques such as AutoAugment, this augmentation technique uses typical resources, not 15,000 GPU hours. It also greatly improves generalization to unforeseen corruptions, and it makes models more stable under small perturbations. Most importantly, even as the distribution shifts and accuracy decreases, this technique produces models that can remain calibrated under distributional shift.

Miscellaneous (Alignment)

Defining and Unpacking Transformative AI (Ross Gruetzemacher et al) (summarized by Flo): The notion of transformative AI (TAI) is used to highlight that even narrow AI systems can have large impacts on society. This paper offers a clearer definition of TAI and distinguishes it from radical transformative AI (RTAI).

"Discontinuities or other anomalous patterns in metrics of human progress, as well as irreversibility are common indicators of transformative change. TAI is then broadly defined as an AI technology, which leads to an irreversible change of some important aspects of society, making it a (multi-dimensional) spectrum along the axes of extremitygenerality and fundamentality. " For example, advanced AI weapon systems might have strong implications for great power conflicts but limited effects on people's daily lives; extreme change of limited generality, similar to nuclear weapons. There are two levels: while TAI is comparable to general-purpose technologies (GPTs) like the internal combustion engine, RTAI leads to changes that are comparable to the agricultural or industrial revolution. Both revolutions have been driven by GPTs like the domestication of plants and the steam engine. Similarly, we will likely see TAI before RTAI. The scenario where we don't is termed a radical shift.

Non-radical TAI could still contribute to existential risk in conjunction with other factors. Furthermore, if TAI precedes RTAI, our management of TAI can affect the risks RTAI will pose.

Flo's opinion: Focusing on the impacts on society instead of specific features of AI systems makes sense and I do believe that the shape of RTAI as well as the risks it poses will depend on the way we handle TAI at various levels. More precise terminology can also help to prevent misunderstandings, for example between people forecasting AI and decision maker.

Six AI Risk/Strategy Ideas (Wei Dai) (summarized by Rohin): This post briefly presents three ways that power can become centralized in a world with Comprehensive AI Services (AN #40), argues that under risk aversion "logical" risks can be more concerning than physical risks because they are more correlated, proposes combining human imitations and oracles to remove the human in the loop and become competitive, and suggests doing research to generate evidence of difficulty of a particular strand of research.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

Discuss

### WORKSHOP ON ASSURED AUTONOMOUS SYSTEMS (WAAS)

20 января, 2020 - 19:21
Published on January 20, 2020 4:21 PM UTC

This may be of interest to people interested in AI Safety. This event is part of the 2020 IEEE Symposium on Security and Privacy and is being sponsored by the Johns Hopkins University Institute for Assured Autonomy.

https://www.ieee-security.org/TC/SPW2020/WAAS/

Discuss

### Why Do You Keep Having This Problem?

20 января, 2020 - 11:33
Published on January 20, 2020 8:33 AM UTC

One thing I've noticed recently is that when someone complains about how a certain issue "just keeps happening" or they "keep having to deal with it", it often seems to indicate an unsolved problem that people may not be aware of. Some examples:

• Players of a game repeatedly ask the same rules questions to the judges at an event. This doesn't mean everyone is bad at reading -- it likely indicates an area of the rules that is unclear or misleadingly written.
• People keep trying to open a door the wrong way, either pulling on a door that's supposed to be pushed or pushing a door that's supposed to be pulled -- it's quite possible the handle has been designed poorly in a way that gives people the wrong idea of how to use it. (The Design of Everyday Things has more examples of this sort of issue.)
• Someone keeps hearing the same type of complaint or having the same conversation about a particular policy at work -- this might be a sign that that policy might have issues. [1]
• Every time someone tries to moderate a forum they run, lots of users protest against their actions and call it unjust; this might be a sign that they're making bad moderation decisions.

I'm not going to say that all such cases are ones where things should change -- it's certainly possible that one might have to take unpopular but necessary measures under some circumstances -- but I do think that this sort of thing should be a pretty clear warning sign that things might be going wrong.

Thus, I suspect you should consider these sorts of patterns not just as "some funny thing that keeps happening" or whatever, but rather as potential indicators of "bugs" to be corrected!

[1] This post was primarily inspired by a situation in which I saw someone write "This is the fifth time I've had this conversation in the last 24 hours and I'm sick of it" or words to that effect -- the reason they had kept having that conversation, at least in my view, was because they were implementing a bad policy and people kept questioning them on it (with perhaps varying degrees of politeness).

Discuss

### Dunning Kruger vs. Double Descent

20 января, 2020 - 05:57
Published on January 20, 2020 2:57 AM UTC

The graph for the Dunning Kruger effect and the double descent effect (summary) are eerily similar. There's a twitter account by a research scientist at Google Brain who posted an observation similar to the one I'm writing about. I didn't really have much more luck finding anything else related to this, but I do think that there is something here worth further investigation. This is basically an extended shower thought.

In the double descent effect, training loss goes down to zero. The agent thinks that they'll perform better on the test set than they actually do. However, there's a regime where the agent has learned just enough to do well on the training set, but not enough to not avoid memorizing the data set. One hypothesis for this is that there's a regime where the model has just enough power to exactly fit the training data, but no more. In this regime, the model will necessarily fit noise. However, as we further increase the power of the model the number of ways we can fit the data increases dramatically. This means that we'll most likely select a model at random, which means we won't fit noise.

In the Dunning Kruger effect people assess their knowledge as higher than it actually is. Before we learn anything, we have low confidence and low performance. When we start learning our confidence on the 'train data' goes up, but as we approach the point where we've learned all the training data we fit noise present in the learning materials given to us. We're over-fitted. However, as we continue to expand our capacity to think about our learning material, we realize that there are many ways to learn the material. We start to regularize and end up performing better justifying an increase in our confidence.

Discuss