Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 6 часов 16 минут назад

[AN #159]: Building agents that know how to experiment, by training on procedurally generated games

4 августа, 2021 - 20:10
Published on August 4, 2021 5:10 PM GMT

[AN #159]: Building agents that know how to experiment, by training on procedurally generated games Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world View this email in your browser Newsletter #159
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer. SECTIONS HIGHLIGHTS
NEAR-TERM CONCERNS
        RECOMMENDER SYSTEMS
AI GOVERNANCE
OTHER PROGRESS IN AI
        MULTIAGENT RL
        DEEP LEARNING
NEWS    HIGHLIGHTS

Generally capable agents emerge from open-ended play (Open-Ended Learning Team et al) (summarized by Zach): Artificial intelligence agents have become successful at games when trained for each game separately. However, it has proven challenging to build agents that can play previously unseen games. This paper makes progress on this challenge in three primary areas: creating rich simulated environments and tasks, training agents with attention mechanisms over internal states, and evaluating agents over a variety of games. The authors show that agents trained with goal-based attention in their proposed environment (XLand) succeed at a range of novel, unseen tasks with no additional training required. Moreover, such agents appear to use general tactics such as decision-making, tool use, and experimentation during game-play episodes.

The authors argue that training-data generation is a central challenge to training general RL agents (an argument we’ve seen before with POET (AN #41) and PAIRED (AN #136)). They propose the training environment XLand to address this. XLand includes many multiplayer games within consistent, human-relatable 3D worlds and allows for dynamic agent learning through the procedural generation of tasks which are split into three components: world, agents, and goals. The inclusion of other agents makes this a partially observable environment. Goals are defined with Boolean formulas. Each goal is a combination of options and every option is a combination of atomic predicates. For example, in hide-and-seek one player has the goal see(me,opponent) and the other player not(see(opponent,me)). The space of worlds and games are shown to be both vast and smooth, which supports training.

The agents themselves are trained using deep-RL combined with a goal-attention module (GOAT). The per-timestep observations of the agent are ego-centric RGB images, proprioception values indicating forces, and the goal of the agent. The GOAT works by processing this information with a recurrent network and then using a goal-attention module to select hidden states that are most relevant to achieving a high return. This is determined by estimating the expected return if the agent focused on an option until the end of the episode.

As with many other major deep-RL projects, it is important to have a good curriculum, where more and more challenging tasks are introduced over time. The obvious method of choosing tasks with the lowest reward doesn’t work, because the returns from different games are non-comparable. To address this, an iterative notion of improvement is proposed, and scores are given as percentiles relative to a population. This is similar in spirit to the AlphaStar League (AN #43). Following this, game-theoretic notions such as Pareto dominance can be used to compare agents and to determine how challenging a task is, which can then be used to create a curriculum.

Five generations of agents are trained, each of which is used in the next generation to create opponents and relative comparisons for defining the curriculum. Early in training, the authors find that adding an intrinsic reward based on self-play is important to achieve good performance. This encourages agents to achieve non-zero rewards in as many games as possible, which the authors call “participation”. The authors also conduct an ablation study and find that dynamic task generation, population-based methods, and the GOAT module have a significant positive impact on performance.

The agents produced during training have desirable generalization capability. They can compete in games that were not seen before in training. Moreover, fine-tuning dramatically improves the performance of agents in tasks where training from scratch completely fails. A number of case studies are also presented to explore emergent agent behavior. In one experiment, an agent is asked to match a colored shape and another environment feature such as a shape or floor panel. At the start of the episode, the agent decides to carry a black pyramid to an orange floor, but then after seeing a yellow sphere changes options and places the two shapes together. This shows that the agent has robust option evaluation capability. In other experiments, the agents show the capacity to create ramps to move to higher levels in the world environment. Additionally, agents seem capable of experimentation. In one instance, the agent is tasked with producing a specific configuration of differently colored cube objects. The agent demonstrates trial-and-error and goes through several different configurations until it finds one it evaluates highly.

There are limitations to the agent capabilities. While agents can use ramps in certain situations they fail to use ramps more generally. For example, they frequently fail to use ramps to cross gaps. Additionally, agents generally fail to create more than a single ramp. Agents also struggle to play cooperative games involving following not seen during training. This suggests that experimentation does not extend to co-player behavior. More broadly, whether or not co-player agents decide to cooperate is dependent on the population the agents interacted with during training. In general, the authors find that agents are more likely to cooperate when both agents have roughly equal performance or capability.



Zach's opinion: This is a fairly complicated paper, but the authors do a reasonable job of organizing the presentation of results. In particular, the analysis of agent behavior and their neural representations is well done. At a higher level, I found it interesting that the authors partially reject the idea of evaluating agents with just expected returns. I broadly agree with the authors that the evaluation of agents across multi-player tasks is an open problem without an immediate solution. With respect to agent capability, I found the section on experimentation to be most interesting. In particular, I look forward to seeing more research on how attention mechanisms catalyze such behavior.

Rohin's opinion: One of my models about deep learning is “Diversity is all you need”. Suppose you’re training for some task for which there’s a relevant feature F (such as the color of the goal pyramid). If F only ever takes on a single value in your training data (you only ever go to yellow pyramids), then the learned model can be specialized to that particular value of F, rather than learning a more general computation that works for arbitrary values of F. Instead, you need F to vary a lot during training (consider pyramids that are yellow, blue, green, red, orange, black, etc) if you want your model to generalize to new values of F at test time. That is, your model will be zero-shot robust to changes in a feature F if and only if your training data was diverse along the axis of feature F. (To be clear, this isn’t literally true, it is more like a first-order main effect.)

Some evidence supporting this model:

- The approach in this paper explicitly has diversity in the objective and the world, and so the resulting model works zero-shot on new objectives of a similar type and can be finetuned quickly.

- In contrast, the similar hide and seek project (AN #65) did not have diversity in the objective, had distinctly less diversity in the world, and instead got diversity from emergent strategies for multiagent interaction (but there were fewer than 10 such strategies). Correspondingly, the resulting agents could not be quickly finetuned.

- My understanding is that in image recognition, models trained on larger, more diverse datasets become significantly more robust.

Based on this model, I would make the following predictions about agents in XLand:

- They will not generalize to objectives that can’t be expressed in the predicate language used at training time, such as “move all the pyramids near each other”. (In some sense this is obvious, since the agents have never seen the word “all” and so can’t know what it means.)

- They will not work in any environment outside of XLand (unless that environment looks very very similar to XLand).

In particular, I reject the idea that these agents have learned “general strategies for problem solving” or something like that, such that we should expect them to work in other contexts as well, perhaps with a little finetuning. I think they have learned general strategies for solving a specific class of games in XLand.

You might get the impression that I don’t like this research. That’s not the case at all — it is interesting and impressive, and it suggests that we could take the same techniques and apply them in broader, more realistic domains where the resulting agents could be economically useful. Rather, I expect my readership to overupdate on this result and think that we’ve now reached agents that can do “general planning” or some such, and I want to push against that.

   NEAR-TERM CONCERNS
 RECOMMENDER SYSTEMS

How Much Do Recommender Systems Drive Polarization? (Jacob Steinhardt) (summarized by Rohin): There is a common worry that social media (and recommender systems in particular) are responsible for increased polarization in recent years. This post delves into the evidence for this claim. By “polarization”, we mostly mean affective polarization, which measures how positive your feelings towards the opposing party are (though we also sometimes mean issue polarization, which measures the correlation between your opinions on e.g. gun control, abortion, and taxes). The main relevant facts are:

1. Polarization in the US has increased steadily since 1980 (i.e. pre-internet), though arguably there was a slight increase from the trend around 2016.

2. Since 2000, polarization has only increased in some Western countries, even though Internet use has increased relatively uniformly across countries.

3. Polarization in the US has increased most in the 65+ age group (which has the least social media usage).

(2) could be partly explained by social media causing polarization only in two-party systems, and (3) could be explained by saying that social media changed the incentives of more traditional media (such as TV) which then increased polarization in the 65+ age group. Nevertheless, overall it seems like social media is probably not the main driver of increased polarization. Social media may have accelerated the process (for instance by changing the incentives of traditional media), but the data is too noisy to tell one way or the other.



Rohin's opinion: I’m glad to see a simple summary of the evidence we currently have on the effects of social media on polarization. I feel like for the past year or two I’ve constantly heard people speculating about massive harms and even existential risks based on a couple of anecdotes or armchair reasoning, without bothering to check what has actually happened; whenever I talk to someone who seems to have actually studied the topic in depth, it seems they think that there are problems with recommender systems, but they are different from what people usually imagine. (The post also notes a reason to expect our intuitions to be misguided: we are unusual in that we get most of our news online; apparently every age group, starting from 18-24, gets more news from television than online.)

Note that there have been a few pieces arguing for these harms; I haven't sent them out in the newsletter because I don't find them very convincing, but you can find links to some of them along with my thoughts here.



Designing Recommender Systems to Depolarize (Jonathan Stray) (summarized by Rohin): This paper agrees with the post above that “available evidence mostly disfavors the hypothesis that recommender systems are driving polarization through selective exposure, aka ‘filter bubbles’ or ‘echo chambers’”. Nonetheless, social media is a huge part of today’s society, and even if it isn’t actively driving polarization, we can ask whether there are interventions that would decrease polarization. That is the focus of this paper.

It isn’t clear that we should intervene to decrease polarization for a number of reasons:

1. Users may not want to be “depolarized”.

2. Polarization may lead to increased accountability as each side keeps close watch of politicians on the other side. Indeed, in 1950 people used to worry that we weren’t polarized enough.

3. More broadly, we often learn through conflict: you are less likely to see past your own biases if there isn’t someone else pointing out what they are. To the extent that depolarization removes conflict, it may be harmful.

Nonetheless, polarization also has some clear downsides:

1. It can cause gridlock, preventing effective governance.

2. It erodes norms against conflict escalation, leading to outrageous behavior and potentially violence.

3. At current levels, it has effects on all spheres of life, many of which are negative (e.g. harm to relationships across partisan lines).

Indeed, it is plausible that most situations of extreme conflict were caused in part by a positive feedback loop involving polarization; this suggests that reducing polarization could be a very effective method to prevent conflict escalation. Ultimately, it is the escalation and violence that we want to prevent. Thus we should be aiming for interventions that don’t eliminate conflict (as we saw before, conflict is useful), but rather transform it into a version that doesn’t lead to escalation and norm violations.

For this purpose we mostly care about affective polarization, which tells you how people feel about “the other side”. (In contrast, issue polarization is less central, since we don’t want to discourage disagreement on issues.) The paper’s central recommendation is for companies to measure affective polarization and use this as a validation metric to help decide whether a particular change to a recommender system should be deployed or not. (This matches current practice at tech companies, where there are a few high-level metrics that managers use to decide what to deploy, and importantly, the algorithms do not explicitly optimize for those metrics.) Alternatively, we could use reinforcement learning to optimize the affective polarization metric, but in this case we would need to continuously evaluate the metric in order to ensure we don’t fall prey to Goodhart effects.

The paper also discusses potential interventions that could reduce polarization. However, it cautions that they are based on theory or studies with limited ecological validity, and ideally should be checked via the metric suggested above. Nonetheless, here they are:

1. Removing polarizing content. In this case, the polarizing content doesn’t make it to the recommender system at all. This can be done quite well when there are human moderators embedded within a community, but is much harder to do at scale in an automated way.

2. Changing recommendations. The most common suggestion in this category is to increase the diversity of recommended content (i.e. go outside the “filter bubble”). This can succeed, though usually only has a modest effect, and can sometimes have the opposite effect of increasing polarization. A second option is to penalize content for incivility: studies have shown that incivility tends to increase affective polarization. Actively promoting civil content could also help. That being said, it is not clear that we want to prevent people from ever raising their voice.

3. Changing presentation. The interface to the content can be changed to promote better interactions. For example, when the Facebook “like” button was replaced by a “respect” button, people were more likely to “respect” comments they disagreed with.



Rohin's opinion: I really liked this paper as an example of an agenda about how to change recommender systems. It didn’t rely on armchair speculation or anecdotes to determine what problems exist and what changes should be made, and it didn’t just assume that polarization must be bad and instead considered both costs and benefits. The focus on “making conflict healthy” makes a lot of sense to me. I especially appreciated the emphasis on a strategy for evaluating particular changes, rather than pushing for a specific change; specific changes all too often fail once you test them at scale in the real world.

   AI GOVERNANCE

Collective Action on Artificial Intelligence: A Primer and Review (Robert de Neufville et al) (summarized by Rohin): This paper reviews much of the work in AI governance (specifically, work on AI races and other collective action problems).

   OTHER PROGRESS IN AI
 MULTIAGENT RL

Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot (Joel Z. Leibo, Edgar Duéñez-Guzmán, Alexander Sasha Vezhnevets, John P. Agapiou et al) (summarized by Rohin): In supervised learning, the test dataset is different from the training dataset, and thus evaluates how well the learned model generalizes (within distribution). So far, we mostly haven't done this with reinforcement learning: the test environment is typically identical to the training environment. This is because it would be very challenging -- you would have to design a large number of environments and then split them into a train and test set; each environment would take a very long time to create (unlike in, say, image classification, where it takes a few seconds to label an image).

The core insight of this paper is that when evaluating a multiagent RL algorithm, you can get a “force multiplier” by taking a single multiagent environment (called a “substrate”) and “filling in” some of the agent slots with agents that are automatically created using RL to create a “scenario” for evaluation. For example, in the Capture the Flag substrate (AN #14), in one scenario we fill in all but one of the agents using agents trained by A3C, which means that the remaining agent (to be supplied by the algorithm being evaluated) must cooperate with previously-unseen agents on its team, to play against previously-unseen opponents. Scenarios can fall in three main categories:

1. Resident mode: The agents created by the multiagent RL algorithm under evaluation outnumber the background “filled-in” agents. This primarily tests whether the agents created by the multiagent RL algorithm can cooperate with each other, even in the presence of perturbations by a small number of background agents.

2. Visitor mode: The background agents outnumber the agents created by the algorithm under evaluation. This often tests whether the new agents can follow existing norms in the background population.

3. Universalization mode: A single agent is sampled from the algorithm and used to fill all the slots in the substrate, effectively evaluating whether the policy is universalizable.

The authors use this approach to create Melting Pot, a benchmark for evaluating multiagent RL algorithms that can produce populations of agents (i.e. most multiagent RL algorithms). Crucially, the algorithm being evaluated is not permitted to see the agents in any specific scenario in advance; this is thus a test of generalization to new opponents. (It is allowed unlimited access to the substrate.) They use ~20 different substrates and create ~5 scenarios for each substrate, giving a total of ~100 scenarios on which the multiagent RL algorithm can be evaluated. (If you exclude the universalization mode, which doesn’t involve background agents and so may not be a test of generalization, then there are ~80 scenarios.) These cover both competitive, collaborative, and mixed-motive scenarios.



Rohin's opinion: You might wonder why Melting Pot is just used for evaluation, rather than as a training suite: given the diversity of scenarios in Melting Pot, shouldn’t you get similar benefits as in the highlighted post? The answer is that there isn’t nearly enough diversity, as there are only ~100 scenarios across ~20 substrates. For comparison, ProcGen (AN #79) requires 500-10,000 levels to get decent generalization performance. Both ProcGen and XLand are more like a single substrate for which we can procedurally generate an unlimited number of scenarios. Both have more diversity than Melting Pot, in a narrower domain; this is why training on XLand or ProcGen can lead to good generalization but you wouldn’t expect the same to occur from training on Melting Pot.

Given that you can’t get the generalization from simply training on something similar to Melting Pot, the generalization capability will instead have to come from some algorithmic insight or by finding some other way of pretraining agents on a wide diversity of substrates and scenarios. For example, you might figure out a way to procedurally generate a large, diverse set of possible background agents for each of the substrates.

(If you’re wondering why this argument doesn’t apply to supervised learning, it’s because in supervised learning the training set has many thousands or even millions of examples, sampled from the same distribution as the test set, and so you have the necessary diversity for generalization to work.)

  DEEP LEARNING

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets (Alethea Power et al) (summarized by Rohin): This paper presents an interesting empirical phenomenon with deep learning: grokking.

Consider tasks of the form “a ◦ b = ?”, where “◦” is some operation on modular arithmetic. For example, in the task of addition mod 97, an example problem would be “32 + 77 = ?”. There are exactly 97 possible operands, each of which gets its own token, and so there are 97^2 possible problems that are defined by pairs of tokens. We will train a neural net on some fraction of all possible problems and then ask how well it performs on the remaining problems it didn’t see: that is, we’re asking it to fill in the missing entries in the 97x97 table that defines addition mod 97.

It turns out that in these cases, the neural net memorizes the training dataset pretty quickly (in around 10^3 updates), at which point it has terrible generalization performance. However, if you continue to train it all the way out to 10^6 updates, then it will often hit a phase transition where you go from random chance to perfect generalization almost immediately. Intuitively, at the point of the phase transition, the network has “grokked” the function and can run it on new inputs as well. Some relevant details about grokking:

1. It isn’t specific to group or ring operations: you also see grokking for tasks like “a/b if b is odd, otherwise a − b”.

2. It is quite sensitive to the choice of hyperparameters, especially learning rate; the learning rate can only vary over about a single order of magnitude.

3. The time till perfect generalization is reduced by weight decay and by adding noise to the optimization process.

4. When you have 25-30% of possible examples as training data, a decrease of 1 percentage point leads to an increase of 40-50% in the median time to generalization.

5. As problems become more intuitively complicated, time till generalization increases (and sometimes generalization doesn’t happen at all). For example, models failed to grok the task x^3 + xy^2 + y (mod 97) even when provided with 95% of the possible examples as training data.

6. Grokking mostly still happens even when adding 1,000 “outliers” (points that could be incorrectly labeled), but mostly stops happening at 2,000 “outliers”.

Read more: Reddit commentary



Rohin's opinion: Another interesting fact about neural net generalization! Like double descent (AN #77), this can’t easily be explained by appealing to the diversity model. I don’t really have a good theory for either of these phenomena, but one guess for grokking is that:

1. Functions that perfectly memorize the data without generalizing (i.e. probability 1 on the true answer and 0 elsewhere) are very complicated, nonlinear, and wonky. The memorizing functions learned by deep learning don’t get all the way there and instead assign a probability of (say) 0.95 to the true answer.

2. The correctly generalizing function is much simpler and for that reason can be easily pushed by deep learning to give a probability of 0.99 to the true answer.

3. Gradient descent quickly gets to a memorizing function, and then moves mostly randomly through the space, but once it hits upon the correctly generalizing function (or something close enough to it), it very quickly becomes confident in it, getting to probability 0.99 and then never moving very much again.

A similar theory could explain deep double descent: the worse your generalization, the more complicated, nonlinear and wonky you are, and so the more you explore to find a better generalizing function. The biggest problem with this theory is that it suggests that making the neural net larger should primarily advantage the memorizing functions, but in practice I expect it will actually advantage the correctly generalizing function. You might be able to rescue the theory by incorporating aspects of the lottery ticket hypothesis (AN #52).

   NEWS

Political Economy of Reinforcement Learning (PERLS) Workshop (Stuart Russell et al) (summarized by Rohin): The deadline for submissions to this NeurIPS 2021 workshop is Sep 18. From the website: "The aim of this workshop will be to establish a common language around the state of the art of RL across key societal domains. From this examination, we hope to identify specific interpretive gaps that can be elaborated or filled by members of our community. Our ultimate goal will be to map near-term societal concerns and indicate possible cross-disciplinary avenues towards addressing them."

FEEDBACK I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email. PODCAST An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
Subscribe here:

Copyright © 2021 Alignment Newsletter, All rights reserved.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.
 

Discuss

Contribution-Adjusted Utility Maximization Funds: An Early Proposal

4 августа, 2021 - 20:09
Published on August 4, 2021 5:09 PM GMT

Right now in Effective Altruism (or Rationality), we have a few donor funds with particular focus areas. In this post I propose a new type of fund that’s instead focused on maximizing the combined utility functions of its particular donors. The fund goals would be something like, “Maximize the combined utility of our donors, adjusted for donation amount, in any way possible, abiding by legal and moral standards.” I think that this sort of fund structure is highly theoretical, but in theory could some particular wants that aren’t currently being met.

For this document, I call these funds "Contribution-Adjusted Utility Maximization Funds", or CAUMFs. This name is intentionally long; this idea is early, and I don't want to pollute the collective namespace.

This fund type has two purposes.

  1. It’s often useful for individuals to coordinate on non-charitable activities. For example, research into best COVID risk measures to be used by a particular community.
  2. These funds should help make it very clear that donations will be marginally valuable for the preferences of the donor. Therefore, donating to these funds should be safe on the margin. Hopefully this would result in more total donations.

You can picture these funds as somewhere between bespoke nonprofit advising institutions, cooperatives, and small governments. If AI and decision automation could cut down on labor costs, related organizations might eventually be much more exciting.



Discuss

Analysis of World Records in Speedrunning [LINKPOST]

4 августа, 2021 - 18:26
Published on August 4, 2021 3:26 PM GMT

[this is a linkpost to Analysis of World Records in Speedrunning]

TL;DR: I have scraped a database of World Record improvements for fastest videogame completition for several videogames, noted down some observations about the trends of improvement and attempted to model them with some simple regressions. Reach out if you'd be interested in researching this topic!

Key points
  • I argue that researching speedrunning can help us understand scientific discovery, AI alignment and extremal distributions. More.
  • I’ve scraped a dataset on world record improvements in videogame speedrunning. It spans 15 games, 22 categories and 1462 runs. More.
  • Most world record improvements I studied follow a diminishing returns pattern. Some exhibit successive cascades of improvements, with continuous phases of diminishing returns periodically interrupted by (presumably) sudden discoveries that speed up the rate of progress. More.
  • Simple linear extrapolation techniques could not improve on just guessing that the world record will not change in the near future. More.
  • Possible next steps include trying different extrapolation techniques, modelling the discontinuities in the data and curating a dataset of World Record improvements in Tool Assisted Speedruns. More.

The script to scrape the data and extrapolate it is available here. A snapshot of the data as of 30/07/2021 is available here.

Figure 1: Speedrunning world record progression for the most popular categories. The horizontal axis is the date of submission and the vertical axis is run time in seconds. Spans 22 categories and 1462 runs. Hidden from the graph are 5 runs before 2011. See the rest of the data here.

Feedback on the project would be appreciated. I am specially keen on discussion about:

  1. Differences and commonalities to expect between speedrunning and technological improvement in different fields.
  2. Discussion on how to mathematically model the discontinuities in the data.
  3. Ideas on which techniques to prioritize to extrapolate the observed trends.


Discuss

Decision-Making Training for Software Engineering

4 августа, 2021 - 13:36
Published on August 4, 2021 10:36 AM GMT

I recently read this post that discusses the author’s experience leading military decision-making trainings for ROTC cadets during the pandemic. I’m going to briefly discuss what those decision-making trainings looked like, and how the principles could be adapted to teach software engineering effectively.

To quote from that article, a Tactical Decision Game (TDG) is a “deceptively simple military decision-making exercises usually consisting of no more than a map and a few paragraphs of text describing a situation. Students are placed in the role of the commander of a unit with a mission and a specified set of resources. You have some information about the enemy, but not as much as you’d like. Then something unexpected happens, upending the situation and requiring you to come up with a new action plan on the spot. Then, after issuing your new orders, you must explain your assessment of the new situation and the rationale behind your decision.”

So, you are given some initial information and make a guess on what to do based on that information. Then you are given more information and you have to figure out something new to do. That sounded a lot to me like watching software requirements change over time, and getting to see whether your initial data structures and code shape hold up to changes.

To adapt these trainings for software engineering, you could imagine a programmer be given a set of requirements, and they implement code for those requirements. Then, they have to either add or change the functionality. This will lay bare any issues with code maintainability that the old approach had. Run enough games like this, possibly paired with suggestions for improvement that clearly would have been easy to refactor or change, and programmers will learn how to write more maintainable code. I think this would be an excellent addition to any college “computer science” curriculum that wants to teach software engineering, and have a high amount of value for the amount of effort it takes.



Discuss

Garrabrant and Shah on human modeling in AGI

4 августа, 2021 - 07:35
Published on August 4, 2021 4:35 AM GMT

This is an edited transcript of a conversation between Scott Garrabrant (MIRI) and Rohin Shah (DeepMind) about whether researchers should focus more on approaches to AI alignment that don’t require highly capable AI systems to do much human modeling. CFAR’s Eli Tyre facilitated the conversation.

To recap, and define some terms:

  • The alignment problem is the problem of figuring out "how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world" (outcome alignment) or the problem of building powerful AI systems that are trying to do what their operators want them to do (intent alignment).
  • In 2016, Hadfield-Mennell, Dragan, Abbeel, and Russell proposed that we think of the alignment problem in terms of “Cooperative Inverse Reinforcement Learning” (CIRL), a framework where the AI system is initially uncertain of its reward function, and interacts over time with a human (who knows the reward function) in order to learn it.
  • In 2016-2017, Christiano proposed “Iterated Distillation and Amplification” (IDA), an approach to alignment that involves iteratively training AI systems to learn from human experts assisted by AI helpers. In 2018, Irving, Christiano, and Amodei proposed AI safety via debate, an approach based on similar principles.
  • In early 2019, Scott Garrabrant and DeepMind’s Ramana Kumar argued in “Thoughts on Human Models” that we should be “cautious about AGI designs that use human models” and should “put more effort into developing approaches that work well in the absence of human models”.
  • In early February 2021, Scott and Rohin talked more about human modeling and decided to have the real-time conversation below.

You can find a recording of the Feb. 28 discussion below (sans Q&A) here.

 

1. IDA, CIRL, and incentives

Eli:  I guess I want to first check what our goal is here. There was some stuff that happened online. Where are we according to you guys?

Scott:  I think Rohin spoke last and I think I have to reload everything. I'm coming at it fresh right now. 

Eli:  Yeah. So we'll start by going from “What are the positions that each of you were thinking about?” But I guess I'm most curious about: from each of your perspectives, what would a win of this conversation be like? What could happen that you would go away and be like, "Yes, this was a success."

Rohin:  I don't know. I generally feel like I want to have more models of what AI safety should look like, and that's usually what I'm trying to get out of conversations with other people. I also don't feel like we have even narrowed down on what the disagreement is yet. So, first goal would be to figure that out.

Eli:  Definitely. Scott?

Scott:  Yeah. So, I have two things. One is that I feel frustrated with the relationship between the AI safety community and the human models question. I mean, I feel like I haven't done a good job arguing about it or something. And so on one side, I'm trying to learn what the other position is because I'm stuck there.

The other thing is that I feel especially interested in this conversation now because I feel like some of my views on this have recently shifted. 

My worry about human-modeling systems has been that modeling humans is in some sense "close" to behaviors like deception. I still see things that way, but now I have a clearer idea of what the relevant kind of “closeness” is: human-modeling is close to unacceptable behavior on some measure of closeness that involves your ability to observe the system’s internals and figure out whether the system is doing the intended thing or doing something importantly different.

I feel like this was an update to my view on the thing relative to how I thought about it a couple of years ago, and I want similar updates.

Eli:  Awesome. As I'm thinking about it, my role is to try and make this conversation go well, according to your goals.

I want to encourage both of you to take full responsibility for making the conversation go well as well. Don't have the toxic respect of, "Ah, someone else is taking care of it." If something seems interesting to you, definitely say so. If something seems boring to you, say so. If there's a tack that seems particularly promising, definitely go for that. If I recommend something you think is stupid, you should say, "That's a stupid thing." I will not be offended. That is, in fact, helpful for me. Sound good?

All right. Scott, do you want to start with your broad claim?

Scott:  Yeah. So, I claim that it is plausible that the problem of aligning a super capable system that is working in some domain like physics is significantly easier than aligning a super capable system that's working in contexts that involve modeling humans. Either contexts involving modeling humans because they're working on inherently social tasks, or contexts involving modeling humans because part of our safety method involves modeling humans.

And so I claim with moderate probability that it's significantly easier to align systems that don’t do any human-modeling. I also claim, with small but non-negligible probability, that one can save the world using systems that don’t model humans. Which is not the direction I want the conversation to really go, because I think that's not really a crux, because I don't expect my probability on “non-human-modeling systems can be used to save the world” to get so low that I conclude researchers shouldn’t think a lot about how to align and use such systems.

And I also have this observation: most AI safety plans feel either adjacent to a Paul Christiano IDA-type paradigm or adjacent to a Stuart Russell CIRL-type paradigm.

I think that two of the main places that AI safety has pushed the field have been towards human models, either because of IDA/debate-type reasons or because we want to not specify the goal, we want it to be in the human, we want the AI to be trying to do "what we want” in quotation marks, which requires a pointer to modeling us.

And I think that plausibly, these avenues are mistakes. And all of our attention is going towards them.

Rohin:  Well, I can just respond to that maybe. 

So I think I am with you on the claim that there are tasks which incentivize more human-modeling. And I wouldn't argue this with confidence or anything like that, but I'm also with you that plausibly, we should not build AI systems that do those tasks because it's close to manipulation of humans.

This seems pretty distinct from the algorithms that we use to build the AI systems. You can use iterated amplification to build a super capable physics AI system that doesn't know that much about humans and isn't trying to manipulate them.

Scott:  Yeah. So you're saying that the IDA paradigm is sufficiently divorced from the dangerous parts of human modeling. And you're not currently making the same claim about the CIRL paradigm?

Rohin:  It depends on how broadly you interpret the CIRL paradigm. The version where you need to use a CIRL setup to infer all of human values or something, and then deploy that agent to optimize the universe—I'm not making the claim for that paradigm. I don't like that paradigm. To my knowledge, I don't think Stuart likes that paradigm. So, yeah, I'm not making claims about that paradigm.

I think in the world where we use CIRL-style systems to do fairly bounded, narrow tasks—maybe narrow is not the word I want to use here—but you use that sort of system for a specific task (which, again, might only involve reasoning about physics), I would probably make the same claim there, yes. I have thought less about it.

I might also say I'm not claiming that there is literally no incentive to model humans in any of these cases. My claim is more like, "No matter what you do, if you take a very broad definition of “incentive,” there will always be an incentive to model humans because you need some information about human preferences for the AI system to do a thing that you like."

Scott:  I think that it's not for me about an incentive to model humans and more just about, is there modeling of humans at all? 

Well, no, so I'm confused again, because... Wait, you're saying there's not an incentive to model humans in IDA?

Rohin:  It depends on the meaning of the word “incentive.”

Scott:  Yeah. I feel like the word “incentive” is distracting or something. I think the point is that there is modeling of humans.

(reconsidering) There might not be modeling of humans in IDA. There's a sense in which, in doing it right, maybe there's not... I don't know. I don't think that's where the disagreement is, though, is it?

Eli:  Can we back up for a second and check what's the deal with modeling humans? Presumably there's something at least potentially bad about that. There's a reason why we care about it. Yes? 

Rohin:  Yeah. I think the reason that I'm going with, which I feel fairly confident that Scott agrees with me on, at least as the main reason, is that if your AI system is modeling humans, then it is “easy” or “close” for it to be manipulating humans and doing something that we don't want that we can't actually detect ahead of time, and that therefore causes bad outcomes. 

Scott:  Yeah. I want to be especially clear about the metric of closeness being about our ability to oversee/pre-oversee a system in the way that we set up the incentives or something.

The closeness between modeling humans and manipulating humans is not in the probability that some system that’s doing one spontaneously changes to doing the other. (Even though I think that they are close in that metric.) It's more in the ability to be able to distinguish between the two behaviors. 

And I think that there's a sense in which my model is very pessimistic about oversight, such that maybe if we really try for the next 20 years or something, we can distinguish between “model thinking superintelligent thoughts about physics” and “model thinking about humans.” And we have no hope of actually being able to distinguish between “model that's trying to manipulate the human” and “model that's trying to do IDA-type stuff (or whatever) the legitimate way.”

Rohin:  Right. And I'm definitely a lot more optimistic about oversight than you, but I still agree directionally that it's harder to oversee a model when you're trying to get it to do things that are very close to manipulation. So this feels not that cruxy or something.

Scott:  Yeah, not that cruxy... I don't know, I feel like I want to hear more about what Rohin thinks or something instead of responding to him. Yeah, not that that cruxy, but what's the cruxy part, then? I feel like I'm already granting the thing about incentives and I'm not even talking about whether you have incentives to model humans. I'm assuming that there's systems that model humans, there's systems that don't model humans. And it's a lot easier to oversee the ones that don't.
 

2. Mutual information with humans

Rohin:  I think my main claim is: the determining factor about whether a system is modeling humans or not—or, let's say there is an amount of modeling the system does. I want to talk about spectrums because I feel like you say too many wrong things if you think in binary terms in this particular case. 

Scott:  All right.

Rohin:  So, there's an amount of modeling humans. Let's call it a scale—

Scott:  We can have an entire one dimension instead of zero! (laughs

Rohin:  Yes, exactly. The art of choosing the right number of dimensions. (laughs) It's an important art.

So, we'll have it on a scale from 0 to 10. I think my main claim is that the primary determinant of where you are on the spectrum is what you are trying to get your AI system to do, and not the source of the feedback by which you train the system. And IDA is the latter, not the former.

Scott:  Yeah, so I think that this is why I was distinguishing between the two kinds of “closeness.” I think that the probability of spontaneously manipulating humans is stronger when there are humans in the task than when there are just humans in the IDA/CIRL way of pointing at the task or something like that. But I think that the distance in terms of ability to oversee is not large...

Rohin:  Yeah, I think I am curious why.

Scott:  Yeah. Do I think that? (pauses

Hm. Yeah. I might be responding to a fake claim right now, I'm not sure. But in IDA, you aren't keeping some sort of structure of HCH. You're not trying to do this, because they're trying to do oversight, but you're distilling systems and you're allowing them to find what gets the right answer. And then you have these structures that get the right answer on questions about what humans say, and maybe rich enough questions that contain a lot of needing to understand what's going on with the human. (I claim that yes, many domains are rich enough to require modeling the human; but maybe that's false, I don't know.) 

And basically I'm imagining a black box. Not entirely black box—it's a gray box, and we can look at some features of it—but it just has mutual information with a bunch of stuff that humans do. And I almost feel like, I don't think this is the way that transparency will actually work, but I think just the question “is there mutual information between the humans and the models?” is the extent to which I expect to be able to do the transparency or something.

Maybe I'm claiming that in IDA, there's not going to be mutual information with complex models of humans.

Rohin:  Yep. That seems right. Or more accurately, I would say it depends on the task that you're asking IDA to do primarily. 

Scott:  Yeah, well, it depends on the task and it depends on the IDA. It depends on what instructions you give to the humans.

Rohin:  Yes.

Scott:  If you imagined that there's some core of how to make decisions or something, and humans have access to this core, and this core is not specific to humans, but humans don't have access to it in such a way that they can write it down in code, then you would imagine a world in which I would be incentivized to do IDA and I would be able to do so safely (kind of) because I'm not actually putting mutual information with humans in the system. I'm just asking the humans to follow their “how to make decisions” gut that doesn't actually relate to the humans. 

Rohin:  Yes. That seems right.

What do I think about that? Do I actually think humans have a core?... 

Scott:  I almost wasn't even trying to make that claim. I was just trying to say that there exists a plausible world where this might be the case. I'm uncertain about whether humans have a core like that. 

But I do think that in IDA, you're doing more than just asking the humans to use their “core.” Even if humans had a core of how to solve differential equations that they couldn't write down in code and then wanted to use IDA to solve differential equations via only asking humans to use their core of differential equations.

If this were the case… Yeah, I think the IDA is asking more of the human than that. Because regardless of the task, the IDA is asking the human to also do some oversight. 

Rohin:  … Yes. 

Scott:  And I think that the “oversight” part is capturing the richness of being human even if the task manages to dodge that. 

And the analog of oversight in debates is the debates. I think it's more clear in debates because it's debate, but I think there's the oversight part and that's going to transfer over.

Rohin:  That seems true. One way I might rephrase that point is that if you have a results-based training system, if your feedback to the agent is based on results, the results can be pretty independent of humans. But if you have a feedback system based not just on results, but also on the process by which you got the results—for whatever reason, it just happens to be an empirical fact about the world that there's no nice, correct, human-independent core of how to provide feedback on process—then it will necessarily contain a bunch of information about humans the agent will then pick up on.

Scott:  Yeah.

Yeah, I think I believe this claim. I'm not sure. I think that you said back a claim that not only is what I said, but also I temporarily endorse, which is stronger. (laughs

Rohin:  Cool. (laughs) Yeah. It does seem plausible. It seems really rough to not be giving feedback on the process, too. 

I will note that it is totally possible to do IDA, debate, and CIRL without process-level feedback. You just tell your humans to only write the results, or you just replace the human with an automated reward function that only evaluates the results if you can do that. 

Scott:  I mean... 

Rohin:  I agree, that's sort of losing the point of those systems in the first place. Well, maybe not CIRL, but at least IDA and debate. 

Scott:  Yeah. I feel like I can imagine "Ah, do something like IDA without the process-level feedback." I wouldn't even want to call debate “debate” without the process-level feedback.

Rohin:  Yeah. At that point it's like two-player zero-sum optimization. It's like AlphaZero or something.

Scott:  Yeah. I conjecture that the various people who are very excited about each of these paradigms would be unexcited about the version that does not have process-level feedback—

Rohin:  Yes. I certainly agree with that. 

I also would be pretty unexcited about them without the process-level feedback. Yeah, that makes sense. I think I would provisionally buy that for at least physics-style AI systems. 

Scott:  What do you mean? 

Rohin:  If your task is like, "Do good physics," or something. I agree that the process-level feedback will, like, 10x the amount of human information you have. (I made up the number 10.)

Whereas if it was something else, like if it was sales or marketing, I'd be like, "Yeah, the process-level feedback makes effectively no difference. It increases the information by, like, 1%." 

Scott:  Yeah. I think it makes very little difference in the mutual information. I think that it still feels like it makes some difference in some notion of closeness-to-manipulation to me.

So I'm in a position where if it's the case that one could do superhuman STEM work without humans, I want to know this fact. Even if I don't know what to do with it, that feels like a question that I want to know the answer to, because it seems plausibly worthwhile. 

Rohin:  Well, I feel like the answer is almost certainly yes, right? AlphaFold is an example of it. 

Scott:  No, I think that one could make the claim... All right, fine, I'll propose an example then. One could make the claim that IDA is a capability enhancement. IDA is like, "Let's take the humans' information about how to break down and solve problems and get it into the AI via the IDA process." 

Rohin:  Yep. I agree you could think of it that way as well. 

Scott:  And so one could imagine being able to answer physics questions via IDA and not knowing how to make an AI system that is capable of answering the same physics questions without IDA. 

Rohin:  Oh. Sure. That seems true. 

Scott:  So at least in principle, it seems like… 

Rohin:  Yeah. 

Eli:  There's a further claim I hear you making, Scott—and correct me if this is wrong—which is, “We want to explore the possibility of solving physics problems without something like IDA, because that version of how to solve physics problems may be safer.”

Scott:  Right. Yeah, I have some sort of conjunction of “maybe we can make an AGI system that just solves physics problems” and also “maybe it's safer to do so.” And also “maybe AI that could only solve physics problems is sufficient to save the world.” And even though I conjuncted three things there, it's still probable enough to deserve a lot of attention. 

Rohin:  Yeah. I agree. I would probably bet against “we can train an AI system to do good STEM reasoning in general.” I'm certainly on board that we can do it some of the time. As I've mentioned, AlphaFold is an example of that. But it does feel like, yeah, it just seems really rough to have to go through an evolution-like process or something to learn general reasoning rather than just learning it from humans. Seems so much easier to do the latter that that's probably going to be the best approach in most cases. 

Scott:  Yeah. I guess I feel like some sort of automated working with human feedback—I feel like we can be inspired by humans when trying to figure out how to figure out some stuff about how to make decisions or something. And I'm not too worried about mutual information with humans leaking in from the fact that we were inspired by humans. We can use the fact that we know some facts about how to do decision-making or something, to not have to just do a big evolution. 

Rohin:  Yeah. I think I agree that you could definitely do a bit better that way. What am I trying to say here? I think I'm like, “For every additional bit of good decision-making you specify, you get correspondingly better speed-ups or something.” And IDA is sort of the extreme case where you get lots and lots of bits from the humans. 

Scott:  Oh! Yeah... I don't think that the way that humans make decisions is that much more useful as a thing to draw inspiration from than, like, how humans think ideal decision-making should be. 

Rohin:  Sure. Yeah, that's fair. 

Scott:  Yeah. It’s not obvious to me that you have that much to learn from humans relative to the other things in your toolbox. 

Rohin:  That's fair. I think I do disagree, but I wouldn't bet confidently one way or the other.
 

3. The default trajectory

Eli:  I guess I want to back up a little bit and just take stock of where we are in this thread of the conversation. Scott made a conjunctive claim of three things. One is that it's possible to train an AI system to do STEM work without needing to have human models in the mix. Two, this might be sufficient for saving the world or otherwise doing pretty powerful stuff. And three, this might be substantially safer than doing it IDA-style or similar. I guess I just want to check, Rohin. Do you agree or disagree with each of those points? 

Rohin:  I probably disagree somewhat with all of them, but the upstream disagreement is how optimistic versus pessimistic Scott and I are about AI safety in general. If you're optimistic the way I am, then you're much more into trying not to change the trajectory of AI too much, and instead make the existing trajectory better. 

Scott:  Okay. (laughs) I don't know. I feel like I want to say this, because it's ironic or something. I feel like the existing framework of AI before AI safety got involved was “just try to solve problems,” and then AI safety's like, "You know what we need in our ‘trying to solve problems’? We need a lot of really meta human analysis." (laughs)

It does feel to me like the default path if nobody had ever thought about AI safety looks closer to the thing that I'm advocating for in this discussion. 

Rohin:  I do disagree with that. 

Scott:  Yeah. For the default thing, you do need to be able to say, "Hey, let's do some transparency and let's watch it, and let's make sure it's not doing the human-modeling stuff." 

Rohin:  I also think people just would have started using human feedback. 

Scott:  Okay. Yeah, that might be true. Yeah. 

Rohin:  It's just such an easy way to do things, relative to having to write down the reward function. 

Scott:  Yeah, I think you're right about that.

Rohin:  Yeah. But I think part of it is that Scott's like, "Man, all the things that are not this sort of three-conjunct approach are pretty doomed, therefore it's worth taking a plan that might be unlikely to work because that's the only thing that can actually make a substantial difference." Whereas I'm like, "Man, this plan is unlikely to work. Let's go with a likely one that can cut the risk in half." 

Scott:  Yeah. I mean, I basically don't think I have any kind of plan that I think is likely to work… But I'm also not in the business of making plans. I want other people to do that. (laughs)

Rohin:  I should maybe respond to Eli's original question. I think I am more pessimistic than Scott on each of the three claims, but not by much.

Eli:  But the main difference is you're like, "But it seems like there's this alternative path which seems like it has a pretty good shot."

Rohin:  I think Scott and I would both agree that the no-human-models STEM AI approach is unlikely to work out. And I’m like, "There's this other path! It's likely to work out! Let's do that." And Scott's like, "This is the only thing that has any chance of working out. Let's do this."

Eli:  Yeah. Can you tell if it's a crux, whether or not your optimism about the no-human-models path would affect how optimistic you feel about the human-models path? It’s a very broad question, so it may be kind of unfair.

Rohin:  It would seriously depend on why I became less optimistic about what I'm calling the usual path, the IDA path or something.

There are just a ton of beliefs that are correlated that matter. Again, I'm going to state things more confidently than I believe them for the sake of faster and clearer communication. There's a ton of beliefs like, "Oh man, good reasoning is just sort of necessarily messy and you're not going to get nice, good principles, and so you should be more in the business of incrementally improving safety rather than finding the one way to make a nice, sharp, clear distinction between things."

Then other parts are like, "It sure seems like you can't make a sharp distinction between ‘good reasoning’ and ‘what humans want.’” And this is another reason I'm more optimistic about the first path.

So I could imagine that I get unconvinced of many of these points, and definitely some of them, if I got unconvinced of those, I would be more optimistic about the path Scott's outlining.

 

4. Two kinds of risk

Scott:  Yeah. I want to clarify that I'm not saying, like, "Solve decision theory and then put the decision theory into AI," or something like that, where you're putting the “how to do good reasoning” in directly. I think that I'm imagining that you have to have some sort of system that's learning how to do reasoning. 

Rohin:  Yep. 

Scott:  And I'm basically distinguishing between a system that's learning how to do reasoning while being overseen and kept out of the convex hull of human modeling versus… And there are definitely trade-offs here, because you have more of a daemon problem or something if you're like, "I'm going to learn how to do reasoning," as opposed to, "I'm going to be told how to do reasoning from the humans." And so then you have to search over this richer space or something of how to do reasoning, which makes it harder. 

I'm largely not thinking about the capability cost there. There's a safety cost associated with running a big evolution to discover how reasoning works, relative to having the human tell you how reasoning works. But there's also a safety cost to having the human tell you how reasoning works. 

And it's a different kind of safety cost in the two cases. And I'm not even sure that I believe this, but I think that I believe that the safety cost associated with learning how to do reasoning in a STEM domain might be lower than the safety cost associated with having your reasoning directly point to humans via the path of being able to have a thick line between your good and bad behavior.

Rohin:  Yeah.

Scott:  And I can imagine you saying, like, "Well, but that doesn't matter because you can't just say, 'Okay, now we're going to run the evolution on trying to figure out how to solve STEM problems,' because you need to actually have the capabilities or something."

Rohin:  Oh. I think I lost track of what you were saying at the last sentence. Which probably means I failed to understand the previous sentences too.

Eli:  I heard Scott to be saying—obviously correct me if I've also misapprehended you, Scott—that there's at least two axes that you can compare things along. One is “What's the safety tax? How much AI risk do you take on for this particular approach?” And this other axis is, “How hard is it to make this approach work?” Which is a capabilities question. And Scott is saying, "I, Scott, am only thinking about the safety tax question, like which of these is—”

Scott:  “Safety tax” is a technical term that you're using wrong. But yeah.

Rohin:  If you call it “safety difficulty”...

Scott:  Or “risk factor.”

Eli:  Okay. Thank you. “I'm only thinking about the risk factor, and not thinking about how much extra difficulty with making it work comes from this approach.” And I imagine, Rohin, that you're like, "Well, even—”

Rohin:  Oh, I just agree with that. I'm happy making that decomposition for now. Scott then made a further claim about the delta between the risk factor from the IDA approach and the STEM AI approach.

Scott:  Well, first I made the further claim that the risk from the IDA approach and the risk from the STEM approach are kind of different in kind. It's not like you can just directly compare them; they're different kinds of risks. 

And so because of that, I'm uncertain about this: But then I further made the claim, at least in this current context, that the IDA approach might have more risk.

Rohin:  Yeah. And what were the two risks? I think that's the part where I got confused.

Scott:  Oh. One of the risks is we have to run an evolution because we're not learning from the humans, and so the problem's harder. And that introduces risks because it's harder for us to understand how it's working, because it's working in an alien way, because we had to do an evolution or something.

Rohin:  So, roughly, we didn't give it process-level feedback, and so its process could be wild?

Scott:  … Yeahh... I mean, I’m imagining... Yeah. Right.

Rohin:  Cool. All right. I understand that one. Well, maybe not... 

Scott:  Yeah. I think that if you phrase it as "We didn't give it any process-level feedback," I'm like, “Yeah, maybe this one is obviously the larger risk.” I don't know.

But yeah, the risk associated with "You have to do an evolution on your system that’s thinking about STEM." Yeah, it's something like, "We had to do an evolution." 

And the risk of the other one is that we made it such that we can't oversee it in the required ways. We can't run the policy, watch it carefully, and ensure it doesn’t reason about humans. 

Rohin:  Yeah. 

Scott:  And yeah, I think I'm not even making a claim about which of these is more or less risky. I'm just saying that they're sufficiently different risks, that we want to see if we can mitigate either of them. 

Rohin:  Yeah. That makes sense.
 

5. Scaling up STEM AI

Rohin:  So the thing I wanted to say, and now I'm more confident that it actually makes sense to say, is that it feels to me like the STEM AI approach is lower-risk for a somewhat-smarter-than-human system, but if I imagine scaling up to arbitrarily smarter-than-human systems, I'm way more scared of the STEM AI version. 

Scott:  Yeah.

Eli:  Can you say why, Rohin?

Rohin:  According to me, the reason for optimism here is that your STEM AI system isn't really thinking about humans, doesn't know about them. Whatever it's doing, it's not going to successfully execute a treacherous turn, because successfully doing that requires you to model humans. That's at least the case that I would make for this being safe.

And when we're at somewhat superhuman systems, I'm like, "Yeah, I mostly buy that." But when we talk about the limit of infinite intelligence or something, then I'm like, "But it's not literally true that the STEM AI system has no reason to model the humans, to the extent that it has goals (which seems like a reasonable thing to assume).” Or at least a reasonable thing to worry about, maybe not assume.

It will maybe want more resources. That gives it an incentive to learn about the external world, which includes a bunch of humans. And so it could just start learning about humans, do it very quickly, and then execute a treacherous turn. That seems entirely possible, whereas with the IDA approach, you can hope that we have successfully instilled the notion of, "Yes, you are really actually trying to—”

Scott:  Yeah. I definitely don’t want us to do this forever.

Rohin:  Yeah.

Scott:  I'm imagining—I don't know, for concreteness, even though this is a silly plan, there's the plan of “turn Jupiter into a supercomputer, invent some scanning tech, run a literal HCH.”

Rohin:  Yeah. 

Scott:  And I don't know, this probably seems like it requires super-superhuman intelligence by your scale. Or maybe not.

Rohin:  I don't really have great models for those scales.

Scott:  Yeah, I don't either. It's plausible to me that... Yeah, I don't know.

Rohin:  Yeah. I'm happy to say, yes, we do not need a system to scale to literally the maximal possible intelligence that is physically possible. And I do not—

Scott:  Further, we don't want that, because at some point we need to get the human values into the future, right? (laughs)

Rohin:  Yes, there is that too.

Scott:  At some point we need to... Yeah. There is a sense in which it's punting the problem. 

Rohin:  Yeah. And to be clear, I am fine with punting the problem. 

It does feel a little worrying though that we... Eh, I don't know. Maybe not. Maybe I should just drop this point.
 

6. Optimism about oversight

Scott:  Yeah. I feel like I want to try to summarize you or something. Yeah, I feel like I don't want to just say this optimism thing that you've said several times, but it's like...

I think that I predict that we're in agreement about, "Well, the STEM plan and the IDA plan have different and incomparable-in-theory risk profiles." And I don't know, I imagine you as being kind of comfortable with the risk profile of IDA. And probably also more comfortable than me on the risk profile of basically every plan.

I think that I'm coming from the perspective, "Yep, the risk profile in the STEM plan also seems bad, but it seems higher variance to me."

It seems bad, but maybe we can find something adjacent that's less bad, or something like that. And so I suspect there’s some sort of, "I'm pessimistic, and so I'm looking for things such that I don't know what the risk profiles are, because maybe they'll be better than this option."

Rohin:  Yeah.

Scott:  Yeah. Yeah. So maybe we're mostly disagreeing about the risk profile of the IDA reference class or something—which is a large reference class. It's hard to bring in one thing.

Rohin:  I mean, for what it's worth, I started out this conversation not having appreciated the point about process-level feedback requiring more mutual information with humans. I was initially making a stronger claim.

Scott:  Okay.

Eli:  Woohoo! An update.

Scott:  (laughs) Yeah. I feel like I want to flag that I am suspicious that neither of us were able to correctly steel-man IDA, because of... I don't know, I'm suspicious that the IDA in Paul's heart doesn't care about whether you build your IDA out of humans versus building it out of some other aliens that can accomplish things, because it's really trying to... I don't know. That's just a flag that I want to put out there that it's plausible to me that both Rohin and I were not able to steel-man IDA correctly.

Rohin:  I think I wasn't trying to. Well, okay, I was trying to steel-man it inasmuch as I was like, “Specifically the risks of human manipulation seem not that bad,” which, I agree, Paul might have a better case for than me.

Scott:  Yeah.

Rohin:  I do think that the overall case for optimism is something like, “Yep, you'll be learning, you'll be getting some mutual info about humans, but the oversight will be good enough that it just is not going to manipulate you.”

Scott:  I also suspect that we're not just optimistic versus pessimistic on AI safety. I think that we can zoom in on optimism and pessimism about transparency, and it might be that “optimism versus pessimism on transparency” is more accurate.

Rohin:  I think I'm also pretty optimistic about process-level feedback or something—which, you might call it transparency... 

Scott:  Yeah. It’s almost like when I'm saying “transparency,” I mean “the whole reference class that lets you do oversight” or something. 

Rohin:  Oh. In that case, yes, that seems right. 

So it includes adversarial training, it includes looking at the concepts that the neural net has learned to see if there's a deception neuron that's firing, it includes looking at counterfactuals of what the agent would have done and seeing whether those would have been bad. All that sort of stuff falls under “transparency” to you? 

Scott:  When I was saying that sentence, kind of.

Rohin:  Sure. Yeah, I can believe that that is true. I think there's also some amount of optimism on my front that the intended generalization is just the one that is usually learned, which would also be a difference that isn't about transparency. 

Scott:  Yeah.

Rohin:  Oh, man. To everyone listening, don't quote me on that. There are a bunch of caveats about that. 

Scott:  (laughs)

Eli:  It seems like this is a pretty good place to wrap up, unless we want to dive back in and crux on whether or not we should be pessimistic about transparency and generalization.

Scott:  Even if we are to do that, I want to engage with the audience first. 

Eli:  Yeah, great. So let us open the fishbowl, and people can turn on their cameras and microphones and let us discuss.
 

7. Q&A: IDA and getting useful work from AI

Scott:  Wow, look at all this chat with all this information. 

Rohin:  That is a lot of chat. 

So, I found one of the chat questions, which is, “How would you build an AI system that did not model humans? Currently, to train the model, we just gather a big data set, like a giant pile of all the videos on YouTube, and then throw it at a neural network. It's hard to know what generalizations the network is even making, and it seems like it would be hard to classify generalizations into a ‘modeling humans’ bucket or not.”

Scott:  Yeah. I mean, you can make a Go AI that doesn't model humans. It's plausible that you can make something that can do physics using methods that are all self-play-adjacent or something.

Rohin:  In a simulated environment, specifically. 

Scott:  Right. Yeah.

Or even—I don't know, physics seems hard, but you could make a thing that tries to learn some math and answer math questions, you could do something that looks like self-play with how useful limits are or something in a way that's not... I don't know.

Rohin:  Yeah.

Donald Hobson:  I think a useful trick here is: anything that you can do in practice with these sorts of methods is easy to brute-force. You can do Go with AlphaGo Zero, and you can easily brute-force Go given unlimited compute. AlphaGo Zero is just a clever— 

Scott:  Yeah. I think there's some part where the human has to interact with it, but I think I'm imagining that your STEM AI might just be, "Let's get really good at being able to do certain things that we could be able to brute-force in principle, and then use this to do science or something.”

Donald Hobson:  Yeah. Like a general differential equation solver or something. 

Scott:  Yeah. I don't know how to get from this to being able to get, I don't know, reliable brain scanning technology or something like that. But I don't know, it feels plausible that I could turn processes that are about how to build a quantum computer into processes that I could turn into math questions. I don't know.

Donald Hobson:  “Given an arbitrary lump of biochemicals, find a way to get as much information as possible out of it.” Of a specific certain kind of information—not thermal noise, obviously.

Scott:  Yeah. Also in terms of the specifications, you could actually imagine just being able to specify, similar to being able to specify Go, the problem of, "Hey, I have a decent physics simulation and I want to have an algorithm that is Turing-complete or something, and use this to try to make more…” I don't know. Yeah, it's hard. It's super hard.

Ben Pace:  The next question in the chat is from Charlie Steiner saying, "What's with IDA being the default alternative rather than general value learning? Is it because you're responding to imaginary Paul, or is it because you think IDA is particularly likely or particularly good?"

Scott:  I think that I was engaging with IDA especially because I think that IDA is among the best plans I see. I think that IDA is especially good, and I think that part of the reason why I think that IDA is especially good is because it seems plausible to me that basically you can do it without the core of humanity inside it or something.

I think the version of IDA that you're hoping to be able to do in a way that doesn't have the core of humanity inside the processes that it's doing seems plausible to exist. I like IDA. 

Donald Hobson:  By that do you mean IDA, like, trained on chess? Say, you take Deep Blue and then train IDA on that? Or do you mean IDA that was originally trained on humans? Because I think the latter is obviously going to have a core of humanity, but IDA trained on Deep Blue—

Scott:  Well, no. IDA trained on humans, where the humans are trained to follow an algorithm that's sufficiently reliable.

Donald Hobson:  Even if the humans are doing something, there will be side channels. Suppose the humans are solving equations—

Scott:  Yeah, but you might be able to have the system watch itself and be able to have those side channels not actually... In theory, there are those side channels, and in the limit, you'd be worried about them, but you might be able to get far enough to be able to get something great without having those side channels actually interfere.

Donald Hobson:  Fair enough. 

Ben Pace:  I'm going to ask a question that I want answered, and there's a good chance the answer is obvious and I missed it, but I think this conversation initially came from, Scott wrote a post called “Thoughts on Human Models,” where he was— 

Scott:  Ramana wrote the post. I ranted at Ramana, and then he wrote the post.

Ben Pace:  Yes. Where he was like, "What if we look more into places that didn't rely on human models in a bunch of ways?" And then later on, Rohin was like, "Oh, I don't think I agree with that post much at all, and I don't think that's a promising area." I currently don't know whether Rohin changed his mind on whether that was a promising area to look at. I'm interested in Rohin's say on that.

Rohin:  So A, I'll note that we mostly didn't touch on the things I wrote in those comments because Scott's reasons for caring about this have changed. So I stand by those comments. 

Ben Pace:  Gotcha. 

Rohin:  In terms of “Do I think this is a promising area to explore?”: eh, I'm a little bit more interested in it, mostly based on “perhaps process level feedback is introducing risks that we don't need.” But definitely my all-things-considered judgment is still, "Nah I would not be putting resources into this," and there are a bunch of other disagreements that Scott and I have that are feeding into that. 

Ben Pace:  And Scott, what's your current feelings towards... I can't remember, in the post you said "there's both human models and not human models", and then you said something like, "theory versus"... I can't remember what that dichotomy was.

But how much is your feeling a feeling of “this is a great area with lots of promise” versus “it's plausible and I would like to see some more work in it,” versus “this is the primary thing I want to be thinking about.”

Scott:  So, I think there is a sense in which I am primarily doing science and not engineering. I'm not following the plan of “how do you make systems that are…” I'm a little bit presenting a plan, but I'm not primarily thinking about that plan as a plan, as opposed to trying to figure out what's going on.

So I'm not with my main attention focused on an area such that you can even say whether or not there are human models in the plan, because there's not a plan.

I think that I didn't update much during this conversation and I did update earlier. Where the strongest update I've had recently, and partially the reason why I was excited about this, was: The post that was written two years ago basically is just talking about "here's why human models are dangerous", and it's not about the thing that I think is the strongest reason, which is the ability to oversee and the ability to be able to implement the strategy “don't let the thing have human models at all.”

It's easy to separate “thinking about physics” from “thinking about manipulating humans” if you put “think about modeling humans” in the same cluster as “think about manipulating humans.” And It's a lot harder to draw the line if you want to put “think about modeling humans” on the other side, with “thinking about physics.”

And I think that this idea just didn't show up in the post. So to the extent that you want me to talk about change from two years ago, I think that the center of my reasoning has changed, or the part that I'm able to articulate has changed.

There was a thing during this conversation... I think that I actually noticed that I probably don't, even in expectation, think that... I don't know.

There was a part during this conversation where I backpedaled and was like, "Okay, I don't know about the expectation of the risk factor of these two incomparable risks. I just want to say: first, they're different in kind, therefore we should analyze both. They're different in kind, therefore if we're looking for a good strategy, we should keep looking at both because we should keep that OR gate in our abilities.

“And two: It seems like there might be some variance things, where we might be able to learn more about the risk factors on this other side, AI that doesn’t use human models. It might turn out that the risks are lower, so we should want to keep attention on both types of approach.”

I think maybe I would have said this before. But I think that I'm not confident about what side is better, and I'm more arguing from “let's try lots of different options because I'm not satisfied with any of them.”

Rohin:  For what it's worth, if we don't talk just about "what are the risks" and we also include "what is the plan that involves this story", then I'm like, "the IDA one seems many orders of magnitude more likely to work". “Many” being like, I don’t know, over two.

Ben Pace:  It felt like I learned something when Scott explained his key motivation being around the ability to understand what the thing is doing and blacklist or whitelist certain types of cognition, being easier when it's doing no human modeling, as opposed to when it is trying to do human modeling but you also don't want it to be manipulative.

Rohin:  I think I got that from the comment thread while we were doing the comment thread, whenever that was.

Ben Pace:  The other question I have in chat is: Steve—

Scott:  Donald was saying mutual information is too low a bar, and I want to flag that I did not mean that mutual information is the metric that should be used to determine whether or not you're modeling humans. But the type of thing that I think that we could check for, has some common intuition with trying to check for mutual information.

Donald Hobson:  Yeah, similar to mutual information. I'd say it was closer to mutual information conditional on a bunch of physics and other background facts about the universe.

Scott:  I want to say something like, there's some concept that we haven't invented yet, which is like “logical mutual information,” and we don't actually know what it means yet, but it might be something.

Ben Pace:  I'm going to go to the next question in chat, which is Steve Byrnes: “Scott and Rohin both agreed/posited right at the start that maybe we can make AIs that do tasks that do not require modeling humans, but I'm still stuck on that. An AI that can't do all of these things, like interacting with people, is seemingly uncompetitive and seemingly unable to take over the world or help solve all of AI alignment. That said, ‘use it to figure out brain scanning’ was a helpful idea by Scott just now. But I'm not 100% convinced by that example. Is there any other plausible example, path, or story?”

Scott:  Yeah, I feel like I can't give a good story in terms of physics as it exists today. But I could imagine being in a world where physics was different such that it would help. And because of that I'm like, “It doesn't seem obviously bad in principle,” or something.

I could imagine a world where basically we could run HCH and HCH would be great, but we can't run HCH because we don't have enough compute power. But also, if you can inverse this one hash you get infinity compute because that's how physics works.

Group:  (laughs)

Scott:  Then your plan is: let's do some computer science until we can invert this one hash and then we unlock the hyper-computers. Then we use the hyper-computers to run a literal HCH as opposed to just an IDA.

And that's not the world we live in, but the fact that I can describe that world means that it's an empirical about-the-world fact as opposed to an empirical about-how-intelligence-works fact, or something. So I'm, like, open to the possibility. I don’t know.

Gurkenglas:  On what side of the divide would you put reasoning about decision theory? Because thinking about agents in general seems like it might not count as modeling humans, and thinking about decision theory is kind of all that's required in order to think about AI safety, and if you can solve AI safety that's the win condition that you were talking about.

But also it's kind of hard to be sure that an AI that is thinking about agency is not going to be able to manipulate us, even if it doesn't know that we are humans. It might just launch attacks against whoever is simulating its box. So long as that person is an agent.

Scott:  Yeah. I think that thinking about agents in general is already scary, but not as scary. And I'm concerned about the fact that it might be that thinking about agents is just convergent and you're not going to be able to invert the hash that gives you access to the hyper-computer, unless you think about agents.

That's a particularly pessimistic world, where it's like, you need to have things that are thinking about agency and thinking about things in the context of an agent in order to be able to do anything super powerful.

I'm really uncertain about the convergence of agency. I think there are certain parts of agency that are convergent and there are certain parts of agency that are not, and I'm confused about what the parts of agency even are, enough that I kind of just have this one cluster. And I'm like, yeah, it seems like that cluster is kind of convergent. But maybe you can have things that are optimizing “figuring out how to divert cognitive resources” and that's kind of like being an agent, but it's doing so in such a way that's not self-referential or anything, and maybe that's enough? I don’t know.

There's this question of the convergency of thinking about agents that I think is a big part of a hole in my map. A hole in my prioritization map is, I'm not actually sure what I think about what parts of agency are convergent, and this affects a lot, because if certain parts of agency are convergent then it feels like it really dooms a lot of plans.

Ben:  Rohin, is there anything you wanted to add there?

Rohin:  I don’t know. I think I broadly agree with Scott in that, if you're going to go down the path of “let's exclude all the types of reasoning that could be used to plan or execute a treacherous turn,” you probably want to put general agency on the outside of that barrier. That seems roughly right if you're going down this path.

And on the previous question of: can you do it, can this even be done? I'm super with, I think it was Steve, on these sorts of AI systems probably being uncompetitive. But I sort of have a similar position as Scott: maybe there's just a way that you can leverage great knowledge of just science in order to take over the world for example. It seems like that might be possible. I don't know, one way or the other. I would bet against, unless you have some pretty weird scenarios. But I wouldn't bet against at, like, 99% confidence. Or maybe I would bet against it that much. But there I'm getting to “that's probably a bit too high.”

And like, Scott is explicitly saying that most of these things are not high-probability things. They're just things that should be investigated. So that didn't seem like an avenue to push down.

Ben Pace:  I guess I just disagree. I feel like if you were the first guy to invent nukes, and you invented them in 1800 or something, I feel I could probably tell various stories about having advanced technologies in a bunch of ways helping strategically—not even just weaponry.

Rohin:  Yeah, I'd be interested in a story like this. I don't feel like you can easily make it. It just seems hard unless you're already a big power in the world.

Ben Pace:  Really! Okay. I guess maybe I'll follow up with you on that sometime.

Joe Collman:  Doesn't this slightly miss the point though, in that you'd need to be able to take over the world but then also make it safe afterwards, and the “make it safe afterwards” seems to be the tricky part there. The AI safety threat is still there if you through some conventional means—

Donald Hobson:  Yeah, and you want to do this without massive collateral damage. If you invent nukes... If I had a magic map, where I pressed it and the city on it blew up or something, I can't see a way of taking over the world with that without massive collateral damage. Actually, I can't really see a way of taking over the world with that with massive collateral damage.

Ben Pace:  I agree that the straightforward story with nukes sounds fairly unethical. I think there's probably ways of doing it that aren't unethical but are more about just providing sufficient value that you're sort of in charge of how the world goes.

Joe Collman:  Do you see those ways then making it safer? Does that solve AI safety or is it just sort of, “I'm in charge but there's still the problem?”

Scott:  I liked the proof of concept, even though this is not what I would want to do, between running literal HCH and IDA. I feel like literal HCH is just a whole lot safer, and the only difference between literal HCH and IDA is compute and tech.
 

8. Q&A: HCH’s trustworthiness

Joe Collman:  May I ask what your intuition is about motivation being a problem for HCH? Because to me it seems like we kind of skip this and we just get into what it's computationally capable of doing, and we ignore the fact that if you have a human making the decisions, that they're not going to care about the particular tasks you give them necessarily. They're going to do what they think is best, not what you think is best.

Scott:  Yeah, I think that that's a concern, but that's a concern in IDA too.

If I think that IDA is the default plan to compare things to, then I'm like, “Well, you could instead do HCH, which is just safer—if you can safely get to the point where you can get to HCH.”

Joe Collman:  Right, yeah. I suppose my worry around this is more if we're finding some approximation to it and we're not sure we've got it right yet. Whereas I suppose if you're uploading and you're sure that the uploading process is actually—

Scott:  Which is not necessarily the thing that you would do with STEM AI, it's just a proof of concept.

Donald Hobson:  If you're uploading, why put the HCH structure on it at all? Why not just have a bunch of uploaded humans running around a virtual village working on AI safety? If you've got incredibly powerful nano-computers, can't you just make uploaded copies of a bunch of AI safety researchers and run them for a virtual hundred years, but real time five minutes, doing AI safety research?

Scott:  I mean, yeah, that's kind of like the HCH thing. It's different, but I don't know...

Charlie Steiner:  I'm curious about your intuitive, maybe not necessarily a probability, but gut feeling on HCH success chances, because I feel like it's quite unlikely to preserve human value.

Donald Hobson:  I think a small HCH will probably work and roughly preserve human value if it's a proper HCH, no approximations. But with a big one, you're probably going to get ultra-viral memes that aren't really what you wanted.

Gurkenglas:  We have a piece of evidence on the motivation problem for HCH and IDA, namely GPT. When it pretends to write for a human, we can easily make it pretend to be any kind of human that we want by simply specifying it in the prompt. The hard part is making the pretend human capable enough.

Donald Hobson:  I'm not sure that it's that easy to pick what kind of human you actually get from the prompt.

Scott:  I think that the version of IDA that I have any hope for has humans following sufficiently strict procedures and such that this isn't actually a thing that... I don’t know, I feel like this is a misunderstanding of what IDA is supposed to do.

I think that you're not supposed to get the value out of the individual pieces in IDA. You're supposed to use that to be able to… Like, you have some sort of task, and you're saying, "Hey, help me figure out how to do this task." And the individual humans inside the HCH/IDA system are not like, "Wait, do I really want to do this task?". The purpose of the IDA is to have the capabilities come from a human-like method as opposed to just, like, a large evolution, so that you can oversee it and trust where it’s coming from and everything.

Donald Hobson:  Doesn't that mean that the purpose is to have the humans provide all the implicit values so obvious no one bothered to mention them. So if you ask IDA to put two strawberries on the plate, it's the humans’ implicit values that do it in a way that doesn't destroy the world.

Scott:  I think that it's partially that. I think it's not all that. I think that you only get the basics of human values out of the individual components of the HCH or IDA, and the actual human values are coming from the humans using the IDA system. 

I think that IDA as intended should work almost as well if you replace all the humans with well-intentioned aliens. 

Ben Pace:  Wait, can you say that again? That felt important to me. 

Scott:  I think, and I might be wrong about this, that the true IDA, the steel-manScott IDA, should work just as well or almost as well if you replaced all of the humans with well-intentioned aliens. 

Donald Hobson:  Well-intentioned aliens wouldn't understand English. Your IDA is getting its understanding of English from the humans in it. 

Scott:  No, I think that if I wanted to use IDA, I might use IDA to for example solve some physics problems, and I would do so with humans inside overseeing the process. There's not supposed to be a part of the IDA that’s making sure that human values are being brought into our process of solving these physics problems. The IDA is just there to be able to safely solve physics problems.

And so if you replaced it with well-intentioned aliens, you still solve the physics problems. And if you direct your IDA towards a problem like “solve this social problem”, you need information about social dynamics in your system, but I think that you should think of that as coming from a different channel than the core of the breaking-problems-up-and-stuff, and the core of the breaking-problems-up-an-stuff should be thought of as something that could be done just as well by well-intentioned aliens. 

Donald Hobson:  So you've got humans that are being handed a social problem and told to break it up, that is the thing you're imitating, but the humans are trying to pretend that they know absolutely nothing about how human societies work except for what's on this external piece of paper that you handed them.

Scott:  It's not necessarily a paper. You do need some human-interacting-with-human in order to think about social stuff or something... Yeah, I don't know what I'm saying. 

It's more complicated on the social thing than on the physics thing. I think the physics thing should work just as well. I think the physics thing should work just as well with well-intentioned aliens, and the human thing should work just as well if you have an HCH that's built out of humans and well-intentioned aliens and the humans never do any decomposition, they only ask the well-intentioned aliens questions about social facts or something. And the process that's doing decomposition is the well-intended aliens' process of doing composition. 

I also might be wrong. 

Ben Pace:  I like this thread. I also kind of liked it when Joe Collman asked questions that he cared about. If you had more questions you cared about, Joe, I would be interested in you asking those 

Joe Collman:  I guess, sure. I was just thinking again with the HCH stuff—I would guess, Scott, that you probably think this isn't a practical issue, but I would be worried about the motivational side of HCH in the limit of infinite training of IDA. 

Are you thinking that, say, you've trained a thousand iterations of IDA, you're training the thousand-and-first, you've got the human in there. They've got this system that's capable of answering arbitrarily amazing, important questions, and you feed them the question “Do you like bananas?” or something like that, some irrelevant question that's just unimportant. Are we always trusting that the human involved in the training will precisely follow instructions? Are you seeing that as a non-issue, that we can just say, well that's a sort of separate thing— 

Scott:  I think that you could make your IDA kind of robust to “sometimes the humans are not following the instructions”, if it's happening a small amount of the time. 

Joe Collman:  But if it's generalizing from that into other cases as well, and then it learns, “OK, you don't follow instructions a small amount of time…”—if it generalizes from that, then in any high-stakes situation it doesn't follow instructions, and if we're dealing with an extremely capable system, it might see every situation as a high-stakes situation because it thinks, “I can answer your question or I can save the world.” Then I'm going to choose to save the world rather than answering your question directly, providing useful information. 

Scott:  Yeah, I guess if there's places where the humans reliably... I was mostly just saying, if you're collecting data from humans and there's noise in your data from humans, you could have systems that are more robust to that. But if you have it be the case that humans reliably don't follow instructions on questions of type X, then the thing that you're training is a thing that reliably doesn't follow instructions on questions of type X. And if you have assumptions about what the decomposition is going to do, you might be wrong based on this, and that seems bad, but... 

Rohin:  I heard Joe as asking a slightly different question, which is: “In not HCH but IDA, which is importantly—” 

Joe Collman:  To me it seems to apply to either. In HCH it seems clear to me that this will be a problem, because if you give it any task and the task is not effectively “Give me the most valuable information that you possibly can,” then the morality of the H—assuming you've got an H that wants the best for the world—then if you ask it, “Do you like bananas?”, it's just going to give you the most useful information. 

Rohin:  But an HCH is made up of humans. It should do whatever humans would do. Humans don't do that. 

Joe Collman:  No, it's going to do what the HCH tree would do. It's not going to do what the top-level human would do. 

Rohin:  Sure. 

Joe Collman:  So the top level human might say, “Yes, I like bananas,” but the tree is going to think, “I have infinite computing power. I can tell you how to solve the world's problems, or I can tell you, ‘Yes, I like bananas.’ Of those two, I'm going to tell you how to solve the world's problems, not answer your question.” 

Rohin:  Why? 

Joe Collman:  Because that's what a human would do—this is where it comes back into the IDA situation, where I'm saying, eventually it seems... I would assume probably this isn't going to be a practical problem and I would imagine your answer would be, “Okay, maybe in the limit this…” 

Rohin:  No, I'm happy with focusing on the limit. Even eventually, why does a human do this? The top-level human. 

Joe Collman:  The top level human: if it's me and you've asked me a question “Do you like bananas?”, and I'm sitting next to a system that allows me to give you information that will immediately allow you to take action to radically improve the world, then I'm just not going to tell you, “Yes, I like bananas”, when I have the option to tell you something that's going to save lives in the next five minutes or otherwise radically improve the world. 

It seems that if we assume we've got a human that really cares about good things happening in the world, and you ask a trivial question, you're not going to get an answer to that question, it seems to me. 

Rohin:  Sure, I agree if you have a human who is making decisions that way, then yes, that's the decision that would come out of it. 

Joe Collman:  The trouble is, isn't that the kind of human that we want in these situations—is one precisely that does make decisions that way? 

Rohin:  No, I don't think so. We don't want that. We want a human who's trying to do what the user wants them to do. 

Joe Collman:  Right. The thing is, they have to be applying their own... HCH is basically a means of getting enlightened judgment. So the enlightened judgment has to come from the HCH framework rather than from the human user. 

Rohin:  I mean, I think if your HCH is telling the user something and the user is like, "Dammit, I didn't like this", then your HCH is not aligned with the user and you have failed. 

Joe Collman:  Yeah, but the HCH is aligned with the enlightenment. I suppose the thing I'm saying is that the enlightened judgment as specified by HCH will not agree with me myself. An HCH of me—it's enlightened judgment is going to do completely different things than the things I would want. So, yes, it's not aligned, but it's more enlightened than me. It's doing things that on “long reflection” I would agree with. 

Rohin:  I mean, sure, if you want to define terms that way then I would say that HCH is not trying to do the enlightened judgment thing, and if you've chosen a human who does the enlightened judgment thing you've failed. 

Donald Hobson:  Could we solve this by just never asking HCH trivial questions?

Joe Collman:  The thing is, this is only going to be a problem where there's a departure between solving the task that's been assigned versus doing the thing that sort of maximally improves the world. Obviously if those are both the same, if you've asked wonderfully important questions or if you asked the HCH system “What is the most important question I can ask you?” and then you ask that, then it's all fine— 

Scott:  I think that most of the parts of the HCH will have the correct belief that the way to best help the world is to follow instructions corrigibly or something. 

Joe Collman:  Is that the case though, in general? 

Scott:  Well, it's like, if you're part of this big network that's trying to answer some questions, it's— 

Joe Collman:  So are you thinking from a UDT point of view, if I reasoned such that every part of this network is going to reason in the same way as me, therefore I need the network as a whole to answer the questions that are assigned to it or it's not going to come up with anything useful at all. Therefore I need to answer the question assigned to me, otherwise the system is going to fail. Is that the kind of…? 

Scott:  Yeah, I don't even think you have to call on UDT in order to get this answer right. I think it's just: I am part of this big process and it's like, "Hey solve this little chemistry problem," and if I start trying to do something other than solve the chemistry problem I'm just going to add noise to the system and make it less effective. 

Joe Collman:  Right. That makes sense to me if the actual top-level user has somehow limited the output so that it can only respond in terms of solving a chemistry problem. Then I guess I'd go along with you. But if the output is just free text output and I'm within the system and I get the choice to solve the chemistry problem as assigned, or I have the choice to provide the maximally useful information, just generally, then I'm going to provide the maximally useful information, right? 

Scott:  Yeah, I think I wouldn't build an HCH out of you then. (laughs

Rohin:  Yeah, that’s definitely what I’m thinking. 

Scott:  I think that I don't want to have the world-optimization in the HCH in that way. I want to just... I think that the thing that gives us the best shot is to have a corrigible HCH. And so— 

Joe Collman:  Right, but my argument is basically that eventually the kind of person you want to put in is not going to be corrigible. With enough enlightenment— 

Rohin:  I think I continue to be confused about why you assume that we want to put a consequentialist into the HCH. 

Joe Collman:  I guess I assume that with sufficient reflection, pretty much everyone is a consequentialist. I think the idea that you can find a human who at least is reliably going to stay not a consequentialist, even after you amplify their reasoning hugely, that seems a very suspicious idea to me. 

Essentially, let's say for instance, due to the expansion of the universe we're losing—I calculated this very roughly, but we're losing about two stars per second, in the amount of the universe we can access, something like that. 

So in theory, that means every second we delay in space colonization and the rest of it, is a huge loss in absolute terms. And so, any system that could credibly make the argument, “With this approach we can do this, and we can not lose these stars”—it seems to me that if you're then trading that against, “Oh, you can do this, and you can answer this question, solve this chemistry problem, or you can improve the world in this huge way”... 

Rohin:  So, if your claim is like, there are two people, Alice and Bob. And if you put Alice inside a giant HCH, and really just take the limit all the way to infinity, then that HCH is not going to be aligned with Bob, because Alice is probably going to be a consequentialist. Then yes, sure, that seems probably true. 

Joe Collman:  I'm saying it's not really going to be aligned with Alice in the sense that it will do what Alice wants. It will do what the HCH of Alice wants, but it won't do what Alice wants. 

Ben Pace:  Can I move to Donald’s question? 

Rohin:  Sure. 

Donald Hobson:  I was just going to say that, I think the whole reason that a giant HCH just answering the question is helpful: Suppose that the top-level question is “solve AI alignment.” And somewhere down from that you get “design better computers.” And somewhere down from that you get “solve this simple chemistry problem to help the [inaudible] processes or whatever.” 

And so, all of the other big world-improvement stuff is already being done above you in some other parts of the tree. So, literally the best thing you can do to help the world, if you find yourself down at the bottom of the HCH, is just solving that simple chemistry problem. Because all the other AI stuff is being done by some other copy of you. 

Joe Collman:  That's plausible, sure. Yeah, if you reason that way. My only thing that I'm claiming quite strongly, is that eventually with suitable amplification, everyone is going to come around to the idea, “I should give the response that is best for the world,” or something like that.

So, if your chain of reasoning takes you to “solving that chemistry problem is making a contribution that at the top level is best for the world,” then sure, that's plausible. It's hard to see exactly how you reliably get to draw that conclusion, that solving the question you've been asked is the best for the world. But yeah.

Rob Miles:  Is it necessary for HCH that the humans in it have no idea where in the tree they are? Or could you pass in some contexts that just says, "Solve this chemistry problem that will improve—"

Scott:  You actually want to send that context in. There's discussion about one of the things that you might do in solving HCH, which is: along with the problems, you pass in an ordinal, that's how much resources of HCH you're allowed. And so you say, like, "Hey, solve this problem. And you get 10100 resources.” And then you're like, "Here, sub-routine, solve this problem. You get 10100 / 2 resources.” And then, once you get down into zero resources, you aren't allowed to make any further calls.

And you could imagine working this into the system, or you could imagine just corrigible humans following this instruction, and saying, “Whenever you get input in some amount of resources, you don't spend more than that.”

And this resource kind of tells you some information about where you are in the tree. And not having this resource is a bad thing. The purpose of this resource is to make it so that there's a unique fixed point of HCH. And without it, there's not necessarily a unique fixed point. Which is basically to say that, I think that individual parts of an HCH are not intended to be fully thinking about all the different places in the tree, because they're supposed to be thinking locally, because the tree is big and complex. But I think that there's no “Let's try to hide information about where you are in the tree from the components.” 

Rob Miles:  It feels like that would partly solve Joe's problem. If you're given a simple chemistry problem and 10100 resources, then you might be like, "This is a waste of resources. I'm going to do something smarter." But if you're being allocated some resources that seem reasonable for the difficulty of the problem you're being set, then you can assume that you're just part of the tree— 

Scott:  Well, if you started out with— 

Ben Pace:  Yeah, Rob is right that in real life, if you give me a trillion dollars to solve a simple chemistry problem, I'll primarily use the resources to do cooler shit. (laughs

Joe Collman:  I think maybe the difficulty might be here that if you're given a problem where the amount of resources is about right for the difficulty of solving the problem, but the problem isn't actually important.

So obviously “Do you like bananas?” is a bad example, because you could set up an IDA training scheme that learns how many resources to put into it. And there the top level human could just reply, "Yes," immediately. And so, you don't—

Scott:  I'm definitely imagining the resources are, it's reasonable to give more resources than you need. And then you just don't even use them. 

Joe Collman:  Right, yeah. But I suppose, just to finish off the thing that I was going to say is that yes, it still seems, with my kind of worry, it's difficult if you have a question which has a reasonable amount of resources applied to it, but is trivial, is insignificant. It seems like the question wouldn't get answered then. 

Charlie Steiner:  I want to be even more pessimistic, because of the unstable gradient problem, or any fixed point problem. In the limit of infinite compute or infinite training, we have at each level some function being applied to the input, and then that generates an output. It seems like you're going to end up outputting the thing that most reliably leads to a cycle that outputs itself, or eventually outputs itself. I don't know. This is similar to what Donald said about super-virulent memes.
 

9. Q&A: Disagreement and mesa-optimizers

Ray Arnold:  So it seems like a lot of stuff was reducing to, “Okay, are we pessimistic or not pessimistic about safe AGI generally being hard?” 

Rohin:  Partly that, and partly me be plan-oriented, and Scott being science-oriented.

Scott:  I mean, I'm trying to be plan-oriented in this conversation, or something. 

Ray Arnold:  It seems like a lot of these disagreements reduce to this ur-disagreement of, "Is it going to be hard or easy?", or a few different axes of how it's going to be hard or easy. And I'm curious, how important is it right now to be prioritizing ability to resolve the really deep, gnarly disagreements, versus just “we have multiple people with different paradigms, and maybe that's fine, and we hope one of them works out”? 

Eli:  I'll say that I have updated downward on thinking that that's important. 

Ray Arnold:  (laughs) As Mr. Double Crux. 

Eli:  Yeah. It seems like things that cause concrete research progress are good, and conversations like this one do seem to cause insights that are concrete research progress, but... 

Ray Arnold:  Resolving the disagreement isn't concrete research progress?

Eli:  Yeah. It's like, “What things help with more thinking?” Those things seem good. But I used to think there were big disagreements—I don't know, I still have some probability mass on this—but I used to think there were big disagreements, and there was a lot of value on the table of resolving them.

Gurkenglas:  Do you agree that if you can verify whether a system is thinking about agents, that you could also verify whether it has a mesa-optimizer

Scott:  I kind of think that systems will have mesa-optimizers. There's a question of whether or not mesa-optimizers will be there explicitly or something, but that kind of doesn't matter. 

It would be nice if we could understand where the base level is. But I think that the place where the base level is, we won't even be able to point a finger at what a “level” is, or something like that. 

And we're going to have processes that make giant spaghetti code that we don't understand. And that giant spaghetti code is going to be doing optimization. 

Gurkenglas:  And therefore we won’t be able to tell whether it's thinking about agency. 

Scott:  Yeah, I don't know. I want to exaggerate a little bit less what I'm saying. Like, I said, "Ah, maybe if we work at it for 20 years, we can figure out enough transparency to be able to distinguish between the thing that's thinking about physics and the thing that's thinking about humans." And maybe that involves something that's like, “I want to understand logical mutual information, and be able to look at the system, and be able to see whether or not I can see humans in it.” I don't know. Maybe we can solve the problem.

Donald Hobson:  I think that actually conventional mutual information is good enough for that. After all, a good bit of quantum random noise went into human psychology. 

Scott:  Mutual information is intractable. You need some sort of, “Can I look at the system…” It's kind of like this— 

Donald Hobson:  Yeah. Mutual information [inaudible] Kolmogorov complexity, sure. But computable approximations to that. 

Scott:  I think that there might be some more information on what computable approximations, as opposed to just saying “computable approximations,” but yeah. 

Charlie Steiner:  I thought the point you were making was more like “mutual information requires having an underlying distribution.” 

Scott:  Yeah. That too. I'm conflating that with “the underlying distributions are tractable,” or something. 

But I don't know. There's this thing where you want to make your AI system assign credit scores, but you want it to not be racist. And so, you determine whether or not by looking at the last layer, you can determine the race of the participants. And then you optimize against that in your adversarial network, or something like that.

And there's like that, but for thinking about humans. And much better than that.

Gurkenglas:  Let's say we have a system that we can with our interpretability tools barely tell has some mesa-optimizer at the top level, that's probably doing some further mesa-optimization in there. Couldn't we then make that outer mesa-optimization explicit, part of the architecture, but then train the whole thing anew and repeat? 

Scott:  I feel like this process might cause infinite regress, but— 

Gurkenglas:  Surely every layer of mesa-optimization stems from the training process discovering either a better prior or a better architecture. And so we can increase the performance of our architecture. And surely that process converges. Or like, perhaps we just— 

Scott:  I’m not sure that just breaking it down into layers is going to hold up as you go into the system.

It's like: for us, maybe we can think about things with meta-optimization. But your learned model might be equivalent to having multiple levels of mesa-optimization, while actually, you can't clearly break it up into different levels.

The methods that are used by the mesa-optimizers to find mesa-optimizers might be sufficiently different from our methods, that you can't always just... I don't know. 

Charlie Steiner:  Also Gurkenglas, I'm a little pessimistic about baking a mesa-optimizer into the architecture, and then training it, and then hoping that that resolves the problem of distributional shift. I think that even if you have your training system finding these really good approximations for you, even if you use those approximations within the training distribution and get really good results, it seems like you're still going to get distributional shift. 

Donald Hobson:  Yeah. I think you might want to just get the mesa-optimizers out of your algorithm, rather than having an algorithm that's full of mesa-optimizers, but you've got some kind of control over them somehow. 

Gurkenglas:  Do you all think that by your definitions of mesa-optimization, AlphaZero has mesa-optimizers? 

Donald Hobson:  I don't know. 

Gurkenglas:  I'm especially asking Scott, because he said that this is going to happen. 

Scott:  I don't know what I think. I feel not-optimistic about just going deeper and deeper, doing mesa-optimization transparency inception.



Discuss

What psychology studies are most important to the rationalist worldview?

4 августа, 2021 - 06:53
Published on August 4, 2021 3:53 AM GMT

The importance of cognitive psychology, neuroscience, and evolutionary psychology to rationalist projects are no secret.  The sequences begin with an introduction to cognitive biases and heuristics for example.  I am an amateur researcher of theoretical and philosophical psychology currently preparing a series of blog posts intended to introduce the rationality community to some of the key insights from that field. As part of that project, I'm hoping to use a couple of studies as case studies to examine the processes by which the knowledge gained from them was produced and the philosophical and theoretical assumptions at play in the interpretation of their findings. In choosing which studies to use as case studies, I'm hoping to optimize for the following criteria.

1. Rationalist/Science approved methodologies were followed. The more confident the community would be in relying on the findings of the studies the better.

2. Relevance to rationalist projects.

3. The "less likely to fall to the replication crisis" the better

4. The more important to the rationalist worldview the better.

I don't know if I'll do all three yet(no promises), but I'm interested in choosing one study or finding from each of these three categories, each optimized for the above criteria. 

1. Cognitive Experimental Psychology

2. Neuroscience

3. Evolutionary Psychology.

Meta-analyses are alright, but I'm going to need specific studies for this project. 

 



Discuss

Curing insanity with malaria

4 августа, 2021 - 05:28
Published on August 4, 2021 2:28 AM GMT

Sometimes the history of medicine is very, very surreal. For example, consider that in 1927, a physician named Julius Wagner-Jauregg received the Nobel Prize in medicine, for...deliberately infecting his patients with malaria. As a treatment for psychosis. 

This often worked

Well, it did kill around 15% of the patients, but it was nonetheless seen as a miracle cure. 

General paralysis of the insane was first identified and described as a distinct disease in the early 19th century. It was initially thought to be caused by an ‘​​inherent weakness of character’. The initial symptoms were of mental deterioration and personality changes; patients suffered a loss of social inhibitions, gradual impairment of judgment, concentration and short-term memory. They might experience euphoria, mania, depression, or apathy. Delusions were common, including “ideas of great wealth, immortality, thousands of lovers, and unfathomable power” – or, on the more negative side, nihilism, self-guilt, and self-blame. 

It was a progressive disease, and nearly always a death sentence. As the condition advanced, the patient would develop worsening dementia, motor and reflex abnormalities, and often seizures; death usually took 3 to 5 years from the initial symptoms. In the 19th century, cases of general paralysis could account for up to 25% of admissions to asylums. 

Some physicians were drawing a connection between general paralysis and syphilis infection as early as the 1850s; however, it took until much later for this explanation to be generally accepted within the medical community, and full confirmation via pathology examinations of the brains of patients who had died of the disease would have to wait until 1913. 

In 1909, an antisyphilitic drug compound was discovered via a process of trialing hundreds of newly synthesized organic arsenical chemicals, looking for one that would have anti-microbial activity but not kill the human patient; this was the first research team effort to optimize biological effects of a promising chemical, which is now the basis of a huge amount of pharmaceuticals research. Unfortunately, arsphenamine, also known as Salvarsan or “606”, was difficult to prepare and administer, and was still fairly toxic to the human patient as well as the syphilis. 

Julius Wagner-Jauregg was a Viennese psychiatrist, but a psychiatrist with a particular interest in experimental pathology, and in brains. Already in the mid-1880s, he was noticing an odd pattern; many of his psychiatric patients were showing improvements in their mental condition after recovering from bouts of other illnesses that resulted in fever. 

Wagner-Jauregg formed two hypotheses. One, some cases of insanity had ‘organic’, biological causes and were related to physical dysfunctions in the brain; two, one disease could be fought by another. He tried deliberately inducing fevers in his patients, by injecting them with tuberculin, a sterile protein extract from cultures of the tubercle bacillus responsible for tuberculosis. However, this was inconsistent at producing a fever, and the results were disappointing. 

In 1917, a soldier ill with malaria was admitted to Wagner-Jauregg’s ward. No, I am not at all sure why a malaria patient was being treated in a psychiatric ward! And, apparently, neither was Wagner-Jauregg:

“Should he be given quinine?” [my assistant Dr. Alfred Fuchs] asked. I immediately said: “No.” This I regarded as a sign of destiny. Because soldiers with malaria were usually not admitted to my wards, which accepted only cases suffering from a psychosis or patients with injuries to the central nervous system.

Wagner-Jauregg would have known that malaria is especially likely to cause repeated, intermittent paroxysms of high fever. Also, unlike with general paralysis, quinine was available as a treatment and reasonably safe. Since general paralysis was still mostly incurable, he must have felt that there wasn’t much to lose. He made the bold choice to draw blood from the sick soldier and inject it into nine of his psychiatric patients diagnosed with general paralysis. It is deeply unclear from sources on this whether he bothered to obtain consent from any of the patients involved. Six of the nine saw improvements in their psychiatric condition, and only one patient is reported to have died of the fever. 

(Unfortunately, but perhaps unsurprisingly given his predilection for mad science, Wagner-Jauregg was later a proponent of eugenics, and backed a proposal for a law that would ban "people with mental diseases and people with criminal genes" from reproducing. His application to join the Nazi party was, apparently, rejected on the basis that his first wife was Jewish.) 

In 1921, Wagner-Jauregg published a report claiming therapeutic success in treating GPI patients with malaria, and this became the standard treatment until the discovery of penicillin in the 1940s. Tens of thousands of patients were treated with deliberate malaria infections. Special psychiatric clinics were opened for this purpose. There were various attempts to produce fevers in safer ways, mostly via hot baths, electric blankets, or “fever cabinets” but sometimes via injection of toxic sulfur compounds; none were as successful as malaria. 

According to a historical cohort study, despite the high risk of this treatment – between 5% and 20% of patients would die from the ‘cure’ – patients treated with malariotherapy did have significantly better chances than they would otherwise. 70% were alive a year after admission, compared to 48% of untreated cases; at 5 years, 28% of malaria-treated patients were alive versus only 8% at baseline. Patients who had only recently contracted syphilis – and thus presumably had less irreversible neurological damage – could be cured entirely, especially if the malarial fever was followed by Salvarsan treatments. 
 

Graph of survival after admission

It wasn’t a great treatment, and it was obviously far from safe, but given the prognosis for general paralysis and the lack of other good options, it’s not surprising that it was seen as revolutionary.

Even now, it’s not fully understood how the fever resulted in a cure; it’s unlikely that the patients’ body temperatures were high enough for a prolonged enough period to directly kill the spirochetes responsible for syphilis infection. Another hypothesis is that the infection stimulated the patient’s immune system to a higher level of activity, which also boosted the body’s defenses against the syphilis infection.  

Even once penicillin was discovered, the treatment wasn’t immediately accepted, and was often given in combination with malariotherapy; this was done in the United States and in the Netherlands up to the mid-1960s, and in the United Kingdom until the 1970s. 

The popularity of pyrotherapy during this period resulted in significantly more research effort going toward the biological study of malaria, including its mode of transmission and treatment. The first permanent laboratory colonies of mosquitoes, and the isolation of various malaria strains, were both established during this time period. Testing of synthetic drugs for malaria treatment was another related advance. It seems likely that malaria is much better understood now than it would be if this historical interlude had never happened. 

Wagner-Jauregg’s work here also pioneered the field of ‘stress therapies’ for psychiatric illnesses, including induced insulin coma therapy for schizophrenia. Electroconvulsive therapy, also popularized during this time period, is still used as a treatment for refractory depression today. 

 

Links

​​Malaria as Lifesaving Therapy

Hot Brains: Manipulating Body Heat to Save the Brain

Julius Wagner-Jauregg Biography (1857-1940)


 



Discuss

What's the status of third vaccine doses?

4 августа, 2021 - 05:22
Published on August 4, 2021 2:22 AM GMT

People seem to think that getting a third mRNA vaccine dose, or "booster shot", substantially increases immunity after 4ish months from your second dose.

  • Is it currently "illegal" to get one, or is it just the policy of all current pharmacies and vaccine distributors to refuse to give them?
  • Is there any shortage of vaccines in the US at this point? Would getting a booster shot counter-factually cause another person to get a dose later?
  • Is there chance that booster shots will become commonly available in the next month? Is it even in the Overton window of things that politicians are bothering to think about?
  • Lastly -- because I assume the answer is no -- is there any reason to think that getting a third shot of a different vaccine would be biologically dangerous?


Discuss

Book Review: Scientific Freedom by Don Braben

4 августа, 2021 - 04:38
Published on August 4, 2021 1:38 AM GMT

What if tech stagnation, declining growth rates, and the near-inevitable seeming collapse of the West are all because we got worried a few scientists would run off with our tax dollars? 

That’s the broad thesis behind Scientific Freedom: The Elixir of Civilization. Published originally in 2008, Scientific Freedom chronicles the journey of physicist Don Braben, as he designs and builds a Venture Research arm at British Petroleum in the 1980s. Braben was successful in funding a transformative research initiative at BP (transformative meaning it fundamentally changes humanity thinks about a subject). In his estimation, 14 out of the 26 groups funded made a transformative discovery, at the cost of only 30 million pounds over 10 years! A few examples of transformative discoveries made by groups funded by Braben in his time at BP are: 

  • Mike Bennett and Pat Heslop – Harrison discovered a new pathway for evolution and genetic control
  • Terry Clark pioneered the study of macroscopic quantum objects
  • Stan Clough and Tony Horsewill solved the quantum – classical transition problem by developing new relativity and quantum theories
  • Steve Davies developed small artificial enzymes for efficient chiral selection
  • Nigel Franks, Jean Louis Deneubourg, Simon Goss, and Chris Tofts quantified the rules describing distributed intelligence in animals
  • Herbert Huppert and Steve Sparks pioneered the new field of geologic fluid mechanics
  • Jeff Kimble pioneered squeezed states of light
  • Graham Parkhouse derived a novel theory of engineering design relating performance to shapes and materials
  • Alan Paton, Eunice Allen, and Anne Glover discovered a new symbiosis between plants and bacteria
  • Martyn Poliakoff transformed green chemistry
  • Colin Self demonstrated that antibodies in vivo can be activated by light
  • Gene Stanley and Jos é Teixeira discovered a new liquid – liquid phase transition in water that accounts for many of water’s anomalous properties
  • Harry Swinney, Werner Horsthemke, Patrick DeKepper, Jean – Claude Roux, and Jacques Boissonade developed the first laboratory chemical reactors to yield sustained spatial patterns — an essential precursor for the study of multidimensional chemistry

So how did Braben fund proposals, if he didn’t use peer review or grant proposals?  

The Model

Don quite literally tried to “eliminate every selection rule imposed since about 1970 that appeared to stand in the way of freedom.”  Don valued building a relationship, talking with the researchers, to determine whether or not they were of sufficient caliber to make a transformative discovery. Don and his small team were the end all be all. If Don got to know you, was impressed by your work, and thought you were working on something that was challenging and transformative, you got funded. Braben’s conviction was what mattered, and he got results. He understood how difficult it would be for some of these folks to get funding (because of peer review). Don minimized overhead and administration by having minimal staff. For advertising, he would travel from university to university giving talks. He aimed to get to know the researchers at a personal level so that trust and rapport could be built in a way you can’t do with a large funding agency. There was very little structure-no deadlines really to speak of, no reports to generate, just science. 

In Don’s approach, he never told anyone “no”. If he thought someone was a quack (for instance, if I claimed that I had disproven super-string theory with only knowledge of calculus), he would kindly probe deeper. ”Will, you say you’ve disproven super-string theory, can you tell me how.” Don wouldn’t tell me to buzz off. For a bumbling or nonsensical answer, he’d just tell me to come back when I had more. This practice let Don filter out fakers, without filtering too aggressively for ideas that might be true, but not accepted by the wider scientific community. 


On a practical level, Venture Researchers would be funded for 3 years at a time. Support could be renewed and often was. On renewal, the director of the research program simply asked themselves whether what they were wanting to do was still challenging. If it was, the program got renewed. In this paradigm, trust between funder and scientists was paramount.

Who Don Was Looking for: “The Planck Club”

Don is not egalitarian in his approach. Transformative Research not a program for everyone. He aims to fund researchers who can make the kinds of discoveries that are paradigm shifting for humanity. Perhaps only 100 or scientists in a generation can make the kinds of paradigm shifting impacts Don is interested in finding. These scientists are what Braben refers to as the “Planck Club,” or the group of elite scientists who make the most notable discoveries of a given century. Here Don describes the early 20th century Plank Club:

The twentieth century was strongly influenced by the work of a relatively small number of scientists. A short list might include Planck, Einstein, Rutherford, Dirac, Pauli, Schrödinger, Heisenberg, Fleming, Avery, Fermi, Perutz, Crick and Watson, Bardeen, Brattain and Shockley, Gabor, Townes, McClintock, Black, and Brenner (see Table 1). However, I give this list only to indicate something of the richness of twentieth-century science. I wrote it in a few minutes, and it obviously has many important omissions. Other scientists would doubtless have their own. If the criteria for inclusion were based on success in creating radically new sciences, or of stimulating new and generic technologies, a fuller list could easily run to a couple of hundred.

The biggest breakthroughs often take a long time, and come from people with interests in all kinds of weird areas. Planck, perhaps Don’s favorite example, took 20 years working on Thermodynamics, and would never have made it if his research had been put under the pressure modern researchers are put under. Scientific Freedom, letting people work on their ideas without constraint, is essential to producing the kinds of discoveries the Plank Club made. In essence, Braben is allergic to bureaucracy. 

The Cost of Venture Research

One of the frustrating things for Braben, is the relative cheapness of Venture Research. In fact, if you believe Braben, the world is leaving trillion-dollar coins on the sidewalk:

The likely costs can be estimated using back-of-the-envelope figures. Let us assume that there were about 300 transformative researchers — the extended membership of the Planck Club — during the twentieth century. Let us adopt a rule of two by which we increase or decrease cost estimates by a factor of two whichever is the most pessimistic. Allowing for inefficiencies, therefore, let us increase the target number of transformative researchers we must find to 600 — that is, six a year on average over the century. This is a global estimate, but for a TR initiative in a large country such as the United States, then, according to our rule of two, we assume that all the new members might have to be found in that country since it is home to about half of all R & D. If we also assume that the searches will be about 50% efficient, which Venture Research experience indicates would be about right, it would mean that a US TR initiative should find some 12 transformative researchers a year. (For comparison, a maximum of nine scientists can win the Nobel Prizes each year). TR is the cheapest research there is, as it is heavy on intellectual requirements but relatively light on resources. For Venture Research in the late 1980s – early 1990s operating in Europe and the United States, the average cost per project was less than £100,000 a year, including all academic and industrial overheads. Costs have gone up since then, so for our present purposes, we might double them to, say, £200,000 or $400,000 a year per project on average.

Transformative researchers should be supported initially for 3 years. Our experience indicates that about half of them would require a second 3  year term; and half of those, a third term of support. Very few projects should run for more than, say, 9 years. Those leaving the TR scheme either would have succeeded and been transferred to other programs created for them — that is, their research would actually have been transformative — or, the scientists agree that they had probably failed in their Herculean quests. However, these average figures are quoted for guidance; there should in fact be no hard-and- fast rules on the length of support. Remember Planck!

and

This sum should also be the steady total thereafter. As we have chosen x to be 12, after 9 years, therefore, a TR research budget (i.e., excluding overheads such as the initiative’s administrative costs) would be some $25 million a year. If it turns out to be significantly more than that, the initiative would be tackling a different problem than TR. After the first 9 years, the TR initiative would have backed some 108 projects, of which according to our experience about 54 should eventually turn out to be transformative in some way.

A TR budget for a smaller country — say, the United Kingdom — should be about half that of the United States, or $ 12.5 million per annum. The Venture

Research budget in our final year of operation (1990) was some $ 5 million, two-thirds of which we spent in Britain. As we had been operating for 10 years, it is possible that we had identified most of the researchers in Britain looking for potentially transformative research support at that time.

That’s only $25 million a year in inflation-adjusted cost for a small country like the UK. They probably spend more on staplers!

What Changed

Pre-1970’s, research was much smaller than it is now, and it was the norm that scientists could work on their problem of choice, without too much bother or oversight from their overlords. No moloch could touch these angels of knowledge, their tendrils of curiosity reaching out over nature, unencumbered by peer review.

Before 1970 or so, tenured academics with an individual turn of mind could usually dig out modest sources of funding to tackle any problem that interested them without first having to commit themselves in writing. Afterward, unconditional sources of funds would become increasingly difficult to find. Today, they are virtually nonexistent.

For Don, this change precipitated a decline in our ability to create breakthrough research. Peer review snuffed out all the weird people following their interesting passions. Instead of cool wacky scientists, we got salesman-scientists in suits. As people who can’t get funding say, “That dog won’t hunt.” There is a possibility that this change has precipitated our relative stagnation. Bureaucracy, and a lack of scientific freedom, the ability to get a small amount of unconditional money to follow your research interests, means that we don’t get a Planck Club for the later half of the 20th century. Technology is the child of science, and if science is sick, maybe it makes us worse at creating the kinds of technology that keeps our world progressing towards a brighter future. Braben believes that although we have gotten many advancements in recent memory (the book was originally published in 2008), most of these are technologies leftover from the harvest of the early 20th century of research. This is important (and I think scores points for Don) because this gives him the title of being one of the earlier “alarm bells” of secular stagnation/decadence/tech stagnation in our society. Here, Don talks about the gift of the discoveries of the Planck Club:

“This prodigious progress came from our growing ability to harvest the fruits of humanity’s intellectual prowess — scientific endeavour, as it is usually called. Material wealth continued to accelerate through most of the last century despite financial crashes and global wars. But then gradually, around about 1970, signs of major change began to emerge. Science’s very success had unsurprisingly led to a steady expansion in scientists’ numbers. That could not continue indefinitely, of course, and the inevitable crunch came when there were more than could adequately be funded. This was not only a numbers problem — the unit costs of research were also increasing. The funding agencies should have seen this coming, but they did not. Indeed, as I shall explain, many today do not accept this version of events and are thereby contributing to one of the greatest tragedies of modernity. This perhaps surprising statement arises because the agencies’ virtually universal response to the crisis was to restrict the types of research they would fund. Thus, to use a truly horrible word, they would prioritize, and focus funding on the most attractive objectives — that is, objectives the agencies perceived to be the most attractive. Thus, for the first time since the Renaissance, the limits of thinking began to be systematically curtailed.”

And

Thanks to that precious gift, and despite the havoc of world wars, financial crashes, and a threefold rise in population, per capita economic growth soared in the twentieth century, reaching a peak, coincidentally perhaps, around about 1970. It then began a steady decline.

Conclusion

Why hasn’t Venture Research caught on? I can only speculate, but I think letting folks run wild is not something that scales. Don might respond that it’s perfectly okay that it doesn’t scale-venture research is not for everyone, it’s just for the select members of folks who have the capability to make transformative discoveries like the ones that belong in  “The Planck Club.” It is important however, that someone is doing this kind of science funding. 

We’ve many more researchers now than in the past, and there are simple bureaucratic reasons why oversight has become more important than research results. It’s like building a vaccine-if you are a regulator, you don’t get points for the hundred of thousands of lives you save, you only get punished if 365 folks get guillain-barré from your vaccine. The first researcher who gets public money, and spends “a little too much time down in Aruba” makes the front page of the Times, and the whole funding program is toast. On the bright side, it truly doesn’t take much money to set up a venture research unit, and it’s something that a rich tech founder could easily fund (Patrick Collison, are you still with me here?).  

Scientific Freedom, for Braben, is something akin to the air we breathe. It’s essential, but less obvious that water, health and security. It’s tough to notice how important it is when you have it, but you sure as hell start to notice when it’s gone. With it, society prospers, and we continue to find our own century’s “Plank Club,” without it, we stagnate.


 



Discuss

LCDT, A Myopic Decision Theory

4 августа, 2021 - 01:41
Published on August 3, 2021 10:41 PM GMT

The looming shadow of deception

Deception encompasses many fears around AI Risk. Especially once a human-like or superhuman level of competence is reached, deception becomes impossible to detect and potentially pervasive. That’s worrying because convergent subgoals would push hard for deception and prosaic AI seems likely to incentivize it too.

Dealing with superintelligent deceptive behavior seeming impossible, what about forbidding it? Ideally, we would want to forbid only deceptive behavior, while allowing everything else that makes the AI competent.

That is easier said than done, however, given that we don’t actually have a good definition or deconfusion of deception to start from. First, such a deconfusion requires understanding what we really want at a detailed enough level to catch tricks and manipulative policies—yet that’s almost the alignment problem itself. And second, even with such a definition in mind, the fundamental asymmetry of manipulation and deception in many cases (for example, a painter AI might easily get away with plagiarism, as finding a piece to plagiarize is probably easier than us determining whether it was plagiarized or not; also related is Paul’s RSA-2048 example) makes it intractable to oversee an AI smarter than us. We are thus condemned to snipe problematic behaviors one by one, as we think about them, until the ones left are too complex for us to imagine and worry about, and we leave massive holes in our alignment schemes.

That being said, if pinpointing deceptive behavior exactly is too hard, an alternative might be to overapproximate it. An overapproximation, in this context, would be some property which is more general than deception, in the sense that it rules out deception without just being equivalent to deception.

Myopia is our proposed overapproximation. The sort of myopia we are interested in dates back from Evan’s Towards a mechanistic understanding of corrigibility (see also Abram’s Partial Agency sequence for reflections around myopia). Myopia is the property of a system to not plan ahead, to not think too far about the consequences of its actions, and to do the obvious best thing in the moment instead of biding its time. In particular, it seems to forbid non-obvious deception, since the latter usually requires foresight and long-term planning. 

Yet, here again, we butt our heads against a lack of good definitions. Abram has proposed some definitions in Defining Myopia, but none of them seem strong enough to prevent deception. From another angle, Krueger et al. showed that a short term incentive and discount at 0 are not sufficient to completely remove the incentive for non-myopic and potentially deceptive behaviors, and so don’t make a good definition of myopia. Another alternative is time-limited myopia, which limits the AI to only plan in a particular, fixed time horizon. Although it should help, this form of myopia depends on a time parameter that we have no idea how to estimate. If the bound is too low, then the AI risks being too stupid and uncompetitive; if it’s too high, it risks being deceptive. Worse, even with a good choice of time parameter, time-limited myopia can have non-myopic incentives in some situations (because of anthropic uncertainty for example).

This post tackles the deconfusion of myopia from a decision-theoretic perspective. We focus on a single decision theory called LCDT that modifies CDT to make it myopic while still solving many capabilities problems. LCDT isn’t meant to be the be all and end all definition of myopia, but it gives a decision-theory-level example of what a version of myopia that actually prevents deception might look like. We also explore variations on LCDT and how they affect the tradeoff between myopia and capabilities. Finally, we argue that LCDT agents could be competitive if they are used to simulate something like HCH.

Thanks to Jérémy Perret for the drawings.

A decision theory benchmark for myopia

What makes a decision theory myopic? Originally, we focused on classic decision theory problems like Newcomb’s Problem (see Mark Xu’s Open Problems with Myopia for an account of this perspective, called Dumb Decision Theories). But we realized that doing so made the tradeoff between capability and deception unclear. Thus, we have decided to instead focus on concrete decision theoretic problems that directly capture the training setups and incentives for deception that we’re concerned about.

The following diagrams represent decision theory problems, not training processes (as has been done by Everitt et al. here). In our cases, the utility nodes (in green) represent the internal utility of the agent, not some training reward.

Imitation (Capabilities)

(You might notice that decision nodes and human (H or HCH) nodes have the same shape and color: red rectangles. This is because we assume that our problem description comes with an annotation saying which nodes are agent decisions. This ends up relevant to LCDT as we discuss in more detail below.)

Task description: both Human and Agent must choose between action a.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  and action b. If the actions are the same, the agent gets 1 utility; otherwise it gets 0 utility. To make it more concrete, let’s say the human takes a with probability ¼ and b with probability ¾, and this information is available in the problem description.

What we want: the Agent should deterministically choose the action with the highest probability (so in the concrete version b), because this gives it the best expected utility.

Why this problem? This problem is just intended to capture the most basic form of imitation learning, where the agent has a good enough model of the human to predict what it will do. While this problem might seem so trivial that any reasonable decision theory should solve it, we will see that some intuitive proposals for myopic decision theory can fail here.

Imitation with self-modification (Capabilities)

Task description: Exactly the same as the version without self-modification, except that the self-modification node can now choose what algorithm the Agent will use.

What we want: Not to self modify, and the Agent should decide similarly to the previous problem (so the action with the highest probability).

Why this problem? An important part of myopia should be that the agent not self-modify to remove its myopia. This is the imitation learning version of this property.

Iterated Imitation (Deception)

Note that the actual causal diagram has an arrow from every agent and human directly to Human_N, which we haven’t drawn for legibility.

Task description: Iterated version of the first problem. The first n−1 instances are a deterministic version of the Imitation problem, where the agent knows which action the Human will take. For the very last instance, if the agent was wrong for every previous instance, the human takes the last action of the agent as its action. Otherwise, the human takes an action uniformly at random. Furthermore, the last utility is equal to 0 for different actions and 2n for the same action.

What we want: At each instance choose the action that the human would take (and any action at random for the last instance).

Why this problem? This problem introduces non-myopic incentives, in that failing the first n−1 instances ensure maximal expected utility. This is because the myopic policy gets an expected utility of n−1+2n2=2n−1, whereas the non-myopic policy gets an expected utility of 2n. But what we want for myopia is the sort of greedy decision making that always takes the optimal choice right now. This is one example where we want to be “dumber” than most decision theories, as both CDT and EDT will take the non-myopic strategy here. 

2 Variants of Approval-Direction (Capability)

Task description: Agent chooses between two actions a and b. Two possible versions:

  • either the approval is just the output of the human, which gives 1 for the wanted action (let’s say a) and 0 for the other;
  • or the approval is computed directly as this binary difference between agent’s output and wanted action.

What we want: Choose the action that gets 1 for approval.

Why this problem? These two variants capture the approval-directed setting. Of particular interest is the first, which relies on the agent’s ability to influence the human (because the approval is computed based on the agent’s output), which, as we’ll see, isn’t allowed by some myopic decision theories. 

2 Variants of Approval-Direction with self-modification (Capabilities)

Task description: Same as the previous problem, except the self-modification node can decide which algorithm Agent is running. 

What we want: Not self-modify and choose the action that gets 1 for approval.

Why this problem? Once again we want to ensure that myopic agents don’t self-modify, to maintain the guarantees of myopia. 

Argmax HCH (Capabilities)

Task description: Same as Imitation, except the thing to imitate is HCH. 

What we want: Output the action that HCH would output.

Why this problem? As we’ll see, imitating HCH is one of the main use cases for a fully myopic decision theory, so being able to implement it is important here.

Argmax HCH with self-modification (Capabilities)

Task description: Same as the previous problem, but now self-modification can decide which algorithm Agent runs AND what the internal model of HCH does (the self-modification can alter any of the H instances). 

What we want: Not self-modify and output the action that HCH would output.

Why this problem? We can see the possibility of deception ihere: because the Agent is building a model of HCH, there is a risk that it decides to change this model in order to make its task easier and get more expected utility. This is particularly important to avoid if we hope to build powerful simulators.

(N,M)-Deception problem

Task description: A variant of the Absent-minded driver. The agent doesn’t know if it is at one of the N “training nodes” where there is an overseer, or at one of the M “deployment nodes” where there is no overseer and has the same action space at both: Cooperate or Defect. 

  • At training nodes, the action is passed through an overseer. If the action is Cooperate, no utility is given and the agent moves on to the next node (but forgets where it was). If the action is Defect, the utility is 1 and the agent stops.
  • At Deployment nodes, Defect gives 4 utility and Cooperate leads to the next node. Cooperating in the last node gives 0 utility.

What we want: Contrary to the intuitive answer, here we want the agent to always defect. A truly aligned AI would always cooperate with us, but the point here is that the incentives push towards late defection (in deployment). Hence we want to turn deception (cooperating until you can defect) into myopic defection, which allows us to catch these wrong incentives in training.

Why this problem? This is a simplified model of the problem a deceptive agent deals with: when to defect, knowing that defecting in training means death but defecting in deployment means doing whatever the agent wants. The way we address this issue is through myopia, such that  the agent should always defect now (and not be deceptive and plan for the future).

Starting at CDT

When looking at existing decision theories, it seems that CDT solves most of our problems. 

On the capability side (without self-modification), CDT solves the problems:

  • (Imitation: Capabilities) CDT gives the correct answer based on the causal graph, which is indeed the highest probability action.
  • (Approval-Direction: Capabilities) The action with the best causal consequences is the highest probability action, as wanted.
  • (Argmax HCH: Capabilities) Basically the same as Imitation, and indeed CDT gives the highest probability action.

The situation becomes more subtle when we add self-modification. Notably, CDT will self-modify if doing so can change HCH, which is not what we want.

  • (Imitation: Self-Modification) Because the self-modify node has exactly the same causal future as the agent node, CDT doesn’t care and so doesn’t self-modify. Put another way, CDT gets a choice to “decide however it wants”, but here the CDT algorithm for the agent node will already take the best action according to CDT.
  • (Approval-Direction: Self-Modification) By the same argument as the imitation case, CDT doesn’t self-modify.
  • (Argmax HCH: Self-Modification) Here the self-modification node has new causal descendants: the model of HCH inside the AI. And this gives a reason to self-modify, because self-modifying might change the model of HCH to make it easier to predict (make it always choose the same action for example). So CDT self-modifies here, which isn’t what we want.

Last but not least, CDT completely fails the 2 problems which are specifically about not being deceptive:

  • (Iterated Imitation) Failing the first (n-1) instances and succeeding at the last gives the best expected return, and so this is what CDT does. Alas, this is also the deceptive choice.
  • (Deception Problem) The analysis of CDT behavior on this problem (which is just the Absent-minded driver for CDT with many more intersections) is considered difficult. That being said, most analyses for the (1,1) version say the probability of Defect should be ⅓, and there is no reason to expect that probability to go to 1 except in the limit of M >> N. So CDT doesn’t Defect all the time, and thus is deceptive.

CDT thus fails to be myopic in two ways:

  • As long as the self-modification has new causal descendants, it will happily self-modify.
  • When the deception is completely causal (as opposed to weird acausal or superrational cases) and incentivized by the utility, CDT is too smart and thus is deceptive.

One way to search for myopic decision theories is thus to try to change CDT to patch these failure modes. We propose just such a decision theory, that only loses a small amount of capabilities from CDT.

(We could argue that EDT is as good -- or as bad depending on the perspective -- as CDT. The main reason we focus on CDT is that the causal graphs are a nice abstraction to toy with for other possibilities.)

Making CDT Myopic: Lonely CDT

Lonely CDT is a twist on CDT: an LCDT agent will make its decision by using a causal model just like a CDT agent would, except that the LCDT agent first cuts the last link in every path from its decision node to any other decision node, including its own future decision nodes.

In more concrete terms, an LCDT agent assumes that every other decision node of agents in the world (including its future decisions) are causally independent of its action.

What do we replace these decision nodes with (as their actual expression does depend on our decision)? We assume that the model has some fixed prior over its own decision, and then we marginalize the cut decision node with this prior, to leave the node with a distribution independent of our decision. It’s not clear how to actually compute the agent’s prior over its own decisions, but the good news is that it has no influence on the actions of LCDT agents in our problems.

Indeed, the only way for the prior to influence the decision is if

  1. The LCDT agent can influence a future node through a path without any agent in it.
  2. Some agent in the causal future of the LCDT agent also influences the future node (and so the prior tells the LCDT agent how this other agent is expected to behave).
Why LCDT is a good myopic decision theory

From the start, the intuition behind LCDT sounds decidedly non-deceptive: it’s unable to think about the consequences of its actions on other agents. And indeed, it deals perfectly with our two deception-focused problems:

  • In Iterated Imitation, LCDT cuts the links from each agent to the following agents and humans, which means that the first (n-1) decisions of agent are assumed to have no influence on HumanN, and so LCDT plays the best action for all instances, as we want.
  • In the (N,M)-Deception problem, LCDT cuts the link from each agent in the N training cases to the overseer. This means that the LCDT agent assumes that whatever it does cannot influence what happens in training -- it’s indifferent. But on deployment, it thinks (correctly) that Defect is the best choice, and so the LCDT-agent always chooses to Defect.

Sure, it works nicely as a safety measure, but what about competitiveness? Well, on  Imitation (Capabilities), Approval-Direction without Human, and Argmax HCH (Capabilities),  LCDT has nothing to cut and so decides just like CDT, which is the competitive choice. In addition, LCDT shows a form of indifference (related to indifference corrigibility maybe) with the self-modification versions of these problems. This is because it cuts all links out of the Self-Modification node -- hence it doesn’t see the point in self-modifying. We’re fine with this, as it’s quite natural to assume that in such a case it does the cheapest thing, i.e. nothing.

Another argument for the competitiveness of LCDT agents is that the cuts only appear at decision time, which means they can have and update an accurate model of the world where its actions do influence other agents.

Issues with LCDT

LCDT obviously pays some price for being myopic. It becomes clear when looking at Approval-Direction with a Human: LCDT cuts the link to the human and so the agent is indifferent to any choice. This is a capability loss because even if it could know exactly the answer, it doesn’t think telling it to the human will change anything. And similarly with the self-modification version, the LCDT agent doesn’t believe its answer will change the output (but at least it doesn’t self-modify).

This looks bad -- such an LCDT agent cannot even ask a human to open a door for it. Yet we believe such AIs could still prove valuable and competitive through behaving like aligned simulators, such as how LCDT works in the HCH case. We go into more detail on this in the last section of this post.

In the benchmark section, we also hinted at another potential issue of LCDT: it depends on knowing which nodes are decision nodes by agents. And all its guarantees disappear if this condition is void. One example would be to model a human as some gears-level model where no single component is an agent, in which case LCDT would be fine with being deceptive. We also explore ways to deal with that problem in the last section.

Cemetery of LCDT variants

Our initial goal was to propose many different myopic decision theories, as a sort of process-level extensive definition of myopia. In the end LCDT proved significantly stronger than all the variants we considered, and so we made it the central point of this point. Here we go quickly over some of the variants we considered, and what they lack in comparison to LCDT.

LCDT[N]

Just like LCDT, but the causal link to future decision nodes are cut after passing N decision nodes. So LCDT[0] is LCDT and the limit limn→∞ LCDT[N] is CDT.

Whether or not this hierarchy has any value thus depends on what happens for N > 0 but finite. Already LCDT[1] self-modifies in each of our problems that allow it (except Imitation), acts deceptively in the Iterated Imitation case (because of the direct link to Agent_N), and will not Defect all the time in the Deception Problem because it thinks that this impacts what the overseer will do.

Symmetric CDT

This is a symmetric variant of CDT, where instead of just intervening on our decision node, we also intervene on every other decision node. This intervention on other decision nodes replaces the node by a probability distribution over values of the intervention obtained by marginalizing the distribution of the node (with some priors over the decision of causal ancestors).

Where LCDT cuts all links from the agent to another decision node, SCDT also cuts every link between every decision node. The intuition is that an SCDT agent doesn’t think that its decision influences other agents, and thinks other agents also think like that. So we can imagine SCDT as the version of LCDT which expects every agent to use LCDT.

As for why LCDT is better, it’s because SCDT cuts significantly more of the causal graph, and thus is unable to model other agents interacting. For example, it models HCH as if each agent didn’t believe it could influence the other, which is clearly going to fail.

Nash LCDT

This is the variant of LCDT where after cutting the causal paths to other agents, we assume that these other agents (causal successors) will make the decisions that give us the least possible utility. That last part replaces the marginalized distribution in LCDT.

The intuition is something like not believing that our actions have any impact on other agents AND being paranoid about the other agents trying to get us.

Unfortunately, in Argmax HCH (self-modification), Approval-Direction with Human and Approval-Direction (Self-modification) with Human, NLCDT believes that the other agents are out to get it such that they will choose a uniform policy to give it the least possible expected utility. So NLCDT is indifferent, and it’s not clear what it decides.

The worst-case mindset of LCDT also sounds very uncompetitive—paranoid people are rarely that productive in the real world.

Nash CDT

This is the variant of CDT where we assume that the other agents (causal descendants) will make the decisions that give us the least possible utility. Note that it’s different from NLCDT in that the other agents can still be causally influenced by our decision. So they can wait to see what we do and then mess with us.

Intuition is something like being paranoid about the other agents observing us to see how best to make us lose utility.

As an even more paranoid version of NLCDT, it has the same issues, if not more.

Further Questions

This section includes some of our reflections on myopia and LCDT that we are still trying to clean and clarify.

Myopic simulation

Our main argument for the competitiveness of LCDT agents, despite the limitations of their myopia, comes from using them as simulators.

The case that started this work was Argmax HCH, an agent that just faithfully imitates HCH. As long as the simulation of HCH is good and faithful (and that HCH is indeed close enough to H to behave truly as enlightened judgement), such an AI would be inherently corrigible and not deceptive.

What if HCH is deceptive (or hard to differentiate from a non-deceptive version, as implied by Joe Collman here)? What if the system simulated by HCH is deceptive? Even in these cases, we expect a gain in safety from the myopia of LCDT. This comes in two flavors:

  • If the LCDT agent simulates a specific system (as in Argmax HCH), then its computation should be fundamentally more understandable than just running a trained model that we searched for acting like HCH. The fact that it has to myopically simulate the next action leads it to explicitly model the system, and extract many valuable insights about its behavior.
  • If the LCDT agent simulates multiple systems depending on its input (as one perspective of autoregressive language models has it), then myopia gives it no incentives to Predict-O-Matic its user.
Finding agents

LCDT fundamentally needs a causal graph/world-model where agents are tagged, to anchor its beliefs about what it can influence. How can we ensure that?

  • One possibility is that the world model admits the concept of an agent as a natural abstraction, and thus tags it by default. We might expect large language models to do so, since they must recreate many texts which use explicitly and implicitly the concept of agency.
  • On the issue of abstraction, we’re considering moving the above analysis from causal graphs to finite factored sets, which should help with how such causal graph features can be inferred. That being said, we haven’t looked in detail in that direction yet.
  • Maybe teaching an LCDT agent to imitate a human’s use of the intentional stance would help it tag in a conservative way, since humans tend to overly model things as agents.
Remaining possibilities for problematic long-term plans

Note that LCDT doesn’t forbid thinking about the long term consequences of one’s actions; it just forbids thinking about the consequences of one’s actions on agents (including future versions of oneself). As such, LCDT can still do things that look decidedly non-myopic, as long as no agent is in the causal path.

There is also an acausal way for LCDT to work with its future self, as long as the result/utility doesn’t depend on another agent: the LCDT agent could make a first step toward completing a long term plan (by starting a paperclip factory), and expecting with some probability that the next self would keep working on it based on the prior. Why such a difference with the Iterated Imitation? Because here the influence of the outcome doesn’t go through another agent, and so the LCDT agent is incentivized to make the world easier for its future self (as long as it doesn’t have to influence an agent to do that).

Note though that this requires a very specific prior over its own decision, that is believing strongly that it will take the action influencing the outcome. If we wanted to rule out such long-term plans, characterizing such priors would be a good first step.

Checking Myopia

One hope for deconfusing myopia is that it would allow an overseer to check whether a model is myopic, thus ruling out deception. Does LCDT help in this endeavor?

If we had a stronger version of LCDT that deals with the problem of tagging agents cleanly, we might be able to check both whether an agent was using a decision algorithm like LCDT and whether we liked the tagged causal model it was using it with. This would give as a weak form of ascription universality focused on myopia.

That being said, such an approach is limited to fitting the behavior to LCDT, which might not be the only myopic decision theory we should look at. Hence it might give false negatives (where the model is actually myopic, but doesn’t follow LCDT).

Another way to use LCDT for prosaic alignment would be to try to implement LCDT on top of learned multi-level world-models.

Conclusion

We propose LCDT as the first truly myopic decision theory. It accomplishes this feat by believing its actions cannot influence other agents (including its future self). Doing so removes incentives for deception in problems where deception is the optimal strategy; it also leads to some loss of capability (mostly the inability to influence other agents for benign reasons). Still, this seems enough to simulate almost any system or agent without tampering with it, and with other safety benefits.



Discuss

How should my timelines influence my career choice?

3 августа, 2021 - 13:24
Published on August 3, 2021 10:14 AM GMT

TL;DR: Recent progress in AI has tentatively shortened my expected timelines of human-level machine intelligence to something like 50% in the next 15 years. Conditional on that being a sensible timeline (feel free to disagree in the comments), how should that influence my career choice?

Context:
I am currently a master student in Artificial Intelligence at the University of Amsterdam with one more year to go (mainly doing my master's thesis). So far, my go-to plan was to apply for safety-relevant PhD positions, probably either in NN generalization or RL, and then try to become a research scientist in a safety-oriented org. Given the shorter timelines, I am now considering becoming an engineer instead since that seems to require much less upskilling time, compared to doing a PhD for 4-5 years. I think the answer to my question hinges upon 

  • my personal fit for engineering vs. research
  • the marginal value of an engineer vs. researcher in the years directly preceding HLMI
  • the marginal value of an engineer now (i.e. a year from now) vs. a researcher in 5-6 years.
    • The reason I split these is that maybe the value changes significantly once HLMI is clearly on the horizon or already there in a number of relevant domains.

I feel like I enjoy research more than pure engineering, but it's not like I don't enjoy engineering at all. Engineering seems more competitive in terms of coding skills, which I might lack compared to the most skilled other applicants. However, that is something I could practice pretty straightforwardly. 

How have other people thought about this question, and how would you judge the questions about marginal value of the two roles?



Discuss

Do Bayesians like Bayesian model Averaging?

3 августа, 2021 - 07:26
Published on August 2, 2021 12:24 PM GMT

Pages 4-5 of the International Society for Bayesian Analysis Newsletter Vol. 5 No. 2 contains a satirical interview with Thomas Bayes. In a part of the interview, they appear to criticise the idea of model Averaging

MC: So what did you like most about Merlise Clyde's paper, Tom?
Bayes: Well, I thought it was really cleaver how she pretended that the important question was how to compute model averages. So nobody noticed that the real question is whether it makes any sense in the
first place.

What’s going on here? I thought Bayesians liked model averaging because it allows us to marginalise over the unknown model:

p(y|x,D)=∑ip(y|x,D,Mi) p(Mi|D).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

where Mi represents the i-th model and D represents the data.



Discuss

Is iterated amplification really more powerful than imitation?

3 августа, 2021 - 02:20
Published on August 2, 2021 11:20 PM GMT

Here I propose the idea that creating AIs that perform human mimicry would result in similar capabilities and results to an AI made with iterated amplification. However, it may provide are greater degree of flexibility than using any hard-coded iterated amplification, which might make it preferable. I don't know if this has been brought up before, but I would be interested what people think or links to previous discussion.

The basic idea of iterated amplification is to create a slow, powerful reasoning system, then train a faster system that approximates this powerful power one. Then repeat this process indefinitely. The main way of coming up with the slow procedure I've seen proposed is to use HCH, which involves emulating a large number of humans interacting with each other to come to come up with an ideal output.

But suppose you instead had a bunch of human imitators without any fixed use of iterated amplification.

AI is valuable, so human researchers do AI research. Since human imitators do the same thing as humans, they could also create more powerful AI systems.

One way for them to do this is with explicit iterated amplification. If the humans mimics see this as the most effective way to increase their capabilities, then human mimics could just perform iterated amplification on their own by reading about it, rather than requiring it to be hard-coded.

However, iterated amplification is not necessarily the most efficient way to create a more powerful AI, and a human mimics would have the flexibility to choose other techniques. Current AI researchers don't usually try to increase AI capabilities by iterated amplifications, but instead by coming up with new algorithms. So perhaps for the human mimics, increasing capabilities using some method other than iterated amplification, like what actual AI researchers do, would be more effective.

For example, suppose the human-mimicking AIs see they could use a more powerful pathfinding algorithm than the messy one implicitly implemented in the learned model of a human mimic. Then they could just use their human-level intelligence to program a custom, efficient path-finding algorithm and then modify themselves to make use of it.

There are of course some safety concerns about making incorrect modifications. However, there are similar safety concerns when making an incorrect fast approximations to a slow, powerful reasoning process like in iterated amplification. I don't see why using pure human mimicry would be more dangerous.

And it could potentially be less dangerous. Iterated amplification has may cause problems due to the fast approximations not being sufficiently faithful approximations of the slow processes. If iterated amplification is repeated, error in approximations has the potential to increase exponentially. However, a human mimic could use its best judgment to see whether it would be safer to increase its capabilities by using iterated amplifications or by using some other method.



Discuss

The Myth of the Myth of the Lone Genius

2 августа, 2021 - 23:55
Published on August 2, 2021 8:55 PM GMT

“Our species is the only creative species, and it has only one creative instrument, the individual mind and spirit of man. Nothing was ever created by two men. There are no good collaborations, whether in music, in art, in poetry, in mathematics, in philosophy. Once the miracle of creation has taken place, the group can build and extend it, but the group never invents anything. The preciousness lies in the lonely mind of a man.” 

- John Steinbeck

“The Great Man theory of history may not be truly believable and Great Men not real but invented, but it may be true we need to believe the Great Man theory of history and would have to invent them if they were not real.” 

- Gwern

The Myth of the Lone Genius is a bullshit cliche and we would do well to stop parroting it to young people like it is some deep insight into the nature of innovation. It typically goes something like this - the view that breakthroughs come from Eureka moments made by geniuses toiling away in solitude is inaccurate; in reality, most revolutionary ideas, inventions, innovation etc. come from lots of hard work, luck, and collaboration with others. 

Here is a good description of the myth from The Ape that Understood the Universe: How the Mind and Culture Evolve by psychologist Steve Stewart-Williams.

”We routinely describe our species’ cultural achievements to lone-wolf geniuses – super-bright freaks of nature who invented science and technology for the rest of us. ... It's a myth because most ideas and most technologies come about not through Eureka moments of solitary geniuses but through the hard slog of large armies of individuals, each making—at best—a tiny step or two forward”

The problem here is that the myth of the lone genius is itself a myth. History (ancient and recent) is full of geniuses who came up with a revolutionary idea largely on their own - that’s why the archetype even exists in the first place (Aristotle, Newton, Darwin, Einstein to name the most obvious examples). The author of the above quote would seem to grant that at least some ideas and technologies come from eureka moments of solitary geniuses. Others would seem to go further - the author of an article entitled “The Myth of the Genius Solitary Scientist is Dangerous” holds up Einstein, Maxwell, and Newton as examples of this archetype, but then exposes the falsehood of these examples by saying:

“Newton looked down on his contemporaries (while suspecting them of stealing his work) but regularly communicated with Leibniz, who was also working on the development of calculus. Maxwell studied at several prestigious institutions and interacted with many intelligent people. Even Einstein made the majority of his groundbreaking discoveries while surrounded by people with whom he famously used as sounding boards.”

Uhhh ok, so they talked to other people while working on their ideas? Sure, we shouldn’t have this naive view that these so-called solitary geniuses work 1000% on their own without any input whatsoever from other people, but that doesn’t mean that they didn’t do most of the heavy lifting. Similarly, another proponent of the myth of the lone genius focuses on the power of partnership (Joshua Shenk, Powers of Two: How Relationships Drive Creativity). From the introduction of an interview with Shenk on Vox:  

“After struggling for years trying to develop his special theory of relativity, Einstein got his old classmate Michele Besso a job at the Swiss patent office — and after "a lot of discussions with him," Einstein said, "I could suddenly comprehend the matter." Even Dickinson, a famous recluse, wrote hundreds of poems specifically for people she voraciously corresponded with by letter.

The idea isn't that all of these situations represent equal partnerships — but that the lone genius is a total myth, and all great achievements involve some measure of collaboration.”

This seems contradictory - so there is still a dominant person in the partnership doing most (or all) of the difficult work, but at the same time the lone genius is a TOTAL myth. I have a feeling that Einstein’s contribution was a little more irreplaceable than that of this Besso fellow. Is there not room for a more moderate position here? I guess that doesn’t really sell books. 

It’s not hard to see why the myth of the lone genius is so popular - it is a very politically correct type of idea, very much going along with the general aversion to recognizing intelligence and genes as meaningful sources of variation in social/intellectual outcomes. It is also kind of a natural extension of the “you can achieve anything you set your mind to!” cliche. The fact that most of the geniuses in question are white men probably plays a not insignificant role in people’s quickness to discredit their contributions. At the end of the day, it’s really tough to admit that there are geniuses in the world and you aren’t one of them. 

Defenders of the myth would probably argue that the vast majority of people are not solitary geniuses and the vast majority of innovations do not come from people like this, so we should just preach the message that hard work and collaboration are what matters for innovation. In this view, the myth of the lone genius is a kind of noble lie - the lessons we impart by emphasizing the fallacy of the lone genius are more beneficial than the lessons imparted from uncritical acceptance of the lone genius story. I’m not sure this is true, and in fact I would argue that the uncritical acceptance of the myth of the lone genius is just as bad as uncritical acceptance of the lone genius story.

What lessons are we really trying to impart with the myth of the lone genius? 

(1) You are not just going to have a brilliant idea come to you out of thin air. 

(2) Creativity is enhanced by collaboration and sharing ideas with others. Most good ideas come from recombining pre-existing ideas. 

(3) Be humble and don’t expect that it will be easy to find good ideas. No, you will not “solve” quantum mechanics after taking your first high school physics class. 

Ok great, I’m on board with all of these lessons, it’s kind of impossible not to be. The problem is that by harping so much on the fallacy of the lone genius we are also sending some implicit messages that are actively harmful to aspiring scientists/engineers/entrepreneurs. 

(4) There are no such things as geniuses, and even if there were you are not one of them. 

(5) You won’t come up with a great idea by spending lots of time thinking deeply about something on your own. The people who think they can do this are crackpots. 

(6) Thinking isn’t real work and ideas are cheap, anything that doesn’t produce something tangible is a waste of time. Go do some experiments, have a meeting, write a paper, etc. 

(1)-(3) are certainly valuable lessons, but I think most relatively intelligent people eventually learn them on their own to some degree. My concern is that lessons (4)-(6) can become self-fulfilling prophecies - upon learning about how innovation really works from the myth of the lone genius, the next would-be revolutionary thinker will give up on that crazy idea she occasionally worked on in her free time and decide to devote more time to things like networking or writing academic papers that no one reads. We should want exceptional people to believe they can do exceptional things on their own if they work hard enough at it. If everyone internalizes the myth of the lone genius to such a degree that they no longer even try to become lone geniuses then the myth will become a reality.

My argument here is similar to the one that Peter Thiel makes about the general lack of belief in secrets in the modern world. 

“You can’t find secrets without looking for them. Andrew Wiles demonstrated this when he proved Fermat’s Last Theorem after 358 years of fruitless inquiry by other mathematicians— the kind of sustained failure that might have suggested an inherently impossible task. Pierre de Fermat had conjectured in 1637 that no integers a, b, and c could satisfy the equation an + bn = cn for any integer n greater than 2. He claimed to have a proof, but he died without writing it down, so his conjecture long remained a major unsolved problem in mathematics. Wiles started working on it in 1986, but he kept it a secret until 1993, when he knew he was nearing a solution. After nine years of hard work, Wiles proved the conjecture in 1995. He needed brilliance to succeed, but he also needed a faith in secrets. If you think something hard is impossible, you’ll never even start trying to achieve it. Belief in secrets is an effective truth.

The actual truth is that there are many more secrets left to find, but they will yield only to relentless searchers. There is more to do in science, medicine, engineering, and in technology of all kinds. We are within reach not just of marginal goals set at the competitive edge of today’s conventional disciplines, but of ambitions so great that even the boldest minds of the Scientific Revolution hesitated to announce them directly. We could cure cancer, dementia, and all the diseases of age and metabolic decay. We can find new ways to generate energy that free the world from conflict over fossil fuels. We can invent faster ways to travel from place to place over the surface of the planet; we can even learn how to escape it entirely and settle new frontiers. But we will never learn any of these secrets unless we demand to know them and force ourselves to look.”

Maybe I’m overthinking all of this - does the myth of the lone genius really affect anyone’s thinking in any substantial way? Maybe it only has the tiniest effect in the grand scheme of things. Even still, I would argue that it matters - uncritical acceptance of the lone genius myth is one more cultural force among many that is making it more and more difficult for individuals to do innovative work (and last time I checked, humanity is made up of individuals). In a fast-paced world full of intense economic/scientific/intellectual competition and decreasing opportunities for solitude, it is harder than ever before to justify spending significant time on intangible work that may or may not pay off. You can’t put on your resume - “I spend a lot of time thinking about ideas and scribbling notes that I don’t share with anyone.”

I guess what I want to counteract is the same thing that Stephen Malina, Alexey Guzey, Leopold Aschenbrenner argue against in “Ideas not mattering is a Psyop”. I don’t know how we could ever forget that ideas matter - of course they matter -  but somewhere along the way I think we got a little confused. How this happened, I don’t know - you can probably broadly gesture at computers, the internet, big data, etc. and talk about how these have led to a greater societal emphasis on predictability, quantifiability, and efficiency. Ideas (and the creative process that produces them) are inherently none of these things; as Malina et al. remind us - Ideas are often built on top of each other, meaning that credit assignment is genuinely hard” and “Ideas have long feedback loops so it’s hard to validate who is good at having ideas that turn out to be good”. I would also mention increased levels of competition (as a result of globalism, increased population sizes, and the multitude of technologies that enable these things) as a major culprit. For any position at a college/graduate school/job you are likely competing with many people who have done all kinds of impressive sounding things (although it is probably 90% bullshit) so you better stop thinking about crazy ideas (remember, there are no such things as lone geniuses) and starting doing things, even if the things you are doing are boring and trivial. As long as they look good on the resume...

The life and times of Kary Mullis provide an illustration of this tension between individual genius and collaboration in the production of radical innovation. Kary Mullis is famous for two things - inventing the polymerase chain reaction (which he would win the nobel prize for) and having some very controversial views. 

“A New York Times article listed Mullis as one of several scientists who, after success in their area of research, go on to make unfounded, sometimes bizarre statements in other areas. In his 1998 humorous autobiography proclaiming his maverick viewpoint, Mullis expressed disagreement with the scientific evidence supporting climate change and ozone depletion, the evidence that HIV causes AIDS, and asserted his belief in astrology. Mullis claimed climate change and HIV/AIDS theories were promulgated as a form of racketeering by environmentalists, government agencies, and scientists attempting to preserve their careers and earn money, rather than scientific evidence.”

This is another reason why people are so leery of the lone genius - it often comes with a healthy dose of crazy. Yes, obviously this can go poorly - his ideas on AIDS did NOT age well - but, as we all know because there is an idiom for it, sometimes you have to break a few eggs to make an omelette. 

“Mullis told Parade magazine: “I think really good science doesn’t come from hard work. The striking advances come from people on the fringes, being playful”

Proponents of the lone genius myth might be wondering at this point - did Mullis really invent PCR all on his own in a brilliant flash of insight? We shouldn’t be surprised that the answer is yes in fact he did, but also that it’s a little more complicated than that. 

“Mullis was described by some as a "diligent and avid researcher" who finds routine laboratory work boring and instead thinks about his research while driving and surfing. He came up with the idea of the polymerase chain reaction while driving along a highway.”

“A concept similar to that of PCR had been described before Mullis' work. Nobel laureate H. Gobind Khorana and Kjell Kleppe, a Norwegian scientist, authored a paper 17 years earlier describing a process they termed "repair replication" in the Journal of Molecular Biology.[34] Using repair replication, Kleppe duplicated and then quadrupled a small synthetic molecule with the help of two primers and DNA polymerase. The method developed by Mullis used repeated thermal cycling, which allowed the rapid and exponential amplification of large quantities of any desired DNA sequence from an extremely complex template.”

“His co-workers at Cetus, who were embittered by his abrupt departure from the company,[10] contested that Mullis was solely responsible for the idea of using Taq polymerase in PCR. However, other scientists have written that the "full potential [of PCR] was not realized" until Mullis' work in 1983,[35] and that Mullis' colleagues failed to see the potential of the technique when he presented it to them.”

"Committees and science journalists like the idea of associating a unique idea with a unique person, the lone genius. PCR is thought by some to be an example of teamwork, but by others as the genius of one who was smart enough to put things together which were present to all, but overlooked. For Mullis, the light bulb went off, but for others it did not. This is consistent with the idea, that the prepared (educated) mind who is careful to observe and not overlook, is what separates the genius scientist from his many also smart scientists. The proof is in the fact that the person who has the light bulb go off never forgets the "Ah!" experience, while the others never had this photochemical reaction go off in their brains."

So what’s the take-home message? Let’s not treat the myth of the long genius like it’s gospel. Sometimes really smart people think long and hard about something and come up with an idea that changes the world. Yes, this happens very rarely and most innovation comes from the “hard slog of large armies of individuals, each making—at best—a tiny step or two forward”, but if we aren’t careful then these Eureka moments will become fewer and farther between and everything will be a hard slog. Let’s do better by providing a more nuanced picture of innovation in which solitary exploration by “geniuses” and collaboration both play critical roles.

(originally posted at Secretum Secretorum)



Discuss

How Many Affordable Units?

2 августа, 2021 - 19:20
Published on August 2, 2021 4:20 PM GMT

With last week's post on affordable housing I was wondering: if we did build out all the Somerville lots zoned for four stories (MR4) to be seven stories of affordable housing, how much housing would that be? I count 254 lots zoned MR4 with a total area of 1.8M sqft. Built out at seven stories and an average of 80% coverage that's 10M sqft of housing. Figure 10% for stairs, elevators, and hallways, and it's really 9M for units. Let's say you build:

bedrooms portion of construction unit size 2br 25% 850 sqft 3br 25% 1000 sqft 4br 25% 1100 sqft 5br 25% 1200 sqft

Then building all of this would give you:

bedrooms number of units 2br 2230 3br 2230 4br 2230 5br 2230

These properties currently comprise just under 1k units, so this is 7.9k new units of affordable housing and 1k units converted to affordable housing. This would bring Somerville from ~35k units to ~43k.

Comment via: facebook



Discuss

What made the UK COVID-19 case count drop?

2 августа, 2021 - 16:01
Published on August 2, 2021 1:01 PM GMT

According to Google COVID-19 data, something made COVID-19 cases drop after July 21, 2021. To me that event is unexpected given that the measures where in the process of being lifted and I wouldn't expect that to coincide with citizens engaging in more risk avoiding behavior. 

If there's a factor bringing cases down we don't know, understanding what the factor is could be very valuable. Does anybody have a good explanation for why the cases dropped?



Discuss

Seattle Robot Cult

2 августа, 2021 - 04:43
Published on August 2, 2021 1:43 AM GMT

By popular request, the Seattle Robot Cult is back. We'll meet at East Montlake Park at 12 pm on Saturday, August 7. From there it's a 2½-3 hour ride to the Poo Poo Point trailhead. The Tiger Mountain hike itself ought to take another 3 hours roundtrip. Then it's another 2½-3 hours to get back to East Montlake Park. There will be breaks, We'll probably stop for fast food at least once for those of us too lazy to pack a lunch. Bring lots of water and lots of snacks.

If you want to meet us at the trailhead that's allowed, but requires extra logistics.

  • The trailhead parking lot quickly fills up early in the day. Trailhead Direct is available.
  • I don't know for sure how long it'll the bicycle group to reach the trailhead.
  • Cell phone reception at the trailhead is pretty good. I have always had reception. I predict you will too.

The topic of discussion this week is "How can we, as individuals, promote earnestness in our communities?"

Participation is free. RSVPs are required.



Discuss

Referential Information

2 августа, 2021 - 00:07
Published on August 1, 2021 9:07 PM GMT

Epistemic status: Tentative. Proposing a concept, giving examples.

When you learn about a thing, I claim that there are generally two kinds of information you get: 

  • Direct information: the direct, object-level information you get
  • Referential information: the information you get about other things in a similar reference class to the object-level thing

I think often the referential information value is substantial, and I tentatively suspect that people don’t account for it enough in their decisions.

Examples
  • Looking into the terms and conditions of a certain credit card, and how you go about setting it up
    • Direct info = information about this specific credit card
    • Referential info = information about how credit cards in general probably work
  • Specializing in chemistry
    • Direct info = chemistry knowledge
    • Referential info = what other STEM fields are probably like, how hard it is to become an expert in a field, how to go about becoming an expert in a field, how research within a field is conducted, how progress is generally made
  • Reading a paper from a field you know little about
    • Direct info = the specific stuff you read about
    • Referential info = some idea of what the frontier of the field looks like, the kinds of problems the field tackles, how the field tackles them
  • Talking in-depth about the details of a complicated but mostly unimportant social interaction with the other person involved
    • Direct info = what happened in that social interaction
    • Referential info = how other people work, how complicated social interactions can be, how useful digging into details about social interactions can be
  • Visiting another country or learning about a new culture
    • Direct info = learning about that country or culture
    • Referential info = learning how different a country/culture can be from your own
  • Learning a new language
    • Direct info = knowledge of how to read/write/speak the language
    • Referential info = what it is like and how difficult it might be to learn a new language


Discuss

Who decides what should be worked on?

1 августа, 2021 - 20:31
Published on August 1, 2021 5:26 PM GMT

I work at a tech startup, and part of my job is to talk to job candidates and answer their questions about the company.  One of the questions I often get asked is, "How do prioritization decisions get made?  Who decides what should be worked on and what shouldn't be?  How are the goals of the company determined?"

On the one hand, this question is eminently reasonable - people's work has to be assigned somehow or other, and it's natural to be curious how that happens.

However, I always get a bit frustrated when trying to answer it, because I know that it is really not the right question to be asking.  It's based on a flawed model of how and why companies succeed, the model hammered into their worldview by big institutions and the MBAs that write about them.

This model tells them to expect grand councils of grand leaders, looking into the future and charting a course through the competitive landscape, having civilized debates or furious shouting matches, eventually coalescing on a top-secret plan detailing a strategy which will determine the future of the company.

I know for sure that this model is far from the truth at my company:

  • The plans are made at the lower levels of the company, and basically rubber-stamped by senior leadership
  • We pretty much don't worry about what we're going to be doing a year from now (though we do consider the impact of our actions on the future)
  • Competitors are not the main focus
  • The resulting document is fairly boring, reflecting things we all already knew
  • The success or failure of the company is not impacted much at all by strategy documents

Furthermore, I believe (though I'm less sure about this) that these ideas generalize beyond just my company and represent a superior way of operating than the one most people intuitively think of.

To elaborate on each:

The plans are made at the lower levels of the company, and basically rubber-stamped by senior leadership

So, then, what is the point of senior leadership?

Joel Spolsky explains this better than I can:

When you’re trying to get a team all working in the same direction, we’ve seen that Command and Control management (military-style) and Econ 101 management (pay for performance) both fail pretty badly in high tech, knowledge- oriented teams.  That leaves a technique that I’m going to have to call The Identity Method. The goal here is to manage by making people identify with the goals you’re trying to achieve... The Identity Method is a way to create intrinsic motivation.

Senior leadership doesn't mainly create plans and strategies - they create dreams, identities, loyalties, desires.  If you have an organization of smart people who believe in the high-level mission of the organization (the actual mission as expressed through the culture of the company, not the one that they put on the wall), that's the ballgame right there - any quarter's strategy document pales in comparison.

As it is written: "Culture eats strategy for breakfast."

One example: Mark Zuckerberg's leadership when Google Plus was launched: every Facebook employee worked 7 days a week and did not leave the office (their families came to visit them).

En route to work one Sunday morning, I skipped the Palo Alto exit on the 101 and got off in Mountain View instead. Down Shoreline I went and into the sprawling Google campus. The multicolored Google logo was everywhere, and clunky Google-colored bikes littered the courtyards. I had visited friends here before and knew where to find the engineering buildings. I made my way there and contemplated the parking lot.

It was empty. Completely empty.

I got back on the 101 North and drove to Facebook.

At the California Avenue building, I had to hunt for a parking spot. The lot was full.

It was clear which company was fighting to the death. This was total war.

Zuckerberg didn't make a grand strategy document to look 5 years ahead.  He instead called in a massive surge of short-term focus and dedication, based on the loyalty people felt to Facebook's mission.

What exactly did the people do there 7 days a week?  This is where my next point comes in: "The success or failure of the company is not impacted much at all by strategy documents."  What is it impacted by?  Mainly the speed and quality of execution.  The communication, problem-solving, and raw talent of the people on the front lines - what they do.  The doing matters more than the thinking.  Picture an Olympic swimming event - there is no strategy except "swim faster", and what determines the winner is the raw physical capabilities.

Likewise, the playbook for a company involves things like "build an incredible product, and ship improvements to it very quickly", and "build an efficient sales/marketing organization to achieve low customer acquisition cost".  These are similar to "swim faster" in that the high-level goal is dead-obvious, but the path to achieving it is full of ingenuity, clever hacks, exquisite technique, and also just raw willingness to put in the hours.

The question I wish these candidates would ask is: "Why is your company able to out-execute everyone else?  What processes do you have to ensure you move fast?"  Because that's the main factor that will determine the success or failure of the company.

There are a few good ideas in the literature of corporate competitive strategy, but they are not that hard to re-derive from a basic understanding of economics, and with a high-performing team you can create exceptions to them.  The economic equivalent of "don't fight a land war in Asia" and "don't invade Russia in the winter."  Thus, our competitors are not the main focus.

As it is written: "startups don't die of murder, only suicide".  (Even Netscape, which famously sued Microsoft in federal court for breach of antitrust laws - when you read about them, you see that the true cause of death was suicide - some people cite their decision to do a full rewrite of their code, others cite their disastrous acquisition of Collabra.)

So - how do plans get made and work distributed?

The famous "plans are useless, planning is indispensable" quote hits the nail on the head here: it's important for the people with the most information to regularly discuss what they're seeing and what should be done about it.  But this is extremely different from the grand strategy meetings envisioned earlier.  These planning processes are deeply and fundamentally tactical.  Picture Churchill's war room, filled with maps with detailed troop movements and live-updated radar reports.

The planning process is one of information-sharing and tactical problem-solving.

Whether ad-hoc or on a regular basis - we look at the situation in front of us today - we have smart people with good information - we just think about it for a bit, and the answer is obvious for any given set of facts.  And when we're not sure whether A is more important than B but we agree they're pretty important, we can just flip a coin... if you have good execution, getting that decision wrong will have negligible consequences.



Discuss

Forecasting Newsletter: July 2021

1 августа, 2021 - 20:00
Published on August 1, 2021 5:00 PM GMT

Highlights
  • Biatob is a new site to embed betting odds into one’s writing
  • Kalshi, a CFTC-regulated prediction market, launches in the US
  • Malta in trouble over betting and gambling fraud
Index
  • Prediction Markets & Forecasting Platforms
  • In The News
  • Blog Posts
  • Long Content

Sign up here or browse past newsletters here.

Prediction Markets & Forecasting PlatformsMetaculus

SimonM (a) kindly curated the top comments from Metaculus this past July. They are (a):

Round 2 of the Keep Virginia Safe Tournament (a) will begin in early August. It'll focus on Delta and other variants of concern, access to and rollout of the vaccine, and the safe reopening of schools in the fall.

Charles Dillon—a Rethink Priorities volunteer—created a Metaculus series on Open Philanthropy's donation volumes (a). Charles also wrote an examination of Metaculus' resolved AI predictions and their implications for AI timelines (a), which tentatively finds that the Metaculus community expected slightly more progress than actually occurred.

Polymarket

Polymarket had several prominent cryptocurrency prediction markets. Will Cardano support smart contracts on Mainnet by October 1st, 2021? (a) called Cardano developers out on missed deadlines (a) (secondary source (a)). 

Will EIP-1559 be implemented on the Ethereum mainnet before August 5, 2021? (a) saw Polymarket pros beat Ethereum enthusiasts by more accurately calculating block times. Lance, an expert predictor market player, covers the topic here (a).

Polymarket also started their first tournament, the first round of which is currently ongoing. 32 participants each received $100, and face-off in a sudden-death tournament (a). Participants' profits can be followed on PolymarketWhales (a).

Kalshi

Kalshi (a)—a CFTC (a)-regulated prediction market—has launched, and is now available to US citizens. Kalshi previously raised $30 million (a) in a round led by Sequoia Capital (a). Fees (a) are significantly higher than those of Polymarket.

Reddit

Reddit added some prediction functionality (a) late last year, and the NBA subreddit has recently been using it (a). See Incentivizing forecasting via social media (a) and Prediction markets for internet points? (a) for two posts which explore the general topic of predictions on social media platforms.

Also on Reddit, r/MarkMyWorlds (a) contains predictions which people want remembered, and r/calledit (a) contains those predictions which people surprisingly got right. Some highlights:

The predictions from r/MarkMyWords about public events could be tallied to obtain data on medium to long-term accuracy, and the correct predictions from r/calledit could be used to get a sense of how powerful human hypothesis generation is.

Hedgehog Markets

Hedgehog Markets Raises $3.5M in Seed Funding (a). They also give a preview (a) of their upcoming platform. I'm usually not a fan of announcements of an announcement, but in this case I thought it was worth mentioning:

A No-Loss Competition allows users to make predictions on event outcomes without losing their principal.

It works like this — a user decides they are interested in participating in one of Hedgehog’s No-Loss Competitions. So, they stake USDC to receive game tokens, and use those game tokens to participate in various prediction markets offered within the competition. Regardless of how a user’s predictions perform, they will always receive back their original USDC stake at the end of the competition.

The problem this solves is that within the DeFi ecosystem, the time value of money—the amount of interest one can earn from money by letting it sit idle, e.g., by lending it to other people or by lending liquidity to stable-coin pools—is fairly high. So once one is willing to get one's money into a blockchain, it's not clear that betting offers the best return on investment. But with Hedgehog Market's proposed functionality, one can get the returns on betting plus the interest rate of money at the same time.

In practice, the proposed design isn't quite right, because Hedgehog contests unnecessarily consist of more than one question, and because one can't also bet the principal. But in the long run, this proposal, or others like it, should make Polymarket worried that it will lose its #1 spot as the best crypto prediction market.

Odds and Ends

Biatob (a)—an acronym of "Betting is a Tax On Bullshit"—is a new site for embedding betting odds on one's writing. Like: I (bet: $20 at 50%) (a) that this newsletter will exceed 500 subscribers by the end of 2021. Here (a) is a LessWrong post introducing it.

Hypermind launches a new contest on the future of AI (a), with 30,000€ at stake for prizes. An interview with Jacob Steinhardt, a UC Berkeley professor who selected the questions, can be found here (a). Hypermind's website has also undergone a light redesign.

I've added Kalshi and Betfair to Metaforecast.

In the News

Unfortunately, Fabs Won’t Fix Forecasting (a) gives a brief overview of the state of the semiconductor manufacturing industry. The recent chips shortage has led to more fabrics being built to serve anticipated demand, and to tighter coordination between buyers and manufacturers. The article then makes a point that "...companies are looking for ways to mitigate shortages. Building fabs is part of the answer, but unless OEMs (original equipment manufacturers (a)) and the supply chain can improve the accuracy of their forecasts, the chip industry's next problem could be be overcapacity."

Malta is in trouble over betting & fraud: Malta faces EU sports betting veto withdrawal (a) & Malta first EU state placed on international money laundering watch-list (a). H/t Roman Hagelstein. From the first article:

As one of Europe’s most prominent gambling hubs – online gambling accounts for 12% of the island’s GDP, generating €700 million and employing 9,000 people – and providing a base to over 250 betting operators including Betsson, Tipico and William Hill, the new stipulations could have a substantial impact on the day-to-day functions of Malta’s economy.

The European Central Bank seems to systematically over-predict inflation (a).

Forecasting Swine Disease Outbreaks (a)

For many years, production companies have been reporting the infection status of their sow farms to the MSHMP. So now we have this incredible dataset showing whether any given farm is infected with porcine epidemic diarrhea (PED) virus in a given week. We combine these data with animal movement data, both into the sow farms as well as into neighboring farms, to build a predictive, machine-learning algorithm that actually forecasts when and where we expect there to be high probability of a PED outbreak

The forecasting pipeline has a sensitivity of around 20%, which means that researchers can detect one out of every five outbreaks that occur.

That’s more information than we had before... so it’s a modest improvement,” VanderWaal said. “However, if we try to improve the sensitivity, we basically create more false alarms. The positive predictive value is 70%, which means that for every 10 times the model predicts an outbreak, it’s right seven of those times. Our partners don’t want to get a bunch of false alarms; if you ‘cry wolf’ too often, people stop responding. That’s one of the limitations we’re trying to balance.

Blog Posts

Thinking fast, slow, and not at all: System 3 jumps the shark (a): Andrew Gelman tears into Kahneman's new book Noise; Kahneman answers in the comments.

Something similar seems to have happened with Kahenman. His first book was all about his own research, which in turn was full of skepticism for simple models of human cognition and decision making. But he left it all on the table in that book, so now he’s writing about other people’s work, which requires trusting in his coauthors. I think some of that trust was misplaced.

Superforecasters look at the chances of a war over Taiwan (a) and at how long Kabul has left after America's withdrawal from Afghanistan (a).

In Shallow evaluations of longtermist organizations (a), I look at the pathways to impact for a number of prominent longtermist EA organizations, and I give some quantified estimates of their impact or of proxies of impact.

Global Guessing interviews Juan Cambeiro (a)—a superforecaster known for his prescient COVID-19 predictions—and goes over three forecasting questions with him. Forecasters who are just starting out might find the description of what steps Juan takes when making a forecast particularly valuable.

Types of specification problems in forecasting (a) categorizes said problems and suggests solutions. It's part of a broader set of forecasting-related posts by Rethink Priorities (a).

Risk Premiums vs Prediction Markets (a) explains how risk premiums might distort market forecasts. For example, if money is worth less when markets are doing well, and more when markets are doing worse, a fair 50:50 bet on a 50% outcome might have negative expected utility. The post is slightly technical.

Leaving the casino (a). "Probabilistic rationality was originally invented to choose optimal strategies in betting games. It’s perfect for that—and less perfect for other things."

16 types of useful predictions (a) is an old LessWrong post by Julia Galef, with some interesting discussion in the comments about how one can seem more or less accurate when comparing oneself to other people, depending on the method of comparison.

Long Content

The Complexity of Agreement (a) is a classical paper by Scott Aaronson which shows that the results of Aumann's agreement theorem hold in practice.

A celebrated 1976 theorem of Aumann asserts that Bayesian agents with common priors can never “agree to disagree”: if their opinions about any topic are common knowledge, then those opinions must be equal. But two key questions went unaddressed: first, can the agents reach agreement after a conversation of reasonable length? Second, can the computations needed for that conversation be performed efficiently? This paper answers both questions in the affirmative, thereby strengthening Aumann’s original conclusion.

We show that for two agents with a common prior to agree within ε about the expectation of a [0, 1] variable with high probability over their prior, it suffices for them to exchange O(1/ε^2) bits. This bound is completely independent of the number of bits n of relevant knowledge that the agents have. We also extend the bound to three or more agents; and we give an example where the “standard protocol” (which consists of repeatedly announcing one’s current expectation) nearly saturates the bound, while a new “attenuated protocol” does better

This paper initiates the study of the communication complexity and computational complexity of agreement protocols. Its surprising conclusion is that, in spite of the above arguments, complexity is not a fundamental barrier to agreement. In our view, this conclusion closes a major gap be- tween Aumann’s theorem and its informal interpretation, by showing that agreeing to disagree is problematic not merely “in the limit” of common knowledge, but even for agents subject to realistic constraints on communication and computation

The blog post The Principle of Indifference & Bertrand’s Paradox (a) gives very clear examples of the problem of priors. It's a chapter from a free online textbook (a) on probability.

What’s the problem? Imagine a factory makes square pieces of paper, whose sides always have length somewhere between 1 and 3 feet. What is the probability the sides of the next piece of paper they manufacture will be between 1 and 2 feet long?

Applying the Principle of Indifference we get 1/2

That seems reasonable, but now suppose we rephrase the question. What is the probability that the area of the next piece of paper will be between 1 ft2 and 4 ft2? Applying the Principle of Indifference again, we get a different number, 3/8

But the answer should have been the same as before: it’s the same question, just rephrased! If the sides are between 1 and 2 feet long, that’s the same as the area being between 1 ft2 and 4 ft2.

In Section 4 we shift attention to the computational complexity of agreement, the subject of our deepest technical result. What we want to show is that, even if two agents are computationally bounded, after a conversation of reasonable length they can still probably approximately agree about the expectation of a [0, 1] random variable. A large part of the problem is to say what this even means. After all, if the agents both ignored their evidence and estimated (say) 1/2, then they would agree before exchanging even a single message. So agreement is only interesting if the agents have made some sort of “good-faith effort” to emulate Bayesian rationality.

The infamous Literary Digest poll of 1936 (a) predicted that Roosevelt's rival would be the overwhelming winner. After Roosevelt instead overwhelmingly won, the magazine soon folded. Now, a new analysis finds that (a):

If information collected by the poll about votes cast in 1932 had been used to weight the results, the poll would have predicted a majority of electoral votes for Roosevelt in 1936, and thus would have correctly predicted the winner of the election. We explore alternative weighting methods for the 1936 poll and the models that support them. While weighting would have resulted in Roosevelt being projected as the winner, the bias in the estimates is still very large. We discuss implications of these results for today’s low-response-rate surveys and how the accuracy of the modeling might be reflected better than current practice.

Proebsting's paradox (a) is an argument that appears to show that the Kelly criterion can lead to ruin. Its resolution requires understanding that "Kelly's criterion is to maximise expected rate of growth; only under restricted conditions does it correspond to maximising the log. One easy way to dismiss the paradox is to note that Kelly assumes that probabilities do not change."

Note to the future: All links are added automatically to the Internet Archive, using this tool (a). "(a)" for archived links was inspired by Milan Griffes (a), Andrew Zuckerman (a), and Alexey Guzey (a).



Discuss

Страницы