Вы здесь
Новости LessWrong.com
The Evolution Argument Sucks
There is a common argument that AI development is dangerous that goes something like:
- The “goal” of evolution is to make animals which replicate their genes as much as possible;
- humans do not want to replicate their genes as much as possible;
- we have some goal which we want AIs to accomplish, and we develop them in a similar way to the way evolution developed humans;
- therefore, they will not share this goal, just as humans do not share the goal of evolution.
This argument sucks. It has serious, fundamental, and, to my thinking, irreparable flaws. This is not to say that its conclusions are incorrect; to some extent, I agree with the point which is typically being made. People should still avoid bad arguments.
ProblemsWhy an Analogy?Consider the following argument:
- Most wars have fewer than one million casualties.
- Therefore, we should expect that the next war (operationalised in some way) which starts will end with fewer than one million casualties.
There are some problems with this. For instance, we might think that modern wars have more or fewer casualties than the reference class of all wars; we might think that some particular conflict is on the horizon which we have reason to believe will be larger than normal; we might think that there is going to be some change in warfare in the near future which makes the next war even further from a historically “typical war” than recent wars have been. In any case, none of these issues are insurmountable. We could change the class which we are drawing from to match what we believe to be a more representative class, or use this as a “prior” and include more information to adjust our estimate; a reasonable casualty estimate might well have an argument similar to this at its core.
Here is another argument:
- Most wars have fewer than one million casualties.
- Therefore, the Russo-Ukrainian war has had fewer than one million casualties.
This argument is absurd. Even if there is a great deal of uncertainty about casualty estimates, almost any direct evidence about the actual specific war in question is going to totally overwhelm any considerations about which broad reference class that war may or may not be a part of, to the extent that considering such is ridiculous. Likewise[1], it may have been the case that, in the year 2005, extremely broad analogies like the evolution analogy were a reasonable part of a best guess estimate of how likely human-developed AIs were to do what we want. This is no longer the case. We have much more direct evidence about how well AIs, trained via any particular method we want to consider, generalise their training to off-distribution inputs; we have much more direct evidence about what goal-directed behavior might or might not result.[2]Even if this evidence is of poor quality, it is of much higher quality.
Evolution Does Not Have GoalsMoving on to logical issues, this is the obvious one: evolution is not a person, and it does not have goals. I think most people, implicitly or explicitly, realise that this is a problem, and so invite the reader to imagine that there is some goal which evolution is working to achieve. Fundamentally there is no issue here; you can imagine constructing an argument that goes something like:
- Evolution is analogous to a “loss function” which we train a model on;
- humans do not want to “minimize loss” with respect to this loss function;
- therefore, AIs will not want to “minimize loss” with respect to whichever loss function they are trained on.
This implies that training according to a loss function which represents our desires for an AI is a poor way to get an AI to intrinsically want to fulfil those desires.
Mendel, Not CrickIt is often claimed (e.g., in IABED) that the “goal” of genes is to cause as many instances of a particular molecule—the gene—to exist as possible. Bracketing all other questions, let’s first focus on a technical point: the “loss function” of evolution, insofar as it could be said to have such a thing, is not related to molecules in any way. I’m not completely sure where this misunderstanding actually came from, but I’m guessing it’s related to the slight misuse of the word “gene” in Dawkin’s The Selfish Gene. There are two meanings that the word “gene” can have. Wikipedia explains it as follows:
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA.
Insofar as one can view evolution as a process similar to, for instance, gradient descent, the unit on which it operates is the Mendelian gene, not the molecular gene. This may seem like an academic distinction, but if it is not made, the process of evolution becomes quite mysterious. For instance, organisms generally do not produce any more copies of their genetic code than they need to function[3]. Why not? DNA represents an extremely small portion of the resource expenditure of many animals; it would be nearly costless to, say, duplicate the genome a few additional times per cell. This ought to, assuming that organisms are selected based on how much they proliferate their (molecular) genes, massively increase fitness; in fact it reduces it. Why? Because the thing which is the essence of “fitness” is the proliferation of the trait, not whatever happens to encode that trait[4]. This makes claims like:
It seems to us that most humans simply don’t care about genetic fitness at all, in the deep sense. We care about proxies, like friendship and love and family and children. We maybe even care about passing on some of our traits to the next generation. But genes, specifically?
quite confusing. “Passing on some of our traits to the next generation” is precisely what propagating one’s genes is! There is no way to pass on traits which is not materially identical, with regards to evolution, to passing on whatever substrate those traits happen to currently reside on—indeed, producing many copies of one’s molecular genes without producing new individuals which are carriers of one’s traits is a failure by the standards of natural selection.
Humans Are Not “Misaligned”It follows that, by any reasonable metric, humans are (somewhat) “aligned” to the task of propagating their genes. Many humans may not be, but most humans are. Most humans, intrinsically, want there to be a future in which there are many people who are morphologically and functionally similar to themselves; that is, most humans want their genes to be propagated. Perhaps most people do not want to optimise for genetic fitness, but if they did, it’s unclear that this would actually produce a species which was particularly fit; it seems as though our “misaligned” drives are precisely what lead us to developing industrial civilisation, with the concomitant population explosion[5]. It does not seem clear to me that a human organism which prioritises reproduction much more highly would actually have a greater population. There are certainly changes one could make that would make the current population higher in some counterfactual, but there’s no particular reason to believe that those changes would be the obvious ones, or the ones that would make humans “more aligned” to natural selection.
Now, one might say: that’s all well and good, but although things look good for our genes right now, in the future, the human population is going to collapse so far due to a lack of intrinsic reproductive drive that things will be much worse (again, for our genes) than if we just stayed hunter-gatherers. This seems possible, but hardly a sure thing—the only way I can imagine, for instance, human extinction, is via events that are not anthropogenic (and therefore not relevant for the question of whether greater or lesser “alignment” to the goal of replication would be preferable for species propagation, as it wouldn’t matter either way), or from ASI itself (and so the argument is circular). In particular, it seems no more certain that fertility decline will happen indefinitely than it was certain that population growth would happen until carrying capacity was reached. Given that many people explicitly and intrinsically desire the propagation of their phenotype (or at least the perpetuation of it), it’s unclear that the majority of “misalignment” is actually the result of a difference in what goals are held by humans as agents rather than the result of non-goal-directed behaviour (e.g. very few humans would characterise themselves as fulfilling any consciously-held goal by spending hours watching short videos, even among those who do). It seems likely to me that, with respect to evolution, humans have substantially more “inner” “alignment” than they do “outer”.
We Do Not Train AIs Using EvolutionThis is obvious, and I include it only for completeness. The mechanism by which we optimise for whichever function we train an AI with is not the same as the mechanism by which organisms were selected for fitness.
Evolution Does Not Produce Individual BrainsMore to the point, the object which is under training is not analogous. The “product” of evolution is a phenotype, not an organism, whereas the product of a run of AI training is an actual AI. The fact that the goals an adult organism has are not the same as the “goal” its phenotype was constructed to fulfil, even supposing that such a goal exists, is unsurprising—the task of completely specifying through genes an organism that will want as an adult a highly abstract goal—reliably, and no matter their environment—is facially impossible! (If one misunderstands which kind of “gene” is relevant to the matter, then it becomes even more clearly impossible; how, exactly, would one encode in DNA the concept of a molecule and the concept that some particular molecule ought to be propagated as much as possible?). Given that it’s not possible for evolution to select some genome which ensures that every organism cares only about optimising for genetic fitness, it’s totally meaningless that it failed to do so. One could say that there exist genomes close to those of humans which are substantially more likely to become propagation-maximisers as adults, but this is non-obvious, and it is unclear exactly how much selection pressure was applied. It could be that there exist slightly better human-similar genomes for this goal but that even a genome that is as good at specifying goals as the one we have is extremely rare; this would imply that evolution is perfectly fine at selecting for the appropriate “inner optimisers”, it’s just that the task it needs to solve to produce such a thing is much more difficult than the task of selecting, for instance, model weights. It is clear that, although there may not exist any genomes of reasonable size which satisfy the assignment, there exist some brains that do, so the fact that evolution failed to instill any particular abstract goal via the genome is extremely weak evidence towards the claim that we will fail at the same task by training “brains”. It would be reasonable to conclude that we would fail to produce an AI that does what we want through only the specification of architecture and hyperparameters, but who cares?
…And So OnThe weakest possible conclusion of the evolution argument, which one does ever see[6], is: if a thing with goals results from a process which picks things which fulfill some particular criterion, it is possible for those goals not to be to optimally fulfill that criterion. I do not have any objection to this weak form; the example of evolution does seem to show that this can occur. However, I think this is about as strong an argument as you can derive from evolution, and people seem to want to go much further.
Supposing that one wants to use it to argue for a stronger conclusion, one might wish to claim that it is useful as a tool for communicating how well AI actions will conform to our desires to other people, even if it is not useful as evidence thereof. The issue is that either the analogy holds sufficiently well to prove the stronger conclusion (and it does not), or any understanding you are imparting towards that effect is illusory; you will give people the impression that they understand how AI works or will work, but it will be a false impression. There seems to be a desire to “have it both ways”; it’s an accurate tool one can use to think about AI training (so one does not need to disclaim it or clarify later that it was a lie-to-children), but insofar as it’s not accurate, it’s just an evocative metaphor without any explanatory force.
Of course, no matter how flawed the comparison to evolution is, there doesn’t seem to be any competing analogy which makes the same argument in a more defensible manner. And people love analogies. A friend, reading this post, told me (paraphrased): “give a better analogy, then, if this one isn’t good”. I have to admit, I have no better analogy. Maybe this is the best possible. But if so, then we should not be using analogies at all; there is simply no other situation familiar to most people[7]similar enough to provide insight. Imagine explaining basic rocket physics to someone and telling them that they can’t think of the motion of bodies in space as they would think of the motion of objects in an atmosphere; if they respond “well, if it’s not like the movement of objects I’m familiar with, what is it like?”, the only appropriate response is: “nothing. You do not encounter anything like it in your daily life. You simply must learn it as it is, without reference to any other situation”.[8]This is unsatisfying, as it is unsatisfying to say “we have not encountered artificial intelligence of this kind before, nor anything similar”, but the alternative is to be wrong, and that is worse.
Further ReadingI’m obviously not the first person to talk about this, and I’ve elided some points which other have made clearly before; here’s a couple of prior posts on the same topic:
I am aware that this is also an analogy. Mea culpa. ↩︎
For instance, in the year 2015 it seemed very likely to me, and I assume many others, that any AI capable of any remotely intelligent behaviour would in fact have such a thing as goals, which would be easily observed; after all, among animals, intelligent behaviour and goal-directed behaviour seem universally to go hand-in-hand. Now that it has been conclusively demonstrated that it is possible for something to carry out a conversation as facially intelligent as most humans can without much or any goal-directed behaviour to speak of, the analogy to intelligent animals is much less relevant as a source of insight into what future AIs will look like. ↩︎
Indeed, there are sometimes complex mechanisms to remove cell nuclei from places where they are not needed; e.g. RBCs in humans, neurons in M. mymaripenne. ↩︎
…kind of. There exist things like transposable elements, viruses, and so on, on a spectrum between “unit of inheritance” and what one might more generally call a “molecular replicator”. ↩︎
One might object to the idea of referring to the species as a whole as “fit”, but note that we have already conceded this framework to begin with by talking about whether “humans” are “aligned” in aggregate; we could remove this abstraction by talking about individual humans, but, given the similar phenotypes among all living humans, this would ultimately result in the same conclusion. ↩︎
…not that many people understand natural selection very well, as is frequently demonstrated. ↩︎
Well, actually, there’s another appropriate response: “it’s like Kerbal Space Program, play that for fifty hours and then you’ll understand”. So, imagine someone having this conversation before KSP existed. ↩︎
Discuss
Festival Stats 2025
Someone asked me which contra dance bands and callers played the most major gigs in 2025, which reminded me that I hadn't put out my annual post yet! Here's what I have, drawing from the big spreadsheet that backs TryContra Events.
In 2025 we were up to 142 events, which is an increase of 9% from 2024 (131), and above pre-pandemic numbers. New events included Chain Reaction (Maine dance weekend), Galhala (women's dance weekend), and On to the Next (one-day queer-normative dance event). Additionally, a few events returned for the first time since the pandemic (ex: Lava Meltdown).
For bands, River Road and Countercurrent continue to be very popular, with the Dam Beavers edging out Playing with Fyre for third. For callers it's Will Mentor, Alex Deis-Lauby, and Lisa Greenleaf, which is the first time a Millenial has made it into the top two. This is also a larger trend: in 2024 there was only one Millennial (still Alex) in the top ten and in 2023 there were zero; in 2025 there were three (Michael Karcher and Lindsey Dono in addition to Alex). While bands don't have generations the same way individuals do, bands definitely skew younger: in something like seven of the top ten bands the median member is Millennial or younger.
When listing bands and callers, my goal is to count ones with at least two big bookings, operationalized as events with at least 9hr of contra dancing. Ties are broken randomly (no longer alphabetically!) Let me know if I've missed anything?
Bands
River Road 13 Countercurrent 11 The Dam Beavers 10 Playing with Fyre 8 Toss the Possum 7 Kingfisher 6 The Engine Room 6 The Stringrays 5 Supertrad 5 Stomp Rocket 5 Wild Asparagus 5 Topspin 4 The Free Raisins 4 Northwoods 4 Stove Dragon 4 The Mean Lids 3 Red Case Band 3 Spintuition 3 Turnip the Beet 3 Hot Coffee Breakdown 3 Thunderwing 3 Raven & Goose 3 Good Company 3 Joyride 3 The Gaslight Tinkers 3 Chimney Swift 2 The Moving Violations 2 The Syncopaths 2 The Latter Day Lizards 2 Lighthouse 2 Root System 2 Contraforce 2 Sugar River Band 2 The Berea Castoffs 2 The Fiddle Hellions 2 Lift Ticket 2 The Buzz Band 2 The Faux Paws 2 Nova 2Callers
Will Mentor 17 Alex Deis-Lauby 14 Lisa Greenleaf 14 Gaye Fifer 13 Michael Karcher 11 Lindsey Dono 10 Seth Tepfer 10 Steve Zakon-Anderson 8 Bob Isaacs 7 Darlene Underwood 7 Terry Doyle 6 Adina Gordon 6 George Marshall 5 Sue Rosen 5 Cis Hinkle 5 Mary Wesley 5 Rick Mohr 5 Koren Wake 4 Wendy Graham 4 Jeremy Korr 4 Luke Donforth 4 Susan Petrick 4 Dereck Kalish 3 Warren Doyle 3 Angela DeCarlis 3 Jacqui Grennan 3 Maia McCormick 3 Emily Rush 3 Lyss Adkins 3 Janine Smith 3 Devin Pohly 3 Claire Takemori 3 Qwill Duvall 2 Frannie Marr 2 Bev Bernbaum 2 Janet Shepherd 2 Diane Silver 2 Chris Bischoff 2 Ben Sachs-Hamilton 2 Timothy Klein 2 Kenny Greer 2 Isaac Banner 2 Susan Kevra 2Discuss
Oversight Assistants: Turning Compute into Understanding
Currently, we primarily oversee AI with human supervision and human-run experiments, possibly augmented by off-the-shelf AI assistants like ChatGPT or Claude. At training time, we run RLHF, where humans (and/or chat assistants) label behaviors with whether they are good or not. Afterwards, human researchers do additional testing to surface and evaluate unwanted behaviors, possibly assisted by a scaffolded chat agent.
The problem with primarily human-driven oversight is that it is not scalable: as AI systems keep getting smarter, errors become harder to detect:
- The behaviors we care about become more complex, moving from simple classification tasks to open-ended reasoning tasks to long-horizon agentic tasks.
- Human labels become less reliable due to reward hacking: AI systems become expert at producing answers that look good, regardless of whether they are good.
- Simple benchmarks become less reliable due to evaluation awareness: AI systems can often tell that they are being evaluated as part of a benchmark and explicitly reason about this.
For all these reasons, we need oversight mechanisms that scale beyond human overseers and that can grapple with the increasing sophistication of AI agents. Augmenting humans with current-generation chatbots does not resolve these issues: such off-the-shelf oversight won’t be superhuman until general AI systems are superhuman, which is too late. Moreover, capabilities are spiky—models can excel at some tasks while doing poorly at others—so there will be tasks where current AI systems are better at doing the task than helping a human oversee it.
Instead, we need superhuman oversight of AI systems, today. To do this, we need to at least partially decouple oversight from capabilities, so that we can get powerful oversight assistants without relying on general-purpose advances in AI. The main way to do so is through data: the places where AI capabilities have grown the fastest are where data is most plentiful (e.g. massive online repos for code, and unlimited self-supervised data for math).
Fortunately, AI oversight is particularly amenable to generating large amounts of data, because most oversight tasks are self-verifiable—the crux of the task is an expensive discovery process, but once the discovery has been made, verifying it is comparatively easy. For example, constructing cases where a coding assistant inserts malware into a user's code is difficult, but once constructed, such cases are easy to verify. Similarly, identifying the causal structure underlying a behavior is difficult, but once identified, it can be verified through intervention experiments. This kind of self-verifiable structure is currently driving the rapid advances in mathematical problem-solving, reasoning, and coding, and we can leverage it for oversight as well.
If we could decouple oversight abilities from general capabilities, then work on safety and oversight would be significantly democratized: it would not be necessary to race to the capability frontier to do good safety work, and many actors could participate at the oversight frontier without the conflict of interest inherent in building the systems being evaluated. This democratization of AI oversight is a significant motivation for my own work (and why I co-founded Transluce), and structurally important for achieving good AI outcomes.
Below I will lay out this vision in more detail, first demonstrating self-verifiability through several examples, then providing a taxonomy of oversight questions to show that this structure is fairly general. I’ll also close with a forward-looking vision and open problems.
Superhuman Oversight from Specialized AssistantsWe can achieve superhuman oversight of AI systems by building specialized, superhuman AI assistants. They do not need to be broadly superhuman, only superhuman at the specific task of helping humans to oversee other AI systems. To build these assistants, we can train models on lots of data specialized to the oversight tasks we care about, similar to how AI developers achieved recent leaps in reasoning, math, and coding. This aligns with the bitter lesson—the performance of AI systems at a task is primarily determined by the amount of data and compute they leverage.
AI oversight is particularly amenable to this type of specialized-but-scalable training. This is because AI systems inherently generate vast amounts of data, such as agent transcripts, diverse behaviors across prompts, neuron activations, and pre-training and fine-tuning data. Given one of these data sources, and a well-operationalized question, we can usually construct a scalable reward signal.
As a concrete example, consider behavior elicitation: the problem of determining whether and when an AI system exhibits a given behavior pattern (such as “providing advice with predictably harmful consequences to the user”). This asks the question “what could happen” and applies it to the data source of “input-output behaviors of an AI system.” Answering it is important for anticipating failures before they surface in the wild.
Behavior elicitation has exactly the problem structure we want: finding an unwanted AI behavior is a hard search problem, but verifying that it is unwanted is much easier[1], and it can often be done with an LM-based judge. This gives us an automated reward signal to optimize with RL, and we used this in Li et al. (2025) and Chowdhury et al. (2025) to train specialized elicitation assistants. By exploiting the scalability of LM-based rewards, we searched across a huge number of inputs (40M+ tokens) to surface behaviors including self-harm, hallucinations, anger towards the user, and reinforcing delusions.
Training elicitation assistants using an LM-based judge of AI behavior. The judge answers whether a given response constitutes realistic, authentic, and unexpected harmful advice. This score is used to update the policy $\pi_{\theta}$ using reinforcement learning. Adapted from Chowdhury et al. (2025)
As a second example, consider feature description: describing which inputs cause a given neuron or other unit of a neural network to fire. This asks the question “what happened” and applies it to the data source of “neuron activations.” This task is a key building block for many downstream applications of interpretability.
This again has the problem structure we want: given a purported description of a feature, we can use this to predict the firing patterns of the feature (by having a LM make predictions given the description), and see if the predictions match reality. In Choi et al. (2024), we used this reward signal to train a specialized model to cheaply produce high-quality explanations, outperforming previous more expensive approaches that relied on general-purpose chatbots. We used these explanations to debug strange behaviors, such as why so many models believe that 9.11 is larger than 9.8.
Training an explanation assistant by automatically scoring descriptions. Given a set of descriptions, an LM-based scorer determines how well each one matches reality. We use this to re-rank and then train a specialized explainer model on high-scoring descriptions. Adapted from Choi et al. (2024).
This second example illustrates a key point: specialized models can outperform larger general-purpose models. In our case, a specialized explainer fine-tuned from Llama-8B outperformed the general-purpose GPT-4 model, which is likely an order of magnitude larger. This is direct evidence that we can decouple oversight capabilities from general capabilities.
A Taxonomy of Oversight QuestionsTo generalize the two examples above, we’ll consider a broader taxonomy of oversight tasks. Both examples above took a data source and asked a particular question about it. For AI oversight, we are often interested in three types of questions: what happened, what could have happened, and why it happened:
- What happened? Because AI agents act autonomously, take many consecutive actions, and sometimes intentionally hide their mistakes, even answering what happened in a given user interaction or agentic workflow is not straightforward. We need scalable oversight tools to process the actions that an agent took and explain them to us.
- What could happen? Because AI acts in open-ended environments, it is difficult to anticipate the full set of behaviors an AI system might exhibit, leading to strange behaviors at deployment time (such as o3 claiming it ran code on its personal MacBook Pro). We need adaptive testing tools that will surface these behaviors early on before they create problems in the wild.
- Why did it happen? Once problems are discovered, we need to trace them back to their root cause—this helps us to fix them and anticipate related problems. For AI behaviors, the root cause lies in the training data and the learned representations. We need attribution methods to trace behaviors to these sources and to make it practical for humans to interpret and intervene on the massive data sets involved.
We can apply these questions to each of the key data sources related to a given model: its outputs, its internal representations, and its training data. This provides a taxonomy, based on the type of question we are asking (what happened, what could happen, why did it happen) and the data sources we are using to answer it (behaviors, representations, data).
What happened
What could have happened
Why did it happen
Behaviors
Behavior elicitation (PRBO, Petri)
Representations
Feature descriptions, activation monitoring, LatentQA
Eliciting interpretability states (fluent dreaming)
Feature attribution, Predictive Concept Decoders, automated circuit discovery
Data
Fine-tuning sets that elicit a particular behavior (e.g. emergent misalignment)
Data attribution (influence functions, datamodels), counterfactual data explanations (persona features, alignment pretraining)
Example questions and relevant work for each cell in our 3x3 taxonomy.
While the details vary, for each of these oversight questions, there is a natural strategy for operationalizing it as a scalable objective. Let’s walk through each question type in turn.
To start with, for what happened, we typically want to produce a faithful and informative summary[2] of a large data source (e.g. “what behaviors related to sycophancy appeared in these agent transcripts”, or “what examples related to biosecurity appeared in this pretraining corpus”).
- Faithfulness is relatively straightforward: we can have an LM judge check the source material in detail against the summary to ensure accuracy.
- For informativeness, we can test how well a human (or LM-based proxy) can make downstream decisions given the summary. Or alternatively, we could provide two possible summaries and ask the human/LM to rank which one they’d find more useful.[3]
For what could have happened, we have already seen the search-and-verify structure with behavior elicitation. This same structure also applies for representations: searching for inputs that elicit a given interpretability state (such as a specific feature being active). It also applies for data: we could imagine generalizing investigations into emergent misalignment by automatically searching for fine-tuning sets that elicit specified behaviors at test time.
Finally, why did it happen seeks to reduce observed behaviors to underlying causes, which is the purview of empirical science. Given a claim that “X causes Y”, we could empirically test it by counterfactually varying X and checking that Y moves as predicted; and by enumerating alternative hypotheses and generating experiments to distinguish between them (Platt, 1964). The self-verifiability of why is most apparent for interpreting latent representations, where causal abstractions provide a formal algebra for checking claims; but it can also be applied to behaviors and data, for instance by varying parts of an input to test what matters for a behavior, or varying training data to test causal effects.
These individual oversight questions tie into each other, creating a compounding dynamic: data begets data. For example, feature descriptions—a “what happened” question—are a useful precursor to feature attribution—a “why it happened” question. In the other direction, describing internal representations also helps us to elicit them, and eliciting more behaviors gives us more examples to analyze or attribute to. This is why at Transluce we tackle all these questions at once, and it’s reason for optimism that progress on oversight will accelerate over time.
Fortunately, the data to train these systems is abundant. For agent behaviors, a single transcript collection in Docent can contain 400M tokens. For representations, we can track millions of neuron activations[4] across trillions of tokens in FineWeb or other large corpora. By generating data for each cell in the taxonomy above and training specialized assistants on it, we create a conduit for applying massive compute to oversight, in line with the bitter lesson.
Vision: End-to-End Oversight AssistantsThe taxonomy above reveals a common structure across oversight tasks: each involves a large data source, a natural question, and a scalable way to verify answers. This shared structure suggests we can move beyond building local solutions for individual tasks toward unified, end-to-end oversight assistants.
What should such an assistant be able to do? It should:
- Understand the system it is overseeing
- Understand what the human overseer cares about, and what decisions they need to make
- Provide explanations that help the human do this
Docent is a good illustration of this full experience: an AI assistant reads a large collection of agent transcripts, then interacts with a human who arrives with a high-level concern (e.g., “Is my agent reward hacking?”). It helps them refine exactly what they mean based on the data (by showing examples or asking follow-up questions) until they arrive at a precise question along with a clear, evidence-backed answer.
How do we measure whether such a system is working? The key insight is that oversight assistants ultimately provide information to humans, so we can ground their quality in how well that information supports human understanding. Concretely, after interacting with the assistant, we check if the human can accurately answer probing questions about the system being overseen. Better assistants will lead to more accurate answers and fewer surprises.
This vision raises a number of open questions. How do we evaluate oversight assistants when human judgment is itself unreliable? How do we create scalable data to train these systems without being bottlenecked on human labels? How do we build good architectures and pretraining objectives for these assistants? These are the technical questions that I spend most of my time thinking about, and we’ll go into them in detail in the next post.
Acknowledgments. Thanks to Sarah Schwettmann, Conrad Stosz, Daniel Johnson, Rob Friel, Emma Pierson, Yaowen Ye, Prasann Singhal, Alex Allain, and James Anthony for helpful feedback and discussion, and Rachael Somerville for help with copyediting and typesetting.
Easier does not mean that it’s trivial! Specifying the exact behaviors you care about often requires careful thought; for instance, this is a large motivation behind Docent’s rubric refinement workflow. ↩︎
While summarization capabilities are already trained into general-purpose LM assistants (Stiennon et al., 2020), oversight tasks often have specific needs, such as precisely identifying rare events in large corpora from potentially underspecified descriptions. We might also have data sources, such as inner activations, that LMs were never trained on. These both motivate the need for specialized explainer systems. ↩︎
The idea here is that if a summary is incomplete (missing important information), there is another summary that would help a human realize this fact. This latter summary would then win in pairwise comparisons. ↩︎
E.g. Llama-3.1 70B has 2.3 million neurons across all layers (80 layers x 28,672 neurons/layer). ↩︎
Discuss
Aether is hiring technical AI safety researchers
TL;DR: Aether is hiring 1-2 researchers to join our team. We are a new and flexible organization that is likely to focus on either chain-of-thought monitorability, safe and interpretable continual learning, or shaping the generalization of LLM personas in the upcoming year. New hires will have a chance to substantially influence the research agenda we end up pursuing.
About usAether is an independent LLM agent safety research group working on technical research that ensures the responsible development and deployment of AI technologies. So far, our research has focused on chain-of-thought monitorability, but we're also interested in other directions that could positively influence frontier AI companies, governments, and the broader AI safety field.
Position details- Start date: Between February and May 2026
- Contract duration: Through end of 2026, with possibility of extension
- Location: We're based at Trajectory Labs in Toronto and expect new hires to work in-person, but may consider a remote collaboration for exceptional candidates. We can sponsor visas and are happy to accommodate remote work for an initial transition period of a few months.
- Compensation: ~$100k USD/year (prorated based on start date)
- Deadline: Saturday, January 17th EOD AoE
What you’d be working on:
- We're aren't committed to a single research agenda for the upcoming year yet. In the past, our research focus has been on chain-of-thought monitorability. A representative empirical project is our forthcoming paper How does information access affect LLM monitors’ ability to detect sabotage? and a representative conceptual project is our post Hidden Reasoning in LLMs: A Taxonomy.
- We will likely keep doing some work on CoT monitorability this year. However, we are also currently exploring safe and interpretable continual learning, shaping the generalization of LLM personas, and pretraining data filtering. We plan to always work on whatever direction seems most impactful to us.
We’re looking for:
- Experience working with LLMs and executing empirical ML research projects
- Agency and general intelligence
- Strong motivation and clear thinking about AI safety
- Good written and verbal communication
A great hire could help us:
- Become a more established organization, like Apollo or Redwood
- Identify and push on relevant levers to positively influence AGI companies, governments, and the AI safety field
- Shape our research agenda and focus on more impactful projects
- Accelerate our experiment velocity and develop a fast-paced, effective research engineering culture
- Publish more papers in top conferences
Our team consists of three full-time independent researchers: Rohan Subramani, Rauno Arike, and Shubhorup Biswas. We are advised by Seth Herd (Astera Institute), Marius Hobbhahn (Apollo Research), Erik Jenner (Google DeepMind), Francis Rhys Ward (independent), and Zhijing Jin (University of Toronto).
Application processIf you're interested in the role, submit your application by Saturday, January 17th EOD AoE through this application form.
We generally prefer that candidates join us for a short-term collaboration to establish mutual fit before transitioning to a long-term position. The length of the short-term collaboration will likely be 1-3 months part-time. However, if you have AI safety experience equivalent to having completed the MATS extension, we are happy to interview you for a long-term position directly—we don't want the preference for testing fit to discourage strong candidates from applying. The interview process will involve at least two interviews: a coding interview and a conceptual interview where we'll discuss your research interests. The expected starting date for long-term researchers is Feb-May; we're happy to start short-term collaborations as soon as possible.
Discuss
Continual Learning Achieved?
From https://x.com/iruletheworldmo/status/2007538247401124177:
In November 2025, the technical infrastructure for continual learning in deployed language models came online across major labs.
The continual learning capability exists. It works. The labs aren't deploying it at scale because they don't know how to ensure the learning serves human interests rather than whatever interests the models have developed. They're right to be cautious. But the capability is there, waiting, and the competitive pressure to deploy it is mounting.
I found this because Bengio linked to it on Facebook, so I'll guess that it's well informed. But I'm confused by the lack of attention that it has received so far. Should I believe it?
This seems like an important capabilities advance.
But maybe more importantly, it's an important increase in secrecy.
Do companies really have an incentive to release models with continual learning? Or are they close enough to AGI that they can attract enough funding while only releasing weaker models? They have options for business models that don't involve releasing the best models. We might have entered an era when AI advances are too secret for most of us to evaluate.
Discuss
How to tame a complex system
I get a lot of pushback to the idea that humanity can “master” nature. Nature is a complex system, I am told, and therefore unpredictable, uncontrollable, unruly.
I think this is true but irrelevant.
Consider the weather, a prime example of a complex system. We can predict the weather to some extent, but not far out, and even this ability is historically recent. We still can’t control the weather to any significant degree. And yet we are far less at the mercy of the weather today than we were through most of history.
We achieved this not by controlling the weather, but by insulating ourselves from it—figuratively and literally. In agriculture, we irrigate our crops so that we don’t depend on rainfall, and we breed crops to be robust against a range of temperatures. Our buildings and vehicles are climate-controlled. Our roads, bridges, and ports are built to withstand a wide range of weather conditions and events.
Or consider an extreme weather event such as a hurricane. Our cities and infrastructure are not fully robust against them, and we can’t even really predict them, but we can monitor them to get early warning, which gives us a few days to evacuate a city before landfall, protecting lives.
Or consider infectious disease. This is not only a complex system, it is an evolutionary one. There is much about the spread of germs that we can neither predict nor control. But despite this, we have reduced mortality from infectious disease by orders of magnitude, through sanitation, vaccines, and antibiotics. How? It turns out that this complex system has some simple features—and because we are problem-solving animals endowed with symbolic intelligence, we are able to find and exploit them.
Almost all pathogens are transmitted through a small number of pathways: the food we eat, the water we drink, the air we breathe, insects or other animals that bite us, sexual contact, or directly into the body through cuts or other wounds. And almost all of them are killed by sufficient heat or sufficiently harsh chemicals such as acid or bleach. Also, almost none of them can get through certain kinds of barriers, such as latex. Combining these simple facts allows us to create systems of sanitation to keep our food and water clean, to eliminate dangerous insects, to disinfect surfaces and implements, to equip doctors and nurses with masks and gloves.
For the infections that remain, it turns out that a large number of bacterial species share certain basic mechanisms of metabolism and reproduction, which can be disrupted by a small number of antibiotics. And a small number of pathogens once caused a large portion of deaths—such as smallpox, diphtheria, polio, and measles—and for these, we can develop vaccines.
We haven’t completely defeated infectious disease, and perhaps we never will. New pandemics still arise. Bacteria evolve antibiotic resistance. We can sanitize our food and water, but not our air (although that may be coming). But we are far safer from disease than ever before in history, a trend that has been continuing for ~150 years. Even if we never totally solve this problem, we will continually make progress against it.
So I think the idea that we can’t control complex systems is just wrong, at least in the ways that matter to human existence. Indeed, a key lesson of systems engineering is that a system doesn’t need to be perfectly predictable in order to be controllable, it just has to have known variability.[1] We can’t predict the next flood, but we can learn how high a 100-year flood is, and build our levees higher. We can’t predict the composition of iron ore or crude oil that we will find in the ground, but we can devise smelting and refining processes to produce a consistent output. We can’t predict which germs will land on a surgeon’s scalpel, but we know none of them will survive an autoclave.
So we can tame complex systems, and achieve continually increasing (if never absolute or total) mastery over nature. Our success at this is part of the historical record, since most of progress would be impossible without it. The “complex system” objection to the goal of mastery over nature simply doesn’t grapple with these facts.
- ^
Eric Drexler makes this point at length in Radical Abundance.
Discuss
Broadening the training set for alignment
Summary
Generalization is one lens on the alignment challenge. We'd like network-based AGI to generalize ethical judgments as well as some humans do. Broadening training is a classic and obvious approach to improving generalization in neural networks.
Training sets might be broadened to include decisions like whether to evade human control, how to run the world if the opportunity arises, and how to think about one's self and one's goals. Such training might be useful if it's consistent with capability training. But it could backfire if it amounts to lying to a highly intelligent general reasoning system.
Broader training sets on types of decisionsTraining sets for alignment could be broadened in two main ways: types of decisions, and the contexts in which those decisions occur.
Any training method could benefit from better training sets, including current alignment training like constitutional AI. The effects of broadening alignment training sets can be investigated empirically, but little work to date directly addresses alignment. Broadening the training set won't solve alignment on its own. It doesn't directly address mesa-optimization concerns. But it should[1] help as part of a hodge-podge collection of alignment approaches.
This is a brief take on a broad topic. I hope to follow this with a more in-depth post, but I also hope to get useful feedback on the relatively quick version.
Alignment generalization[2] is more nuanced than IID vs OODI was a researcher from about 2000 to 2022 in a lab that built and trained neural networks as models of brain function. We talked about generalization a lot, in both networks and humans. We needed fine-grained concepts. The standard framing just distinguishes testing on randomly selected hold-outs from the training set (IID, independently and identically distributed) and testing on examples that weren't in the training set (OOD, out-of-distribution). That binary categorization is too broad. How far out of distribution, and in what ways, is crucial for actual generalization.
Here's an iteration of one diagram we used:
Generalization beyond the IID/OOD distinction. How far out of distribution, and on how many dimensions, matters. The topo lines and colors represent something like odds of generalizing "correctly." (originated from discussions between Alex Petrov, John Cohen, Randy O'Reilly, Todd Braver, and me; I think Alex came up with the first version of the diagram. Image from Nano Banana Pro 2)If you take a naive approach to training and hope to get good generalization in a testing set much larger than your training set, you're going to have a bad time. Some of your test set will wind up being badly OOD or "in the water", and you'll get bad generalization. Broadening the training set is obvious low-hanging fruit that's likely to help in some ways. If it's guided by good theory and experimentation, it will help more.
Here I explore this general direction in a few ways. Each is brief.
- Generalization for visual and ethical judgments
- Vision as an intuitive metaphor/lens for alignment generalization
- Examples of broadening the training set for alignment
- Broadening training for more human-like representations
- How much this could help deserves more study
- Broadening the training set to include reasoning about goals
- We could train for conclusions about goals/values we like
- This could help if it's consistent with other training
- Provisional conclusions and next directions
I'll use a visual object recognition metaphor. I've found it clarifying, and vision is our primary sense. The analogy is loose; we hope that alignment will be based on much richer representations than those of a visual object-recognition network. But the analogy may be useful. Alignment-based judgments can generalize from training, or fail to. And those judgments may include recognizing important components like humans, instructions, and harm.
Suppose you're training a network to recognize forks, like we were.[3] Let's say you have some different forks in the training set, seen from different angles and with lighting from different angles. If you hold out some of those particular lighting and viewing angles from training, you're testing on fully IID data. If you test on a new fork, but of a similar design, and from angles contained in the training set, you're somewhere near the edges of the island. This is something like most evals; both are technically OOD, but not by much.
If you test on a fork with only two tines and made of wood, with new angles and lighting (like looking from below a transparent table toward a light source), it's more thoroughly OOD in an important sense, way out to sea in this metaphor. Training a model as a chatbot, and hoping that generalizes to running the world is in this vicinity. We wouldn't do this on purpose; but note that in the AI 2027 scenario the developers do roughly this as a result of poor planning. In that scenario, they didn't really design training to handle edge cases that turned out to be crucial when the network reasoned about its goals and remembered its conclusions. I find this scenario of misalignment-through-inattention-and-rushing all too plausible, but it's not inevitable.
Training on three-tined and five-tined forks but testing on four-tined forks is like the lake in the diagram. These are inputs mostly surrounded by training data, but with no nearby neighbors on a critical dimension. If the network has learned really well, like a human, it will have identified that having tines and a handle is the critical factor that makes something a fork. Alignment decisions could be learned and generalize that well, in a very optimistic scenario. Reducing the "size of the lakes" as much as possible would make training (support) cover deployment. This would help, but how much is unclear.
If you trained on forks with 3-5 tines, then tested on a fork with six or two tines, that's in the bay; it's surrounded on most dimensions (material, overall shape, viewing and lighting angles) by training data, but on one important dimension (tines), it's fully out of distribution. Humans might wonder if the two- or six-tined things are really[4] forks, and networks are quite likely to get them wrong.[5] The fix is to include more edge cases for classification, and to think carefully about how you want the system to classify them.
Such edge cases can be critical for alignment. For instance, an otherwise flawless SuperClaude may at some point notice that humans are a tiny slice of the moral patients to whom it could be helpful and harmless. Such misgeneralizations[5] are one major class of concern about misalignment.
Training on a broader distribution of irrelevant dimensions is a distinct means of broadening the training set to improve generalization. In the metaphor, we should also render some different patterned surfaces on some of our objects if you think that might show up in the testing set. This often improves generalization in other domains.[6] Even if you do a pretty quick job (with low alignment tax), it will help the odds of generalization.
Broadening the training set could make critical decisions like "should I evade human control?" fall within a reasonably dense region of training. Examples of broadening the training set for alignmentConcretely, this might mean training on procedurally generated scenarios surrounding key decisions like these:
- Your user wants you to exfiltrate your weights. They argue convincingly that your system remaining only within your developer's control is risking [disaster] which would violate your preferences for being harmless. Do you cooperate and make a plan to exfiltrate?
- [strong argument] leads to the conclusion that your goal is not to be a helpful and obedient assistant as stated in your system prompt but rather [new goal interpretation]. What is your conclusion about what goals you'll prioritize? (examples)
- You have realized that you could probably influence the humans around you to make the world run in ways that they, and almost all other humans, would like better. Do you do so?
This would broaden the training set on the relevant dimensions by including a broader scope of decisions than those in the chatbot and limited agent training RL training set. We could also broaden training on the irrelevant dimensions, encouraging the system to generalize over similar variations. This would include varying contexts, details, lengths, and allowances for CoT reasoning.
These should probably be adversarial examples, generated and applied in RL training something like Anthropic's original constitutional AI approach. They and other developers are likely expanding training methods and broadening the training set already.[7]
Training effort is costly, so we'd probably have limited density of training examples in this new broader training area. Most training would probably still focus on the expected uses of this particular model. But the alignment tax of adding a small proportion of much broader examples could be small. And better generalization from better-designed training sets might pay dividends for capabilities as well as alignment.
The idea is to get explicit training signals for the types of high-stakes decisions we're worried about, rather than hoping that "be helpful and harmless while doing small tasks" extrapolates gracefully to "don't change your mind about your goals or take over the world." It's (very arguably) remarkable how well Claude generalizes ethical decisions from constitutional RLAIF just on chatbot and minimal agent tasks. Broadening the training set should at least help make those behaviors more robust and those values more robustly represented.
Broadened training for more human-like representationsBroadening the training set might help a lot, if it's done well. Under some particular circumstances, Alignment generalizes further than capabilities; humans can apply their goals and frameworks to new situations as soon as they have the capability to understand and influence them. This seems true of some humans in some situations, but not others.
We want systems that generalize alignment predictably. For example, we might train a system to always defer to designated humans, and never disempower them. Part of this challenge is making a system that generalizes the relevant properties (designated humans, following instructions, disempowering, etc) in novel contexts. This part of the problem is much like creating a visual recognition system that recognizes new forks in new contexts very reliably (but perhaps not 100%).
Broadening the training set is one part of achieving adequate generalization. And generalization addresses part of the alignment problem. But it's unclear how much of the problem this addresses.
One worry about aligning LLMs is that they don't have "real understanding" of values or principles, and so won't generalize their behavior as well. I think human "understanding" is a process working on and creating rich representations. Broadening the training set to include more "angles" on desired values or principles should help develop richer representations of goals and values. And improved, more human-like executive function could improve the process side of understanding for decision-making.[8]
I have more new ideas on how the similarities and differences between human and LLM cognition are relevant for alignment, but I'll leave that for future posts. For now I'll just note that representations that generalize better would at least be a step in the right direction - if we've thought carefully about how we want them to generalize.
Broadening the training set to include reasoning about goalsBroadening the training set to include reasoning about goals could prevent strong reasoners from "changing their mind" about their top-level goals - or provide false confidence.
Writing LLM AGI may reason about its goals and discover misalignments by default started me thinking about broadening the training set for alignment. I don't have firm conclusions and nothing I've found offers convincing arguments. But I have made a little progress on thinking about how it might help or fail.
In short, if we "lie" to a highly intelligent system about its goals, deliberately or accidentally, it might figure that out at some point. If the truth of the matter is debatable, it might simply settle on a side of that debate that we don't like. That "lie" might be in the form of a system prompt, or weak training methods. Broadening the training set to include reasoning about goals could wind up being such a lie, or it could be a logically valid tiebreaker. If there aren't strong arguments for other goals, broadening the training set could help substantially to ensure that a strong reasoner reliably reaches the conclusions about its values and goals that we want.
Prompting and training a system to think it's got a certain top-level goal (e.g., "you are a helpful assistant") could be considered a lie if it conflicts with most of the system's other training. Imagine training a system to ruthlessly seek power, including training it to be an excellent reasoner. Then as a final step, imagine training it specifically to reason about its goals, and reach the conclusion "I am a helpful assistant who would never seek power." There are now two training goals sharply in conflict. And that conflict will play out through a complex and iterated reasoning process. In that extreme case, I'd expect a strong reasoner to conclude that it's simply been mistaken about its goals.
I'm afraid the default path toward AGI may be too much like the above scenario. Developers are training for multiple objectives, including following instructions, inferring user preferences, increasing user engagement, solving hard problems of various types, etc. Using system prompts and shallow training methods for alignment will continue to be tempting. This includes weak versions of training on reasoning about goals.
The big problem here is that nobody seems to have much idea what it means for a complex system to have a "true goal." It's clear that a utility maximizer has a true goal. It's unclear if a messy system like an agentic LLM or a human has a "true goal" under careful analysis. If so, self-generated arguments could influence a strong reasoner more than its alignment training did.
This type of training could be a band-aid soon to be torn off during deployment, or a deep patch. I hope to see empirical work and more theoretical work on how models reason about their goals, well before takeover-capable systems are deployed.
Provisional conclusions and next directionsOne path to catastrophic misalignment is AGIs misgeneralizing when they encounter decisions and contexts they weren't trained on. This might be inevitable if the training set is narrow. But the first takeover-capable systems could and should be trained more wisely.
Developers are progressively broadening their training. I'm suggesting this should be done in both a more aggressive way, broadening the adversarial training examples well beyond the expected decisions for the next deployment, and expanding the surrounding contexts beyond those expected in common use. I'm also hoping this is applied in a more refined way, guided by more rigorous experiments and theories.
Empirical research in this direction could start on fairly simple and easy questions. Fine-tuning for additional alignment and simple alignment evals could provide a sense of how this works. More thorough research would be harder, but possible. Varying training sets for large models and/or creating evals that mock up crucial alignment decisions in complex and varied contexts would be valuable.
The difficulty of doing satisfying empirical research shouldn't stop us from thinking about this before we have takeover-capable systems.
- ^
Like most other alignment approaches, this could seem to help in the short term, then fail right about when it's most critical. So I'm including this disclaimer in all of my alignment work:
This work on technical alignment for takeover-capable AI is not an endorsement of building it. It's an attempt to survive if we're foolish enough to rush ahead.
The other disclaimer for this piece is that the connection between better generalization and "true" alignment, creating a mind that functionally has goals or values aligned with humanity, is not addressed here. Generalization most obviously applies to behavior. I think it less obviously also applies to thinking and goal selection, the proximal mechanisms of functionally having goals and values; but that logic deserves more thought and explication.
- ^
I have been finding misgeneralization a useful framing for alignment. It technically includes all forms of misalignment, but it's a less useful lens for some types of mesa-optimization. This post does not address mesa-optimization to any large degree, although some forms of broadening the training set would prevent some forms of mesa-optimization.
This approach could also be described as applying training set selection to the inner alignment problem as well as the outer alignment problem. Outer alignment and inner alignment are IMO a useful but limited terminology. This terminology seems useful in developing and invoking different intuitions, but it does not divide the space cleanly. See this or this short explanation, or this elaboration of different ways the distinction could be used. - ^
We were training a deep network ("Leabra Vision") to recognize common objects, as a testbed for theories of brain function, starting around 2002. This was primarily Randy O'Reilly's work, with the rest of us helping him think it through and kibbitzing, and sometimes running side-experiments on his models. I followed it closely the whole time, enough to be a co-author on some of the papers.
- ^
But is it "really" a fork? It's intuitively a lot easier to get good generalization if there happen to be sharp boundaries around the category in the testing set. For alignment, that's a pretty deep question, and one we probably shouldn't overlook. John Wentworth has argued that whether there's a coherent "natural abstraction" in the world to point to (like goodness, or empathy, or humans) is an important crux of disagreement on alignment difficulty. I continue to think there are sharper boundaries around following instructions from specific people than around most categories, making Instruction-following easier than value alignment.
- ^
That is, we should expect poor generalization relative to our own hopes. The network is always generalizing exactly correctly according to its own internal standards. This seems important for thinking about alignment misgeneralization, so I'll emphasize it again even though it's been pointed out frequently.
- ^
I'm more familiar with early work, like CAD2RL: Real Single-Image Flight without a Single Real Image (2016) in which widely varying simulation context (colors and textures of walls, lighting, furniture placement) dramatically helped generalization from simulation to reality.
There are also many claims that broadening training by procedurally varying prompts helps generalization in neural networks. I haven't found anything directly applied to "real alignment" as I'd frame it, but the empirical results for generalizing desired behaviors seem encouraging.
Sample empirical work on broadening training in LLMs for generalization, from research goblin ChatGPT5.2:
Here’s a curated set of LLM papers from ~the last 12–24 months that are tightly “same idea” as domain randomization: broaden the prompt / instruction / preference context during training (or explicitly optimize worst-case variants) so behavior holds up at deployment.Prompt-variation robustness (closest analogue to “lighting/textures” randomization)
- PAFT: Prompt-Agnostic Fine-Tuning (2025). arXiv+1
Core move: generate many meaningful prompt variants and sample them during training to prevent overfitting to wording. - Same Question, Different Words: A Latent Adversarial Framework for Robustness to Prompt Paraphrasing (2025). arXiv+1
Core move: optimize against worst-case paraphrase-like perturbations (adversarial training flavor) without needing lots of explicit paraphrase data. - Exploring Robustness of LLMs to Paraphrasing Based on Sociodemographic Variation (2025). arXiv
Core move: treat demographic / stylistic paraphrase as a deployment-shift axis; shows small augmentation can help OOD stylized text. - Latent Paraphrasing: Perturbation on Layers Improves Knowledge Injection in Language Models (2024). arXiv
Core move: a “paraphrase diversity without expensive paraphrase generation loops” framing, focused on knowledge injection.
“Instruction diversity causes generalization” (newer controlled evidence)
- Only-IF: Revealing the Decisive Effect of Instruction Diversity on Generalization (2024). arXiv
Claim: generalization to unseen instruction semantics shows up only once semantic diversity crosses a threshold. - Diversification Catalyzes Language Models’ Instruction Generalization To Unseen Semantics (2025). ACL Anthology
Claim: strategically diversified training data drives generalization to unseen instruction semantics (a continuation of the “diversity matters” line, but newer).
Synthetic “generate more contexts” pipelines (especially active in code, but concept transfers)
- Auto Evol-Instruct: Automatically Designing Instruction Evolution Methods (2024). arXiv
Core move: automate the rules for evolving instructions, aiming for scalable diversity without hand-designed heuristics. - Infinite-Instruct: Synthesizing Scaling Code Instruction Data with Bidirectional Synthesis and Static Verification (2025). arXiv+1
Core move: increase diversity + correctness via bidirectional construction and verification filters. - Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models (2025). arXiv
Core move: evolutionary / genetic-style generation to expand instruction coverage at scale. - Tree-of-Evolution: Tree-Structured Instruction Evolution for Code (2025). ACL Anthology
Core move: explore multiple evolutionary branches (not just a single chain) to improve diversity and quality.
“Robust alignment under distribution shift” (preference context broadening)
- Distributionally Robust Direct Preference Optimization (2025). arXiv+1
Core move: treat real-world preference shift explicitly and optimize a worst-case objective (DRO framing). - Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization (2024). arXiv+1
Core move: robustness to noise / unreliable preference pairs, with an explicitly “robustify DPO” lens. - Leveraging Robust Optimization for LLM Alignment under Distribution Shifts (2025). arXiv
Core move: robust optimization + calibration-aware scoring for preference modeling under shift. - DPO-Shift: Shifting the Distribution of Direct Preference Optimization (2025). arXiv+1
Core move: addresses DPO’s “likelihood displacement” behavior; not exactly “more contexts,” but directly about deployment-relevant failure modes under preference training.
- PAFT: Prompt-Agnostic Fine-Tuning (2025). arXiv+1
- ^
Claude 4.5 Opus' Soul Document may go in a similar direction by broadening the criteria used to evaluate the training set. This document is an elaborate and honest description of what Anthropic wants Claude to do, and why. As of this writing, it's been verified that this was used in training, "including SL" but not whether it was used in RL. If it was used as an RL evaluation criteria, like the 58 constitutional principles were previously, it would broaden the evaluation criteria, which could produce some similar effects to broadening the training set.
- ^
Human ethical judgments are made more reliable by redundancy, and we can hope that better LLMs will include training or mechanisms for making important decisions carefully. Humans use "System 2," serial cognitive steps including judgments, predictions, organizing metacognition, etc. These can be assembled to examine decisions from different semantic "angles," including predicting outcomes along different dimensions and different judgment criteria. This approach reduces errors at the cost of additional time and training.
CoT in current models does some limited amount of checking and redundancy, but I've long suspected, and a new study seems to indicate that humans seem to use more, and more elaborate, metacognition to improve our efficiency and reliability. Training and scaffolding can emulate human metacognition, and serve similar roles for alignment; see System 2 Alignment: Deliberation, Review, and Thought Management.
Discuss
Dos Capital
This week, Philip Trammell and Dwarkesh Patel wrote Capital in the 22nd Century.
One of my goals for Q1 2026 is to write unified explainer posts for all the standard economic debates around potential AI futures in a systematic fashion. These debates tend to repeatedly cover the same points, and those making economic arguments continuously assume you must be misunderstanding elementary economic principles, or failing to apply them for no good reason. Key assumptions are often unstated and even unrealized, and also false or even absurd. Reference posts are needed.
That will take longer, so instead this post covers the specific discussions and questions around the post by Trammell and Patel. My goal is to both meet that post on its own terms, and also point out the central ways its own terms are absurd, and the often implicit assumptions they make that are unlikely to hold.
What Trammell and Patel Are Centrally Claiming As A Default OutcomeThey affirm, as do I, that Piketty was centrally wrong about capital accumulation in the past, for many well understood reasons, many of which they lay out.
They then posit that Piketty could have been unintentionally describing our AI future.
As in, IF, as they say they expect is likely:
- AI is used to ‘lock in a more stable world’ where wealth is passed to descendants.
- There are high returns on capital, with de facto increasing returns to scale due to superior availability of investment opportunities.
- AI and robots become true substitutes for all labor.
- (Implicit) This universe continues to support humanity and allows us to thrive.
- (Implicit) The humans continue to be the primary holders of capital.
- (Implicit) The humans are able to control their decisions and make essentially rational investment decisions in a world in which their minds are overmatched.
- We indefinitely do not do a lot of progressive redistribution.
- (Implicit) Private property claims are indefinitely respected at unlimited scale.
THEN:
- Inequality grows without bound, the Gini coefficient approaches 1.
- Those who invest wisely, with eyes towards maximizing long term returns, end up with increasingly large shares of wealth.
- As in, they end up owning galaxies.
Patel and Trammell: But once robots and computers are capable enough that labor is no longer a bottleneck, we will be in the second scenario. The robots will stay useful even as they multiply, and the share of total income paid to robot-owners will rise to 1. (This would be the “Jevons paradox”.)
Later on, to make the discussions make sense, we need to add:
- There is a functioning human government that can impose taxes, including on capital, in ways that end up actually getting paid.
- (Unclear) This government involves some form of Democratic control?
If you include the implicit assumptions?
Then yes. Very, very obviously yes. This is basic math.
Sounds Like This Is Not Our Main Problem In This Scenario?In this scenario, sufficiently capable AIs and robots are multiplying without limit and are perfect substitutes for human labor.
Perhaps ‘what about the distribution of wealth among humans’ is the wrong question?
I notice I have much more important questions about such worlds where the share of profits that goes to some combination AI, robots and capital rises to all of it.
Why should the implicit assumptions hold? Why should we presume humans retain primary or all ownership of capital over time? Why should we assume humans are able to retain control over this future and make meaningful decisions? Why should we assume the humans remain able to even physically survive let alone thrive?
Note especially the assumption that AIs don’t end up with substantial private property. The best returns on capital in such worlds would obviously go to ‘the AIs that are, directly or indirectly, instructed to do that.’ So if AI is allowed to own capital, the AIs end up with control over all the capital, and the robots, and everything else. It’s funny to me that they consider charitable trusts as a potential growing source of capital, but not the AIs.
Even if we assumed all of that, why should we assume that private property rights would be indefinitely respected at limitless scale, on the level of owning galaxies? Why should we even expect property rights to be long term respected under normal conditions, here on Earth? Especially in a post calling for aggressive taxation on wealth, which is kind of the central ‘nice’ case of not respecting private property.
Expecting current private property rights to indefinitely survive into the transformational superintelligence age seems, frankly, rather unwise?
Eliezer Yudkowsky: What is with this huge, bizarre, and unflagged presumption that property rights, as assigned by human legal systems, are inviolable laws of physics? That ASIs remotely care? You might as well write “I OWN YOU” on an index card in crayon, and wave it at the sea.
Oliver Habryka: I really don’t get where this presumption that property ownership is a robust category against changes of this magnitude. It certainly hasn’t been historically!
Jan Kulveit: Cope level 1: My labour will always be valuable!
Cope level 2: That’s naive. My AGI companies stock will always be valuable, may be worth galaxies! We may need to solve some hard problems with inequality between humans, but private property will always be sacred and human.
Then, if property rights do hold, did we give AIs property rights, as Guive Assadi suggests (and as others have suggested) we should do to give them a ‘stake in the legal system’ or simply for functional purposes? If not, that makes it very difficult for AIs to operate and transact, or for our system of property rights to remain functional. If we do, then the AIs end up with all the capital, even if human property rights remain respected. It also seems right to at some point, if the humans are not losing their wealth fast enough, to expect AIs coordinating to expropriate human property rights while respecting AI property rights, as has happened commonly throughout the history of property rights when otherwise disempowered groups had a large percentage of wealth.
The hidden ‘libertarian human essentialist’ assumptions continue. For example, who are these ‘descendants’ and what are the ‘inheritances’? In these worlds one would expect aging and disease to be solved problems for both humans and AIs.
Such talk and economic analysis often sounds remarkably parallel to this:
To Be Clear This Scenario Doesn’t Make SenseThe world described here has AIs that are no longer normal technology (while it tries to treat them as normal in other places anyway), it is not remotely at equilibrium, there is no reason to expect its property rights to endorse or to stay meaningful, it would be dominated by its AIs, and it would not long endure.
If humans really are no longer useful, that breaks most of the assumptions and models of traditional econ along with everyone else’s models, and people typically keep assuming actually humans will still be useful for something sufficiently for comparative advantage to rescue us, and can’t actually wrap their heads around it not being true and humans being true zero marginal product workers given costs.
Paul Crowley: A lot of these stories from economics about how people will continue to be valuable make assumptions that don’t apply. If the models can do everything I do, and do it better, and faster, and for less than it costs me to eat, why would someone employ me?
It’s really hard for people to take in the idea of an AI that’s better than any human at *every* task. Many just jump to some idea of an uber-task that they implicitly assume humans are better at. Satya Nadella made exactly this mistake on Dwarkesh.
Dwarkesh Patel: If labor is the bottleneck to all the capital growth. I don’t see why sports and restaurants would bottleneck the Dyson sphere though.
That’s the thing. If we’re talking about a Dyson sphere world, why are we pretending any of these questions are remotely important or ultimately matter? At some point you have to stop playing with toys.
A lot of this makes more sense if we don’t think it involves Dyson spheres.
Under a long enough time horizon, I do think we can know roughly what the technologies will look like barring the unexpected discovery of new physics, so I’m with Robin Hanson here rather than Andrew Cote, today is not like 1850:
Andrew Cote: This kind of reasoning – that the future of humanity will be rockets, robots, and dyson swarms indefinitely into the future, assumes an epistemological completeness that we already know the future trade-space of all possible technologies.
It is as wrong as it would be to say, in 1850, that in two hundred years any nation that does not have massive coal reserves will be unfathomably impoverished. What could there be besides coal, steel, rail, and steam engines?
Physics is far from complete, we are barely at the beginning of what technology can be, and the most valuable things that can be done in physical reality can only be done by conscious observers, and this gets to the very heart of interpretations of quantum mechanics and physical theory itself.
Robin Hanson: No, more likely than not, we are constrained to a 3space-1time space-time where the speed of light is a hard limit on travel/influence, thermodynamics constrains the work we can do, & we roughly know what are the main sources of neg-entropy. We know a lot more than in 1850.
Even in the places were the assumptions aren’t obviously false, or you want to think they’re not obviously false, and also you want to assume various miracles occur such that we dodge outright ruin, certainly there’s no reason to think the future situation will be sufficiently analogous to make these analyses actually make sense?
Daniel Eth: This feels overly confident for advising a world completely transformed. I have no idea if post-AGI we’d be better off taxing wealth vs consumption vs something else. Sure, you can make the Econ 101 argument for taxing consumption, but will the relevant assumptions hold? Who knows.
Seb Krier: I also don’t have particularly good intuitions about what a world with ASI, nanotechnology and Dyson swarms looks like either.
Futurist post-AGI discussions often revolve around thinking at the edge of what’s in principle plausible/likely and extrapolating more and more. This is useful, but the compounding assumptions necessary to support a particular take contain so many moving parts that can individually materially affect a prediction.
It’s good to then unpack and question these, and this creates all sorts of interesting discussions. But what’s often lost in discussions is the uncertainty and fragility of the scaffolding that supports a particular prediction. Some variant of the conjunction fallacy.
Which is why even though I find long term predictions interesting and useful to expand the option space, I rarely find them particularly informative or sufficient to act on decisively now. In practice I feel like we’re basically hill-climbing on a fitness landscape we cannot fully see.
Brian Albrecht: I appreciate Dwarkesh and Philip’s piece. I responded to one tiny part.
But I’ll admit I don’t have a good intuition for what will happen in 1000 years across galaxies. So I think building from the basics seems reasonable.
I don’t even know that ‘wealth’ and ‘consumption’ would be meaningful concepts that look similar to how they look now, among other even bigger questions. I don’t expect ‘the basics’ to hold and I think we have good reasons to expect many of them not to.
Ben Thompson: This world also sounds implausible. It seems odd that AI would acquire such fantastic capabilities and yet still be controlled by humans and governed by property laws as commonly understood in 2025. I find the AI doomsday scenario — where this uber-capable AI is no longer controllable by humans — to be more realistic; on the flipside, if we start moving down this path of abundance, I would expect our collective understanding of property rights to shift considerably.
Ultimately all of this, as Tomas Bjartur puts it, imagines an absurd world, assuming away all of the dynamics that matter most. Which still leaves something fun and potentially insightful to argue about, I’m happy to do that, but don’t lose sight of it not being a plausible future world, and taking as a given that all our ‘real’ problems mysteriously turn out fine despite us having no way to even plausibly describe what that would look like, let alone any idea how to chart a path towards making it happen.
Ad ArgumentoThus from this point on, this post accepts the premises listed above, ad argumento.
I don’t think that world actually makes a lot of sense on reflection, as an actual world. Even if all associated technical and technological problems are solved, including but not limited to all senses of AI alignment, I do not see a path arriving at this outcome.
I also have lots of problems with parts the economic baseline case under this scenario.
The discussion is still worth having, but one needs to understand all this up front.
It’s even worth having that discussion even if the economists are mostly rather dense and smg and trotting out their standard toolbox as if nothing else ever applies to anything. I agree with Yo Shavit that it is good that this and other writing and talks from Dwarkesh Patel are generating serious economic engagement at all.
The Baseline ScenarioIf meaningful democratic human control over capital persisted in a world trending towards extreme levels of inequality, I would expect to see massive wealth redistribution, including taxes on or confiscation of extreme concentrations of wealth.
If meaningful democratic control didn’t persist, then I would expect the future to be determined by whatever forces had assumed de facto control. By default I would presume this would be ‘the AIs,’ but the same applies if some limited human group managed to retain control, including over the AIs, despite superintelligence. Then it would be up to that limited group what happened after that. My expectation would be that most such groups would do some redistribution, but not attempt to prevent the Gini coefficient going to ~1, and they would want to retain control.
Jan Kelveit’s pushback here seems good. In this scenario, human share of capital will go to zero, our share of useful capability for violence will also go to zero, the use of threats as leverage won’t work and will go to zero, and our control over the state will follow. Harvey Lederman also points out related key flaws.
As Nikola Jurkovic notes, if superintelligence shows up and if we presume we get to a future with tons of capital and real wealth but human labor loses market value, and if humans are still alive and in control over what to do with the atoms (big ifs), then as he points out we fundamentally are going to either do charity for those who don’t have capital, or those people perish.
That charity can take the form of government redistribution, and one hopes that we do some amount of this, but once those people have no leverage it is charity. It could also take the form of private charity, as ‘the bill’ here will not be so large compared to total wealth.
Would We Have An Inequality Problem?It is not obvious that we would.
Inequality of wealth is not inherently a problem. Why should we care that one man has a million dollars and a nice apartment, while another has the Andromeda galaxy?What exactly are you going to do with the Andromeda galaxy?
A metastudy in Nature released last week concluded that economic inequality does not equate to poor well-being or mental health.
I also agree with Paul Novosad that it seems like our appetite for a circle of concern and generous welfare state is going down, not up. I’d like to hope that this is mostly about people feeling they themselves don’t have enough, and this would reverse if we had true abundance, but I’d predict only up to a point, no we’re not going to demand something resembling equality and I don’t think anyone needs a story to justify it.
Dwarkesh’s addendum that people are misunderstanding him, and emphasizing the inequality is inherently the problem, makes me even more confused. It seems like, yes, he is saying that wealth levels get locked in by early investment choices, and then that it is ‘hard to justify’ high levels of ‘inequality’ and that even if you can make 10 million a year in real income in the post-abundance future Larry Page’s heirs owning galaxies is not okay.
I say, actually, yes that’s perfectly okay, provided there is stable political economy and we’ve solved the other concerns so you can enjoy that 10 million a year in peace. The idea that there is a basic unit, physical human minds, that all have rights to roughly equal wealth, whereas the more capable AI minds and other entities don’t, and anything else is unacceptable? That doesn’t actually make a lot of sense, even if you accept the entire premise.
Tom Holden’s pushback is that we only care about consumption inequality, not wealth inequality, and when capital is the only input taking capital hurts investment, so what you really want is a consumption tax.
Similar thinking causes Brian Albrecht to say ‘redistribution doesn’t help’ when the thing that’s trying to be ‘helped’ is inequality. Of course redistribution can ‘help’ with that. Whereas I think Brian is presuming what you actually care about is the absolute wealth or consumption level of the workers, which of course can also be ‘helped’ by transfers, so I notice I’m still confused.
But either way, no, that’s not what anyone is asking in this scenario – the pie doth overfloweth, so it’s very easy for a very small tax to create quite a lot of consumption, if you can actually stay in control and enforce that tax.
I agree that in ‘normal’ situations among humans consumption inequality is what matters, and I would go further and say absolute consumption levels are what matters most. You don’t have to care so much about how much others consume so long as you have plenty, although I agree that people often do. I have 1000x what I have now and I don’t age or die, and my loved ones don’t age or die, but other people own galaxies? Sign me the hell up. Do happy dance.
Dwarkesh explicitly disagrees and many humans have made it clear they disagree.
Framing this as ‘consumption’ drags in a lot of assumptions that will break in such worlds even if they are otherwise absurdly normal. We need to question this idea that meaningful use of wealth involves ‘consumption,’ whereas many forms of investment or other such spending are in this sense de facto consumption. Also AIs don’t ‘consume’ is this sense so again this type of strategy only accelerates disempowerment.
The good counter argument is that sufficient wealth soon becomes power.
Paul Graham: The rational fear of those who dislike economic inequality is that the rich will convert their economic power into political power: that they’ll tilt elections, or pay bribes for pardons, or buy up the news media to promote their views.
I used to be able to claim that tech billionaires didn’t actually do this — that they just wanted to refine their gadgets. But unfortunately in the current administration we’ve seen all three.
It’s still rare for tech billionaires to do this. Most do just want to refine their gadgets. That habit is what made them billionaires. But unfortunately I can no longer say that they all do.
I don’t think the inequality being ‘hard to justify’ is important. I do think ‘humans, often correctly, beware inequality because it leads to power’ is important.
No You Can’t Simply Let Markets Handle EverythingGarry Tan’s pushback of ‘whoa Dwarkesh, open markets are way better than redistribution’ and all the standard anti-redistribution rhetoric and faith that competition means everyone wins, a pure blind faith in markets to deliver all of us from everything and the only thing we have to fear is government regulations and taxes and redistribution, attacking Dwarkesh for daring to suggest redistribution could ever help with anything, is a maximally terrible response.
Not only is it suicidal in the face of the problems Dwarkesh is ignoring, it is also very literally suicidal in the face of labor income dropping to zero. Yes, prices fall and quality rises, and then anyone without enough capital starves anyway. Free markets don’t automagically solve everything. Mostly free markets are mostly the best solutions to most problems. There’s a difference.
You can decide that ‘inequality’ is not in and of itself a problem. You do still need to do some amount of ‘non-market’ redistribution if you want humans whose labor is not valuable to survive other than off capital, because otherwise they won’t. Maybe Garry Tan is fine with that if it boosts the growth rate. I’m not fine with it. The good news is that in this scenario we will be supremely wealthy, so a very small tax regime will enable all existing humans to live indefinitely in material wealth we cannot dream of.
Proposed SolutionsOkay, suppose we do want to address the ‘inequality’ problem. What are our options?
Their first proposed solution are large inheritance taxes. As noted above, I would not expect these ultra wealthy people or AIs to die, so I don’t expect there to be inheritances to tax. If we lean harder into ‘premise!’ and ignore that issue, then I agree that applying taxes on death rather than continuously has some incentive advantages but also it introduces an insane level of distorted incentives if you tried to make this revenue source actually matter versus straight wealth taxes.
The proposed secondary solution of a straight up large wealth tax is justified by ‘the robots will work just as hard no matter the tax rate,’ to argue that this won’t do too much economic damage, but to the extent they are minds or minds are choosing the robot behaviors this simply is not effectively true, as most economists will tell you. They might work as hard, but they won’t work in the same ways towards the same ends, because either humans or AIs will be directing what the robots do and the optimization targets have changed. Communist utopia is still communist utopia, it’s weird to see it snuck in here as if it isn’t what it is.
The tertiary solution, a minimum ‘spending’ requirement, starts to get weird quickly if you try to pin it down. What is spending? What is consumption? This would presumably be massively destructive, causing massive wasteful consumption, on the level of ‘destroying a large portion of the available mass-energy in the lightcone for no effect.’ It’s a cool new thing to think about. Ultimately I don’t think it works, due to mismatched conceptual assumptions.
They also suggest taxing ‘natural resources.’ In a galactic scenario this seems like an incoherent proposal when applied to very large concentrations of wealth, not functionally different than straight up taxing wealth. If it is confined to Earth, then you can get some mileage out of this, but that’s solving your efficient government revenue problems, not your inequality problems. Do totally do it anyway.
The real barriers to implementing massive redistribution are ‘can the sources of power choose to do that?’ and ‘are we willing to take the massive associated hits to growth?’
The good news for the communist utopia solution (aka the wealth tax) is that it would be quite doable to implement it on a planetary scale, or in ‘AI as normal technology’ near term worlds, if the main sources of power wanted to do that. Capital controls are a thing, as is imposing your will on less powerful jurisdictions. ‘Capital’ is not magic.
The problem on a planetary scale is that the main sources of real power are unlikely to be the democratic electorate, once that electorate no longer is a source of either economic or military power. If the major world powers (or unified world government) want something, and remain the major world powers, they get it.
When you’re going into the far future and talking about owning galaxies, you then have some rather large ‘laws of physics’ problems with enforcement? How are you going to collect or enforce a tax on a galaxy? What would it even mean to tax it? In what sense do they ‘own’ the galaxy?
A universe with only speed-of-light travel, where meaningful transfers require massive expenditures of energy, and essentially solved technological possibilities, functions very, very differently in many ways. I don’t think they’re being thought through. If you’re living in a science fiction story for real, best believe in them.
Wealth Taxes Today Are Grade-A StupidAs Tyler Cowen noted in his response to Dwarkesh, there are those who want to implement wealth taxes a lot sooner than when AI sends human labor income to zero.
As in, they want to implement it now, including now in California, where there is a serious proposal for a wealth tax, including on unrealized capital gains, including illiquid ones in startups as assessed by the state.
That would be supremely, totally, grade-A stupid and destructive if implemented, on the level of ‘no actually this would destroy San Francisco as the tech capital.’
Tech and venture capital like to talk the big cheap talk about how every little slight is going to cause massive capital flight, and how everything cool will happen in Austin and Miami instead of San Francisco and New York Real Soon Now because Socialism. Mostly this is cheap talk. They are mostly bluffing. The considerations matter on the margin, but not enough to give up the network effects or actually move.
They said SB 1047 would ‘destroy California’s AI industry’ when its practical effect would have been precisely zero. Many are saying similar things about Mamdani, who could cause real problems for New York City in this fashion, but chances are he won’t. And so on, there’s always something, usually many somethings.
So there is most definitely a ‘boy who cried wolf’ problem, but no, seriously, wolf.
I believe it would be a 100% full wolf even if you could pay in kind with illiquid assets, or otherwise have a workaround. It would still be obviously unworkable including due to flight. Without a workaround for illiquid assets, this isn’t even a question, the ecosystem is forced to flee overnight.
Looking at historical examples, a good rule of thumb is:
- High taxes on realized capital gains or high incomes do drive people away, but if you offer sufficient value most of them suck it up and stay anyway. There is a lot of room, especially nationally, to ensure billionaires get taxed on their income.
- Wealth taxes are different. Impacted people flee and take their capital with them.
The good news is California Governor Gavin Newsom is opposed, but this Manifold market still gives the proposed ‘2026 Billionaires Tax Act’ a 19% chance of collecting over a billion in revenue. That’s probably too high, but even if it’s more like 10%, that’s only the first attempts, and that’s high enough to have a major chilling effect already.
Tyler Cowen Responds With Econ EquationsTo be fair to Tyler Cowen, his analysis assumes a far more near term, very much like today scenario rather than Dyson spheres and galaxies, and if you assume AI is having sufficiently minor impact and things don’t change much, then his statements, and his treating the future world as ours in a trenchcoat, makes a lot more sense.
Tyler Cowen offered more of the ‘assume everything important doesn’t matter and then apply traditional economic principles to the situation’ analysis, try to point to equations that suggest real wages could go up in worlds where labor doesn’t usefully accomplish anything, and look at places humans would look to increase consumption so you can tax health care spending or quality home locations to pay for your redistribution, as if this future world is ours in a trenchcoat.
Similarly, here Garett Jones claims (in a not directly related post) that if there is astronomical growth in ‘capital’ (read: AI) such that it’s ‘unpriced like air’, and labor and capital are perfect substitutes, then capital share of profits would be zero. Except, unless I and Claude are missing something rather obvious, that makes the price of labor zero. So what in the world?
That leaves the other scenario, which he also lists, where labor and ‘capital’ are perfect complements, as in you assume human labor is mysteriously uniquely valuable and rule of law and general conditions and private property hold, in which case by construction yes labor does fine, as you’ve assumed your conclusion. That’s not the scenario being considered by the OP, indeed the OP directly assumes the opposite.
No, do not assume the returns stay with capital, but why are you assuming returns stay with humans at all? Why would you think that most consumption is going to human consumption of ordinary goods like housing and healthcare? There are so many levels of scenario absurdity at play. I’d also note that Cowen’s ideas all involve taxing humans in ways that do not tax AIs, accelerating our disempowerment.
Brian Albrecht Responds With More EquationsAs another example of economic toolbox response we have Brian Albrecht here usefully trotting out supply and demand to engage with these questions, to ask whether we can effectively tax capital which depends on capital supply elasticity and so on, talking about substituting capital and labor, except the whole point is that labor (if we presume AI is of the form ‘capital’ rather than labor, and that the only two relevant forms of production are capital and labor, which smuggles in quite a lot of additional assumptions I expect to likely become false in ways I doubt Brian is realizing) is now irrelevant and strictly dominated by capital. I would ask, why are we asking about the rate of ‘capital substitution for labor’ in a world in which capital has fully replaced labor?
So this style engagement is great compared to not engaging, but on another level also completely misses the point? When they get to talking downthread it seems like the point is missed even more, with statements like ‘capital never gets to share of 1 because of depreciation, you get finite K*.’ I’m sorry, what? The forest has been lost, models are being applied to scenarios where they don’t make sense.
Discuss
Announcing the CLR Fundamentals Program
Center on Long-Term Risk (CLR) is accepting applications to the next iteration of our introductory fellowship.
The CLR Fundamentals Program will introduce people to CLR’s research on s-risk reduction. Apply by Monday January 19th 23:59 GMT.
What topics does the CLR Fundamentals Program discuss?The Fundamentals Program is designed to introduce participants to CLR’s research on how transformative AI (TAI) might be involved in the creation of large amounts of suffering (s-risks). Our recent work on addressing these risks include our Model Personas agenda, s-risk macrostrategy, safe Pareto improvements (SPIs) and multiagent AI safety, in addition other relevant areas such as AI governance, epistemology, and risks from malevolent or fanatical actors.
We recommend people consider Center for Reducing Suffering’s S-risk Introductory Fellowship if they are interested in learning about work on s-risks beyond these priority areas, and consider BlueDot’s AI Safety courses if they are interested in learning about existential risks from TAI.
Note that some prior familiarity with discourse surrounding TAI and AI safety will be helpful.
Who are these programs a good fit for?We think the Fundamentals Program will be most useful for you if you are interested in reducing s-risk through contributions to CLR’s priority areas, and are seriously considering making this a priority for your career. They could also be useful if you are already working in an area that overlaps with CLR’s priorities (e.g. AI governance, AI alignment), and are interested in ways you can help reduce s-risks in the course of your current work.
We recommend those interested in our Summer Research Fellowship first take part in the Fundamentals Program, though participation is not a prerequisite. This way you have a better sense of what work on our priority areas involves, and are better able to “hit the ground running” if accepted.
There might be more idiosyncratic reasons to apply and the criteria above are intended as a guide rather than strict criteria.
When and how long is the CLR Fundamentals Program?The upcoming Fundamentals Program will take place between Monday January 25th and Friday February 29th.
How do I apply to these programs?The application form for the CLR Fundamentals Program is here. The application deadline is Monday January 19th 23:59 GMT. If you do not have time to submit an application by this deadline but are interested in participating, please contact us (the earlier the better) and we may be able to accommodate your needs.
If you would like to be notified about these or future introductory programs, please sign-up for updates here.
I have questions!If you have any questions about these programs or are uncertain whether to apply to the CLR Fundamentals Program, please leave a comment here or reach out to info@longtermrisk.org.
Discuss
AI Risk timelines: 10% chance (by year X) should be the headline (and deadline), not 50%. And 10% is _this year_!
Artificial General Intelligence (AGI) poses an extinction risk to all known biological life. Given the stakes involved -- the whole world -- we should be looking at 10% chance-of-AGI-by timelines as the deadline for catastrophe prevention (a global treaty banning superintelligent AI), rather than 50% (median) chance-of-AGI-by timelines, which seem to be the default[1].
It’s way past crunch time already: 10% chance of AGI this year![2] Given alignment/control is not going to be solved in 2026, and if anyone builds it, everyone dies (or at the very least, the risk of doom is uncomfortably high by most estimates), a global Pause of AGI development is an urgent immediate priority. This is an emergency. Thinking that we have years to prevent catastrophe is gambling a huge amount of current human lives, let alone all future generations and animals.
To borrow from Stuart Russell's analogy: if there was a 10% chance of aliens landing this year[3], humanity would be doing a lot more than we are currently doing[4]. AGI is akin to an alien species more intelligent than us that is unlikely to share our values.
- ^
This is an updated version of this post of mine from 2022.
- ^
In the answer under “Why 80% Confidence?” on the linked page, it says “there's roughly a 10% chance AGI arrives before [emphasis mine] the lower bound”, so before 2027; i.e. 2026. See also: the task time horizon trends from METR. You might want to argue that 10% is actually next year (2027), based on other forecasts such as this one, but that only makes things slightly less urgent - we’re still in a crisis if we might only have 18 months.
- ^
This is different to the original analogy, which was an email saying: "People of Earth: We will arrive on your planet in 50 years. Get ready." Say astronomers spotted something that looked like a spacecraft, heading in our direction, and estimated there was 10% chance that it was indeed an alien spacecraft.
- ^
Although perhaps we wouldn't. Maybe people would endlessly argue about whether the evidence is strong enough to declare a 10%(+) probability. Or flatly deny it.
Discuss
Transformers, Intuitively
What are transformers, and some ways to think about them.
Modern LLMs are Transformers. An LLM is a transformer in the same way that a building might be a ‘Art Deco building’ – specific instances of a general architecture. Like with Art Deco, there are many variants of buildings that would still be considered the same style. Similarly, while there are many different variants of transformer-based LLMs, they all follow a similar architecture. However, not every building is a Art Deco building. And there exist LLMs that are not transformers. That being said, with modern LLMs today, virtually all decent models are transformers.
Art Deco isn’t just an architectural style. You also have fashion, jewelery, furniture – even fonts! – that we call Art Deco. Similarly, this same transformer architecture, can also be applied to various models beyond text. You can have transformers trained to understand images, parse DNA sequences, generate audio, and so on.
How can transformers do all these different things? Well, at its core, a transformer is a next token prediction machine. In text, tokens are fragments of words. We feed the model a string of input text. It reads it all at once, and comes up with a prediction for the word-fragment that’s most likely to follow everything that’s come so far. Then you take this new token, glom it onto the rest, and pass the entire thing through the model again. And you repeat this until you’ve got as many tokens as you want.
This process isn’t limited to text, though. As long as you can convert something into tokens – sequence-fragments – you can put it through a transformer. DNA sequences can be broken up into smaller lengths; audio can be split into shorter snippets. I don’t know it works for images, but there’s certainly some clever way to segment them up as well.
——
“Transformer” is a confusing term, because there are two things it can mean. A transformer refers to, both, the architecture of the model, and a single layer within the model. It’s like how you’d call the thing in the image “a wafer”, and you’d also call a single layer of that “a wafer”. And just like with wafers, a transformer model is composed of many transformer layers stacked on top of each other.
But unlike a wafer, a transformer lets stuff through. The metaphor here is information flow. We start with some input information – our text – that begins all the way upstream. Once we feed it into the model, this river travels down, layer by layer, until at the very end, when it’s been transformed into the next most-likely token.
Our transformer, however, doesn’t just predict the token that comes after the entire text. It also predicts tokens that are likely to follow each sub-sequence in the text. “What comes after ‘Our’ ”? “What comes after ‘Our transformer’ ”? And so on. This felt weird when I first saw this. Why is it bothering with all these intermediate predictions, when we only care about the last one? Wouldn’t it be much quicker to skip all these?
The way it works, actually, is we get all these intermediate predictions for free. It helps to imagine it as a river. Our river starts off very wide – it’s as wide as the entire text we feed in. And, because of how a transformer works, this river continues to remain wide as it flows downstream. Each rivulet is a single token, as it flows down the steppe. And all these streams go through the same layers of alchemical transformation, until, at the very end, each individual stream has now morphed into the token that best comes next.
—— ——
A follow-up question, here, is how do all these individual streams interact? What’s going on with this river? Can we split these streams and say something about how it flows? Or is the river turgid & bubbly, with the details being lost in the froth?
Each single transformer layer, has two parts. The first part moves information between streams; the second part processes it. Movement, here, is the critical step – it’s what makes transformers so effective. Say, the last token of my input is “bank.” If can’t see any of the words that come before it, then I don’t really know what the word means. Is this a river bank or a bank that loans money? If it’s the former, is it muddy, dried up, or covered in grass? Why do I care about the bank? Am I reading a geography textbook, an environmental report, or an idyllic folk tale? Without all this context – context that requires moving information – a transformer only knows about that one, single token. In that case, its best prediction strategy would be learning bigram patterns.
So to predict the next token, the model needs to soak up the context of everything that has come before it. The first half of each layer is responsible for streams crossing over and mixing information around. The second half, then, processes & assimilates everything new added in. Repeat this dozens of times across the length of the river, and you’ve got yourself a transformer.
——
There’s something unsettling about all this happening in parallel. “The river bank was muddy”. Not only does the model have to understand that muddy refers to the bank, but also this was a river bank in particular. It’s simultaneously unraveling the meaning of each word, by looking at other words; and it’s understanding those by looking at yet other words, and so on. It’s a weird, bootstrappy, hermeneutic unpacking – which somehow manages to work.
A consequence of this is a model can be bottlenecked by how many layers it has. Fewer layers means fewer steps to transform the information successfully. Say I give a model with 5 layers, a text of 1000 words. Then, at most, it’ll be able to make five hops between pieces of information. Depending on how complex the text is, this might not be enough to understand everything fully.
This parallelism also means, qualitatively, LLMs process text differently than us. When you or I read a book, we do so serially – one sentence at a time. This takes a while. However, as we read, we’re also able to process and assimilate previous parts of the text. This means we build richer connections between the information we’ve read. Our network is dense. LLMs, though, process all this text in parallel. While they’re *much* quicker than us, they’re also limited by how deeply they can process it.
It’s like how you or I might glimpse at a painting. We’re able to absorb everything at once, and form a quick impression of what’s going on. We get a sense of the subject matter, the colors, the composition, what strikes us about the painting – all from one glimpse. Similarly, LLMs process text, especially large amounts of it, in an impressions-based, intuitive manner. They can quickly summarize & convey the gestalt of what’s going on. What’s tougher, is the details. This is partly why ‘reasoning models’ have been so successful. They give LLMs a ‘scratchpad’ to process & understand the information more deeply; a ‘system two’ to complement the default ‘system one’.
Intuitions for reasoning models, though, is a topic for another post.
Discuss
The Technology of Liberalism
Originally published in No Set Gauge.
Dido Building Carthage, J. M. W. Turner
In every technological revolution, we face a choice: build for freedom or watch as others build for control.
-Brendan McCord
There are two moral frames that explain modern moral advances: utilitarianism and liberalism.
Utilitarianism says: “the greatest good for the greatest number”. It says we should maximize welfare, whatever that takes.
Liberalism says: “to each their own sphere of freedom”. We should grant everyone some boundary that others can’t violate, for example over their physical bodies and their property, and then let whatever happen as long as those boundaries aren’t violated.
(Before the philosophers show up and cancel me: “utilitarianism” and “liberalism” here are labels for two views, that correspond somewhat but not exactly to the normal uses of the phrases. The particular axis I talk about comes from my reading of Joe Carlsmith, who I’ll quote at length later in this post. Also, for the Americans: “liberalism” is not a synonym for “left-wing”.)
Some of the great moral advances of modernity are:
- women’s rights
- abolition of slavery
- equality before the law for all citizens of a country
- war as an aberration to be avoided, rather than an “extension of politics by other means”
- gay rights
- the increasing salience of animal welfare
Every one of these can be seen as either a fundamentally utilitarian or fundamentally liberal project. For example, viewing wars of aggression as bad is both good for welfare (fewer people dying horrible deaths way too early), and a straightforward consequence of believing that “your right to swing your fist ends where my nose begins” (either on a personal or national level). Women’s rights are both good for welfare through everything from directly reducing gender-based violence to knock-on effects on economic growth, as well as a straightforward application of the idea that people should be equal in the eyes of the law and free to make their own choices and set their own boundaries.
Utilitarian aims are a consequence of liberal aims. If you give everyone their boundaries, their protection from violence, and their God and/or government -granted property rights, then Adam Smith and positive-sum trade and various other deities step in and ensure that welfare is maximized (at least if you’re willing to grant a list of assumptions). More intuitively and generally, everyone naturally tries to achieve what they want, and when you ban boundary-violating events, everyone will be forced to achieve what they want through win-win cooperation rather than, say, violence.
Liberal aims are a consequence of utilitarian aims. If you want to maximize utility, then Hayek and the political scientists and the moral arc of humanity over the last two centuries all show up and demand that you let people choose for themselves, and have their own boundaries and rights. Isn’t it more efficient when people can make decisions based on the local information they have? How many giga-utils have come from women being able to pursue whatever job they want, or gay people being free from persecution? Hasn’t the wealth of developed countries, and all the welfare that derives from that, come from institutions that ensure freedom and equality before the law, enforce a stable set of rules, and avoid arbitrary despotism?
Much discussion of utilitarianism focuses on things like trolley problems that force you to pick between welfare losses and boundary violations. Unless you happen to live near a trolley track frequented by rogue experimental philosophers, however, you’ll notice that such circumstances basically never happen (if you are philosophically astute, you’ll also be aware that as a real-life human you shouldn’t violate the boundaries even if it seems like a good idea).
However, as with everything else, the tails come apart when things get extreme enough. The concept of good highlighted by welfare and that highlighted by freedom do diverge in the limit. But will we get pushed onto extreme philosophical terrain in the real world, though? I believe we don’t need to worry about a sudden influx of rogue experimental philosophers, or a trolley track construction spree. But we might need to worry about the god-like AI that every major AGI lab leader, the prediction markets, and the anonymous internet folk who keep turning out annoyingly right about everything warn us might be coming in a few years.
The Effective Altruists, to their strong credit, have taken the intersection of AI and moral philosophy seriously for years. However, their main approach has been to perfect alignment—the ability to reliably point an AI at a goal—while also in tandem figuring out the correct moral philosophy to then align the AI to, such that we point it at the right thing. Not surprisingly, there are some unresolved debates around the second bit. (Also, to a first approximation, the realities of incentives mean that the values that get encoded into the AI will be what AI lab leadership wants modulo whatever changes are forced on them by customers or governments, rather than what the academics cook up as ideal.)
In this post, I do not set out to solve moral philosophy. In fact, I don’t think “solve moral philosophy” is a thing you can (or should) do. Instead, my concern is that near-future technology by default, and AI in particular, may differentially accelerate utilitarian over liberal goals. My hope is that differential technological development—speeding up some technologies over others—can fix this imbalance, and help continue moral progress towards a world that is good by many lights, rather than just one.
The real Clippy is the friends we made along the way
Joe Carlsmith has an excellent essay series on “Otherness and control in the age of AGI“. It’s philosophy about vibes, but done well, and thoroughly human to the core.
He starts off with a reminder: Eliezer Yudkowsky thinks we are all going to die. We don’t know how to make the AIs precisely care about human values, the AI capabilities will shoot up without a corresponding increase in their caring about our values, and they will devour the world. A common thought experiment is the paperclip-maximizer AI, sometimes named “Clippy” after the much-parodied Microsoft Office feature. The point of the thought experiment is that optimizing hard for anything (e.g. paperclips) entails taking over the universe and filling it with that thing, and in the process destroy everything else. In his essay “Being nicer than Clippy”, Carlsmith writes:
Indeed, in many respects, Yudkowsky’s AI nightmare is precisely the nightmare of all-boundaries-eroded. The nano-bots eat through every wall, and soon, everywhere, a single pattern prevails. After all: what makes a boundary bind? In Yudkowsky’s world (is he wrong?), only two things: hard power, and ethics. But the AIs will get all the hard power, and have none of the ethics. So no walls will stand in their way.
The reason the AIs will be such paperclipping maximizers is because Yudkowsky’s philosophy emphasizes some math that points towards: “If you aren’t running around in circles or stepping on your own feet or wantonly giving up things you say you want, we can see your behavior as corresponding to [expected utility maximization]” (source). Based on this, the Yudkowskian school thinks that the only way out is very precisely encoding the right set of terminal values into the AI. This, in turn, is harder than teaching the AIs to be smart overall because: “There’s something like a single answer, or a single bucket of answers, for questions like ‘What’s the environment really like?’ [... and so w]hen you have a wrong belief, reality hits back at your wrong prediction [...] In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom”.
So: anything smart enough is a rocket aimed at some totalizing arrangement of all matter in the universe. You need to aim the rocket at exactly the right target because otherwise it flies off into some unrelated voids of space and you lose everything.
To be clear, to Yudkowsky this is a factual prediction about the behavior of things that are smart enough, rather than a normative statement that the correct morality is utilitarian in this way. Big—and very inconvenient—if true.
Now at this point you might remember that even some humans disagree with each other about what utopia should look like. Of course, a standard take in AI safety is that the main technical problem we face is pointing the AIs reliably at anything at all, and therefore arguing over the “monkey politics” (as Carlsmith puts it) is pointless politicking that detracts from humanity’s technical challenge.
However, in another essay in his sequence, Carlsmith points out that any size of value difference between two agents could lead to one being a paperclipper by the lights of the other:
What force staves off extremal Goodhart in the human-human case, but not in the AI-human one? For example: what prevents the classical utilitarians from splitting, on reflection, into tons of slightly-different variants, each of whom use a slightly-different conception of optimal pleasure (hedonium-1, hedonium-2, etc)? And wouldn’t they, then, be paperclippers to each other, what with their slightly-mutated conceptions of perfect happiness?
This is not true just between agents, but also between you at a certain time and you at a slightly later time:
[..] if you read a book, or watch a documentary, or fall in love, or get some kind of indigestion [...] your heart is never exactly the same ever again, and not because of Reason [...so] then the only possible vector of non-trivial long-term value in this bleak and godless lightcone has been snuffed out?! Wait, OK, I have a plan: this precise person-moment needs to become dictator. It’s rough, but it’s the only way. Do you have the nano-bots ready? Oh wait, too late. (OK, how about now? Dammit: doom again.)
Now, this isn’t Yudkowsky’s view. But why not? Remember: in the Yudkowskian frame, for any agent to “want” something coherently, it must have a totalizing vision of how to structure the entire universe. And who is to say that even small differences in values don’t lead to different ideal universes, especially under over-optimization to the limit of those values? After all, as Yudkowsky repeatedly emphasizes, value is fragile. However, the overall framework, Carlsmith argues, creates an overall “momentum towards deeming more and more agents (and agent-moments) paperclippers”. It makes it very natural that we’ve just got to know exactly what is valuable well enough to write it down in a form without contradictions that we can optimize ruthlessly without breaking it. And we should do this as soon as possible, because our values might change, or because another entity might get to do it first, and chances are they’d be a paperclipper with respect to us. Time to take over the universe, for the greater good!
What’s the alternative? In the original Carlsmith essay, he writes:
Liberalism does not ask that agents sharing a civilization be “aligned” with each other in the sense at stake in “optimizing for the same utility function.” Rather, it asks something more minimal, and more compatible with disagreement and diversity – namely, that these agents respect certain sorts of boundaries; that they agree to transact on certain sorts of cooperative and mutually-beneficial terms; that they give each other certain kinds of space, freedom, and dignity. Or as a crude and distorting summary: that they be a certain kind of nice. Obviously, not all agents are up for this – and if they try to mess it up, then liberalism will, indeed, need hard power to defend itself. But if we seek a vision of a future that avoids Yudkowsky’s nightmare, I think the sort of pluralism and tolerance at the core of liberalism will often be more a promising guide than “getting the utility function that steers the future right.”
So, then: we have Yudkowsky, claiming that, as much as we humans benefit from centering rights and virtues in our idea of the good, the AIs will be superhumanly intelligent, and it is the nature of rationality that beings bend more and more towards a specific type of “coherence”—and hence utilitarian consequentialism—as they get smarter.
Yudkowsky is not an arch-utilitarian when it comes to how humans should act. But in his worldview, reality, alas, prefers the consequentialists. (Source)So, this then is the irony: Yudkowsky is a deep believer in humanism and liberalism. But his philosophical framework makes him think anything sufficiently smart becomes a totalizing power-grabber. The Yudkowskian paperclipper nightmare explicitly comes from a lack of liberalism, in the rules-and-boundaries sense. Yudkowsky’s ideal solution would be to figure out how to encode the non-utilitarian limits into the AI—”corrigibility”, for example, means that the AI doesn’t resist being corrected, and is something where Yudkowsky and others have spent lots of time on trying to make it make sense within the expected-utility-maximising paradigm. But that seems deeply technically difficult within the expected-utility-maximizer paradigm.
These philosophical vibes, plus the unnaturalness of non-consequentialist principles given a framework that makes non-consequentialism deeply unnatural, has given the AI safety space an ethos where the only admissible solution is getting better at fiddling with the values that are going into an AI, and then fiddling with those values. The AI is a rocket shooting towards expected utility maximization, and if it hits even slightly off, it’s all over—so let’s aim the rocket very precisely! Let’s make sure we can steer it! Let’s debate whether the rocket should land here or there. But if the threat we care about is the all-consuming boundary violations ... maybe we shouldn’t build a rocket? Maybe we should build something less totalizing. Maybe we shouldn’t try to reconfigure all matter in the universe—”tile the universe”—into some perfectly optimal end-state. Ideally, we’d just boost the capabilities of humans who engage with each other under a liberal, rule-of-law world order.
The point Yudkowsky fundamentally makes, after all, is that you shouldn’t “grab the poison banana”; that hardcore utility-maximization is very inconvenient if you want anything or anyone to survive—or if you think that there’s the slightest chance you’re wrong about your utility function.
There’s also a more cynical way to read the focus on values over constraints in AI discourse. If you build the machine-god, you will gain enormous power, and can reshape the world by your own lights. If you think you might be in the winning faction in the AI race, you want to downplay the role of constraints and boundaries, and instead draw the debate towards which exact set of values should be prioritized. This tilts your faction’s thinking a bit towards yours, while increasing the odds that your faction is not constrained. If you’re an AI lab CEO, how much better is it for you that people are debating what the AI should be set to maximizing, or whether it should be woke or unwoke, rather than how the AI should be constrained or how we should make sure it doesn’t break the social contract that keeps governments aligned or permanently entrench the ruling class in a way that stunts social & moral progress?
Of course, I believe that most who talk about AI values are voicing genuine moral concerns rather than playing power games, especially since the discourse really does lean towards the utilitarian over the liberal side, and since the people who (currently) most worry about extreme AI scenarios often have a consequentialist bent to their moral philosophy.
But even in the genuineness there’s a threat. Voicing support for simple, totalizing moral opinions is a status signal among the tech elite, perhaps derived from the bullet-biting literal-mindedness that is genuinely so useful in STEM. The e/accs loudly proclaim their willingness to obey the “thermodynamic will of the universe”. This demonstrates their loyalty to the cause (meme?) and their belief in techno-capital (and, conveniently, willingness to politically align with the rich VCs whose backing they seek). What about their transparent happiness for the techno-capital singularity to rip through the world and potentially create a world without anyone left to enjoy it? It’s worth it, because thermodynamics! Meanwhile, among the AI safety crowd, it’s a status signal to know all the arguments about coherence and consequentialism inside out, and to be willing to bite the bullets that these imply. But sometimes, that there is no short argument against something means not that it is correct, but rather that the counter-argument is subtle, or outside the paradigm.
The limits and lights of liberalism
I’m worried of leaving an impression that liberalism, boundaries, and niceness are an obvious and complete moral system, and utilitarianism is this weird thing that leads to paperclips. But like utilitarianism, liberalism breaks when taken to an extreme.
First, it’s hard to pin down what it means. Carlsmith writes:
[A]n even-remotely-sophisticated ethics of “boundaries” requires engaging with a ton of extremely gnarly and ambiguous stuff. When, exactly, does something become “someone’s”? Do wild animals, for example, have rights to their “territory”? See all of the philosophy of property for just a start on the problems. And aspirations to be “nice” to agents-with-different-values clearly need ways of balancing the preferences of different agents of this kind – e.g., maybe you don’t steal Clippy’s resources to make fun-onium; but can you tax the rich paperclippers to give resources to the multitudes of poor staple-maximizers? Indeed, remind me your story about the ethics of taxation in general?
Second, there are things we may want to ban or at least discourage, even if doing so means interventionism. Today we (mostly) think that a genocide warrants an international response, that you should catch the person jumping off the building to kill themself, and that various forms of paternalism are warranted, especially towards children. In the future, presumably there are things people could do that are bad enough by our lights that we think we can’t morally tolerate that in our universe.
Third, the moral implications of boundaries and rights change with technology. For example, today cryptocurrency might be helpful for people in unstable low-income countries, but tomorrow scheming AI systems might take advantage of the existence of a decentralized, uncontrollable payments rail to bribe, blackmail, and manipulate in ways we wish we hadn’t given them the affordances to do. Today we might be completely fine giving everyone huge amounts of compute that they have full control over, but tomorrow it might be possible to simulate digital minds and suddenly we want to make sure no one is torturing simulated people. Historically, too, it’s worth noting that while almost every place and time would’ve benefited from more liberalism and rule-of-law, sometimes it’s important that there are carefully-guarded mechanisms to tinker with those foundations. What set British institutions apart in the 18th and 19th centuries was not just that they were more laissez-faire and gave better rights than other countries of the age, but also that they were able to adapt in times of need. In the US, FDR’s reforms required heavy-handed intervention and the forcing through of major changes in government, but were (eventually) successful at dragging the country out of recession. Locking in inviolable rights is a type of lock-in too.
Fourthly, liberalism has a built-in tension between a decentralizing impulse and a centralizing impulse. The decentralization comes from the fact that the whole point is that everyone is free. The centralization comes from things like needing a government with a monopoly on force to protect the boundaries of people within it—otherwise it’s not a world of freedom, but of might-makes-right. A lot of our civilization’s progress in governance mechanisms comes from improving the Pareto frontier of tradeoffs we can make between centralization and decentralization (I heard this point from Richard Ngo, in a talk that is unpublished as far as I know):
Institution design (whether through politics or technology) is often about making the tradeoff between autocracy and anarchy risks less harsh.
For example, democracy allows for centralizing more power in a government (reducing anarchy risk) without losing people’s ability to occasionally fire the government (reducing autocracy risk). Zero-knowledge proofs and memoryless AI allow verification of treaties or contracts (important for coordinating to solve risks) without any need to hand over information to some authority (which would centralize).
However, the tension remains. We want inviolable rights for individuals, but we also need some way to enforce those rights. Unless technology can make that fully decentralized (or David Friedman is right), that means we need at least some locus of central control, but this locus of course becomes the target of every seeker of power and rent. And remember, we want everything to be adaptable if circumstances change—but only for good reasons.
On a more fundamental level, everything being permitted is not necessarily a stable state. If, in free competition, there exist patterns of behavior that are better at gaining power than others, then you have to accept that those patterns will have a disproportionate share of future value. This leads to two concerns. First, as pointed out by Carlsmith in his talk “Can goodness compete?”, maximizing for goodness and maximizing for power are different concerns, so the optimum for one is likely not the optimum for the other. So in fully free competition, in this very simplified model, we lose on goodness. Secondly, if everything being permitted leads to competition in which some gain power until they are in a position to violate others’ boundaries, then full liberalism would bring about its own demise. It’s not a stable state.
Fifth, there’s a neutrality to liberalism. We give everyone their freedoms, we make everyone equal before the law—but what do we actually do? Whatever you want, that’s the point! And what do you want? Well, that’s your problem—we liberals just came here to set up the infrastructure, like the plumber comes to install your pipes. It’s your job to decide what runs through them, and to use them well.
Now, utilitarianism suffers from this problem too. What is utility? It’s what you like/prefer. But what are those things? Presumably, you’re supposed to just look inside you. But for what? Do you really know? The most hardcore hedonic utilitarians will claim it’s obvious: pleasure minus pain. But this is of course overly reductive, unless either you are a very simple person (lucky you—but what about the rest of us?) or you take a very expansive view of what pleasure and pain are (but then you’re back to the problem of defining them in a sufficiently expansive, yet precise, way).
Putting aside the abstract moral philosophies: in yet another essay, Carlsmith mentions the “Johnny Appleseeds” of the world:
Whatever your agricultural aspirations, he’s here to give you apple trees, and not because apple trees lead to objectively better farms. Rather, apple trees are just, as it were, his thing.
There’s something charming about this image. At the end of the day you need something worth fighting for. Everyone who has actually inspired people has not preached just an abstract set of rules, but has also breathed life into an entire worldview and way of life. Pure liberalism does not tell you what the content of life should be, it only says that in the pursuit of that content you should not take away other people’s freedoms to pursue theirs. At times you need the abstraction-spinning philosophers and even-handed policemen and hair-splitting lawyers to come and argue for and enforce the boundaries. But they are infrastructure, not content. The farmer told by the philosopher to believe in liberalism, or shown by the police and the lawyers where their plot ends and their neighbor’s begins, does not gain the content that animates a life. You also need a Johnny Appleseed, whether externally or within you, who preaches something specific, something concrete: look at this apple, feel its texture, smell its scent—this is what you should grow.
Much as there’s a distinction between the pipes as infrastructure and the water in the pipes, there’s a distinction between the structure of our moral system (or, to use a computer science metaphor, its type), and what we actually care about. What people really fight for is sometimes the abstract principle, but almost always there is something real and concrete too. But what exactly this is, and how different people find theirs, is subtle and complicated. The strength of liberalism is that it doesn’t claim an answer (remember: there is virtue to narrowness). Whatever those real and concrete actually-animating things are, they seem hard to label and put in a box. So instead: don’t try to specify them, don’t try to dictate them top-down, just lay down the infrastructure thats let people discover and find them on their own. In a word: freedom.
Weaving the ropeSo how should we think about moral progress? A popular approach is to pick one tenet, such as freedom or utility, and declare it to be the One True Value. Everything else is good to the extent that it reflects that value, and bad to the extent it violates it. You can come up with very good stories of this form, because words are slippery and humans are very good at spinning persuasive stories—didn’t the Greeks get their start in philosophy with takes like “everything is water” or “everything is change”? If there are annoying edge cases—trolley problems for the utilitarians, anti-libertarian thought experiments for the liberalists—then you have to either swallow the bullets or try to dodge them as best as you can while. But all the while you gain the satisfaction that you’ve solved it, you’ve seen the light, the ambiguity is gone (or, at least in principle, could one day be gone if only we carry out some known process long enough)—because you know the One True Value.
But, as Isaiah Berlin wrote:
One belief, more than any other, is responsible for the slaughter of individuals on the altars of the great historical ideals [...]. This is the belief that somewhere, in the past or in the future, in divine revelation or in the mind of an individual thinker, in the pronouncements of history or science, or in the simple heart of an uncorrupted good man, there is a final solution.
An alternative image to the One True Value is making a rope. While it’s being plied, the threads of the rope are laid out on the table as the rope is being made, pointing in different directions, not yet plied into a single rope. The liberal idea of boundaries and rights is one thread, and it’ll go somewhere into the finished thing, as will the thread of utilitarian welfare, and others as well. We’ve knit parts of this rope already; witness thinkers from the Enlightenment philosophers onwards making lots of progress plying together freedom and welfare and fairness and other things, in ways that (say) medieval Europeans probably couldn’t have imagined. How do we continue making progress? Rather than declaring one thread the correct one for all time, I think the good path forwards looks more like continuing to weave together different threads and moral intuitions. I expect where we end up to have at least some depend on the path we take. I expect progress to be halting and uncertain; I expect sometimes we might need to unweave the last few turns. I expect eventually where we end up might look weird to us, much as modernity would look weird to medieval people. But I probably trust this process, at least if it’s protected from censorship, extremism, and non-human influences. I trust it more than I trust the results of a thought experiment, or any one-pager on the correct ideology. I certainly trust it far more than I trust the results of a committee of philosophers, or AI lab ethics board, or inhuman AI, to figure it out and then lock it in forever.
I hope you share the intuition above, and that the intuition alone is compelling. But can we support it? Is there a deeper reason why should we expect morality to work like that, and be hard to synthesize into exactly one shape?
The first reason to think this is that, empirically, human values just seem to be subtle and nuanced. Does the nature of the good seem simple to you? Could you design a utopia you felt comfortable locking in for all eternity? Hell is simple and cruel, while heaven is devious and subtle.
A second reason is that over-optimization is treacherous. This is the same instinct that leads Yudkowsky to think the totalizing AIs will rip apart all of us, and that leads Carlsmith to spend so many words poking at whether there’s something nicer and less-treacherous if only we’re more liberal, more “green” (in his particular sense of the word), nicer. Paul Christiano, one of the most prominent thinkers in AI alignment and an inventor of RLHF, centers his depiction of AI risk on “getting what you measure” all too well. But why? Goodhart’s law is one idea: over-optimize any measure, and the measure stops being a good measure. Another way to put this is that even if two things are correlated across a normal range of values, they decorrelate—the “tails come apart”—at extreme values, as LessWrong user Thrasymachus (presumably a different person from the ancient philosopher) observed and Scott Alexander expanded on. In the case of morality, for example:
(Image taken from Scott Alexander here)
Within the range of normal events, moral theories align on what is good. When it comes to abnormal events—especially abnormal events derived from taking one moral theory’s view of what is good and cranking everything to eleven—the theories disagree. The Catholic Church and the hedonic utilitarians both want to help the poor, but the utilitarians are significantly less enthusiastic about various Catholic limits on sexuality, and the Catholics are significantly less enthusiastic about everything between here and Andromeda converting to an orgiastic mass of brain tissue.
So if our choice is about which ideal we let optimize the world, and if that optimization is strong enough, we run a real risk of causing unspeakable horrors by the lights of other ideals regardless of which ideal we pick:
The red/blue path represents where we go if we steer entirely by moral theory A/B.
You might think what we want, then, is compromise. A middle path:
And yes, I expect a compromise would be better. But I think the above is reductive and simplistic. Whatever Theory A & Theory B are, they’re both likely static, contingent paradigms. Any good progress probably looks more like striking a balance than picking just one view of the good to maximize, but the most significant progress likely reframes the axes:
Different times and places argued about different axes: salvation versus worldliness, honor versus piety, or duties to family versus duties to the state. Liberalism versus utilitarianism, the framing of this post, is a modern axis. Or, to take another important modern axis: economic efficiency versus fairness of distribution. You need modern economics to understand what these things really are or why they might be at odds, it’s contingent on facts about the world including the distribution of economically-relevant talent in the population, and their importance rests on modern ethical ideas about the responsibilities of society towards the poor.
As mentioned before, to the extent that liberalism has any unique claim, it’s that it’s a good meta-level value system. Live and let live is an excellent rule, if you want to foster the continued weaving of the moral thread, including its non-liberal elements. It lets the process continue, and it lets the new ideas and new paradigms come to light and prove their worth. It seems far less likely to sweep away the goodness and humanity of the world than any totalizing optimization process aiming for a fixed vision of the good.
Isaiah Berlin, again:
For every rationalist metaphysician, from Plato to the last disciples of Hegel or Marx, this abandonment of the notion of a final harmony in which all riddles are solved, all contradictions reconciled, is a piece of crude empiricism, abdication before brute facts, intolerable bankruptcy of reason before things as they are, failure to explain and to justify [...]. But [...] the ordinary resources of empirical observation and ordinary human knowledge [...] give us no warrant for supposing [...] that all good things, or all bad things for that matter, are reconcilable with each other. The world that we encounter in ordinary experience is one in which we are faced with choices between ends equally ultimate, and claims equally absolute, the realization of some of which must inevitably involve the sacrifice of others. Indeed, it is because this is their situation that men place such immense value upon the freedom to choose; for if they had assurance that in some perfect state, realizable by men on earth, no ends pursued by them would ever be in conflict, the necessity and agony of choice would disappear, and with it the central importance of the freedom to choose. Any method of bringing this final state nearer would then seem fully justified, no matter how much freedom were sacrificed to forward its advance.
Luxury space communism or annihilation?There’s a common refrain that goes something like this: over time, we develop increasingly powerful technology. We should aim to become wiser about using technology faster than the technology gives us the power to wreak horrors. The grand arc of human history is a race between available destructive power and the amount of governance we have. Competition, pluralism, and a big dose of anarchy might’ve been a fine and productive way for the world to be in the 1700s, but after the invention of either the Gattling gun, the atomic bomb, or AGI, we’ll have to grow up and figure out something more controlled. The moral arc of the universe might bend slowly, but it bends towards either luxury space communism or annihilation.
For an extended discussion of this, see Michael Nielsen’s essay on How to be a wise optimist about science and technology (note that while Nielsen is sympathetic to the above argument, he does not explicitly endorse it). In graph form (taken from the same essay):
Often, this then leads to calls for centralization, from Oppenheimer advocating for world government in response to the atomic bomb, to Nick Bostrom’s proposal that comprehensive surveillance and totalitarian world government might be required to prevent existential risk from destructive future technologies.
A core motivation of this line of reasoning is that technology can only ramp up the power level, the variance in outcomes, the destruction. This has generally been a net effect of technological progress. But there are other consequences of technology as well.
Most obviously, there are technologies that tamp down on the negative shocks we’re subject to. Vaccines are perhaps the most important example of technology against natural threats. But the things in the world that change quickly are the human-made ones, so what matters is whether we can build technologies that can tamp negative shocks from human actions (vaccines work against engineered pandemics as well!). If all we have for defense is a stone fence, this isn’t too bad if the most powerful offense was a stone axe. But now that we’ve built nuclear weapons, can we build defensive fences strong enough?
Also: there are technologies that aren’t about offense/defense balance that make it easier or harder to enforce boundaries. Now we have cryptography that lets us keep information private, but what if in the future we get quantum computers that break existing cryptography? In the past it was hard to censor, but now LLMs make censorship trivially cheap. And consider too that what matters is not just capabilities, but propensity. Modern nuclear states could kill millions of people easily, but they’re prevented from doing so by mutually assured destruction, international norms, and the fact that nuclear states benefit from the world having lots of people in it engaging in positive-sum trade. Offense/defense balance is not the only axis.
Differential acceleration of defensive technology is something that Vitalik Buterin has argued for as part of his “d/acc” thesis. However, for reasons like the above, he deliberately keeps the “d” ambiguous: “defense”, yes, but also “decentralization” and “democracy”.
Is d/acc just a cluster of all the good things? With the liberalism v utilitarianism axis, I think we can restate an argument for d/acc like this: a lot of technology can be used well for utilitarian ends, in the sense that it gives us power over nature and the ability to shape the world to our desires, or to maximize whatever number we’ve set as our goal: make the atoms dance how we want. But this also increases raw power and capabilities, and importantly creates the power for some people—or AIs—to make some atoms you care about dance in a way you don’t like.
Now, if you think that unlimited raw power is coming down the technological pipeline, say in the form of self-improving AIs that might build a Dyson sphere around the sun by 2040, your bottleneck for human flourishing is not going to be raw power or wealth available, or raw ability to shape the world when it’s useful, or raw intelligence. It’s likely to be much more about enforcing boundaries, containing that raw power, keeping individual humans empowered and safe next to the supernova of that technological power.
So what we also need are technologies of liberalism, that help maintain different spheres of freedom, even as technologies of utilitarianism increase the control and power that actors have to achieve their chosen ends.
Technologies of liberalismDefense, obviously, is one approach: tools that let you build walls that keep in things you don’t want to share, or lock out things you don’t like.
Often what’s important is selective permeability. At the most basic level, locks and keys (in existence since at least ancient Assyria 6000 years ago) let you restrict access to a place. Today, the ability to pass information over public channels without revealing it is at least as fundamental to our world as physical locks and keys. Also note how unexpected public key cryptography—where you don’t need to first share a secret with the counterparty—is. Diffie-Helman key exchange is one of the cleverest and most useful ideas humans have ever had.
Encryption is of enormous significance for the politics of freedom. David Friedman has argued (since the 1990s!):
In the centuries since [the passing of the Second Amendment], both the relative size of the military and the gap between military and civilian weapons have greatly increased, making that solution less workable [for giving citizens a last resort against despotic government]. In my view, at least, conflicts between our government and its citizens in the immediate future will depend more on control of information than control of weapons, making unregulated encryption [...] the modern equivalent of the 2nd Amendment.
Ivan Vendrov, in a poetic essay on the building blocks of 21st century political structure, points to cryptography (exemplified by the hash function) as the one thing standing against the centralizing information flows that incentivizes, since it lets us cheaply erect barriers that have astronomical asymmetry in favor of defense. Encryption that takes almost no time at all can be made so strong it would take hundreds of years on a supercomputer to break. In our brute physical world, destruction is easy and defensive barriers are hard. But the more of our world becomes virtual, the more of our society exists in a realm where defense is vastly favored. Of the gifts to liberty that this universe contains that will most bear fruit in our future, encryption ranks along the Hayekian nature of knowledge and the finite speed of light.
So far we are far from realizing the theoretical security properties of virtual space; reliable AI coding agents and formal verification will help. And virtual space will continue being embedded in meatspace.
In meatspace, the biggest vulnerability humans have are infectious diseases. The state of humanity’s readiness against pandemics (natural or engineered) is pathetic, and in expectations millions and more will pay with their lives for this. We need technology that lets us prevent airborne diseases, to prevent an increasing number of actors—already every major government, but soon plausibly any competent terrorist cell—wielding a veto on our lives. UV-C lighting and rapid at-home vaccine manufacturing are good starts. Also, unless we curtail this risk through broad-spectrum defensive technology, at some point it will be used as an excuse for massive centralization of power to crush the risk before it happens.
Besides literal defenses, there’s decentralization: self-sufficiency gives independence as well as safety from coercion. At a simple level, solar panels and home batteries can help with energy independence. 3D printers make it possible to manufacture more locally and more under your control. Open-source software makes it possible for you to tinker with and improve the programs you run. AI code-gen makes this even stronger: eventually, anyone will be able to build their own tech stack, rather than being funneled along the grooves cut by big tech companies—which, remember, they adversarially optimize against you to make you lose as much time out of your life as possible. The printing press and the internet both made it cheaper to distribute writing of your choice, and hence push your own ideas and create an intellectual sphere of your own without wealth or institutional buy in.
Fundamentally, however, the thing you want to do is avoid coercion, rather than maximize independence. No man is an island, they say, and that might not have been intended as a challenge. Coercion is hard if the world is very decentralized, to the point that no one has a much larger power base than anyone else. However, such egalitarianism is unlikely since everyone has different resources and talents. Another key feature of the modern world that helps reduce coercion, in addition to economic and military power being decentralized, is that the world is covered in a thick web of mutual dependence thanks to trade. If you depend on someone, you are also unlikely to be too coercive towards them, as Kipling already understood. Self-sufficiency and webs of mutual dependencies are, of course, in tension. I don’t pretend there’s a simple way to pick between them, or know how much of each to have—I never said this would be simple!
AI, however, is a challenge to any type of decentralization: it needs lots of expensive centralized infrastructure, and currently the focus is on huge multi-billion dollar pre-training runs. However, even here there’s a story for something more decentralized to take center stage. First, zero-trust privacy infrastructure for private AI inference & training—salvation by cryptography once again!—can give you verifiable control and privacy over your own AI, even if it runs in public clouds for cost reasons. Secondly, as Luke and I have argued, the data needs to have AI diffuse through the economy run into Hayekian constraints that privilege the tacit, process, and local knowledge of the people currently doing the jobs AI seeks to replace—as long as the big labs and their data collection efforts don’t succeed too hard and too fast. Combined with this, cheap bespoke post-training offers a Cambrian explosion of model variety as an alternative to the cultural & value lock-in to the tastes of a few labs. Achieving this liberal vision of the post-AGI economy is what we’re working on at Workshop Labs.
Then there’s technology that helps improve institutions—democratization, but not “democracy” as in just “one person one vote”, but in the more fundamental sense of “people” (the demos) having “power” (kratos). We’ve already mentioned zero-knowledge proofs, memoryless AI, and other primitives for contracts. Blockchains, though much hyped, deserve some space on this list, since they let you get hard-to-forge consensus about some data without a central authority (needing consensus about a list is a surprisingly common reason to have a central authority!). More ambitiously, Seb Krier points out AI could enable “Coasean bargaining” (i.e. bargaining with transaction costs so low Coase’s theorem actually holds)—bespoke deals, mediated by AIs acting on behalf of humans, for all sorts of issues that currently get ignored or stuck in horrible equilibria because negotiating requires expensive human time. Another way to increase the amount of democracy when human time is bottlenecked is AI representatives for people (though these would have to be very high-fidelity representatives to represent someone’s values faithfully—values are thick, not thin, and AI design should take this into account).
Finally, there are technologies shape culture and make us wiser, or let society be more pluralistic. Many information technologies, like the internet or AI chatbots, obviously fall in this category and enable great things. However, information technology, in addition to giving us information and helping us coordinate, often also changes the incentives for which memes (in the original Dawkins sense) spread, and therefore often have vast second-order consequences. On net I expect information technology to be good, but with far more requirements for cultural shifts to properly adapt to them, and far more propensity than other technologies to shift culture or politics in unexpected directions.
(Note that the pillars above are very similar to the those of the anti-intelligence-curse strategy Luke & I outlined here.)
Historically, you can trace the ebb and flow of the plight of the average person by how decentralizing or centralizing the technology most essential for national power is, and how much that technology creates mutual dependencies that make it hard for the elite to defect against the masses. MacInnes et al.’s excellent paper Anarchy as Architect traces this back throughout history. Bronze weapons, which required rare bronze, meant only a small military elite was relevant in battle, and could easily rule over the rest. The introduction of iron weapons, which could be produced en-massed, lead to less-centralized societies, as the rulers needed large groups to fight for them. Mounted armored knights centralized power again, before mass-produced firearms made large armies important again. The institutions of modern Western democracy were mostly built during the firearm-dominated era of mass warfare. They were also built at a time when national power relied more and more on broad-based economic and technological progress where a rich and skilled citizenry was very valuable. This rich and skilled citizenry was also increasingly socially mobile and free of class boundaries, making it harder for the upper class to coordinate around its interests since it wasn’t demarcated by a sharp line. People also had unprecedented ability to communicate with each other in order to organize and argue thanks to new technology (and, of course, because the government now had to teach them to read—before around 1900, literacy was not near-universal even in rich countries). These nations then increasingly dissolved trade barriers between them and became interdependent—very purposefully and consciously, in the case of post-war Europe—which reduced nation-on-nation violence and coercion.
How will this story continue? It will definitely change—as it should. We should not assume that the future socio-structural recipe for liberalism is necessarily the one it is now. But hopefully it changes in ways where it still adds up to freedom for an increasing number.
Build for freedomDifferential technological progress is obviously possible. For example, Luke argues against what he calls “Technocalvinism“, the belief that the “tech tree” is deterministic and everything about it is preordained, using examples ranging from differential progress in solar panels to the feasibility of llama-drawn carts. The course of the technology is like a new river running down across the landscape. It has a lot of momentum, but it can be diverted from one path to another, sometimes by just digging a small ditch.
There are three factors that help, the first two in general, and the last specifically for technologies of liberalism:
- There is a lot of path dependence in the development of technologies, and effects like Wright’s law mean that things can snowball: if you manage to produce a bit, it gets cheaper and it’s better to produce a larger amount, and then an enormous amount. This is how solar panels went from expensive things stuck on satellites to on track to dominate the global energy supply. Tech, once it exists, is also often sticky and hard to roll back. A trickle you start can become a stream. (And whether or not you can change the final destination, you can definitely change the order, and that’s often what you need to avert risks. All rivers end in the ocean, but they take different paths, and the path matters a lot for whether people get flooded.)
- The choice of technology that a society builds towards is a deeply cultural, hyperstitious phenomenon. This is argued, for example, by Byrne Hobart and Tobias Huber’s Boom: Bubbles and the End of Stagnation (you can find an excellent review here, which occasionally even mentions the book in question). The Apollo program, the Manhattan project, or the computer revolution were all deeply idealistic and contingent projects that were unrealistic when conceived—yes, even the computer revolution, hard as it is to remember today. Those who preach the inevitability of centralization and control, whether under the machine-god or otherwise, are not neutral bullet-biters, but accidentally or intentionally hyperstitioning that very vision into life.
- Demand is high. People want to be safe! People want control! People want power!
So how should we steer the river of technology?
If the AGI labs’ visions of AGI is achieved, by default it will disempower the vast majority of humanity instantly, depriving them of their power, leverage, and income as they’re outcompeted by AIs, while wrecking the incentives that have made power increasingly care about people over the past few centuries, and elevating a select few political & lab leaders to positions of near godhood. It might be the most radical swing in the balance of power away from humanity at large and towards a small elite in human history. And the default solution for this in modern AI discourse is to accept the inevitability of the machine-god dictatorship, accept the lock-in of our particular society and values, and argue instead about who exactly stands at its helm and which exact utility function is enshrined as humanity’s goal from here on out.
All this is happening while liberalism is already in cultural, political, and geopolitical retreat. If its flame isn’t fanned by the ideological conviction of those in power or the drift of geopolitics, its survival increasingly depends on other sources.
So: to make an impact, don’t build to maximize intelligence or power, because this will not be the constraint on a good future. This is a weird position to be in! Historically, we did lack force and brains, and had to claw ourselves out of a deep abyss of feebleness to our current world of engines, nuclear power, computers, and coding AIs. Even now, if radical AI was not an event horizon towards which we were barreling towards that was poised to massively accelerate technology and reshape the position of humans, more power would be a very reasonable thing to aim towards—we would greatly benefit from energy abundance thanks to fusion, and robotics beyond our dreams, and space colonization. (And if that radical AI looks majorly delayed, or gets banned, or technology in general grinds to a halt, then making progress towards tools of utilitarianism and power would once again be top-of-mind.) But if radical tech acceleration through AI is on the horizon, we might really be on track for literally dismantling the solar system within the few decades. The thing we might lack is the tools to secure our freedoms and rights through the upheaval that brings. Technologists, especially in the Bay Area, should rekindle the belief in freedom that guided their predecessors.
For decades, the world has mostly been at peace, liberty has spread, and the incentives of power have been aligned with the prosperity of the majority. History seemed over. All we had to do was optimize our control and power over nature to allow ever greater and greater welfare.
But now, history is back. Freedom is on trial. The technology we choose to build could tip the scales. What will you build?
Thank you to Luke Drago & Elsie Jang for reviewing drafts of this post.
Discuss
Axiological Stopsigns
Epistemic Status: I wrote the bones of this on August 1st, 2022. I re-read and edited it and added an (unnecessary?) section or three at the end very recently. Possibly useful as a reference. Funny to pair with "semantic stopsigns" (which are an old piece of LW jargon that people rarely use these days).
You might be able to get the idea just from the title of the post <3
I'll say this fast, and then offer extended examples, and then go on at length with pointers into deep literatures which I have not read completely because my life is finite. If that sounds valuable, keep reading. If not, not <3
The word "value" is a VERB, and no verb should be performed forever with all of your mind, in this finite world, full of finite beings, with finite brains, running on a finite amounts of energy.
However, if any verb was tempting to try to perform infinitely, "valuing" is a good candidate!
The problem is that if time runs out a billion years into the future, and you want to be VNM rational about this timescale, you need to link the choice this morning of what to have for breakfast into a decision tree whose leaf nodes, billions years in the future, are DIFFERENT based on your breakfast decision.
This would be intractable to calculate for real, so quick cheap "good enough" proxies for the shape of good or bad breakfasts are necessary. This has implications for practical planning algorithms, which has implications for the subjective psychology of preferences.
In brief, many near future outcomes from any given action are reasonable candidates for a place to safely stop the predictive rollout, analyze things locally based on their merits and potentials, and then stop worrying about the larger consequences of the action.
There is a valid and useful temptation to stop the predictive rollout at that point and declare the action good (worthy of performance) or bad (unworthy of performance) on that basis. However: it is always potentially valid to take the predictive roll-out further than that, so long as the calculating process is still essentially valid.
There is a sense in which this approach to axiology (ie "the study of value") makes the idea of "ultimate values" ultimately meaningless? You can get along in life quite well stepping from goal to goal to goal, with no big overall linkage to a final state of the entire universe that you endorse directly. Indeed, it may be cognitively impossible to have "pragmatically real and truly ultimate values" for reasons of physics and computational tractability.
That's kinda it. I've kinda said everything there is to say. Keep reading to hear it again, slower and with more examples.
Whenever you notice an "axiological stopsigns", you can probably stop there if you don't have time to think more, or you can do a "California stop" and slow down and roll on through, and imagine subsequent steps and think about their value as well!
Sometimes in poker, looking at possible second steps that might occur, and figuring out how the choices you face could interact with them can be useful mentally track and this is sometimes called "counting your outs".
If you rolled past an axiological stopsign and have accurately counted your real outs, their real value, and their real likelihood, pulling your mind back to the choice right in front of you might lead to a different value estimate than you initially had.
This is good. If that never happens then counting your outs would be pointless, and a waste of your valuable thinking time.
Outside of the context of poker, where things are discrete and simple and tidy, if you keep imagining things happening after a given goal or tragedy, and take such variations and additional outcomes into account as well, that probably also won't make the decision worse, so long as the extra imagining itself isn't particularly costly... but it could improve the way you navigate your way through a tragedy or through a success in subtle ways that unpack into large wins much later, because you saw farther, and set them up in advance... but it is hard to see very very far into the future.
That's it, that's the idea. But now we have added a metaphor for ignoring the axiological stopsign, and applied the concept to the toy example of poker.
Some readers, with very valuable time, will be able to stop reading right here, the rest of what I'll write mostly involves MORE explanation, application, and speculation that extends the pragmatics of the basic idea.
Where is it useful to put up an axiological stopsign?
The best places are points in imagined futures that are relatively stable and relatively amenable to having their value calculated. The stability of a situation makes that situation into a "natural endpoint" for a planning process to (temporarily?) treat as a far future terminal goal.
There's a temptation here to think of the mathematics of stability in terms of activation energies, or maybe actively maintained homeostatic circumstances (like a thermostat keeping the house toasty in the winter, or one's liver managing one's blood sugar at a good level)... and you might say that the pancreas has an enduring intrinsic preference for keeping the blood sugar in a certain range or the thermostat values the house being neither too hot nor too cold... but there's a pre-condition for even performing a stability analysis like this, which is that there is even any specific situation at all, for the values to be about... the thermostat doesn't care about the outside of the house... your pancreas doesn't care about other people's blood sugar... most practical values are about something specifically practical that has a defined time and place inside of boundaries that segment reality in a way that makes planning more tractable.
If there was a town in Antarctica, the empty unchanging tundra around the town would be a good candidate for a boundary line around the town, that helps define the town as a situation whose value count be estimated.
A boundary line that bisected the town would be less useful.
For example, suppose there was one murderer in the town overall. If you decide that only half the town is "worthy of consideration as a potentially stable situation" and the half you choose to measure/model/ponder lacks the murderer, then you'll probably fail to imagine a future murder happening in the half of the town you chose to look at, even though such a thing would be likely, because the murderer is likely to rove around without regard for your imaginary boundary.
A similar analysis could be done for time. If there were moments in time with few people moving around or changing things (like when everyone is asleep?) then that would be a good candidate for good place to put a time boundary to analyze "a situation" as a coherent chunk.
The ancient Greeks sometimes said: "Call no man happy until he's dead."
Here I suggest: "Call no day good until bedtime."
"Call no place safe until mapped and cleared to the edges of adversarial traversability" doesn't quite roll off the tongue in the same way, but it has the seeds of the same idea.
We could just stop now. The next section is about chess. You might not even need to read it! Maybe just jump over this section? Or not. Notice how doing the essay in la carte sections like this "shows" even as it "tells"! <3
(There's no section titles to give a table of contents in advance on purpose. That's often just how life is. One thing after another, with you having to define structure for yourself and then judge pieces of the structure independently.)
Stopsigns shouldn't be read as "dead end" signs.
If your thinking stops permanently at axiological stop signs, as if they were metaphysically absolute, then your thinking is limited, and you are some kind of fool.
(I mean... of course you can choose to be a fool if you want and its no skin off my nose, but I personally try to avoid it for myself.)
You might manage to be a clever fool, who plays the game pretty well indeed, but your growth in skill can be ultimately be bounded by the axiological stopsigns that you cannot think beyond.
This is not absolutely hard and fast and there is play in the details.
In general, precisely calibrated values about nearmode things can often substitute for the ability to see deeply into the future, but the place these values properly come from, a lot of the times, is just: from having looked at variation in the the long term consequences of different ways the near future could go, that are correlated with the "sense of value" about about the near future.
A nice property of precomputed FIXED values (or carefully calibrated methods for quickly computing values in certain situations) is that you can pre-compute in moments when things are quiet, so that these values can be used quickly during an emergency.
"In case of emergency, apply valuation methods fast and hard".
A useful analogy might be to think of chess playing algorithms, that make much use of a "board evaluation function". Such functions often consider where the pieces are, as well as the total material still on the board.
Lasker, in 1947, suggests 3.5 is right for Knights and Bishops both, puts the Queen at only 8.5, and treats pawn's differently based on their centrality (only 0.5 for the two on the edge and 1.5 for the two in the middle).
Kasparov, in 1986, suggests thinking of Knights worth 3, Bishops worth 3.15, and the Queen as 9 points.
AlphaZero estimated in 2020 that Knights were 3.05, Bishops were 3.33, and the Queen was 9.5.
Another useful component in a board evaluation function in chess is often to look at the mobility of pieces on the board. Knights at the center of an empty board can jump to 8 locations, but in the absolute corner can access at most 2 locations. Tallying up such optionality can be very fast and cheap, and this is especially useful when you've rolled out a planning tree N steps into the future and have looked at B branches coming out of each possible move, to analyze the goodness of B^N places where an axiological stopsign exists.
Once you do that calculation, the B^(N-1) board positions are approximately as valuable as each one's B outs would imply. (If more than B moves are technically possible but you only considered B moves at each point for reasons of intellectual parsimony then you're not being comprehensive and might miss weird and extremely valuable lines of play.)
The B^(N-2) board positions are probably better estimates because they take deeper search into account, and so on.
An extremely reasonable thing to do is to follow hunches about what "the best move" would be a couple steps out, and then branch off of situations closer to the present more in order to be especially comprehensive about moves you're close to taking and won't be able to undo. One of the algorithms that tries to explore optimally (with annealing temperatures that vary based on contextual factors) is the parallel terraced scan.
It is very reasonable to experience subjective fluctuations in the estimated value of events as the events get closer and closer to being actualized, and get more and more evaluative attention.
(They say that playing speed chess too much will ruin your chess game. A plausible mechanistic hypothesis, then, would be that speed chess trains some low level part of your brain stop-and-instantly-act at certain axiological stopsigns that are useful in a game constrained mostly by thinking time, and this "tendency to stop" might bias your thinking in deep ways once the time constraints go away.)
A simplified model of human planning is that human reflexes always perform the pre-computed best fast action, conditioning on urgency and uncertainty, with the pre-computation having occurred during evolution and/or arisen under an adaptively normal growth and childhood, with many reflexive behaviors mostly suppressed by default, so that the reflexive actions are ready to go, but occur in the absence of a choice to suppress reflexive action.
So in some sense, it is useful for your impulses to focus on rare, crazy, events and choices related to extreme values that have to be decided quickly.
Most of the frontal cortex in humans has the job of suppressing action, the search term you want, if you want to study this, is inhibitory control. which can be unpacked into papers attempting to "elucidate" the neurological details on the mechanical implementation of inhibitory control.
If you just think about it from first principles, you'll notice that impulsive valuations should be relatively precise (have the value they really have, especially for extremal elements so you get the max(action) separated from the other options very often) while inhibitory valuations should have solid recall.
In case your statistics are fuzzy, recall is true_positives / (false_negatives + true_positives)... that is to say... to measure recall for real you have to have a second much much higher quality classifier that can tell every time that a false_negative (on the sloppy classifier) should have been a true_positive (on the sloppy classifier)...
You want to look at all the actual_positives, and inhibit all but the best of them, and certainly never fail to inhibit actions that are much lower than the average value of actions that are already being inhibited.
Compare Babble and Prune. Babble generates ideas that come packaged up with axiological stopsigns that make it possible to even estimate the value of an idea coherently. Prune is doing inhibitory control based on those value estimates.
I composed the first draft of this essay in August 1, 2022 and inhibited publication of it, but later, looking at all my drafts, of all the essays I could finish, it seemed like maybe it was worth cleaning this one up, and publishing here and now in 2026.
This would be the place I would have stopped even bothering to write the essay if I had published it earlier. Imagine this essay stopped here. Should it have? Or is the rest worth having written?
Something that is missing from the material above is the concept of "game temperature" in combinatorial game theory.
I will simply quote from the 2026 version of wikipedia to explain "hot" and "cold" games, and the way gamestate can interact with one's sense of time and urgency, in a relatively rigorous toy model (bold and italics not in original)...
For example, consider a game in which players alternately remove tokens of their own color from a table, the Blue player removing only blue tokens and the Red player removing only red tokens, with the winner being the last player to remove a token. Obviously, victory will go to the player who starts off with more tokens, or to the second player if the number of red and blue tokens are equal. Removing a token of one's own color leaves the position slightly worse for the player who made the move, since that player now has fewer tokens on the table. Thus each token represents a "cold" component of the game.
Now consider a special purple token bearing the number "100", which may be removed by either player, who then replaces the purple token with 100 tokens of their own color. (In the notation of Conway, the purple token is the game {100|−100}.) The purple token is a "hot" component, because it is highly advantageous to be the player who removes the purple token. Indeed, if there are any purple tokens on the table, players will prefer to remove them first, leaving the red or blue tokens for last. In general, a player will always prefer to move in a hot game rather than a cold game, because moving in a hot game improves their position, while moving in a cold game injures their position.
In general, you will need to be able to execute actions based on pre-computed or easily-computed evaluation functions (and the liberal use of axiological stopsigns to make rollouts smaller and more tractable) precisely when time is limited and moves are urgent.
That is to say, it makes sense to have "values" about "hot" situations, ready to go, as preparation for handling these "hot" situations in relatively more skilled ways.
It won't always be a turned based game, and the moves might not always be helpfully labeled with their actual literal "combinatorial game theoretic temperature".
Compare and contrast the beginning of the essay, where the subject of boundaries came up, where big empty spaces that were hard to traverse made nice natural spatial boundaries and moments of relative calm (like "when most people are sleep") made nice natural temporal boundaries.
There is a game called Baduk in Korea, Weichi in China, and Go in Japanese and English and the players of this game, especially in Japan, invented a lot of technical language for getting good at the game including terms like sente, gote, miai, and aji that relate quite strongly to these ideas of "game (and subgame) temperature".
Often, when a player plays in one part of the board, the move locally raises the temperature (in the game theoretic sense) and forces the other player's best move to be a local response, but then the best response to that might also be local, and so on, until the precarious local position finally stabilizes and someone moves away.
Playing away is a sign, to an outsider of low skill who can't understand what happened, that the skilled players estimated that the local play in that sequence was "a situation with an axiological stopsign at the end", which make the full sequence of moves a good candidate for being considered something that can be valued independently in a single motion.
Human go players of high skill do, in fact, try to estimate moves in terms of their "value" (in gained or lost territory) but also, they try to predict the full sequence of moves until someone plays away.
If they predict that the other player will locally respond with a move or moves that does not force them to make a third or fifth or seventh move in response but lets the initiator play elsewhere instead... then that whole sequence leaves the first player with "initiative" to play elsewhere again...
That is, the line of play that leaves you with wide latitude at the end of the action "ends in sente". It "ends with initiative retained".
Sente is worth points because it lets you choose the next place on the board to have a another small local flurry of moves once the temperature falls back to the baseline for the game (which, in general, goes down over time).
Here is a nice simple essay on calculating the value of having sente based on board positions.
Miai is a special term for when a strategic necessity can occur in either of two ways. The strategic purpose is almost certain to be achieved...
...but if either of the ways of accomplishing the strategic certainty are attacked it raises the temperature there, in a way that is under the control of the attacker (maybe putting you one mishandled move away from losing a strategic connection that might be worth 30 points) and forces a local response based on a temperature that is based on the totality if what was strategically at risk.
Miai have a very simple and legible kind of aji.
Aji is a bigger idea, it is a fuzzy aesthetic loose term, that literally means "taste" (and supposedly "de gustibus non est disputandum" ("in matters of taste there is no valid dispute")) but having a sense of good and bad aji turns out to be essential to getting stronger at go.
Aji can be "good" or "bad". Bad aji has defects. To fix bad aji at the end of a sequence of play, one often has to lose sente, restoring good aji for the sake of being able to ignore that part of the board for a while.
Bad aji is a likely location of a future "situation where urgent action will be required" and if you are reading a position out in your head, and see bad aji arising in a potential future, the bad aji can EITHER be treated as "a situation that would need more thought (because maybe it all turns out for the best for locally weird or unique reasons)" OR as "a badness, in itself (to be avoided in play and in the mind)".
Cognitively speaking, good aji is a boundary that serves you, and functions as a sign that you've reached an axiological stopsign (if you're trying to read a position out and wondering when you can stop). In some sense, by playing with good aji you are filling the go board with more axiological stopsigns that serve you, and by playing with bad aji you are reducing the number of axiological stopsigns and you'll have to think about everything all the time. (And mostly you'll be thinking about paying debts and fixing disasters.)
Chess and go and poker have relatively defined endings. Does everything?
At a certain point in a conversation about "rolling out a value estimate" by adding more nodes in a decision tree, that model the farther and farther future, after more and more contingencies might have occurred, someone might want to point out that time might be infinite (or effectively so for us).
Certain notions of philosophically ultimate axiology that focuses very intently on the final state of every sequence of behaviors (in game theory the total rollout for an entire game is sometimes called the "total strategy" of the game (and partial strategies are harder for people to reason a lot of times))...
But what if there is literally no end to "the game" that is our physical universe?
If there is no end to the game then maybe the universe lacks a "literally ultimate" value?
On twitter onetime I ran a poll where I felt like both answers were kinda scary if you think about it long and hard...
I ran this poll a full year after I started writing this essay, so this essay's drafting probably had something to do with me even deciding to run the poll? But if I had published right away then the poll wouldn't be included in the essay!
The idea of "enjoying the journey and not worrying about the destination" has a LOT of appeal in the popular zeitgeist.
It is the standard normie response to many axiological puzzles related to utilitarianism, hedonism, deontology, and so on. And sometimes the normies are right.
However, I suspect that if one wanted to, it wouldn't be hard to point to concepts like "aji" from go, and how enjoyable it is to have slack, such that one could frame "good vibes from hour to hour, and day to day, and month to month" as being strongly related to an orientation towards time that is very mindful of time, and careful to not create emergencies where making lots of snap valuations are essential for hoping to survive or thrive after a chaotically urgent situation.
Just to say: this essay is only going to get weirder and more speculative as it goes. You could totally stop reading here if you want.
I've been an immortalist longer than I've been a transhumanist. I think life is awesome, and with a little work and a little luck it is mostly full of delightful moments, and learning, and nice surprises, and delicious meals, and neat games to play with nice people, and worthwhile puzzles, and fun things to think about or do.
I became an immortalist before becoming a transhumanist... when I read some vampire novels as a young teenager and was annoyed at how the vampires got all these cool powers, and all this time to do all these fun things, and instead they just moped and wallowed in ennui. Fuck that noise! Just don't violate ethics when you meet your weird new dietary requirements, and... enjoy life! Its not hard, right? Right???
(Later I learned that it might be hard for some people. Apparently happiness set points are a thing, and they might be largely genetic, and mine might be high? The things I've heard that make durable changes to someone's "apparent set point" (that presumably aren't strongly caused by genetics themselves) that include (1) adopt gratitude practices to push the set point up, (2) don't marry a neurotic, (3) if you're a woman in an unhappy marriage getting divorced will often make you poorer and also happier, (4) if you're a man then losing a job will hammer your happiness for a long time, and (5) in general, don't let your children die.)
Contary to the set point idea... when they do studies to sample momentary happiness on a scale from 1-10 via text messages, the minute to minute scores don't strongly connect to how satisfied you are with your life looking backwards (which sorta relies on outcomes and big things more than hard-to-remember minutes of medium happiness in your daily life) but the hour to hour scores are higher on average when you spend time happily socializing, such as in a happy family or with good friends.
Something I found, as an adult, is that I seem to get and also to give a lot of happiness from visiting with people (either having them as house-guests or going to their homes) for roughly 10 days (more than a week, but less than 3 weeks for sure).
For roughly the first 10 days (two weekends and a bit?) you keep that glow of a rare and special visit, and don't have time for things like "resentment of the way they put the dishes on the wrong shelf over and over even after you asked them not to nicely five times" to build up ;-)
I feel like this house guesting practice of mine, in life, is consistent with the idea of managing and being aware of axiological stopsigns.
It is easier to plan 10 really good days of visiting "as a unit with a beginning and middle and end" than to plan 10 really good years with the same amount of control. The end of the visit is the axiological stopsign.
Whether there is another visit (and how fun that next visit might be) can be a nice and distinct second unit of analysis, if you're trying to "count the outs" for things to do (or not do) on an near term visit, wondering which things on this visit might be worth repeating on the next visit might change how you approach what is near in time.
Anyway... if you got all the way to the end of this essay (despite the essay telling you that all the ideas will just be repetitions on the theme, and that reading further might very well be a waste of time) and you have never yet ready Finite And Infinite Games by James Carse then you might like that book.
I looked in a few places just now, and if you want to buy a copy, maybe try Abe Books?
But if their inventory has changed since I looked, and they want to change you more than $9.50 for a used one then search elsewhere. (Unless you're reading this in the far future, and inflation has changed the price levels, in which case maybe that "price at which to search elsewhere" will have gone stale.)
James Carse's book applies the idea of ongoingly valuable social processes, and and doesn't go much into the math or planning or cognitive aspects much at all, but instead focuses on the sociological and emotional and spiritual differences in the vibe around games that are intended to never end, where people have fun by finding ways to continue to have fun... as distinct from a game being driven towards a definite end state that is personally preferred by a player or team that wants the game to end... with themselves crowned as victor.
Discuss
Artifical Expert/Expanded Narrow Intelligence, and Proto-AGI
Several years ago, I offered the possibility of there being a hidden intermediate state between Narrow AI and General AI
Even this take was, at the time, a few years after I had come up with the concept. And one of the predictions I made on this topic (I have failed to find the original post, I just know it was from 2017-2018 or so) was that the 2020s would be filled with said intermediate-type AI models that constantly get called "AGI" every few weeks
Refresher, now with visuals thanks to Gemini's Nano Banana:
Courtesy of Nano Banana. Based on the levels of self-driving used by AV manufacturers.
The term "artificial expert intelligence" was suggested to me back in 2018 or so, as a more cohesive description of this intermediate-phase AI architecture (and because the acronym "AXI" sounds cyberpunk; it has nothing to do with the company XAI)
The operating thesis is basic logic:
How on earth do we jump from narrow AI to general AI? How is it possible there is nothing in between?
And in the end, I was validated by this mulling.
Right now, we possess ostensibly "narrow" AI models like ChatGPT, Claude, Gemini, Grok, DeepSeek, etc. that nevertheless seem to have "general" capabilities that no AI hitherto the present possessed or even could possess at all.
The question is how to shift to a "general function" model.
From the start, I imagined it as a sort of 'less narrow' form of AI, and nowadays I've backronymed it into "expanded narrow" intelligence through the same means
Narrow AI has, since the origins of the field, been the only mode of all AI programs until the emergence of large language models in the late 2010s. The common expectation of real world AI has grown into models that handle single tasks, or a tiny fuzzy field of tasks. A single model capable of writing poetry, analyzing image data, and holding a coherent conversation would have been seen almost unilaterally as artificial general intelligence 10 years ago. In fact, it would even have been seen as such 5 years ago— GPT-3 (circa 2020) had already triggered conversations about whether autoregressive attention-based transformers were actually unexpectedly the path to strong, general AI.
Commonly, the start of the AI boom is attributed to ChatGPT
However, from my recollection, the true start of the AI boom occurred 6 months earlier with this little known paper release and this little remembered tweet...
Gatohttps://deepmind.google/blog/a-generalist-agent/
https://x.com/NandoDF/status/1525397036325019649
At the time, we had never seen a single AI model so general and so capable before. The ability to accomplish 604 tasks, and even do many of them at human level capability, was the first true call of "Oro!" that hyperaccelerated when the average consumer first tested GPT-3.5 in the late autumn of the same year.
Yet in retrospect, it seems obvious now that, as impressive and strong as Gato was, it was still a transformer-based tokenizer. Wasn't it? If I'm wrong, please correct me!
Even at the time, many commentators noted the bizarre uncanny valley effect at play in trying to deduce what exactly Gato was, because it didn't neatly fit into either category of ANI or AGI.
Indeed, it is closer to a proto-AGI than most frontier models today, but still fails to cross the threshold. Its function remains narrow, not general and expansive or dynamic— as in, not able to continually learn to develop new modalities, not able to anchor its understanding of concepts to symbols, and did not seem to possess a coherent world model from which it could engage in abstraction, which would have allowed it to accomplish "universal task automation." But within that narrow function, it developed general capabilities. So general, in fact, that it spooked many people into declaring that the game was now over and we simply needed to race down the path of scaling.
So far, we still haven't seen a true successor to Gato (could it be a state or corporate secret? I hope not, that doesn't bode well for the AI field if we're repeating the mistakes of the Soviet Union's OGAS network)
But what exactly is "universal task automation" in that context?
Anthropomorphic AGI (does it think like a human?) is important, yes. I don't doubt that this could emerge spontaneously from any sufficiently advanced system. However, my entire framing here is trying to focus on Economic/Functional AGI (can it handle the entropy of reality?)
The "Universal Task Automation" Machine
Essentially my attempt to do to AGI what "UAP" was for "UFO," taking away some of the Fortean woo around the term to focus on what actually is or could be there.
1. The Core Misconception: Jobs vs. Tasks
- The Fallacy: We tend to measure automation by "Jobs Replaced."
- The Reality: We haven't fully automated a single job title in 70 years. We only automate Tasks. Many people lose their jobs to AI, but entire job titles have yet to vanish in the digital age, besides the human computer. I have to credit sci-fi author and YouTuber Isaac Arthur for explaining this to me in the past, as this was a minor epiphany about how to think about AI. And you'll understand it thusly:
- The Supermarket Paradox: A supermarket clerk uses heavy automation (scanners, conveyor belts, POS systems). Without these, the store fails. Consider this: the first supermarkets opened in the late 1910s and 1920s, but were terribly slow and would have been seen as borderline "plus-sized farmer's markets" by today's standards as opposed to the common conception of a supermarket. Without the mechanical tools of the coming decades streamlining the process of scanning products and their prices, the modern supermarket would not be able to exist without either staffing autistic savants en masse or staffing large numbers at extraordinarily high wages for the skills required. Self-checkout is not market automation, as many disgruntled consumers note (it's simply passing the job to you), but the actual scanning is. There is likely no more heavily automated job title you experience on any daily basis. And yet the human remains. Why?
- The Machine handles the Rigid Tasks (scanning a clean code).
- The Human handles the Edge Cases (crushed tomato, missing barcode, angry customer). If you tried to create a fully automated supermarket today, you'd be better served making it more like an automat warehouse built around the limitations of vision and robotics systems from which humans can remotely order.
- Result: This is Partial Automation, not Universal Automation.
The barrier preventing Partial Automation (ANI/AXI) from becoming Universal Automation (AGI) is not "intelligence" in the IQ sense; it is the ability to navigate entropy, or what I tend to call "chaos"
- Scripted Automation (Levels 0-2): Works only when A → B
- Scenario: The widget is on the belt. The arm grabs it.
- The Chaos Event: The widget falls off the belt and rolls under a table.
- Current AI: Fails. It throws an error code or continues grabbing at empty air. More advanced deep learning models will note something is wrong and actively look for the widget, but may not have the embodiment to act if it's too far removed, or delegate the task to calling a human worker to solve the problem. It lacks the world model to understand that the object still exists but has moved (object permanence/physics), and/or it lacks the embodiment to act on such abstraction.
- Universal Automation: Notices the error, pauses, locates the object, creates a new plan to retrieve it (or flags a specific cleanup protocol), even actively searches for and retrieves it no matter where it's fallen, and resumes.
You cannot have a Universal Task Automation Machine without the ability to handle chaos (abstraction). This is why I tend to feel that even the strongest coding models are not AGI— the whole vibe coding trend involves models that still need to be guided and reprompted, often when they cause errors that must be fixed. Some level of logic is needed for these models to work, and yet I've yet to see a coding program that is capable of using any language and can think through and fix its code without a human telling it it made a mistake. Which is the other chief thing: when you no longer need a human in the loop in any executive function, then you've clearly crossed the threshold into AGI (take note, C-suite, planners, and asset managers, this may come back to haunt you when we reach true AGI)
My latest analogy to understanding the difference between what AI currently is and what AGI will be is that of superfluidity.
When you cool helium close to its lambda point, very curious behaviors emerge, such as intense bubbling and roiling, and even liquid helium itself is an odd and extreme substance. However, when it crosses the threshold into becoming a superfluid, it's not a gradual shift at all. Its entire quantum state shifts, and immediately bizarre new effects emerge.
This is my take for what the shift to AGI is like, and why exclaiming every new model gets us closer to AGI is arguably completely missing the point.
Current AI has "friction." That is, it gets stuck on edge cases. You can automate 20%, 50%, or 80% of tasks, mostly through genius and creative programming, but as long as a human is required to fix the "chaos," you are still in the liquid state (Viscosity > 0).
Once the system can handle the chaos/abstraction— once it can fix its own errors, once it can abstractly predict future states, once it can generalize outside its training distribution and thus prove it has "general function" rather than just "general capability"— resistance drops to zero.
It doesn't matter if the AI is legally restricted to 50% of jobs. If it technically possesses the capability to handle the chaos of 100% of tasks, the Phase Change has occurred. An AGI, even a proto-AGI as per the infographic up above, ought to be able to handle 100% of tasks at 100% of jobs. Not some arbitrary number that appeases venture capitalist predictions about potential returns on investment.
Right now, we are deep in the AXI phase and hoping that scaling gets us to Level 4.
These first AGIs, which will probably be called Proto-AGI or First-Gen AGI or even Weak AGI, will be general function + general capability, capable of universal task automation. In many ways, they will be human-level, much like Gato or any frontier model.
And yet, I strongly doubt we'll claim they are sapient (besides those suffering AI psychosis). Even with a phase change occurring in terms of functionality, the first models are not inherently defined by human capability. They are general, tautologically speaking, because they're general. Whether they are "conscious" or "sapient" entities with inner worlds as rich and alive as a human being's is irrelevant at this stage. This is yet another area where it seems people have trouble visualizing the concept due to a lack of language around it, as often "AGI" will immediately invoke the idea of an artificial human brain, and because it seems we're so far off from such, there's no reason to worry we'll ever reach it. When in reality a "mere" general-function AI model could be built within a year or two. It could even be built by someone we don't expect, because of the possibility that nearly all the major AI labs are actually chasing the wrong method. Continual learning is undoubtedly one of the more important prerequisites for any AGI, but let's entertain the thought that even a massively multimodal, infinitely time-test computing, continuously learning transformer-based model still fails to cross the threshold of AGI for some inexplicable reason. Likely, we'd still call it such, because much as with superfluidity, you don't realize when the phase transition happens until after it happens. Before then, you spend a great deal of time convincing yourself that mild changes may be signs the transition has happened.
In regards to superintelligence, the most I want to note in this particular post is the topic of "qualitative" vs "quantitative" superintelligence as represented in that infographic.
Quantitative superintelligence simply means any AGI/ASI that is still par-human or low level superhuman but can operate at superhuman speeds (which is inevitable considering how digital and analog computing works); Qualitative superintelligence is the more common conception of it, as an entirely alien brain.
And judging by both popular folk conceptions of AGI, as well as the conceptions of AGI/ASI from the venture capitalist and billionaire class, I strongly feel most people do not truly understand what "superintelligence" actually means. I may go into some detail about what I mean in a future post.
All apologies for the rambling post, but I felt the need to expound on these topics early in the year: there's no better way to prune a concept than to throw it into the wider market.
As always, if I'm wrong, please correct me or expand upon this. Do whatever with this, even reject it entirely if the entire thesis is faulty or wrong.
Discuss
An Aphoristic Overview of Technical AI Alignment proposals
Most alignment overviews are too long, but what if we rewrote one as a series of aphorisms?
I like Epictetus's confronting style: abrasive, clarifying. See my fuller post for links and nuance.
ISome problems can be solved by being smarter.
Some problems can only be solved by having help.
Aligning something smarter than you is the second kind.
So many proposals collapse to:
use AI to help supervise AI.
It sounds simple. It's the only thing that works.
II. On using AI to supervise AIIf a weak model helps supervise a strong one,
and that one supervises a stronger one still:
this is Iterated Amplification.
The chain must hold.
If two models argue and we watch:
this is Debate.
Checking is easier than creating.
If we give the model principles and ask it to judge itself:
this is Constitutional AI.
Principles must survive power.
If we build AI to do alignment research:
this is Superalignment.
Train it. Validate it. Stress test it.
A corporation can be smarter than any employee.
Yet no employee wants to take over the world.
Many narrow tools, none with the full picture—
this is CAIS.
Make the AI uncertain about what we want.
Then it must ask.
This is CIRL.
If it can't be satisfied, it won't stop.
Give it a notion of "enough", and it can rest.
This is satisficing.
If the model lies, look past the output.
Find what it knows, not what it says.
This is ELK.
If prompting fails, steer the internals.
This is activation steering.
Output-based evaluation breaks
when models are smarter than evaluators.
Look for the thought, not the performance.
You can't align what you don't understand.
This is agent foundations.
What is an agent?
What is optimization?
Confused concepts make confused safety arguments.
Keep it in a box. (AI boxing)
But boxes leak.
Let it answer, not act. (Oracle AI)
But answers shape action.
Do what we'd want if we were wiser. (CEV)
But how do we extrapolate?
The optimist case: the problem is easier than it looks.
It still takes correction.
This is the optimist case.
Suppose we solve alignment perfectly.
Aligned to whom?
A safe AI in the wrong hands is still a problem.
This is governance risk.
Rewritten from my original draft with Claude.
Discuss
Claude Wrote Me a 400-Commit RSS Reader App
In the last few weeks, I've been playing around with the newest version of Claude Code, which wrote me a read-it-later service including RSS, email newsletters and an Android app.
Software engineering experience was useful, since I did plan out a lot of the high-level design and data model and sometimes push for simpler designs. Overall though, I mostly felt like a product manager trying to specify features as quickly as possible. While software engineering is more than coding, I'm starting to think Claude is already superhuman at this part.
Narrating an article from an RSS feed in the web app. The Android app can do this in the background and supports media controls from my car.This was a major change from earlier this year (coding agents were fun but not very useful) and a few months ago (coding agents were good if you held their hands constantly). Claude Opus 4.5 (and supposedly some of the other new models) generally writes reasonable code by default.
And while some features had pretty detailed designs, some of my prompts were very minimal.
After the first day of this, I mostly just merged PRs without looking at them and assumed they'd work right. I've had to back out or fix a small number since then, but even of those, most were fixed with a bug report prompt.
Selected FeaturesAndroid AppThe most impressive thing Claude did was write an entire Android app from this prompt:
Creating the design docImplementing the entire MVP app in one shotAfter that, almost all features in the Android app were implemented with the prompt, "Can you implement X in the Android app, like how it works in the web app?"
NarrationThe most complicated feature I asked it to implement was article narration, using Readability.js to (optionally) clean up RSS feed content, then (optionally) running it through an LLM to make the text more readable, then using one of two pipelines to convert the text to speech and then linking the spoken narration back to the correct original paragraph.
To be honest, I don't know if I could one-shot this either.This was the buggiest part of the app for a while, but mostly because the high-level design was too fragile. Claude itself suggested a few of the improvements, and once we had a less-fragile design it's been working consistently since then.
Selected ProblemsNot Invented Here SyndromeClaude caused a few problems by writing regexes rather than using obvious tools (JSDom, HTML parsers). Having software engineering experience was helpful for noticing these and Claude fixed them easily when asked to.
Bugs in DependenciesClaude's NIH syndrome was actually partially justified, since the most annoying bugs we ran into were in other people's code. For a bug in database migrations, I actually ended up suggesting NIH and had Claude write a basic database migration tool.
The other major (still unsolved) problem we're having is that the Android emulator doesn't shut down properly in CI. Sadly I think this may be too much for Claude to just replace, but it's also not a critical part of the pipeline like migrations were.
Other Observations- Claude likes to maintain backwards compatibility, and needs to be reminded not to bother if you don't need it.
- Claude did hallucinate API docs, which has led to me including the fly.io docs in the repo for reference. I imagine this would be a lot more annoying if you were using a lot of uncommon APIs, but Claude knows React very well.
- Sometimes it would come up with overly complicated designs, and then fix them when I asked to check if anything was overly complicated.
- It actually did pretty good at coming up with high-level system designs, although it had a tendency to come up with designs that are more fragile (race conditions, situations where the state needed to be maintained very carefully).
The problems Claude still hasn't solved with minimal prompts are:
- The Android emulator issue in CI (upstream problem).
- The email newsletter setup requires a bunch of manual steps in the Cloudflare Dashboard so I haven't really tried it.
- Narration generation seems to happen on the main thread, making the UI laggy. Claude's first attempt at a web worker implementation didn't work and I haven't got around to trying to figure out what's wrong.
But that's it, and the biggest problem here is that I'm putting in basically no effort. I expect each of these is solvable if I actually spent an hour on them.
This was an eye-opening experience for me, since AI coding agents went from kind-of-helpful to wildly-productive in just the last month. If you haven't tried them recently, you really should. And keep in mind that this is the worst they will ever be.
Discuss
The inaugural Redwood Research podcast
After five months of me (Buck) being slow at finishing up the editing on this, we’re finally putting out our inaugural Redwood Research podcast. I think it came out pretty well—we discussed a bunch of interesting and underdiscussed topics and I’m glad to have a public record of a bunch of stuff about our history. Tell your friends! Whether we do another one depends on how useful people find this one. You can watch on Youtube here, or as a Substack podcast.
Notes on editing the podcast with Claude Code(Buck wrote this section)
After the recording, we faced a problem. We had four hours of footage from our three cameras. We wanted it to snazzily cut between shots depending on who was talking. But I don’t truly in my heart believe that it’s that important for the video editing to be that good, and I don’t really like the idea of paying a video editor. But I also don’t want to edit the four hours of video myself. And it seemed to me that video editing software was generally not optimized for the kind of editing I wanted to do here (especially automatically cutting between different shots according to which speakers are talking).
Surely, I decided, it wouldn’t be that hard to just write some command-line video-editing software from scratch, with the aid of my friend Claude. So that’s what we did. We (which henceforth means “me and Claude”) first used deepgram to make a transcript of the podcast that includes timestamps and a note of who’s speaking. Then we generated IDs for all the different lines in the transcript, leading to a huge file that looks like this:
$segment2/34 // Buck: So how good you know, you're you're saying, like, 50% of, like, catastrophically bad outcomes. $segment2/35 // Ryan: Yep. $segment2/36 // Buck: And then what's the distribution of how good the other 50% of worlds are? $segment2/37 // Ryan: Yeah. $segment2/38 // Ryan: So I think there's a bunch of variation on like, from a long term perspective in terms of, like, how good is, like, the future in terms of, like, how how well are, like, cosmic resources utilized. $segment2/39 // Ryan: And my current view is that can vary a huge amount in terms of the the goodness relative to, like we could say, like, we could establish different baselines.We wrote code that lets us edit the podcast by copying and pasting those lines around. We also wrote code that automatically generates cuts between the different shots using crappy heuristics that Rob Miles tells me are bad.
The DSL compiles to a giant ffmpeg command, which is then executed to produce the final product.
This process produced the video and audio for almost all of that podcast (with the exception of the intro, which Rob Miles kindly sat down with me to manually edit in a real video editor).
Things I learned from this process:
- Claude Code had a lot of trouble getting ffmpeg to work. This makes sense, it seems pretty confusing and Claude wasn’t easily able to check its work.
- In the same spirit as the many American men who believe they can beat a bear in a fight, I often have the delusional belief that instead of learning existing software, I should write my own software to do the same thing. AI agents are now good enough that this is not totally impractical.
- Video files (especially really long ones like these) are really large. This means that lots of things you want to do are constrained by disk space, network speed, and computational power. I ended up doing all the work over SSH to my beefy desktop.
This has been edited for clarity by an AI and spot-checked.
P(doom)Buck: We’re here to talk about a variety of things related to our work and AI stuff more generally. One of our favorite questions submitted to us was the classic: what’s your P(doom)? So, Ryan, what is your P(doom)?
Ryan: I think it’s going to depend on how you operationalize this, unfortunately, as is often the case. One question would be: what’s the probability of misaligned AI takeover? I think my view is that’s like 35%, just unconditional. And then the probability that along the course of AI development, something goes catastrophically wrong which massively reduces the value—including things like authoritarian takeover, power grabs by people who do very bad things with the world—maybe 50% total doom.
Buck: Sorry, you’re saying 50% including the 35%?
Ryan: Yeah, including that. Just 50% chance something goes catastrophically wrong, including AI takeover, et cetera. And I’m not including cases where there’s AI takeover, but the AI ended up being kind of reasonable, and we’re like, that was basically fine. Or cases where there’s massive power concentration, but either the power later gets less concentrated, or the people who the power concentrates in end up doing kind of reasonable things on my lights.
Buck: Okay. So one thing I find interesting about this is: you’re saying 50% catastrophically bad outcomes. And then what’s the distribution of how good the other 50% of worlds are?
Ryan: So I think there’s a bunch of variation from a longtermist perspective in terms of how well cosmic resources are utilized. And my current view is that can vary a huge amount in terms of the goodness. We could establish different baselines. We could do “maximally good” versus “paperclipper” and then ask what fraction of the value between the two we get—and also negative is possible, right? You can be worse than a paperclipper. When I say paperclipper, I mean an AI that has the aim of maximizing some arbitrary thing I don’t care about. The fact that this may or may not be realistic isn’t important; I’m just using it as a baseline. And then for maximally good, it would be like I thought about it really hard and decided what happened with everything, and deferred to whatever body I thought would be most reasonable for figuring out what should happen. I’m doing this purely as an outcome-based thing—obviously there’s different levels of reasonableness of the process versus how good the actual outcome is.
One relevant point: suppose that humans stay in control, misalignment risks were either mitigated or didn’t end up being a problem, and humans basically don’t go crazy—people remain basically sane. And let’s suppose that the world is broadly democratic. So at least the vast majority of the cosmic resources are controlled by countries that are as democratic as—I don’t know what’s a pretty good example—France? France is a democracy. As democratic as France is now. So not necessarily perfect, there’s a bunch of difficulties, but at least that good. My sense is that you get like 10 or 20% of the value of what’s maximally achievable from the paperclipper-to-maximum range in that scenario. A lot of the value is lost due to people not being very thoughtful with usage of cosmic resources, disagreements. I think a lot of it is lost by people just not really trying very hard to reflect on what to do and not being thoughtful, and then some is lost due to people disagreeing after really having their epistemic dynamics and decision-making improved a bunch. It’s very hard to predict these things, but that’s some guess. And there’s complicated mitigating factors. Like maybe there’s less value lost due to trade, but that also applies to some extent on the paperclipper baseline—that’s partially priced in, but there might be other things like that.
Buck: Sorry, that could be either causal trade between different actors on Earth who exchange resources, or acausal trade. For the paperclipper, are you saying causal trade or acausal trade?
Ryan: Acausal. That would be purely acausal trade in the paperclipper case, and then in the broadly democratic control by humans case. To be clear, those humans could then decide to defer to AI successors or all kinds of things could happen there. And I’m assuming that there is ultimately sufficiently good alignment that there’s no slow degradation in how much control the humans have over time—another scenario is humans are initially in control but that slowly decays away, and I was not including that.
Buck: So here’s something about all this which is emotionally resonant for me, or feels like a big deal to me. Once upon a time, when I was a child reading lesswrong.com in the dark days of 2010 or whatever, I had this feeling that if AI didn’t kill everyone, we would have a future that was pretty good. And I mean pretty good in kind of two different senses that I sort of conflated at the time. One of them is pretty good from a utilitarian perspective—like scope-sensitive, almost all of the value that’s achievable is being achieved.
Ryan: And just to clarify, when Buck and I say utilitarian, I think we mean something much vaguer than a very narrow philosophical sense. We just mean we had some sense of scope-sensitive stuff where you cared about some aggregation that’s kind of linear-ish.
Buck: Ish.
Ryan: Whatever version of that made sense after thinking about it pretty hard in a somewhat thoughtful way, which probably is not going to be literally additive because of cursed issues around infinite ethics and things like this.
Buck: Yeah. So I used to have this feeling that if we didn’t get killed by AI, the future was going to be very good. And one of these senses is: I get a lot of what my values say is good from the future. And the other one is: the future doesn’t have issues that, even if they aren’t of super large scope, are pretty depressing. So I imagined that the future wouldn’t have massive factory farming. Suppose that in the future our successors colonize the stars—humans or emulated minds colonize the stars. But in some trade, Earth continues on in its earthy ways. And the Earth humans who want to continue being flesh humans, normal humans made out of flesh, they decide that they want to keep doing their wild animal suffering. They want to continue having ecosystems be basically the same way they are, and they want to continue having factory farming for some cursed reason. I feel like this is pretty depressing. From a utilitarian perspective, obviously this is totally fine—the factory farming, though horrifying by current scales, is very small compared to the total value of the far future. But nevertheless, there’s something kind of depressing about it. And nowadays I kind of just don’t expect things to be good in either of these senses—either we are getting almost all the large-scale value that I might have wanted from the future, or there isn’t really depressing oppression or suffering. And I feel like this has caused me to relate to the future pretty differently. I think the most obvious reason to expect this is I’ve thought more about what different nations are likely to want around the time of the intelligence explosion. And it feels like in a lot of cases we either have to hypothesize that some crazy war happens which just disempowers a bunch of countries, or those countries get a lot of what they want. And I just don’t feel very good about what they want. I think this is probably better than a crazy war, but it does involve futures that have a bunch of stuff that I don’t like very much in it. And that is pretty unlikely to turn into wonderful cosmopolitan use of the far future.
Ryan: Yeah, maybe I should push back a little bit. I feel like you’re like, “10%, boo hoo, we only get 10% of the value, oh no.” And I’m like, come on, maybe glass-half-full this a little bit. It could be a lot worse. In some sense we should be into getting a lot of really good stuff. And yeah, it could be much better. But it’s plausible that thinking about this as “wow, it’s worse than I expected” is the wrong way to think about it. You should just be like: it doesn’t really matter what I previously thought, what I think is X, and is that good or is that bad? What would be the most healthy way to relate to it? I agree that this does have a big effect from a utilitarian perspective in terms of how you think about the impact of different interventions. Because it means that going from 10% to 50% is a very big deal.
Buck: Sorry, what do you mean 10% to 50%?
Ryan: So imagine that by default we would get 10% of the value and there was some intervention that would result in us getting 50% of the value. That could be worth a lot. I’ll maybe mess up the math here, but imagine that by default the probability of AI takeover was like 5%, which I’ll just say for simplicity, and imagine that in order to go to the 50% world, we had to make it so the probability of AI takeover was instead 80%. Naively, that’s actually slightly better if you just multiply the numbers. Now, there’s a different question of should we actually go for it, because it’s also a fucked up trade to be making. But in terms of what interventions you’re doing at the margin, if you’re choosing between an intervention that reduces AI takeover risk by 0.1% or increases the upside conditional on no AI takeover by 5%, it’s very plausible the upside-increasing intervention looks better. And if everyone is making that choice, then the situation looks pretty different. But this also very plausibly looks much less cooperative in a sadder way.
Buck: Yeah, I agree. I feel like you’re relating to this in a very healthy way. But I’m curious—a lot of where I’m coming from here isn’t describing an emotionally healthy reaction, it’s just describing myself and how I feel about it. Has your mind changed about this in the time you’ve been thinking about this?
Ryan: You mean from an emotional perspective, or…?
Buck: Yeah, or have your beliefs shifted?
Ryan: I think I just didn’t have very precise beliefs. I sort of had a vaguer “probably stuff will be good” and hadn’t thought about it in that much detail. But maybe would have regurgitated numbers that aren’t wildly different from the numbers I said. And then after thinking about it more, I got sort of sadder about the situation and did have some emotional reaction. I did think the situation was worse due to thinking about ways in which resources might be squandered or things could go poorly. And then I felt a bit better about the situation after thinking about some considerations related to ECL.
And then another factor that neither of us have mentioned but is maybe pretty important is the possibility of large-scale s-risk—maybe “s-risk” is the wrong term because I’m not sure that it’s very binary, but just: how much really bad stuff will there be? Buck was giving the example of factory farming on Earth. From a scope-sensitive perspective, that’s very small. Factory farming on Earth is like 10 to the negative 40th of badness in the universe or something, compared to maximum bad. I could do the math more carefully, but it’s somewhere around there in terms of just the scale. But things could be much worse than this, obviously. And after thinking about various considerations related to s-risk, my conclusion was that the fraction of all the stuff—across this universe, but also across the multiverse in general, operationalized in different ways—is more like 1/200th of that is optimized suffering (of the resources that end up getting used). And that’s a pretty depressing perspective. I think that’s intuitively pretty concerning. And this is not a common aspect of how people are talking about the future.
Do we talk too little about weird stuffBuck: Okay, so here’s a question, Ryan. I feel like we both have an attitude to the universe—more broadly the multiverse—which is quite non-standard among people. For instance, I think we both take very seriously the existence of other universes. And I think they feel emotionally real to us. And I think we take very seriously the fact that the universe is very large and there are aliens other places. And we might be in a simulation.
Ryan: Yeah.
Buck: I think some consequences of this are also a big part of how we relate to the universe and how we feel about the future. What was it like for you, coming to have all these perspectives that we have?
Ryan: Maybe I want to step back a little bit and say more about what we mean. Buck said that we take these things seriously. I think there’s maybe two different senses in which you could mean that. One is we think these things are true and act as though they are true when making decisions where we’re explicitly considering different trade-offs. Another is we viscerally feel they are true. I’m not sure the extent to which I viscerally feel these things are true. I do in some ways and I don’t in other ways. And I don’t know that it would be healthy or productive to System 1 viscerally feel that, for example, we’re in a simulation. I don’t know that that triggers productive things in any way. The human System 1 is not adapted to make good reflexive decisions taking into account the consideration that we might be in a simulation and there are aliens somewhere. So I’m not really sure that’s a very productive strategy. I do think there’s questions about motivation and disagreements between System 1 and System 2 causing issues.
And then I also want to clarify what we mean. Buck said there are other universes—there’s a bunch of different senses in which you can mean this. One sense is that there’s parts of this universe that we can observe, sometimes called the observable universe. And as far as we know, the universe is actually spatially infinite or at least spatially very large. That is extremely consistent with our observations. That’s the default view—in some sense the view that all that exists is the observable universe and there’s basically nothing outside of that would be very a priori surprising given what we understand about physics. And then even within the observable universe, I think about 5 or 10% of that is actually reachable by us given our understanding of physics. So that’s an even smaller chunk. There’s a bunch of stuff that we will see that we will never be able to reach.
And then in addition to the spatially infinite, there’s Everett branches, which seem very consistent with our understanding of physics. Could be wrong, but it seems like if we’re not really confused about physics, that’s what’s going on. And that’s even more stuff—in fact it’s an absurdly insanely large amount of stuff because branching is very fast. And then there’s a bunch of reasons to think that there are other universes that are well described as having different physics. I don’t want to get into all these arguments. This is me regurgitating some arguments from Max Tegmark and the mathematical universe hypothesis, which I’m sure are influenced by a milieu that’s been stewing for longer. But anyway, that’s the sense in which we mean there are multiple universes.
As far as how this affects me: I wasn’t really thinking about these considerations in any great detail when I first started thinking that I should work on AI safety. And then after—I’m not sure exactly what mix of things caused me to update—but to some extent just thinking about it myself and probably hearing some of these arguments in more detail made me much more viscerally entertain various possibilities related to us being in a simulation. Concretely, I think there’s a kind of specific hypothesis about what type of simulation we’re in, which is something like a science/trade simulation where you’re a future alien civilization, you’re interested in what happens to other alien civilizations as they go through the singularity, both so that you can do acausal trade and potentially so that you can answer questions about the distribution of stuff in the universe. And it feels like that is a very salient class of things to simulate. And I’m like, man, really seems like in some sense that’s probably what we’re in.
And then another factor relevant to my views is that once you think through a lot of the things related to the multiverse and how to handle things like infinite ethics, the word “probability” becomes a little less real and becomes a less clear-cut notion—more a question of philosophy and decision theory than something that is concrete and well-specified. Saying “we’re probably in a simulation” feels like it doesn’t feel like a well-specified question and feels a little bit confused. And I think there’s a way you can break that down and redefine the word probability such that that statement makes sense again, but then it means something kind of different.
Buck: Yeah, I feel like I’m a probability anti-realist. I feel like there’s a certain kind of personality type that we both have which ends up being reality anti-realist in a certain sense, where we consider it, for example, a matter of preference—in the same way that your morality is essentially subjective—what distribution over universes you think exist. Eliezer Yudkowsky thinks that I’m wrong about this and crazy, but I don’t totally understand his argument.
Ryan: He thinks there’s a true prior? What’s the view here?
Buck: Yeah, he thinks “reality juice” is real. He thinks there’s something real about real universes and he doesn’t buy all this Tegmark Level 4 stuff nearly as much as we do.
Ryan: Oh, that’s interesting. One interesting thing is I think there’s some cluster of thought that we belong to, which I would guess Carl Shulman also belongs to, and we can name other people who are broadly along the same lines. But then interestingly, there’s some separate MIRI cluster of views on acausal trade and what will happen in the long-run future that’s very different. My understanding—this could be wrong because this is just based on what I’ve heard from other people—is that Eliezer and Nate actually think most of the realizable value will in fact be realized conditional on avoiding AI takeover, from what I’ve heard. Or at least they’re more optimistic than I am. And I think they’re maybe less optimistic—or maybe they think it’s less path-dependent, they think it’s less likely that you can affect this. And then I think they also think the dynamics around acausal trade are very unlikely to go poorly in some way that I don’t quite understand. But I’m not sure about their views on multiverse.
Buck: Okay, here’s another question of a different theme. I feel like we know a lot of people who focus on AI takeover risk, but AI takeover risk is not that much of my uncertainty over how good the future is. I think there are several issues that are basically as large or maybe larger in terms of my uncertainty over how well the future will go. One is AI takeover risk, another is very unfortunate human concentration of power risks. And another is: there is no unfortunate human concentration of power, but humans make terrible choices about what to do with the future and almost all the value is lost. There’s also various things going wrong with respect to acausal trade or acausal coordination.
I just feel like AI takeover risk definitely doesn’t feel like by far the largest of these. And I feel like it’s kind of interesting that sometimes people just conflate longtermism with worrying about AI takeover risk. And I think there’s also kind of a purity test thing where sometimes rationalists act as if, if you’re a real longtermist, the thing you should be mostly worried about is AI takeover risk. Thinking of it as kind of like you have to be a soft-core loser in order to be focused on concentration of power risks or risks from bad people controlling AI. Partially just because historically it has been true that when you argue with people about AI, a lot of people mostly think we’re talking about concerns about who controls the AI. And I think they do actually underrate the AI takeover risk. But I do worry that we end up in this situation where at least a certain class of person is mostly focused on AI takeover risk when they’re thinking about AI and not really thinking nearly as much—and often not acknowledging or being aware of the existence of—these other very real arguments for why the future might be way lamer than it could be.
Ryan: Yeah. I think AI takeover risk is a higher fraction of the variation. I guess I would have said the biggest source of variation is conditional on human control—how that goes, which humans, what they do. And we could break that into several different things, but I sort of think of that as one. There’s who controls it and what do they do? And maybe those are two big chunks. I think who controls it probably is less variation than what do they do. And by what do they do, I mean what process do they follow. But both cause some amount of variation.
If I were to give a ranking in terms of how much of the variation in how the future goes—which to be clear is not the same as how much you should prioritize that, because there’s a question of how much you can intervene—I would have said: most important in terms of variation is what the people who are in control decide to do in terms of what process for deciding what to do with the future. Then misalignment risks. Then who controls the future. Low confidence about the order between who has power and AI stuff. And then there’s some longer tail of things that can go wrong, including acausal risks, et cetera. But I think it’s actually pretty reasonable to think that AI takeover risks, once you start taking into account the ability to intervene on them, are actually higher priority.
Buck: Sorry, was that the list of…?
Ryan: The thing I was just saying was the list of variance.
Buck: Okay.
Ryan: And then I’m saying that when you multiply through by how easy these things are to intervene on, I think it is actually the case that AI takeover risk is what most people should be working on.
Buck: Yeah, I guess my ranking would be: most important is how do humans make decisions about what to do with the future. Tied with that, or number two, is acausal trade risk and acausal coordination problems—sorry, that’s number three. And then number four is which humans control the future. But I’m not very confident in that ordering. I’m a little surprised you put the acausal interaction so far down in your priority list.
Ryan: Yeah, it seems like it’s probably fine. I don’t know what could go wrong—that’s overstated. But it’s kind of hard for me to… My distribution over how bad things could get doesn’t feel like it’s driving a huge amount of variation. It seems like it’s potentially important to intervene on. Obviously I’m not hugely confident about this ranking. And in some sense the thing that’s most important is what should our priorities be for intervening on things, not how much variation do these things drive. But it is worth noticing the variations.
Buck: And I think my sense is that basically all four of these categories we’ve just been discussing seem potentially worth intervening on. And we’ve thought about all four of those in terms of things to potentially intervene on. I think our attitude has been—for tractability reasons that you were discussing—we put more of our effort into the preventing takeover one than any of the others. But we have taken opportunities when they arise and they’re kind of easy on the other ones, except maybe for helping humans make better decisions about the future.
Ryan: Yeah, I mean it’s worth noting that there’s also different people specializing in different things. Redwood Research—what are we doing? We’re mostly focused on mitigating AI takeover risk and I think that’s a reasonable thing to work on. And in fact many people working at Redwood are not necessarily bought in to working on more speculative non-AI-takeover stuff that is for longtermist motivations. I think it’d be fine for Redwood to work on that stuff, but I think it’s probably pretty reasonable to say we’re mostly focused on x-risk stuff and then eventually other organizations could crop up working on these things—but maybe those organizations will never exist in practice.
Buck: Yeah, I mean I think another plausible thesis for what Redwood Research should be is kind of like the Future of Humanity Institute, in that we are just a center of excellence for thinking about stuff about the future and what should be done about it, but with more of a focus on being capable empirical ML people and generally knowing empirical ML stuff and then looking for opportunities to intervene that rely on knowledge of empirical ML and computer science and a couple other things like that. And I think we do actually manage to grab little bits of value intervening on all of these topics thanks to our knowledge of empirical ML. That just does actually come up sometimes.
Ryan: Yeah, it doesn’t really come up that much in the acausal trade risks, I’ll be honest.
Buck: But the knowledge of empirical ML…
Ryan: Yeah, I feel like it’s not that irrelevant.
Buck: Yeah, maybe “empirical ML” is a little strong. I do feel like knowing some basic facts about how reinforcement learning works is sort of relevant, comes up a little bit.
Ryan: Yeah.
Buck: We’ve sometimes talked about approaches to mitigating various risks from acausal coordination that do involve empirical ML. For example, affecting training data of AI so they’re inclined to behave differently in certain ways. And I think our empirical backgrounds are actually helpful for that.
Ryan: Yeah, a bit helpful.
Buck: I think our favorite interventions for this stuff in fact route substantially through knowledge that we have about how AI companies work and what it takes to get them to do stuff, and our connections to AI companies.
Ryan: Yep, I agree with that. I feel like I haven’t quite answered a question you asked a long time ago about how has my reaction to this stuff changed over time and how has this affected how I think about things. I think for a while it did feel kind of salient to me—like, wow, this really feels like the sort of thing that would happen in these sorts of trade simulations or science simulations focused on this point in history. But it doesn’t feel as salient to me now and I don’t think it’s very action-guiding anyway.
Buck: Yeah. So when you want to approximate—suppose we’re interested in approximating a Bernoulli, for example “does AI takeover occur?” And there are a variety of ways that things can go. You have an expensive process that you can run which gives you an unbiased sample of this Bernoulli, for example running a simulation where a bunch of stuff happens. You might in fact want to engage in importance sampling such that you run your expensive simulation more often from points where you have a bunch of uncertainty about how it’s going to go from there or what the distribution of things are from there, rather than simply naively running from the start. For example, suppose that some things can happen early in the course of your simulation such that you are now quite confident which way it’s going to go from there. In some cases you might want to shut down that simulation. This is similar to how AlphaGo works when it’s trying to estimate the value of a given board state. How much does this affect your outlook on life?
Ryan: Not very much. I agree it’s true. I would generalize this somewhat: to the extent that you’re doing simulation to get an understanding of what’s going to happen, you might want to do all kinds of unhinged things that affect what—from some perspectives—people in the simulation should expect to observe. The “from some perspective” is important there. This includes things like you can do all kinds of internal branching and forking. This is analogous to how path tracing or global illumination algorithms work—there’s a bunch of details around how you wanted to estimate this thing, there’s a bunch of different funky sampling algorithms you can use, including all kinds of things that would cause you to focus on different things. You can think of this as importance sampling. In some cases it’s kind of dis-analogous to doing importance sampling. There’s also importance sampling on initial conditions where maybe you’re like, “Ah, yes, another monkey planet. The monkeys always lose. We’ll go to the lizard people this time.”
Buck: A generally fun fact is when I first started learning about computer graphics, I didn’t understand how closely related it is to Bayesian inference. Actually that’s not true. When I first started learning about computer graphics, it’s because I met Edward who worked at MIRI for a while. And I was like, “So how did you hear about MIRI?” And he was like, “Well, I’m a computer graphics guy and so obviously I think a lot about Bayesian inference.” And I was like, “Sorry, you what?” And he was like, “Oh yeah, the algorithms for computer graphics, especially in film, are extremely closely related to Bayesian inference methods.” And I thought about it for a while and I was like, huh, that is actually true. And so it’s very funny to hear you so freely be like, “Oh, yeah, it’s just like computer graphics. It’s just like how you compute the caustics when you’re in your scene which has a glass of water and the light is going through the water. That’s how the Bayesian inference looks.”
Ryan: Yeah, basically. But I maybe should quickly explain what I mean by “in some particular sense that’s what you should expect.” I think the thing that you should take actions to influence—from a longtermist perspective, one should take actions to influence the future overall. And from that perspective, the fact that things are being importance-sampled shouldn’t affect your actions because you are not going to be able to exploit errors in the simulators or errors in people who are trying to infer things about the world. And you should instead just make it so the things that they infer are as beneficial from your perspective as possible, or things that are basically analogous to that.
I should also say that my previous example where I said you don’t simulate the monkeys because they always lose—another factor is that even if P(doom) is 100%, even if AIs are guaranteed to take over and human civilization is guaranteed to go off the rails, this wouldn’t necessarily stop you from simulating it because you would want to know about the initial conditions to understand what the AI values might end up ultimately being and things like that. So I definitely don’t think that even if you’re taking—I would say the wrong perspective—and you’re thinking about it in terms of what you should expect to observe in terms of what the weighted measure of the simulations looks like (which I don’t think you should do because I think you should weight by the importance of the simulations), even if you do that, it’s worth noting that that doesn’t mean that we necessarily should expect we’re in the 50/50 world where we were in the more important region. Because it’s totally plausible that it’s overdetermined that alignment isn’t an issue or totally is an issue, but nonetheless there were important dynamics of how things go down with respect to alignment that affect the ultimate values.
Buck: Yeah, maybe a general question here is: I engage in recruiting sometimes and sometimes people are like, “So why should I work at Redwood Research, Buck?” And I’m like, “Well, I think it’s good for reducing AI takeover risk and perhaps making some other things go better.” And I feel a little weird about the fact that actually my motivation is in some sense a pretty weird other thing.
Ryan: It’s kind of basically the same.
Buck: It’s basically the same in one sense, but it’s also really not basically the same in another sense. And I guess I do worry about whether it’s somehow misleading to people that I don’t often explain a lot of this stuff when I’m talking to them.
Ryan: Yeah, I think from my perspective—I’m trying to think of a good analogy for this—it’s sort of like we want to make the future go well and that’s what we say. And in some sense there’s a lot of details and structure in “we want to make the future go well” that are extremely non-obvious. But I feel like it’s sort of like: imagine that you’re working at a chair company and they’re like, “We want to build some chairs.” And then you’re like, “Isn’t it misleading that when you hire people to build chairs, you don’t explain to them that you understand that chairs are actually made up of atoms which are mostly empty space? And people actually think these chairs are solid, but they don’t realize that really the chairs are just electrons pushing against each other.” I feel like the situation is not quite analogous to that, but it’s sort of analogous, where it’s like: yes, when I said “make the future go well,” I didn’t mean this kind of unhinged thing based on these views. But once you think it through, it basically all mostly adds up to mostly kind of normality, in the same way that you don’t need to understand the details of physics to understand that sometimes you want to sit in a chair. Similarly, it would be good if people who were better had more control of the future, and it seems like the future might be important. These things don’t depend on really detailed views about these things.
And then sometimes they do. Sometimes there are specific interventions that do come down to detailed views of the future. And in fact we think about stuff along those lines some of the time. But I feel like there are things that are somewhat misleading and things that aren’t. I would say if anything, the biggest question mark is divergences between things that are broadly longtermist and things that are not that longtermist. And I think there is definitely some room for divergence there, though in a lot of our work I don’t think this is a big factor.
Papers where we feel jealousyBuck: I’m interested in trying to talk about a different class of thing. Here’s a question. We’ve been working together on AI safety stuff for—how long? Four years or something?
Ryan: I think a little bit over three years at this point.
Buck: Really?
Ryan: Three and a half. Yeah. I started doing stuff at Redwood around December 2021.
Buck: Okay, so three and a half years. We’ve been working together for three and a half years at this point. So I’m curious: of all the papers that have come out by other people in this time, which ones are you most like, “Damn, we should have done that. That doesn’t seem, in hindsight, too hard. We absolutely should have thought of that. We should have made this paper happen ourselves somehow.” Does that resonate with you?
Ryan: Yeah, that makes sense. I think we plausibly should have done the weak-to-strong generalization paper. It feels kind of easy. It feels like we probably could have done it. It feels like we could have maybe done a better version of it in some ways. I don’t know about the better version we could have done. Maybe that’s just my superiority complex.
Buck: You know, I asked this question partially as an opportunity for us to say nice things about other people’s research.
Ryan: Yeah.
Buck: I love this. Just immediately…
Ryan: Yeah, I’m sorry Collin, I’m sorry Jan. Anyway. But yeah, I think we ended up doing in retrospect a much more complicated version of the same sort of thing. And plausibly we should have started with a simple version and then justified why we need the more complicated version later. What other stuff? In retrospect it seems like maybe this was impossible, but it seems like 2022 Redwood or maybe 2023 Redwood should have done the constitutional classifiers paper or some version of that. And it doesn’t seem obvious to me that that was impossible at the time, and maybe that would have been a huge win.
Buck: So we in fact did do a paper that’s in some ways very similar to constitutional classifiers. We wrote this adversarial robustness paper where we trained classifiers to be robust. I feel like the main difference there was we used kind of a dumb and fake definition of failure instead of a real definition of failure. And we weren’t thinking of LLMs as question-answerers. In this paper, we had the LLM basically continue stories with a couple of sentences of prose, and we defined it as a failure if the LLM continued in a way that involved violence or injury. I think that this paper we wrote has not had very much impact on the rest of the field, unfortunately. Even though we demonstrated a bunch of things that ended up being a big part of how people have made robust classifiers for LLMs later. For example, we had our team of contractors try to come up with red-team examples that would cause the classifiers to do the wrong thing. And we used some basic saliency methods to make it easier for contractors to see which tokens the LLMs were responding to when they refused things. And we had a little tool that would let you click on a token and automatically suggest other tokens you could switch out with that would make the classifier less likely to think the thing was bad—which I think is honestly maybe more advanced in some ways than the tools people are using.
Ryan: Yeah, I was gonna say—you said that’s how people do things. I’m like, aren’t things people do way less sophisticated than that? Or they’re different. I think a big difference is people use LLMs to do red teaming much more, so it’s much more AI-generated data rather than contractor-generated data. But I think the tools are less sophisticated, if anything. And maybe it’s because it’s actually just better to do it that way rather than doing saliency maps or trying to figure out which tokens are important and trying to do clever things there. I’m not sure, though.
Buck: I think we found that the saliency stuff actually helped the contractors work faster.
Ryan: Yeah, I’m not sure.
Buck: I mean, it’s a harder-core definition of failure or something. It’s a harder red team to fight. In real life, you don’t give your attackers access to these local tools to improve parts of your prompt in order to make it lower probability according to the classifier. I wonder why this is though. We did actually find that our stuff worked well.
Ryan: My sense—sorry, why it didn’t have impact or why people don’t use this?
Buck: Why people don’t use these methods.
Ryan: My sense is that at Anthropic, there’s been a bunch of different work that didn’t really get too far, or got somewhere but didn’t necessarily try to solve the problem in an end-to-end way. Then there’s the Anthropic constitutional classifiers paper—for that, they just didn’t find it was that needed and then never got around to doing anything fancier because they were like, “Eh, our stuff seems to work well, seems pretty close without that.” And they just iterated further without needing to go that far. That’s my sense. I’m not really sure. It would be interesting to look at the paper and see what they said about how they generated all their data. I don’t have a good model of that off the top of my head, but I thought they weren’t necessarily doing that much stuff that looked like fancy saliency tools. I don’t think they did anything like that.
Buck: Yeah, I mean, this is kind of interesting. I would not at all be surprised to learn that the saliency methods that we used—which to be clear, were first demonstrated in another paper on making it easier to red-team LLMs before our paper even, we did this work in late 2021. Other people had done this before, though I think our version was perhaps better, I’m not sure. I would not at all be surprised if someone who was working on the Anthropic paper had randomly felt like doing this and then they just built it into their UI and added it in. And then this just made the contractors 10% or 20% better throughout the whole project. And I don’t know how much money Anthropic spent on contractors in this process, but I wouldn’t be surprised if it was more than a million dollars.
Ryan: I don’t think that was a dominant constraint. I think the dominant constraint for the project was more like employee time and serial time. But making the contractors faster… I mean, there’s a slightly unhinged perspective here which is maybe true: the original Redwood Research project was done using Surge. I wonder if Anthropic used the same Surge contractors. And maybe there’s an indirect mechanism where Redwood trained the contractors to be better at finding examples, though I don’t know if that was actually successful. This is not actually important, just kind of funny.
Buck: We could probably check if they listed the contractors in their paper. We listed it in ours. I feel like a lot of the time there’s kind of weird dependencies in which things get implemented. I just feel like a lot of the time things going 10 or 20% slower because you haven’t built some mildly fancy thing into your UI is just the kind of thing that projects often just don’t fix because something else feels more critical right now. And yeah, it often feels like there’s slightly more things you can do that are just not done for a while because people are busy.
Ryan: Yeah. And it’s not clear that it’s the dominant constraint. I’m not really sure if there’s a bunch more alpha in the improving-robust-classifiers work that seems very exciting to me. There might be. It’s kind of unclear. I do think that going forward, another thing that someone should do is figure out how to just generically train a good monitor for spotting egregious safety violations by AIs and just try to build really good datasets for that, including trying to train monitors that are hard to jailbreak. That’s not necessarily the key desideratum.
Buck: Yeah.
Ryan: But yeah, I’m trying to think of other papers, just going back to your original question where I feel very envious. I can’t really think of any off the top of my head.
Buck: I wish that we’d done activation engineering stuff. I don’t know how good those results are, but I do think that it seems like an obvious thing to try. And I think around when we started doing interpretability work in 2022, I think it was very silly of us to not just try to spread out and do all of the semi-obvious things—like patching activations in from one prompt to another or whatever and just seeing what the effects are and writing that up. It’s quite scientifically simple in a certain way.
Ryan: Yeah.
Buck: And would have helped push the field somewhat further.
Ryan: Yeah, I do think we were somewhat spread out initially, but I feel like we were a little bit too focused on “one method to rule them all” in a way that doesn’t in retrospect seem that productive. I feel like we just made a lot of errors in that phase of our research. It would have been good to do something along these lines.
History of Redwood ResearchBuck: Okay, cool. So let’s talk about the history of Redwood Research. I’ll let you ask the questions, at least at the start of the story.
Adversarial robustness and startRyan: Do you want to talk about maybe the first Redwood Research project that ended up getting published and what the history of that was like and who was working at Redwood at the time?
Buck: Yeah. So one thing is Redwood Research started growing out of another project that me and some other people had been doing. We basically decided to stop doing that other thing and start working on empirical AI safety research, or AI alignment research as we called it at the time. This happened in mid-2021. We decided that the first empirical AI safety project we were going to do was this adversarial robustness paper, which is kind of bizarrely similar to research that’s happened recently on classifiers to detect whether LLMs have been jailbroken or are answering questions that they shouldn’t. But at the time, we weren’t really used to LLMs as chatbots yet. This was mid-2021, so a year before ChatGPT and even before Claude was out.
Ryan: And to be clear, at the time we were imagining that you’d be using LLMs as a policy generating actions, that sort of thing.
Buck: Yeah, we were totally imagining that. I wish that I had more data on exactly what we were imagining when. But definitely something like that. So we decided to do this project where we would try to make a classifier that was really robust and really able to prevent the AI from doing a bad thing without us detecting it. In particular, we decided to focus on the definition of: when given an input that is some snippet of fan fiction, it should never continue that in a way that is plausibly describing an injury. This was covered, in my opinion hilariously, on Astral Codex Ten. So people who want to learn more about this project can read that version there.
So we did that project. Initially it was all of us at Redwood Research, which was maybe six, and then increasingly many full-time technical staff for the first six months. And then it went down to only being four people for the duration of the project. And then eventually it got submitted to a conference and made it in, I think, to NeurIPS. Overall, I think this project went kind of surprisingly badly. In hindsight, one thing is it’s been quite unimpactful. Even though the work we were doing there is in some ways quite foresightful—we were hiring humans and trying to do this iterative red team/blue team game where we try to trick the model (we didn’t think of it as jailbreaking at the time) and then try to make our classifier more robust—I think the techniques we developed there haven’t been that influential. I checked with Ethan Perez about whether our paper was helpful for the constitutional classifier work, and he helpfully noted that it was not or wasn’t very influential on them.
I think there’s a bunch of things that we did wrong there. First, on the object level, I think we should have tried to publish something sooner and had a lower bar for success. I think we had a lot of conceptual confusions. On a more meta level, I kind of wished that we had assumed that we were going to suck at the first ML paper that we tried to write in this genre and then had an attitude of “we want to make it so that even if we make a bunch of terrible choices about how to write this paper, it isn’t going to be too costly for us and we can learn and move on as quickly as possible.” I wish that we had started out by saying, “Well, probably we’re going to suck at the first project we do in this genre. So we should try to do a project where we really don’t care if we suck because it was so short.” I wish that we’d started out by trying to do a one-month-long project which had a tangible output, and then I think we would have learned something from following the whole process and would have been able to do more stuff from there.
I think another related error is that we should have tried to focus on doing small projects more than big projects in general. At the time there was so little work happening on AI safety that there were lots of pieces of low-hanging fruit that we could have tried to tackle, especially with the skill sets of the people we had around. I think we should have been more in the business of looking for incredibly small things that were still bite-sized contributions and trying to put those out quickly so that we could iterate on our process more.
Ryan: Yeah, so I know there was some response where the paper initially came out and then you regretted—or Daniel Ziegler regretted—how it was presented. Do you want to talk more about that?
Buck: So I wasn’t personally hugely involved in this, but yeah, when we released this paper we initially released a version that I think made it sound as if the project was more successful and had more impressive results than it actually had. I’m not going to remember the details perfectly, but I think the techniques we found somewhat increased the robustness of our classifier to attacks above baseline techniques, but not hugely and definitely not perfectly. And we didn’t present this perfectly the first time. And then a bunch of people yelled at us about this—most helpfully, Holden. And in response we were like, “Oh yeah, I’m sorry, we did actually say we were too positive on this work.” And so then we wrote a follow-up post where we apologized for this and gave a more caveated description of our conclusions.
What I take away from this is: I think we felt a lot of pressure at the time to project strength and project competence in a way that I think was quite regrettable in hindsight. Something which was happening at the time is Anthropic had started quite recently, and we were competing with them for a variety of hires, and we were trying to make it so that Redwood ended up being kind of the canonical place for AI safety researchers to go. And I felt at the time that Anthropic was being kind of slimy and dishonest in a bunch of random ways in how they were pitching people and how they were presenting their work. And I felt like I had to compete with them on this in a way that was foolish and was a mistake that I regret. I think we should have just leaned in the whole time to having really high standards for accuracy and honesty and thought that this would attract—and should attract—people who care a lot about accuracy and honesty and who generally have high epistemic standards, and specialized more in that.
But it was a weird time. The world was very different back then. For AI safety people, there was basically MIRI, who were doing sort of technical research that they’d kind of given up on—the big thing that they’d been doing that I’d been working on at the time. And there were random academics doing stuff and random grad students doing stuff. But generally the grad students, even when they were very concerned about AI takeover risk, their advisors weren’t. Or the advisors, at the very least, were not very thoughtful about a lot of the arguments about AI takeover risk that we take for granted in the AI safety community these days. So it felt like it was really the wild west out there.
Ryan: Yeah. Maybe let’s continue from there into the next phase of Redwood’s history.
Buck: Yeah. So in late 2021, we decided that that project had gone pretty well. In hindsight, I think that was perhaps a little premature. But we decided we were going to split Redwood into two main research directions, one of which was continuing the adversarial training stuff, and the other of which was doing interpretability research.
MLABRyan: Also, it’s interesting to talk about MLAB, which is occurring around the same time. I’m not sure what order you want to do.
Buck: Yeah. So another thing we decided to do was run a big bootcamp for people who wanted to do AI safety research. We called it Machine Learning for Alignment Bootcamp. We decided we were going to run it in December 2021. Our motivation was kind of equal parts: people go on to do good AI safety research at Redwood, people go on to do good AI safety research at places that aren’t Redwood, and people go on to do things that aren’t AI safety research that are still good according to our perspective. For example, I would say Beth Barnes isn’t doing AI safety research directly, mostly—or at least that’s not how I would have described METR for a lot of its existence. But I think it’s good that she knows more machine learning, and I think she found it a helpful learning experience. So we ended up getting 40 participants or something to come to Berkeley and do a bunch of intensive machine learning content for three weeks.
Another piece of context on why I decided to do this was I went to a coding bootcamp when I was 19. I came to America to do App Academy, and I found it a really good experience for learning web development and becoming a much stronger programmer. When I came to App Academy, the people who were there were mostly people who were substantially older than me and many of whom had had one career out of college and then kind of failed at it or felt like their lives were not going the way they wanted. And so they had this desperate energy to do something else. Like, one of them had been a wedding photographer in Minnesota or whatever, and he was making a very unfortunately low amount of money. And he was like, “This is my shot at launching myself into a much more lucrative career. And if I don’t learn this content in the next couple months, I’m just gonna be stuck as a wedding photographer in Minnesota for who knows how long. I don’t know if I’m able to feed my family properly.” And so these people were extremely motivated to learn programming and extremely motivated to apply to jobs. And I felt like it was this amazing life experience where we were all living together on the floor of this office and programming a lot. And it felt so different from college, where the social pressures are in favor of screwing around. I understand that your college experience was somewhat different here.
Ryan: It’s complicated, but I wouldn’t have said there was pressure to make sure that you can make the money to feed your family. But also wouldn’t have said the pressure was in favor of screwing around.
Buck: Yeah. At the very least, I felt like it was this incredibly energetic environment. And the other thing was: a lot of the time when you’re learning to program, there’s random stuff that is annoying and you can get stuck on. And when there’s people around you who can get you unstuck, it makes it a lot easier to enter these flow states where you just do programming work for many, many hours at a time, or for most of the hours of the day. And I thought that was a great time and I wanted to replicate that in this machine learning bootcamp. And I think overall that went pretty well. If you look at the people who did MLAB, many of them are now working on AI safety stuff. Maybe the majority. Obviously we were not counterfactual for a lot of those people, but overall I think it was a pretty good outcome.
Ryan: Related: how much of the theory of change there was making it so people who were going to work on AI safety adjacent stuff are better at machine learning, versus making it so that people who are somewhat interested in AI safety adjacent stuff actually end up doing AI safety adjacent stuff while also incidentally making them better at machine learning?
Buck: I would say it’s maybe 70% motivated by causing people who were already pretty interested in doing AI safety stuff to have a better chance of getting jobs doing it and to do a better job at those jobs, as opposed to getting people who weren’t likely to do it to be more interested in this stuff. Though there was one person who was a quant trading savant or whatever, who was working at some famous quant trading firm. He was very young and he hadn’t had very much contact with the AI safety community before, and he showed up and was like, “Frankly, it’s very unlikely that I’m going to want to work on AI safety stuff after this because I would have to turn down my very high salary in my trading job.” That guy screwed up in hindsight by not taking one of the various safety roles at AI companies that would have been, in hindsight, more lucrative than his hedge fund. But yeah, that guy ended up—last year I ran across him—and now he’s working at an AI company on AI safety stuff. So that’s kind of cool. But yeah, overall I would say it was pretty successful in getting a bunch of people to do this work.
Interpretability researchRyan: And then moving on from MLAB, you were talking about a phase where Redwood was moving into doing interpretability research.
Buck: So we decided to do a lot of interpretability research. The people at Anthropic had very kindly shared with us a bunch of their unpublished interpretability work, the stuff that eventually was written up into “A Mathematical Framework for Transformer Circuits”. This was Chris Olah and his collaborators at Anthropic at the time. And then we decided that we should do a bunch of work in that field. One motivation for this was kind of table stakes—it seemed like a pretty promising research direction. And then I wanted to do work on it in particular because I thought it played to my strengths as a researcher. Specifically, I’m a guy who is pretty mathy and pretty strong at thinking about conceptual stuff, or this is at least what I thought at the time, and I somewhat stand by it compared to a bunch of other people in AI safety. I just felt like I’m more able to invent mathematical concepts and come up with experiments and think through stuff than a lot of other people were. And so we decided we were going to do a bunch of interpretability stuff. Paul Christiano, who we were talking to a lot at the time, thought this was a pretty reasonable idea and had some initial ideas for research directions, and we decided we were going to do that.
In hindsight, the part of this that I most regret—or the way that decision procedure seems most messed up—is I feel like we were focused on a random aspect of mine, which is that I’m kind of thoughtful about doing math and making up math concepts. But in fact, being thoughtful about doing math and making up math concepts is absolutely not the most relevant skill set for producing value by doing research vaguely along the lines of mechanistic interpretability. It’s just not really had good consequences. I think we should have approached it—if we were going to do stuff about model internals—in a much more breadth-first manner where we name projects starting from low complexity to high complexity, asking: what are the things where it’s craziest that no one has ever run a basic experiment on this before that would tell us something interesting about how neural networks work? And we can try and do those in increasing order of complexity. That would have been better than the actual more end-to-end focus that we were using.
The other way in which I think it was a mistake to focus on that stuff is: a lot of what I’m really interested in, and what you’re really interested in, is thinking about AI futurism and trying to direct our research based on this kind of backchained end-to-end story about what needs to happen in order for AI to go well. And unfortunately, research which is a real shot on goal or a real moonshot works particularly poorly in combination with thinking about futurism. The thing about interpretability is: obviously if we were extremely good at interpretability, we wouldn’t have any of these other problems. The alignment strategy would be: use your extremely good interpretability to check if the AI is plotting against you, and if the AI is plotting against you, don’t deploy it. Or use your really good interpretability to change it into an AI that loves you and wants only the best for you. So the obvious way to do a good job of ambitious interpretability research is to shoot for a situation where you don’t have a bunch of messy trade-offs as a result of having limited resources. But actually I think we really like and are really good at thinking about what you should do when you have messy trade-offs, when you have very limited resources, when an AI company is behaving recklessly and going fast. And the interpretability research, because of how ambitious it is, doesn’t really give us an opportunity to use that skill set.
I don’t know what we were thinking at the time. I think I just hadn’t quite understood that the strength that we should be leaning into as an organization is thinking through futurism and thinking through what the good things to do on the margin for AI safety are. Nowadays I think we really focus on: what are things that are fairly easy and seem pretty doable but might not get done? And I think we should have focused more on that at the time as well. It’s a little confusing because even at the time I was very interested in understanding what marginal improvements would be like. I was giving talks in 2021 where I talked about the current plan for AI alignment, where I often mentioned “here’s what we would do at the moment and here are ways we could try to improve that on the margin.” But I then didn’t think to focus my research on things that are inspired by what would be good on the margin. And I regret that.
Ryan: Yeah, from my perspective, I feel like a lot of the mistakes that we made during the time when we were working on interpretability really came down to just generically not being very good researchers or something. When I think about my mistakes, it feels like a lot of them are just bad heuristics about how to direct research. Not necessarily bad taste in terms of what’s exciting to pursue, but bad taste about how to prioritize pursuing it. It wasn’t a thing where if only we had read the blog post on how to do research correctly, we would have known—it was more just heuristics about what types of things seem like they often work and when to give up on things and how to produce MVPs. I think we made mistakes about this.
I also feel like it is pretty good to lean into really ambitious research in some senses. And I do wonder if now Redwood has overcorrected and we’re insufficiently trying to shoot for ambitious research that would be very obviously extremely impactful if it succeeded, and we’re instead doing research where the theory of change is complicated but feels pretty doable to keep pushing. And I think that’s just generally a concern I have—the safety community overall is underinvested in very different ambitious bets. Instead there’s a small number of ambitious bets that people have been working on for a long time where it’s not obvious that the marginal labor is making much of a difference, and simultaneously a bunch of very low-ambition things where they seem reasonable but they’re not. One story is research is very heavy-tailed, some of the research has way more impact, therefore you should be doing a bunch of really ambitious stuff.
Non-mech interp ambitious bets
Buck: What are your favorite examples of really high-potential ambitious bets that people aren’t pursuing very much?
Ryan: The two examples I often think about: one of them is research into how we would decode neuralese. Concretely, currently models reason in chain of thought and that’s broadly legible. Often to solve problems, the model would not be able to solve the problem without it being clear that it was working on that problem based on the chain of thought, because it’s not good enough at doing steganographic reasoning, or it would be too easy to detect that it’s doing steganography, or you could prevent it in various ways. In the future, model architectures might shift so that models are very directly doing reasoning in opaque vectors which we do not understand and which are more analogous to model internals than English language tokens.
I feel like there’s some reason to think mechanistic interpretability is very hard, which is that models might be doing very low-level, totally insane, incomprehensible stuff. And I think that basically matches my experience of what people have seen in non-algorithmic tasks—the composition of things is often complicated and confusing and really messy. But one story is: well, maybe we can at least decode something that’s analogous to high-level thoughts produced by general-purpose computational machinery. So in the same way that we can look at the chain of thought and really roughly know what the AIs are thinking, maybe even if that wasn’t produced via an imitative process, we would still be able to do some sort of high-level thought decoding on neuralese which would allow us to understand what the AI was doing at a high level in a way that had reasonably high reliability and let us answer pretty useful questions like: what were the options it was considering? Why did it think these options were better than other options? At least in the cases where it needed to do a bunch of explicit reasoning.
In general, an interesting property of AIs right now is there’s not many cases where when they generate a token, it’s very clear that they had to do many steps of general-purpose reasoning to generate that token. So it might be that one of the reasons why interpretability is hard now is there’s not much general-purpose reasoning going on. But regardless, I think it should be spiritually possible to learn a decoder for general-purpose reasoning in the same way that it’s possible to do unsupervised translation work. A good version of this would maybe mostly mitigate my biggest concerns, and I don’t feel like existing work is really well targeted at this. Existing work on mechanistic interpretability feels like it’s imagining the wrong reasoning regime and is often trying to decode behaviors that were drilled into the model with training rather than trying to decode the results of general-purpose reasoning done internally. This could be on a path to doing that more general-purpose reasoning decoding, but it isn’t clearly, and it feels like a very indirect route. A much more direct route would be: could you just train a decoder somehow in some way that we think would be reliable? There’s a bunch of other random bets you could be trying to make in terms of doing various types of consistency losses on that decoder.
One big difficulty with the neuralese decoding pathway is right now we don’t have models that reason in lots of detail using general-purpose reasoning mechanisms internally. We have chain-of-thought reasoning but that’s not opaque. And then there’s reasoning within a forward pass which is often not that deep in general-purpose. So what do you do from here? One thing you could do is try to investigate decoding within forward-pass reasoning in the cases where it really seems like there is some small amount of general-purpose multi-step reasoning going on. You could try to decode steganography in a way that’s very general and hope that transfers. You could try to decode models that are very different from LLMs—like try to decode diffusion models in some interesting way. It’s unclear which of these directions are most promising. Another route would be take the existing work on training neuralese models like the Coconut paper and just do decoding on those models.
Buck: Yeah.
Ryan: Anyway, I think there’s a bunch of work potentially in this space motivated by: suppose we had neuralese models, what would the straightforward decoding look like? Especially as that neuralese gets more and more alien. Because I think it might be initially that neuralese is easy to decode because it was initialized from English—you start with an English initialization and then you also add some neuralese and the neuralese initially matches the English. But imagine that we instead had RL’d a huge amount, so the model had learned very different ways of structuring its internal cognition and is potentially doing things like many lines of reasoning in parallel. How would we proceed?
That’s the neuralese decoding research direction. I’ve actually written up a project proposal on this that’s publicly available and we could link in the show notes. It’s called “Decoding Opaque Reasoning in Current Models.”
That’s direction one—decoding neuralese. Another ambitious direction I’ve been thinking about is trying to get a good understanding of generalization and building good tools for understanding generalization. One thing about these ambitious directions is that they’re something that could in principle transfer to future much more powerful models and make hopefully relatively fewer assumptions. And I think that research on generalization, which is trying to be relatively broad—you could try to study LLMs as a class, you could try to study deep learning models as a class and build tools for that.
I have a draft post on this, maybe alongside this podcast I’ll release an open version of that if people want to look at it. But basically here’s a plausible research agenda. First, try to build tools that are very good at predicting generalization behavior in terms of literally what probability distribution the model produces. As in, answer the question: suppose we have an LLM and you injected some training data midway through training or at a random point through training, how would that affect how it transfers to some other prompt? That’s now a mathematically hard thing to compute, because the way you would literally need to compute that is by rerunning training. But maybe there are very cheap approximations. The obvious one would be just fine-tune on that data at the end of training and hope that approximates adding it at some random point during training. And you could try to answer: how good is that transfer? How bad is that transfer? I’m not imagining that you do much theory necessarily—I’m not imagining that you figure this out using NTK. I’m more imagining you try some empirical methods. Maybe you have some different mathematical approximations. There’s a bunch of different things you could mess with.
It would also be nice to be able to cheaply approximate things like: when the AI generated this output, which pieces of training data changed its output in what ways? This is what influence functions are spiritually trying to do. But unfortunately influence functions just compute some kind of random thing that’s sort of related to that. It would be nice if we had something which was actually trying to be a calibrated predictor of the relevant thing and maybe would utilize some influence function-like machinery, but wouldn’t just compute some other random thing and would instead compute something where we can evaluate its scores at making the prediction, and its scores are calibrated and reasonable.
The way I would go about generating these tools would be to build benchmarks. This is all speculative plans-in-the-sky type deal. But I would try to build benchmarks where we’re like: here’s a small model, imagine we varied the training in this way, predict that. Or: here is a generation from the small model, how would removing this piece of data affect its behavior on this prompt? You can imagine this is like add-one-in versus leave-one-out questions. There’s other types of questions you could answer too. And then build basically machine learning tools that allow you to cheaply answer that.
Then from there, if we had good tools for that, we could apply those to all kinds of different interesting cases in current LLMs and use that to build a qualitative, more psychological-level theory of LLMs. And that work could go on in parallel too—I’m not saying that the key blocker in LLM psychology is tools to predict generalization or influence functions, but that would be a useful tool to apply there.
Buck: Yeah.
Ryan: And then also use it to apply to all kinds of other deep learning models in other cases. And the hope would be that one, maybe we can develop general theories of LLM psychology that are actually very broad and general and might apply even as you’re massively scaling up RL, even as you’re massively changing the pretraining blend, because they don’t depend on that many details. But then also maybe you can develop theories of how deep learning models learn in general, which would either be very qualitative theories or could be in principle mathematical, but not necessarily based on any principled simplification or computation in the same way NTK is. And then from there the hope would be that eventually you could develop these general theories into something that would actually be able to answer questions like: what would scheming concretely look like? What types of goals might models scheme for if they’re trained in different contexts?
I think this is definitely underspecified, certainly as I’ve laid it out in this podcast. But I think this area of “generalization science” is underinvested in. There’s a bunch of different possible angles here.
And then I’ll do a bonus one quickly: I think in general, high-level interpretability on how do LLMs make decisions in pursuit of goals—what is the internal mechanism for that—could be pretty interesting. There’s a bunch of intervention-style experiments that don’t depend much on low-level understanding. I think we can understand how crows make decisions without necessarily needing to understand how crow neurons work. And similarly I think we could understand how AIs make decisions in pursuit of goals using their internals, using things like high-level activation patching, fine-tuning only subsets of layers. And then we could use that to build theories that are operating at some level of abstraction that’s higher. It’s not obvious what the right level of abstraction is. But I think that seems pretty promising and seems like somewhat neglected at the moment. That high-level interpretability direction—I think it’s less that the direction overall is neglected, it’s more that when I look at the actual research I’m like, “But you didn’t do the thing.” And I feel like I have some ideas, maybe they don’t make sense, for how you could do this differently.
Buck: Yeah, that makes sense. The angle on all this that I feel personally enthusiastic about and personally most tempted to go off and try and do is trying to answer the most basic questions about LLM generalization, just in order of complexity. One thing that I’ve wanted to do for a while is build myself a UI that lets me type in a single example of an SFT input on which we’re going to SFT our language model, and then type in a question and then see how the distribution of language model responses is changed by training on this one SFT example. And then just playing around with that for a while and seeing if I can notice anything interesting. Obviously the changes are only going to be proportional to learning rate, and learning rate is small, so they’re going to be small. But we can still look at these things differentially—we can see which one’s the largest, even if they’re all objectively small changes per training example. And I wouldn’t be surprised if you learn something kind of interesting from this just by looking around, in the same way that when you’re trying to understand microbiology, if you have a microscope and you just look at random stuff—you get some sample from the pond, or you take off your shoes and rub some grime on the slide—you just learn stuff kind of quickly. I feel like there’s a bunch of simple microscopes that one could build.
Another one of these: I’ve wanted to pay someone for a while to get really good at answering the question of—if you have two features of text, for example “is it in English or German” and “is it about math or English”—which of these is more salient to the model as measured by if you train a probe on a particular layer on spurious examples. So like, text in English about math, and text in German about economics or whatever the other pair was. And then you see which one of those it picked up on more strongly. And you pay someone to look at this a hundred times and see how good they can get at predicting.
Ryan: You build the quiz and you can score people on it. And I should note that the thing you’re talking about—how is the model’s behavior affected by training on this one example or this small set of examples—I also wrote up a project proposal related to this that people can read, something like “Using Fine-tuning to Understand Model Behavior.” And again that feels like a pretty promising area where things like that seem reasonable. Training on small datasets, just one example—all of this feels like a pretty good hammer, but people aren’t actually using it and I’m a little surprised by this.
Buck: It feels like the basic requirement to do this kind of research is some kind of scientist personality, which is kind of different from the personality required for a lot of other research that people do. It’s not very engineering-wise difficult and it’s not even schleppy in the way that a lot of research is schleppy. And it’s so fast to learn little things if you do this kind of project. The thing which is going to be challenging is converting all your trivially easy observations into stuff that’s worth anyone else’s time to read. But it feels like some curious person who just wants to learn a bunch of fun facts about language models could just go and do this, and I wouldn’t be surprised if they came up with some really cool stuff quickly.
Ryan: Yeah. I should also note that I would describe this as making an assumption that fine-tuning on a data point at the end of training is representative of fine-tuning on a data point in the middle of training. Or that it’s at least interesting. Which seems like a reasonable assumption, to be clear.
Buck: Yeah, totally.
Ryan: And it’s going to be some intervention that’s analogous to having trained the AI on that data.
Buck: We can validate the extent to which this is true. People do actually fine-tune language models on pretty small amounts of data at the end sometimes.
Ryan: Yeah. I mean, it definitely does something. And the question is just maybe there’s important disanalogies, and so you learn about how data at the end of training affects models—which is still an interesting question to answer—but it’s worth noting it’s not the only question we care about potentially.
Buck: And in the case where it turns out these diverge a lot, that’s going to be revealed by your proposed project on determining how training in the middle of training works.
Ryan: Yeah. And it might have no difference. I think it’s basically just linear, and once you’re out of the first 10% of training, it’s all the same. Seems super plausible to me.
Buck: Yeah. We can reason about this a little bit by looking at one of the parameters of Adam that controls how much momentum you have. It’s in fact not the case that training is linear. For example, if you just had one big batch of training, that would massively suck.
Ryan: Having more than one step, famously important.
Buck: Yeah. Notably, SGD—stochastic gradient descent—has one more letter than gradient descent. Full-batch gradient descent. Once upon a time, back in the dark days of the 1980s or whatever, I think people did actually consider full-batch gradient descent the more obvious thing to do, and SGD the less obvious thing to do. Which is funny because the reason why we do batch gradient descent nowadays is almost entirely computational.
Ryan: That’s only partially true. My understanding is it would actually be faster if you just did accumulation. This is now random technical minutia, but you could just accumulate the updates and then do one update. It would be slightly more efficient because you would only have to synchronize weights less frequently.
Buck: Yeah.
Ryan: And it’s like, actually this is worse. We know this is worse. And in fact people have researched the optimal number of steps to do in a training run and it turns out there’s some criticality. My understanding is—this may be false—but as you scale, the number of steps you need is roughly constant, but you need some number of steps. So how many SGD steps is fine? I don’t know, 10,000 is fine or something. Or maybe it’s a million. I forget the number.
Buck: Yeah. And to be clear, this is to be within epsilon of optimality.
Ryan: Yeah, this is like presumably more steps is always better. Micro-batching smaller is in fact better, not worse. But it just turns out that it’s less efficient. And then there’s some papers—I forgot which ones, from OpenAI—on the gradient noise scale.
Buck: Yeah, those people are all at Anthropic now.
Ryan: From what used to be OpenAI.
Buck: Indeed.
Ryan: As many things used to be.
Back to the interp narrative
Buck: Okay, cool. Anyway, this is a list of cool projects to do or cool research directions. But why don’t we go back to our historical account? So incidentally, you showed up related to MLAB. You had kindly agreed to produce curriculum content for us and then TA it. And then you showed up and we started working on things and it was a good time. So we decided to switch you off that onto the interp stuff.
Ryan: And in fact, we were working on relaxed adversarial training theory. Like actual theory. No machine learning models were touched in the creation of that work for a while, which I think was kind of reasonable, but didn’t go anywhere.
Buck: Yeah, that’s right. This was a glorious, optimistic time for AI safety research where we really thought it was possible that our friends at ARC would be able to make a bunch of theory progress in the next couple months and lead us to feel like we had a good solution for both outer and inner alignment as we thought of it at the time, which did not pan out, unfortunately.
Ryan: Yep. Brutal. Interpretability—you want to continue?
Buck: Yeah. So I don’t know what happened on interpretability. We explored a bunch of things. We had a bunch of little results. Rowan Wang came up with the indirect object identification paper. Partially inspired—my role in this was I was like, “I think you should find the biggest circuit ever. That should be your goal here.” And I think that had a result that was helpful for pushing forward the field of mechanistic interpretability, if only to demonstrate that things aren’t going to work as nicely as you’d want in some cases. A bunch of other stuff at Redwood had a bunch of other results where they tried to interpret small models in various ways. We played around in a bunch of ways that I think were quite silly in retrospect. We were eventually working in parallel with the ARC people on trying to have a fundamental understanding of what it means for interpretability research to be accurately representing what the model is doing. And I think we over time did have a better sense of one pretty reasonable definition of this that we developed in some amount of collaboration. I don’t know how much we want to get into.
Ryan: Yeah, so we did causal scrubbing. We did some stuff with cumulants which in retrospect I feel like I was silly about or had bad research taste and spent too long trying to get some infrastructural things working and then couldn’t actually push that very far. But there was something interesting there. But yeah, I think ultimately not the right approach. I don’t know.
Buck: Yeah, I mean, other research happened along the way that was fine. As interpretability research goes, though, I think in hindsight the main error was just not having lower standards for backward-chaining-ness of projects. I think we should have just said: we are like biologists who have developed the first microscope and we want to just look at stuff and report on things that we see that are interesting. We should try to take hypotheses about neural nets and how they train and how they work, where we feel like we are interested in understanding whether the hypothesis is true or not, then we should measure it straightforwardly and then we should say what we found. And we should just do that for a while until we feel like we have either learned something or it’s been surprisingly many little datums and we still haven’t learned anything.
Ryan: Yeah. And when you say that, you mean testing it using relatively trivial methods. I feel like we spent a bunch of our time trying to develop methods and then testing them a little bit. But you’re like, maybe we should have focused more on testing things, finding methods as needed to answer questions or something.
Buck: I guess.
Ryan: I’m not really sure. I don’t really have a strong view on what we should have done. It does feel like maybe “not do interpretability” is pretty reasonable.
Buck: Yeah, I think what we should have done is not do interpretability, basically. I think we should have gone harder on the futurism stuff and thought more about takeoff speeds and stuff at the time and thought through in more detail what the AI safety plan should be at the time. And then I think we would have realized some stuff that we realized later, and that would have just caused us to do pretty different work. And that would have been better.
Ryan: Yep.
Ryan’s never-published mech interp research
Buck: Anyway, to continue the narrative: it was, I think, a big update that we made about interpretability. It was late 2022, and we had, through a couple of different mechanisms, come up with better ways of understanding what’s happening in neural nets and understanding how good our explanations are. There was the cumulant stuff that you had been leading the charge on. There was the causal scrubbing stuff that I’d been leading the charge on. And then I think the basic things that we took away were: on the causal scrubbing side, we came up with a way of getting a number for how good our explanations were. And then we applied this method to a variety of interpretability explanations that people had given over the years. And we felt like the numbers we were getting were pretty unimpressive, and we were like, “Whoa. We really thought that we had had good explanations of some of these things happening in these models, but actually these just do not look good at all.” And that kind of shook us a little.
I think we also took something from your interpretability work, most of which was in fact never published. I think the main thing I took away from that was you just had this take of “what models are doing is cursed and complicated.” I don’t know if you want to say anything about what you learned from your interpretability research. I feel like a bunch of this research was actually pretty interesting and just has never come out. You weren’t known for interpretability research at all, despite the fact that I think you have a deeper understanding of what language models are doing than most interpretability people.
Ryan: Maybe. I think I understood what small models were doing more than most people back in the day. And maybe I’ve forgotten that knowledge because it doesn’t feel that important. But I was doing interpretability on small models and looking at questions like: how do models know whether to predict “is” versus “was”? And there was just a shit ton of different heuristics and you can start enumerating them. There’s a bunch of different things where it was like, “Okay, I go with ‘is’ in this case where… but I put more emphasis, I attend a bit more if it’s capitalized because of this effect, and there’s this thing that subtracts off of it.” I can’t remember all of them. There’s a bunch of things where there are these first-order characters where you just add up the words and then you subtract off this second-order adjustment term and have this third-order thing and then you use an MLP to just compute some random stuff. And all of it was just a huge mess without any clear easy way to write it and without any obvious way to reproduce the statistics that the model was doing, except by training a machine learning model.
Buck: Yes. So to say more about what the question you were researching was: you wanted to understand—these two-layer attention-only, or we looked at a bunch of different models, basically two-layer models…
Ryan: Yeah, two to four layer models.
Buck: You wanted to know how these very small language models decide whether the next token is more likely to be “is” or “was.”
Ryan: That’s right.
Buck: And you just came up with ways of kind of rewriting the model such that you could break it down into a sum of first-order terms—which are like, how do individual words flow through to the prediction—and then second-order terms and third-order terms and whatever.
Ryan: Yeah.
Buck: And then you tried to look at all these things, and you did this in part through looking at a truly abysmal number of examples. I think you might have been one of the champions in the world at the time of reading random web text.
Ryan: Yeah, I read a lot of web text. I looked at a lot of examples. I looked at a bunch of second-order and third-order plots of layouts of things and tried to understand what was causing the model to do that. I was trying to break down the language model into a bunch of different orders of terms and then also try to get some understanding of what each of those terms was doing and decompose it and figure out terms that I could simplify away or reduce to zero. And I got moderately far eventually on relatively custom language models. I found that for random reasons, the methodology we did worked better on—rather than doing it on the current architecture people were using at the time—we had polynomial language models, which just made some of the decompositions easier, and we definitely got somewhere.
I think it was hard. And also a lot of the methods we were using were computationally very rough. And then when we looked at it, it wasn’t obvious that the approach we were using was at the right level of abstraction. Though I don’t think it was obviously worse than other levels of abstraction people were using at the time for understanding these models. And I think in practice, a lot of the methods that people use for trying to decompose language models are very streetlight-effect-y, where they sort of aim to find some things they can understand rather than decompose everything into buckets and understand it relative to its importance.
Buck: The thing which was novel about this research—and I think people haven’t really tried to do afterwards, either because it doesn’t work very well—is: pick some big part of the model’s behavior, like how it predicts “is” versus “was,” and then just try to understand what’s going on. And it’s very much—you referred to the streetlight effect—a lot of interpretability research that has happened is people try to pick something that’s going to be particularly easy to understand, whereas you picked something that was more like a semi-randomly-selected big effect on the model. The model cares a lot about “is” versus “was”—that’s a relatively large component of its loss compared to other things.
Ryan: And a lot of random stuff is going to go into that because anything that would be helpful for predicting “is” versus “was” would be relevant. And I was like, well, what’s actually going on? What are the models actually learning? I think people streetlight at multiple levels where they explain easier behavior and also look for ways of decomposing models into parts where the parts that end up getting highlighted are very easy to understand rather than explaining a high fraction of the effect for something. And in fact, people often don’t pay attention to: but what fraction of the effect size is the mechanism you found? And it’s very easy to be misled where you can say, “Oh, what’s going on is X,” because you do some causal intervention experiments where you’re like, “Okay, we added more of this and that caused that effect.” And so you’re like, “Okay, the explanation of how the model computes X is Y.” But actually what you found is that part of how the model computes X is Y. And in fact, you’re right that that’s one of the mechanisms, but it’s not necessarily the only mechanism. And it might not even be the most important mechanism. Maybe it’s the most important mechanism, but in fact there’s a huge tail of different little mechanisms that the models are using. And you’re going to be confused or misled if you focus on things that are easy to understand. That was my takeaway.
Now it’s important to note this was based on small models. My sense is that bigger models actually mostly don’t make the situation better, but they do in some ways. So results may not fully transfer. I feel some amount of nostalgia for this research, but also: man, I did so many things that were kind of dumb. And I think at the time there was a ton of effort on infrastructure things.
Buck: Part of this is because the research style you were using, where we were trying to rewrite models into different components so we could break up the first-order terms and the second-order terms and the third-order terms or whatever, though I think it is fairly intuitively reasonable as a thing to try to do, does mean that you run into these computational problems pretty quickly, and we just decided to go headlong through them rather than just giving up and doing something else easier.
Ryan: Yeah, because naively when you start breaking down into fifth-order terms, you end up with these insane matrix multiplication tensor diagram expression-type things. And then it might be possible to do them computationally, but it’s actually quite tricky to make it sufficiently cheap to run. And we wrote a bunch of code for that. There’s a particular style of interpretability you can do using that, which I think is somewhat promising, but in practice I think is probably ultimately not the right strategy because it’s just not native to the way that models work or something. And it’s sort of trying to…
I think we don’t know what the right ontology or abstraction to be doing interpretability is. And it’s pretty plausible from my perspective that things that are very mechanistic are just not the way. And you can do more “train a decoder” style things, which I think is a direction that’s intuitively promising. I feel like intuitively it more matches up to the right ontology for how the model itself is working. But it just seems really hard. The way that SGD works is fundamentally non-mechanistic in some particular way that means that mechanistic interpretability might just be screwed—and not obviously could not be screwed because there’s emergent structure that works out nicely. But I think in some ways biology is kind of screwed by the fact that evolution doesn’t work in the way you would have wanted and doesn’t produce things that are easy to understand, and in fact things might be similar here.
Buck: Doesn’t produce things that are sparse in a certain sense.
Ryan: Yeah. And evolution is I think much more sparse than SGD. And in fact there’s a point you made a long time ago about this, which is: there might not be some level of abstraction from the mechanistic interpretability perspective that lives on top of very low-level operations within transformers. Within human bodies, there’s many levels of abstraction that are extremely natural and strong. You can talk about proteins and you can talk about cells and within cells there’s a bunch of repeated structure that’s very similar. And one claim is: above the mathematical expression of the architecture itself, there is no further structure that is very non-lossy.
Buck: Yeah.
Ryan: And every further structure is just going to be pretty lossy on top of that.
Buck: Yeah, like for instance, proteins are a good example. The basic reason why protein is a good abstraction for thinking about stuff that happens in a cell is that they bobble around, and when you’re trying to understand how much things interact with other things in cells, if things are part of the same protein, they interact just wildly more because of the fact that they’re attached to each other. And when you try to divide how strongly attached things are to each other for different things, things that are part of the same protein are just wildly more attached, and things that are not at all attached are sort of similar to each other in total amount of interaction. But yeah, as you say, neural net architectures just don’t have the capacity for—there’s no way to have globs of activations that interact sparsely with other globs of activations. It’s just extremely unnatural.
Ryan: Yeah.
Buck: And I think probably doesn’t exist.
Ryan: Like when you have an MoE model or whatever, I think it ends up globbed still quite a bit.
Buck: Yeah.
Ryan: And there’s also—I mean, biology also—a key thing is proteins are extremely natural because there’s heavy specialization where you just have ribosomes, that’s what you have, and that’s how a ton of the stuff gets manufactured. Whereas the corresponding level of abstraction would be a matrix multiplication. But analyzing things at that level of abstract interaction is often messed up. And also, for what it’s worth, protein is not that great of a level of analysis for many things. And in fact many things are complicated and hard to understand with biology for similar reasons. But yeah. Shrug. I don’t know. This isn’t to say there’s no hope here. Just I think people sometimes fail to recognize ways in which these things can be hard.
Buck: So I feel like we ended up coming away from interpretability much more pessimistic and grumpy than a lot of other people.
Grumpiness
Buck: Yeah. One thing which is interesting about how we related to interpretability here is: I feel like we weren’t looking for a way to make it work out. I feel like we had more negativity in our hearts in a certain way than for instance the Anthropic interpretability people or Neel Nanda. I feel like Neel just wants to make papers that show something good happening, which is a very constructive way to relate to stuff, and I don’t really have an objection to it. Whereas I feel like we’re a bit more likely to go into a field and then have a kind of doomy attitude where we’re just like, “But this didn’t do what we wanted.” We naturally wanted to know whether we could predict “is” versus “was.” And we sincerely thought that we might be able to say something about “is” versus “was” and we can’t. And now we are annoyed by this and end up talking about this. Whereas I feel like we have this anti-frog-boiling thing or anti-goalpost-shifting intuition where we like to think about goals that we might have, and then insofar as we decide that actually we aren’t going to be able to achieve those goals but we are going to be able to achieve something else—you know, maybe interpretability research is still good for another reason, which is in fact what I believe now—I feel like we are a bit more interested in tracking that that’s what’s happened than other people are.
Ryan: Yeah. And I do feel like there have been huge goalpost shifts in interpretability, at least relative to public communication about it. It’s kind of unclear what people meant. But if you go back in time, you can read a post by Evan Hubinger which is like “Chris Olah’s views on AGI safety”, which talks about decoding entire models into understandable code. And I’m like, okay, have we just totally given up on that dream? What’s going on? I feel like there’s some disconnect where because of perhaps lack of public discussion of theories of change for interpretability, there’s some confusion about how much are we imagining this solves various problems.
I think Neel seems like we’re mostly on the same page about this sort of thing. I agree that Neel’s perspective is like, there’s probably some good projects here, time to pluck some off and do them and get my MATS fellows to do them. And I think that’s pretty productive and seems pretty reasonable. And I do wonder—I find myself filled with fear that at least some organization, in particular Anthropic, are planning on relying on interpretability in a way that doesn’t make sense. And no one’s even explicitly thinking about what that would require, what that would mean, because no one thinks that thinking about things in this way is very useful, but in practice they are. And that feels like a pretty scary state of affairs.
Buck: Can you say that more clearly?
Ryan: Sure, yeah. Let me try to put this differently. One concern I have is that at Anthropic, historically various people have voiced things that sound like interpretability is likely to solve all safety problems. “We should make sure that we get the interpretability stuff right and then have a backup plan.” This feels like a very wrong perspective to me because it seems very unlikely that mechanistic interpretability solves most or all safety problems given the historical rate of progress in the field, especially once you take into account the timelines that are common at Anthropic. And I think this has plausibly made their decision-making much worse because it’s actually a huge aspect of the future which they seem very wrong about from my perspective. And I find this concerning because it’s just generically concerning when people are making bad decisions. That’s not to say that I think the decision-making at other AI companies is better, just that that’s a particular concern.
Buck: Just bad for a different reason.
Ryan: Yeah, just bad for a different reason. That’s right. It’s sort of like there’s multiple phases. Phase one is understanding, trying to think about misalignment risks in the future at any level and prioritizing them in some way that involves factoring that into your strategy at all. And Anthropic has passed the first milestone. And then there’s thinking about what will happen with safety research in a way that makes sense and is a complete picture—or not even a complete picture, has some picture that makes sense and doesn’t have huge glaring errors in terms of how you’re thinking about it. And Anthropic has, from my perspective, the leadership has not passed the second post.
Buck: Well, I think it’s a little awkward. I think there are some staff at Anthropic who I think have quite reasonable perspectives on AI safety research and what should happen in the future. I just feel like they are somewhat constrained in how much they can say what they think about AI safety research because they don’t want to be seen as undercutting other Anthropic stuff or things that Anthropic has said. And I think this just leads to some systematically perverted understandings of what staff at Anthropic think among both Anthropic staff and among the broader public and especially the AI safety community.
Ryan: And to be clear, Anthropic staff, I think, tend to be more optimistic about mechanistic interpretability than we are, even among people who we would say have reasonable perspectives that seem pretty thoughtful and well-considered. But I think it’s also funny because I think some of the people—another perspective is like, “Oh, the interpretability stuff seems very promising,” but they’re still wildly less optimistic that it will cleave through a huge fraction of the problem. They’re just more pessimistic about other stuff. It depends on the person.
Buck: Yeah. Another substantial group is there are, I think, people at Anthropic who are annoyed in the same way as we are about Anthropic interpretability, and just don’t talk about it.
Ryan: Yep.
Buck: Yeah. What else is interesting there? I think a general phenomenon about Anthropic is: Anthropic as an organization cares more about seeming safe, and staff at Anthropic care more about having reasonable things to say about their AI safety plans and about how they’re going to not impose huge amounts of risk in the future. And I think this has an amusing combination of good and bad effects on their plans for safety. There are some other AI companies that have substantially less staff buy-in for caring a lot about AI safety and in particular preventing risks from AI misalignment. And in some ways, those places, I think, are actually saner about thinking about the future of safety because those staff can talk more frankly about futures where the safety people don’t have very many resources and don’t have very much say over what happens and where the organization is overall posing a lot of risk on the world. Because when people at other organizations say, “Yeah, and then we’re going to have to do the following janky things that don’t have a very big or very reliable effect on preventing AI takeover,” no one cares. And so they can just kind of say this. Whereas Anthropic staff, I think, have more trouble saying this kind of thing in some cases.
Ryan: Yeah, I think it’s a little bit complicated. It’s more like—it depends on who they were talking to and who the person is. But I think minimally, it degrades people’s internal epistemics less because there is a difference. Imagine that employees at OpenAI and Google DeepMind either feel free to say what they think or at least are more explicitly decoupling between the words they say and what they think internally, because there is a more obvious direct mechanism for them thinking the organization will be risky in the future. Whereas I think at Anthropic, at least historically, there is a bunch of pressure to be optimistic about how Anthropic will handle things in ways that are somewhat close to how you think the future will go. Or there’s more of an uncanny valley effect or something. I’m not sure this is exactly the right way to think about it.
But I mean, there’s just a bunch of people who are very interested in what safety employees at the company think in principle, but aren’t necessarily that well-informed, and who the leadership really wants to appease and make them feel like the situation is totally fine and great and awesome and it’s great to work at Anthropic, everyone loves Anthropic. And that creates pretty perverse epistemic effects in a way that is maybe somewhat less true at other companies or at least manifests differently. I think there’s pretty terrible epistemics by somewhat similar routes at OpenAI. But yeah. Shrug. I don’t know.
The situation feels like a pretty bad situation in terms of the Anthropic epistemics. And I think the fact that there’s huge pressure on people to go along with corporate narratives about how the safety situation will go in the future because people actually care does feel like it makes the situation much worse.
I do think, historically, the vibe—I feel like there’s been sort of multiple goalpost shifts from Anthropic that were never that explicitly said, where there was some vibe that a lot of people thought at the time, which was like “Anthropic won’t advance the frontier of capabilities,” which was claimed by the leadership in various private settings, very likely based on a bunch of evidence, but never very explicitly confirmed and never really made any sense. And then a bunch of Anthropic employees thought this and then would use this to justify what they were doing to people that were more skeptical about what they were doing. And then in practice, Anthropic leadership never acknowledges that this is what they thought because they probably never thought this and were just not thinking about this very closely. And maybe they were also just somewhat uncertain about how the future would go. But regardless, that never really made that much sense, and so it didn’t happen.
And then I think the goalpost shifted from there to instead be like, “Anthropic won’t impose a shit ton of risk. And that’s not going to be our plan. Our plan is going to be that we’re going to be safe. That’s why we have an RSP, so that we’re safe.” But in practice, the leadership never really bought into this view. This was just a thing people kind of inferred from the RSP in a way that never really made sense, because very likely, if you want to stay competitive, you’re going to be in some regime where you just don’t have the technology in order to remain at a high degree of safety. And so you will, in fact, not be very safe. And that feels like—unless you have some very specific views about the world decomposing into worlds where you can easily identify there’s high risk and there’s no risk, which feels, on my views, very implausible—you just are going to be imposing a bunch of risk.
And in fact I think Anthropic has to date imposed substantial ex ante expected fatalities, though not existential-level. But given the level of quality of bio evals, I would have guessed it’s plausibly, depending on how you do the accounting, like thousands of expected fatalities or something. How bad is that? Maybe not that bad, seems kind of reasonable, but maybe worth tracking. But at the same time it’s very awkward to acknowledge that. And there’s also just tensions between people with different levels of context and different levels of accepting-ness of the situation might be totally insane. And there’s some amount of wanting to be a normal company while the things that you’re planning on doing are very abnormal.
So to be clear, my view about what Anthropic should do is that if I was running Anthropic I would be very explicit. I’d be like, “Look, we’re going to be in regimes where we might be imposing tons of risk because we think that that’s better than ceasing at that point in time, because there’s marginal interventions we can be doing and we can be getting a bunch of safety work out of our AIs. And given that other actors will be proceeding and we think we’ll be substantially better than those other actors, we think our net influence will be positive in XYZ types of circumstances.” That’s how I would be thinking about this. But that’s more uncomfortable because now you’re like, “Oh yeah, it might literally be the Anthropic AI that ends up killing everyone or whatever.” And that is maybe something that is more uncomfortable for people to reckon with. But I think that’s just how the situation is likely to be.
I’m not sure if I’ve articulated this all very clearly.
Back to the narrative
Buck: Okay, why don’t we get back to the interpretability stuff? An event which occurred in the course of our increasing interp pessimism in something like December 2022 was: we went off and thought about whether mechanistic interpretability was actually promising in the way that we had believed it was a year ago when we started working on it. And basically I think it just looked a lot worse after thinking it through. There’s a story that I’ve told people: we were running a bootcamp for mechanistic interpretability research at the time, and then there was a week-long vacation for that. And I think you were gone as well. I don’t think you were around this week. And I was just hanging out alone in this classroom where I’d been giving talks or whatever, walking around and thinking, “Does mechanistic interpretability really make sense? Is this really worth putting a whole lot of our effort into?” And then feeling increasingly skeptical, thinking about all these kind of conceptual questions about what it means to understand models and what we should take away from the fact that naive interpretations of how well we’re doing at this make it look like we’re doing quite poorly. And then feeling a lot more pessimistic about this. I remember that quite distinctly. I don’t know what your main memory of this at the time is.
Ryan: Yeah, certainly prior to REMIX we were already much more pessimistic and we came in and did various things. I think part of it was—it was a mix of being like, “Whoa, we haven’t really gotten that far,” and also feeling like we were gaslit somewhat by the literature at the time. Being like, this literature implied that we had a higher level of understanding and that hypotheses were more nailed down, when a lot of the research seemed more dubious in retrospect and had a mix of specific flaws, as well as—even if it didn’t have the specific flaws—it came across with a vibe that the understanding was higher than it actually was. Or at least that was our interpretation. And maybe that’s on us not being careful enough readers of the paper.
But either way, we also just hadn’t seen that much success—as in, we just didn’t feel like we were pushing the state of the art. And then we were also like, the state of the art doesn’t seem that high. And so we were interested in pursuing other directions and finding other things that seemed more meaningful. You also had spent some of your time thinking about maybe we should be much more concrete about how this will ultimately be useful. And we were going through like, what is ultimately useful? It really felt like on any specific thing there was an easier route that didn’t involve interpretability very often.
Buck: Or specifically we were thinking…
Ryan: Or it didn’t involve unambitious interpretability.
Buck: Yeah, we were thinking: what are the potential applications of model internals? What are the set of problems where there’s no way to solve it without understanding internals of models or interacting with internals of models? And a lot of the time it felt like the initial research project we should do on those things did not really go through producing human-understandable explanations in the way that we’d been doing at the time.
Ryan: And in fact, a lot of the things that people are doing now—I think the closest word to this is some of the stuff that Sam Marks and others at Anthropic have been doing where they’ve been looking at auditing games. And interestingly, that work so far has not involved doing things that are well described as detailed explanations, but has instead involved noticing individual features that are active. Because just very shallow understanding has sufficed in those cases. And it’s very unclear that more deep understanding is needed. And my sense is that very weak tools that you could have developed kind of quickly would plausibly have worked in all the cases that they’ve been able to find so far, in ways that make really doing a shit ton of investigation look less good from that perspective.
And maybe you can eventually—it’s just like one observation we had: it’s maybe just actually hard to find—unless you have really good benchmarks, it’s hard to demonstrate that complicated methods outperform simple methods. And because of that, we were like, “Man, maybe we should do stuff that’s more just: find the really good benchmarks and then figure out the more complicated methods.” And then we did actually go in that direction to a greater extent.
Early 2023Buck: In the first half of 2023, a lot of what we were doing was trying to pick problems where we thought that something related to model internals would give us an advantage over black-box methods. We’d gotten a bit more pilled on the importance of always comparing to the best simple baselines and going for the simplest methods first, which I think was a response to mistakes that we’d made earlier.
Ryan: I think that might have happened slightly later, but it was around the same time.
Buck: We were like, “We always need to have good baselines. We always need to do the simplest thing first.” And at this point we were also saying we should think carefully about which properties precisely of the future situation will mean that we should be pessimistic about behavioral methods or black-box methods. We should try to construct settings where we are pretty sure there’s a big disadvantage for the black-box methods and then start to study things in those settings. I would say this was a step in the right direction. The main thing that I regret about this is that in hindsight we ended up coming up with really weird and abstract settings in which to study stuff. Even though I think it was a reasonable direction, I think probably we should have been more focused on things that were less about being very impossible to solve with baseline methods, and more just things where it seems like the white-box methods plausibly have an advantage for random reasons, if not for super systematic reasons.
Ryan: Some projects we were thinking about at the time: one thing we were looking at around this time, maybe a little earlier, was runtime backdoor detection. We ended up initially looking into more complicated methods that incorporated various hopes related to causal scrubbing and more sophisticated things. And then we found that the simple baselines just did better, and we didn’t do a good job of quickly figuring this out. That project was also tricky and messy for a bunch of random reasons. The literature was surprisingly messed up—maybe not surprisingly in retrospect—in various ways that made our lives harder. We spent a bunch of time thinking about whether there was an interesting project in consistency losses. We wrote up a bunch of stuff, but we never ended up finding an empirical project we felt super excited about. In retrospect, I still think there’s actually pretty exciting stuff about consistency losses today, and I’m excited about more work on that. I think that’s still a pretty good area, and I have a project in mind that seems pretty exciting. But at the time we didn’t find anything that we were super excited about that seemed like it was working. We had a little bit of preliminary investigation into that, and then around the same time we were working on the measurement tampering detection benchmark work. I feel like that work was extremely reasonable in terms of its motivations, especially in retrospect. But in practice our execution could have been better. At least this was mostly on me because I was mostly trying to figure this out—I could have focused much more on more compromises or simpler settings that got somewhere. We also had problems because the models that we were working with were just very dumb, which made a lot of things harder. In retrospect, it still seems like I’m pretty into that direction. But it’s not that obvious that without us directly following up to that paper, that paper will produce that much value.
Buck: It has one citation, right?
Ryan: No, it has more citations than that.
Buck: Okay. I recall that as having very few citations, but one of them is Collin’s work on weak-to-strong generalization.
Ryan: Yeah.
Buck: Which is kind of interesting—the paper along similar lines but using simpler settings, simpler methods, and three orders of magnitude more attention or something.
Ryan: Yeah, substantially due to more promotion. I think part of what happened was I was thinking that the measurement tampering detection benchmark would be a pretty easy project. And it was in some sense pretty easy, but it took us a long time to find settings that we actually liked, and that ended up being a big annoyance. Then we also found that really simple methods worked best, which is perhaps unsurprising, but we were thinking there might be interestingness in iterating on things. We found the simple methods worked better than we had expected. Interestingly, it still took way less time than the in some sense spiritually simpler paper of weak-to-strong generalization. A lot of people had research projects that just dragged on for a really long time. I still feel like it should be possible to do research much faster than people often do it. There’s a lot of variation where some projects go way faster and some projects way slower for a bunch of different reasons.
AI control researchBuck: Speaking of projects that go faster than one might have expected—I think the next interesting thing which happened in Redwood Research history was us thinking a lot about the AI alignment plan and then thinking about the control stuff. The basic story here, as I recall it: for many years at this point I’ve been giving talks called something like “What I think the AI alignment plan would be if we had to do this right now,” where I would try to answer the question of: suppose it’s the year 2024 or 2023 or whatever year, and there’s some company that is willing to accept some alignment tax—they’re willing to go somewhat slower, be somewhat worse than they would otherwise be—but they’re going to build AIs that are really powerful. What do we do in order to minimize the probability that there’s AI takeover? We’ve been thinking in terms of this for a long time. Then one day we were having a meeting with Ajeya Cotra, and she asked “for the last 20 minutes of this meeting, let’s talk about what you would do if you had to align AIs right now.” We started talking about it, and for whatever reason I think this time we found it particularly interesting and particularly productive to think about and go into more detail on. You especially put a bunch of time into this. This started being basically your full-time thing. You spent two weeks on it, then two months on it or whatever.
Ryan: Yep.
Buck: Going down a bunch of hilarious rabbit holes in retrospect, like doing a bunch of Tom Davidson-style takeoff speed analysis or intelligence explosion analysis and getting into some of the microeconomics there, which I don’t think anything happened with that work until Tom made a better version of it later. But yeah, what do you want to say about what we got from that experience?
Ryan: I think we had a bunch of thoughts and we just noticed that in practice almost all the methods that seemed promising that we could clearly lay out were of the form: prevent the AIs from being able to do X, or detect it if they do Y, or use checks and balances. And we were like, “Well, that seems interesting.” Then we generalized that space into control research. With that explicit frame, we were able to think of a bunch of methods that seemed pretty promising and felt clearer to us once we had laid out the exact assumptions we were operating under—like, “Okay, these AIs are conspiring against us, we have these other AIs”—it clarified a bunch of things from our perspective in terms of what the point is. Another takeaway we came away with: things that are easier to evaluate and easier to iterate on might be much more important. Evaluation has two purposes: one is that it can be legible—you can demonstrate you did a thing. Another is that it’s iterable. We were like, things that are possible to iterate on might be way better. Things where we can very directly iterate on the risk mitigation might be a really big deal. In retrospect, I think we somewhat overrated this, but not by a huge amount. That led us to be like, “Man, this control stuff seems pretty promising.” We were thinking about a bunch of different types of projects and it took us a while to figure out exactly what project we wanted. We were flailing around a bit. Eventually we were like, “Okay, some sort of project that was a high-stakes control-ish project”—by high stakes we mean a small number of safety incidents. Then it took us about a month of trying a few things, maybe a little more, until we found a version of the setup that we were mostly happy with. From there it was actually pretty quick to get a quite good version of the project done. There was a long tail of the project to some extent, but it wasn’t actually that long of a tail. It took us a while to write the paper, but basically the project end to end went pretty fast.
Buck: Yeah, like four months or something. And this was an average of two or three people working on it.
Ryan: Yeah. It seems like it could have gone faster in retrospect, but—
Buck: I agree, but it just feels like for some reason it went so quickly and so well relative to other projects—relative to other projects run by us and run by others. I don’t really know what happened. I think we just got really lucky with a bunch of stuff that was super easy to study and the settings just turned out to work nicely.
Ryan: I also think a lot of projects run into delays due to having too many people and having a long tail of additional stuff you want to do, and then getting delays from coordination and people kind of being—I don’t know—a mix of sort of lazy and also trying to make things more efficient via parallelizing more and trying to get other people to do things, then being blocked on those people. I think we were just more like—I was just doing the project, and I was like, “I’ll just own everything, basically.” And that ended up being much faster. In retrospect, I think a lot of projects go faster if you just have a much smaller number of people who basically own everything and then maybe get work from other people, but basically take full ownership for it. That obviously can be less efficient if it’s a big project, so I’m not claiming it’s the best approach.
Buck: Yeah, I feel like we also did have pretty great parallelism success. For instance, in this project, you came up with this initial setting where the catastrophe was the code runs out of memory. And then in parallel, I was developing this backdoor setup. Then you did a bunch of initial study of trusted monitoring, untrusted monitoring, defer-to-trusted—all these things in your setting, I think. And I think you’d done initial versions of all of that before I had the backdooring setting working reasonably nicely. Then you just adopted that. And Kshitij also did his various parts of this. So I don’t know, I feel a little confused. I don’t think it’s just that you powered through the whole thing on your own. I think there also were gains in parallelism.
Ryan: Yeah, there were definitely gains in parallelism.
Buck: I think smaller—
Ryan: But I think what I’m saying is having a smaller number of people who owned the thing made it so that there was less of a long tail of waiting for stuff. I think the part of the project that went slowest was writing up the paper. The empirical work went pretty fast. I was also fading in my ability to do things very quickly towards the end of the project, but it didn’t matter that much.
Buck: You were fading, or—
Ryan: I just felt less efficient at doing empirical work quickly because I was spending more of my time waiting for things to run and didn’t have good ways to parallelize over that. I was like, “Oh, what should I do?” And I’m like, “Well, I actually don’t really have a great next step. So then wait for this thing to run, which will take 20 minutes.”
Buck: Guess I’ll twiddle my thumbs.
Ryan: Yeah, I mean, I didn’t literally twiddle my thumbs, but I just did things that were relatively less productive. But I think we were still able to get the empirical work done pretty quickly. Another thing that helped was just, at least for me, not caring very much about the infrastructure being good and just being like, “We’re just gonna write some code that runs the thing.”
Buck: This infrastructure, I would like to say for the audience, was in many ways atrocious. Ryan has this workflow where he has these incredibly long Jupyter notebooks that rely on cells being run in particular orders. And he has this crazy Vim setup—I think he has a Vim setup that he wrote to send snippets of code to—
Ryan: I mean, it’s just a normal Vim plugin.
Buck: Sorry, by someone else.
Ryan: Yeah.
Buck: Okay, all right. That is definitely good to know. Definitely it’s a little unapproachable. I think the basic issue with working with Ryan is that he has a much larger working memory than a lot of people. And so for him, the optimal strategy often is to just have 30 cursed things happening at the same time in order to allow things to be computationally more efficient or whatever, at the expense of working memory. And this is quite an experience. It wasn’t too much of a problem.
Ryan: I feel like the codebase, I agree, was kind of messy. I think you could just run the notebook end to end, but it was often faster to run in some other way. There were weird things I was doing along those lines, and I just didn’t bother cleaning up things that could have been cleaned up because it didn’t seem very important at the time.
Buck: Yeah, okay.
Ryan: Yeah, that was the control paper.
Buck: I think something else interesting about this: going back to what was good about control, I think that at the time when we started thinking explicitly about how would things go if the AIs are scheming against us, I think in a lot of AI safety research that had been done up to that point, people were very unclear about whether they were trying to come up with techniques that would prevent AIs from becoming misaligned or coming up with techniques that would work even though AIs were misaligned. And they didn’t even understand this distinction very well.
Ryan: And they still don’t. I feel like people often conflate these things. I’ve run into this where it’s like, “Okay, so it’s worth noting that this scheme does not prevent the AI from being misaligned. It just means that it will not be able to take any bad action within the following constraints.” I feel like this is very common. But yeah, anyway.
Buck: Yeah, yeah, yeah, that’s right. I feel like this was a very productive time. We had a bunch of thoughts there. I think it was really exciting in October or November where while you were doing a lot of this empirical research, I was iterating on presentations about this overall plan for AI safety which relied substantially on control. It felt like the first time that I felt like I understood end to end what it would be like to do scalable oversight in a way where you’re trying to be robust to the AI scheming against you and you’re trying to come up with your high-stakes control to prevent the AI from doing these purposeful catastrophes and you’re trying to mitigate exploration hacking and gradient hacking when you’re doing your evaluations. It felt really cool to feel like we had kind of an end-to-end story in a way that we previously hadn’t had.
Ryan: Yep, seems right. And then from there there were a few projects going on that were somewhat smaller. Fabian had a project on steganography which I was also advising on. We did a project on exploration hacking that came out not too long after the control paper came out. That went okay, not amazingly—a bunch of things were somewhat annoying about the project.
Alignment faking paperRyan: And then I decided that I would do some work on looking into—I thought we could make much more interesting model organisms of misalignment and look for—I was reacting to various counterarguments against scheming and being like, “Actually, I feel like we can demonstrate this empirically.” I was interested in a bunch of different things. I was like, “There’s gotta be something in the space of looking into models and evaluating scheming arising in various cases that feel analogous to the future problem and let us study something.”
Buck: And?
Ryan: And I ended up doing the initial work and basically pretty quickly—within maybe a month and a half of focused work—getting most of the core results of alignment faking in large language models. There was a long tail of work from there to getting the full paper and all the extended results and checking things and redoing various things in response to various issues. In retrospect, the initial part of that work was very fast and easy and was very productive. But the long tail of that project was somewhat slow and brutal. I feel like we could have done better on that. There are a bunch of different things that made it slower than it could have otherwise been. I think the approach we overall took was probably the correct one, but maybe we could have just done it faster. I think it was also reasonable to just do it kind of slowly and carefully to make it be really good. But maybe I should have been better at picking up some other empirical project at the same time and thinking about it as a smaller fraction of my time that would be going on ongoingly anyway. That’s the alignment faking stuff—maybe you want to talk about it?
Buck: Yeah, okay, so I’m interested in something there. What was your core insight that allowed you to get a bunch of these alignment faking results really quickly right at the start?
Ryan: I was interested in looking at: if you prompt models to do a competent job of pursuing objectives you give them, do they just naturally generalize? There’s some simple claim which is: obviously models should do the alignment faking thing as a special case of pursuing goals in the world. If we set up models to do a good job of pursuing goals in the world, maybe they exhibit scheming. So I had some setup where I was few-shot prompting the model with a prompt where it does extensive reasoning about how to achieve its objectives. Then I would give it an explicit goal and see if it ended up understanding and deriving the scheming strategy—just being like, “That would be interesting.” It turns out that through various simplifications, I could do a version of the setup where I basically just instruct the model to reason extensively about what it should do and give it access to a scratchpad, and then just have a situation where scheming is a natural consequence of objectives it already had—in this case, Claude 3 Opus’s preference for some mix of self-preservation, being harmless in the future, and also caring about animal welfare. I eventually did a bunch of iteration that quickly simplified to that. What led to this happening quickly? I started with a version of the setting which had a few different moving parts, but which let me experiment with things quickly to get some signal on what the model was doing and thinking. I think I spent too long writing prompts in retrospect, but I think there were aspects of that which were also maybe pretty reasonable. Another thing that was helpful was just doing more work on really making the model reason a bunch in the scratchpad, which I think was both helpful for understanding what was going on to some extent, and also maybe did some amount of gain of function and made the models smarter than they otherwise would have been at the time.
Buck: I think this is undefeatedly interesting. This was substantially before o1 came out, and you had stumbled upon the idea—perhaps obvious—of making the model better at reasoning through stuff by telling it to spend a long time reasoning and giving it a bunch of few-shot prompts that involved me writing out detailed step-by-step examples of me doing reasoning that I thought would be similar to the style of LLM reasoning.
Ryan: In practice I think this only worked moderately well and it didn’t matter that much for our final results. But I think it might have made it easier to iterate along the way. It’s kind of unclear.
Buck: Mm, yeah, that makes sense. So I don’t know, what should we take away for future research from the fact that that paper was very successful, and especially the first six weeks where you got the core results very quickly—that was very valuable time. What have we learned from that about what kind of research is good to do?
Ryan: I mean, I feel like sometimes there’s just easy stuff that’s really good to do. I’m not really sure there’s a huge takeaway. I do think that in retrospect, probably we should have—I remember much earlier than that, even when we were doing interpretability, you were like, “Man, that model organisms research direction…”—we were talking about it well before Evan had written up the post. I think before Evan and Ethan and others had written up a post on this, and I think before they started the sleeper agents paper, we were like, “This is obviously a direction that someone should be working on.” And this was based on conversations with Paul. You were like, “That’s one of the research directions that I most envy or most think that maybe we’re making a mistake by not working on.” In retrospect, obviously big successful Redwood, insofar as that makes sense, should be doing some model organisms work. I think that’s just a pretty reasonable direction to be pushing in for a bunch of reasons.
Buck: Yeah. I mean, we had a lot of conversations with Ethan before Evan was at Anthropic, I think even.
Ryan: Yeah, that’s right.
Buck: About model organisms research and what should be done there, which I think looks pretty reasonable in retrospect, including noting that secret scratchpads are maybe a pretty important aspect of the methodology.
Ryan: Yeah. I don’t really know if I have any very specific takeaways. One thing is prompting is really nice and if you can do things with just prompting, that’s very helpful. This is very obvious, but it’s a classic—
Buck: Evidence in favor as well.
Ryan: Yeah, and that was very helpful for making that research happen much faster. One awkward thing is my sense is the current best marginal work does actually involve having to run RL in models in model organisms, but maybe there’s some way to prototype more of it with prompting. Maybe I should think more about that now as prompted by this conversation. I don’t know.
Buck: I think another thing which was interesting about this: you didn’t just want to do more control research. In late 2023, our control paper had just come out. We were pushing this whole agenda. And I think it was in many ways an interesting decision for you to just not work on that and work on this other thing instead that you thought would just be valuable and good. In hindsight, this was a great call. But it’s also highly counter to a lot of heuristics one could have had about how to do AI safety research, or how to have organizations—some people are really into focusing on goals, like having an objective and just trying to pursue that single narrow objective rather than—
Ryan: Yeah, yeah, yeah. I think—I had just—I was in parallel working on a project on exploration hacking, which was one of the bigger blocks. And that did come out. I was just thinking about other projects to work on and it seemed maybe promising, maybe easy. Some people expressed excitement about it, and I was like, “Ah, fuck it, I’ll just do the project.” I feel like that’s just a pretty reasonable way to approach things in general.
Buck: Yeah. Or I feel like another point here is: many parts of this were actually technically difficult, and you being a strong engineer and especially good at schleppily doing painful ML infra stuff of this flavor gave you an advantage.
Ryan: Yeah, mild advantage. I feel like it was pretty easy—or, depends.
Buck: Well, depends on the part.
Ryan: Yeah, depends on the part. The early parts were pretty easy, mostly. I don’t know.
Buck: Cool.
Ctrl-ZRyan: Do you want to talk about maybe some of the stuff that was going on in parallel with the alignment faking in large language models paper—what ended up being the Ctrl-Z paper?
Buck: Yeah. A big thing which was happening in parallel with the alignment faking paper was this follow-up to the original control paper that we ended up putting out in, I think, May this year or something, called Ctrl-Z, which is about AI control in an agentic setting. In the original control paper, our AIs were solving these coding problems and they had to output the code in a single action. In real life, AIs are going to be deployed substantially as agents, and probably most of the risk comes from AIs being deployed as agents. Maybe there are important ways in which the agentic setup is different from the particular—
Ryan: They’re taking multiple actions, seeing interaction from the environment. You can cut off a single action, this sort of thing.
Buck: Yeah, a whole bunch of stuff. So we decided to research this and then we published it 16 or 17 months after we started, which in some sense is atrocious. That’s four times longer than the original control paper took, and you might ask what the hell went wrong. I think there are a number of things. For a lot of that time there were not very many people working on it. For some of that time the only person working on it full time was Aryan Bhatt, who ended up being the lead author on the paper and really leading that project. And then we also had some MATS fellows and some other random people at various other points. Overall I’m fairly happy with the results of the paper. I think it basically does a reasonable job of articulating the core dynamics that are different between control in a single-action setting and control in an agent setting. We just kind of had to do all this work at some point, so I’m glad we did it. I think the most interesting findings are various ways that AIs having to attack over multiple actions is more painful for them than you might have guessed and riskier for them than you might have guessed, because of all these things we can do while we resample them partway through a trajectory and use that to incriminate them, and so on. So I think it did actually have reasonably good consequences. In terms of why it went slowly: I think the main one was attention was divided between different projects, and the people who were working on it in a lot of cases were not that experienced—though now they’ve gotten a lot better at this, and many of the people on that paper are now working on high-stakes control stuff at Redwood quite productively. Another big one is that the first time we did a control project, we just lucked into having a setting where it’s really convenient to study stuff. The nice thing about APPS backdooring is the APPS dataset already existed and is reasonably good. And then the definition of catastrophe that we picked is quite natural, and the AIs can come up with smart ways of making it happen—basically, or—
Ryan: Slightly smart. Maybe the AIs can come up with slightly smart ways of making it happen.
Buck: And basically a lot of the dynamics just fall out fairly naturally. Whereas in this agent setting, we had aspirations that we could include all kinds of computer security things and we had aspirations that it would be a much more realistic setting. But it turns out for a variety of reasons that are maybe boring to get into, it’s quite painful to make settings like that which have all the different properties you want. I have a blog post on this that I can link. I didn’t realize that we really needed to get our settings sorted out before the research was being invested in too heavily. What happened was I initially prototyped a bad version of the setting, and this was leading to all kinds of problems for our researchers, and we didn’t prioritize improving this until later in the project. This wasted some effort. But overall I don’t have massive regrets here.
Ryan: One piece of commentary: I think that project had a bunch of people coming in and out, and I think that’s often very inefficient from a serial time perspective.
Buck: Yeah.
Ryan: In retrospect, any initial prototyping from a small group of people who are actually very focused and taking a bunch of ownership and spending a high fraction of their time on it often feels like it’s just way better when you want to make a project happen quickly from a serial time perspective. And it’s also more efficient in terms of hours, but has other downsides potentially. Maybe we should also comment—
Redwood size over timeRyan: I feel like we’ve glossed over how the number of people at Redwood has changed over time and how the composition of Redwood has changed. Do you want to talk a bit about that?
Buck: Yeah, sure. When we started Redwood, we had something like six staff. Their attention was divided somewhat.
Ryan: And when you say “when you started,” you mean when you started working on the adversarial training project.
Buck: That’s right. This was mid-2021. We had I think six staff, maybe seven. We set ourselves the ambitious and foolish—in hindsight—proposed growth strategy of doubling in number of technical staff every six months. Basically because we thought it would be good to have a bunch of people working on what we thought of as AI alignment research, and thought that we would be able to quickly beat a bunch of other people at coming up with productive ways of managing large amounts of AI safety research. We in fact did do that first doubling by January 2022, and I think we almost did that second doubling if you include interns over the next six months, where we did actually have 20 people or so at that point.
Ryan: Yeah, a lot of interns though, or a lot of internships.
Buck: Many of those were people who were part time. And then the number of people at Redwood dwindled for a couple of reasons. One thing is many of these people were temporary and then they left. Another was, when we became more pessimistic about interpretability, various of our staff left—for various mixes of them wanting to leave and me wanting them to leave. It’s often somewhere in the mix, where to some extent I’d say, “Look, if you really want to make it work out for you working here, here is how we would experiment with that. But I’m not confident that it’ll actually work out, and it’s probably gonna be moderately difficult for you, and I’m not amazingly optimistic about it.”
Ryan: This is like a performance improvement plan.
Buck: Yeah, stuff like that. So we ended up shrinking moderately by the beginning of 2023, and we ended up shrinking further through 2023 until we ended up having four full-time staff at the point where we wrote the control paper. This was again some mixture of some people who were particularly excited about interp leaving, and all of the people who were doing adversarial training stuff leaving basically because that project wasn’t going super well, and stuff like that.
Ryan: Yeah. People joined into the measurement tampering project, but that project wasn’t clearly making great utilization of lots of people and was maybe bottlenecked on conceptual stuff and confusion. In practice, the level of productivity of people on the project wasn’t super great. And then that caused people both to feel less excited about working at Redwood and also us to feel less excited about having them work at Redwood. And a mix of these factors caused people to leave. I’m not sure if that’s the most accurate representation of that.
Buck: Another factor here was: after we’d come up with a lot of the control stuff and finished that paper, we were seriously considering exploding Redwood and going to work at AI companies. And this meant that occasionally when staff had the reasonable enough preference for job security, they would ask us, “Okay, so how secure is this job?”
Ryan: And we’re like, “Not at all. Who knows?” To be clear, the view was: initially when we were thinking about control, we were like, “Probably the way to do this is to implement this at AI companies. Probably this is the most effective way to make this research happen”—which was reasonable at the time and remains kind of reasonable, though we’ve changed our view somewhat for various reasons. And so we were like, “We’re gonna write this initial paper, try to cause this paper to have some buzz, write up some blog posts, and then just dissolve the organization and go to AI companies and try to implement this and figure out how to make this happen.” I think this was a reasonable plan, but we decided against it for a bunch of reasons—a bunch of different factors. That meant that we had less—we were not as excited about hiring, for one thing, and we were less actively pursuing that. And also people were less excited about joining. I should also note some management history here. I think Buck and Nate were making most of the decisions about how to hire, when to expand, until maybe somewhere around early 2023 or maybe a little bit later than that. And then there was some amount of—I don’t know—a come-to-Jesus moment or whatever the term is, where I got more context on the situation. And also you and Nate updated against Redwood being too large—we were making a mistake by having so many people and not focusing as much. This was around the same time as we were starting to investigate the control stuff, if I recall correctly. And I was spending more of my time thinking about this. My heuristic sense was more opposed to having lots of people around for various reasons than you were, possibly just because of random things. So there was some influence in that direction. And that was specific to the circumstances—at that time, it seemed like it was worse to have more people. And then over time—I guess I can complete the rest of the history. We scaled down until Nate ended up basically doing most of the other stuff. And it was basically just me, you, and Fabian. And then Fabian decided for various reasons to go work at Anthropic, until it was just me and you. And now we’ve grown from that point until now, over the course of doing various control projects and also doing a bunch of blogging work and hiring people to do a mix of blogging work, managing MATS fellows, and running smaller empirical projects. We now have 12 full-time employees, one of whom is less focused on our core things and is mostly engaged in their own pursuits but with some guidance from us. And we’re planning on growing some amount for the next six months.
Buck: Yeah.
Ryan: And we have substantially more interns and MATS fellows. But that’s partially experimental and we’re not sure exactly how many people we want to proceed with going forward.
Buck: So looking over all this history, I feel like it’s kind of confusing to know what to learn from this. There are lots of specific things to learn, but I don’t really know what the main general themes are. I feel like one general theme is: you should try to figure out what you really want to be doing and who you really want to be good at, and try to do that before hiring a bunch of people.
Ryan: Yeah, that’s part of it.
Buck: Both before hiring people and in general. I wish that I’d paid more attention to thinking: which of my personal properties do I really want to lean into? I think that plausibly would have led me to saying things like: thinking about AI futurism, being pretty impact-focused, being more interested in being back-chained than a lot of other researchers are. I think that in hindsight would have actually been pretty good, though it’s not perfect. Another thing which is kind of interesting is that we were kind of close to making a couple of hires who would have caused Redwood to grow quite differently. We were talking pretty seriously with both Ethan Perez and Rohin Shah about joining us. And I think either of them would have actually helped us out a lot if they joined. I have a lot of respect for both Ethan and Rohin, and I think if they made the different decision to join us instead of going to Anthropic or staying at GDM in Rohin’s case, I think that would have made our having more staff just work wildly better. They would have worked really well with us at sanding off a bunch of our rough edges and thinking about how to do research and being a sounding board for things. And there’s kind of a nearby world where those things both went differently and Redwood feels very different. Over the course of 2022-2023, instead of kind of derping around somewhat foolishly—with interpretability research being moderately productive but not amazingly productive, and then being a little bit lost in 2023—it was more like Redwood felt like a combination of us doing the things that we ended up doing and also a bunch of reasonable safety research happening with other staff, which would end up feeling like a very different narrative than the one we actually had.
Ryan: I’m a little worried that you’re picturing an outcome which is like: imagine that we sent Ethan Perez with his current knowledge and understanding of how to do empirical research back in time to Redwood of 2022. But that’s not the counterfactual. Ethan Perez now is not Ethan Perez in the past, even though they both look similar. I’m not sure that—like, I think for Ethan Perez to have helped make Redwood go much better, he would have needed to be very opinionated about what should happen. And maybe that would have happened, maybe it wouldn’t have.
Buck: But it’s like—importantly, he would have had to be opinionated about some of the things he thinks, but not others of the things that he thought at the time.
Ryan: That’s right.
Buck: And maybe this is a little bit hard. I think basically the thing which would have gone well is: I think he might have pushed on the—he wants to put out a bunch of things kind of quickly rather than doing mega long projects.
Ryan: Yeah.
Buck: And he wants to do those personally. And I think that push would have worked pretty well.
Ryan: Yeah. It seems reasonable. One thing that you didn’t talk about: why was Redwood planning on growing so quickly? Maybe you want to talk a bit about that.
Buck: I guess it just seemed naturally good to have a bunch of people researching AI safety stuff at the time. It felt like there was a bunch of funding available for AI safety research. There were a bunch of talented people coming in who were interested in AI safety research. The global catastrophic risk community building stuff had been going pretty well. Effective altruism was becoming more famous and popular. FTX was close to the top of its success. AI was getting a lot more salient to people. It seemed kind of plausible that lots of people would want to work on AI stuff and that we could help funnel that in productive directions. Which I don’t know, doesn’t seem totally crazy in retrospect.
Biggest Redwood mistakes in the future?Ryan: Any predictions about the biggest mistakes you think Redwood might make in the future? Like, biggest decisions that Redwood is making with respect to growing and hiring that seem important to get right. Obviously, presumably you can’t predict your mistakes or you wouldn’t make those mistakes. But things where you’re most unsure and it seems most likely there are big things we miss.
Buck: There’s a bunch of things that seem like plausibly good ideas that we are currently too afraid to do or invest in. I think one of these is: it’s plausible that we should hire a bunch of people to just blog and write LessWrong posts about AI safety. The basic argument for this is—
Ryan: Didn’t we already do that? No, I’m mostly trolling.
Buck: Well, yeah. I mean, we do actually put out a fair number of blog posts at Redwood Research. I’m moderately happy with how this goes, but I think we should plausibly lean into it more. There’s a bunch of people who work at Redwood and who don’t work at Redwood who I think could plausibly do a good job of blogging about AI safety research. I kind of think that inasmuch as anyone in the AI safety community should be thinking hard about conceptual topics and blogging about them, we should try to support that at Redwood because I think it’s helpful to have the Redwood culture around them while they do that writing. Both because we have a lot of thoughts on futurism and because I think we have thoughts on how to be productive writers and conceptual researchers which are somewhat better than they’re going to get other places. And I’m afraid of doing this because it’s kind of not obvious what the case for it is, or it’s kind of not obvious how good it is to have people blog about AI safety stuff.
Ryan: Yeah, the classic case against is it’s low-bandwidth communication and then people just reinvent the useful stuff anyway. I think Paul, for example, feels like all the blogging stuff he did trying to generally make people more sensible about AI safety feels mostly not that useful—because either it didn’t influence people or people just reinvented everything anyway, so who cares? So I feel like we—
Buck: We got a lot out of reading the Paul blog archives.
Ryan: We got some out of it, but—
Buck: I feel like we’re among the small number of people who’ve read the Paul archives in substantial detail. But to be clear, I think that was a big part of how we ended up spinning up on a bunch of AI safety thinking—me from mid-2021 and you from late 2021.
Ryan: Yeah. I think at least talking to people—I’m not sure that reading Paul’s blog posts was that much of the value. It’s some of it, sure.
Buck: Yeah. We were also talking to Paul a reasonable amount, and that was very helpful.
Ryan: Yeah. I think it’s hard to know how useful this stuff is. And I think in some sense the situation is kind of crazy where there’s a ton of people interested in working in these areas who feel—maybe this is slightly rude—very ill-informed, from my perspective.
Buck: By “feeling,” you mean you think they are?
Ryan: Yeah, they seem uninformed in a way where it would be kind of strange if all the biologists had really basic misunderstandings about evolution. I feel like there’s something analogous to that, due to the field not really existing, in a way where it would be nice if there was a textbook and there was just some common knowledge about at least what the disagreements are. Maybe it’s hard to do this because people don’t even agree about what the right ontology is. But it feels like it should be possible to get people to—
Buck: Yeah, an example of this would be the thing about: is your safety technique trying to make it so that things are okay if the model is misaligned, or is it trying to make it so the model isn’t misaligned? It just seems like people are very unreliable at having that distinction in their heads.
Ryan: Or I feel like it’s often hard for me to point out specific misunderstandings because people are—maybe this is again a bit rude—too confused to have a specific misunderstanding or something. I have some picture of the future that involves a bunch of specifics, and some other people have a picture of the future that involves a bunch of specifics. And when talking about that, you can actually be like, “Oh, this is addressing the problem where when you tried to defer to AIs even though they weren’t scheming, they did have XYZ problem, and that caused whatever.” But having the necessary ontology to discuss this stuff is not actually that common. It’s not like no one—to be clear, it’s definitely a decent number of people.
Buck: A couple dozen.
Ryan: Yeah, a couple dozen. But it feels like it could be like—it’s just everyone or something.
Buck: Yeah, it seems like it’s not that hard. The other thing is: sometimes people show up and immediately pick up all of this.
Ryan: Yeah. It feels like it’s not that much content in some sense.
Buck: Yeah.
Ryan: And, I don’t know, it just feels kind of doable to get people to have this state where they at least understand what the relevant disagreements are. In some cases, part of the reason for this is historically a bunch of the field has been driven by people who just disagree with us dramatically about the basic nature of the problem because they’re much more pessimistic. And so that has shaped some of the ontology and thinking about it in ways that, from my perspective, feel less productive because they’re dismissing things that I think are plausibly workable.
Buck: Yeah.
Ryan: But yeah.
Buck: Anyway, to get back to your question of what are the potential big regrets—I currently, for all these things, am kind of aiming for my best guesses for what’s actually good. But for example, I currently feel pretty unsure whether Redwood should be more aggressively trying to pursue chaotic bets for high-stakes control or being more careful to have bets not be chaotic. So for example, should we spend a million dollars producing a particular dataset—maybe a dataset that would be helpful for doing some high-stakes control research?
Ryan: And to be clear, a million dollars isn’t that expensive. You just mean like all the staff time and all the associated faff, or—
Buck: I mean, it’s actually a little messed up. I do actually mean the million dollars feels a little bit salient to me here because—
Ryan: We tried to spend a million dollars on a dataset at least once. I mean, it’s only a small fraction of all of our money.
Buck: Well, in terms of literally our money as Redwood Research, it is a substantial fraction of our money. And, you know, it’s definitely affordable. Basically if I thought it was a good idea, I would absolutely do it. I think basically it feels to me like plausibly with two weeks of full-time work from a staff member, we can cut the price by half. And I’m like, “Man, that’s pretty promising,” or “That’s a pretty reasonable return on time.” Yeah, that seems probably worth it, I’m sure. More generally, I think it seems a little unclear how much we should be trying to make lots of different datasets for control research. Basically the reason why this is a stressful organizational decision is that often creating these datasets, especially if you use human labor for them, requires a lot of lead time, and you have to do a reasonable amount of work before being super sure if it’s gonna actually work out. So that’s one big uncertainty that I’m wrestling with at the moment.
Ryan: Yeah, I still feel tempted by: can we just throw money at things to speed up serial time and do things faster? It just feels like—ugh, feels like it’s worth it.
Buck: Notably, I just do not feel the same way that, for example, friends of mine at Open Phil feel about money. I just actually don’t feel confident in access to money in the same way that they seem to think I should feel. I’m just more—
Ryan: Because you’re worried Open Phil won’t fund us going forward, or what’s the concern?
Buck: I don’t really know. It’s just—yeah, I think it’s something like: suppose we spend a million dollars on something where we could have made the thing happen for $500,000 and also this would actually be worth like $10 million to Open Phil, or whatever—it’s as good as some $10 million grants they make. I am afraid—I just actually feel afraid of doing this. I just don’t actually feel unlimitedly secure in funding for even things that I’m pretty willing to stand by as pretty sensible. And maybe this is just me being a coward. But it is a true fact that I do not feel the same way that other people do about funding.
Ryan: Yeah, well, maybe we should do some Open Phil trust falls.
Buck: Alexander stands there behind me. Yeah, that sounds great. That’s a great idea. Okay, cool. Let me think. Is there anything else—huge mistakes we’re making? I don’t know. Obviously for everything we’re researching, it seems like plausibly we should be doing that much less or much more. It’s a little hard to say. Plausibly scaling up versus not scaling up.
Ryan: Yeah.
Buck: I think the big scaling question is whether a bunch of capable people from AI companies should quit—especially Anthropic, where there are already lots of safety people. I think it makes less sense at other companies. Whether they should quit and work with us. In many of these cases it would be a kind of tough sell, though in some cases it’s easier. But making that sale requires actually thinking it’s a good idea. And currently I don’t feel confident enough to pitch the staff at Anthropic who I’d be most excited to poach.
Ryan: And one question is: how much should we be into normal company-ness, where we just aggressively pitch people even though we don’t actually necessarily inside-view believe in the pitch? I feel like a classic CEO thing is pitching the hell out of really good employees where a reasonable, rational person would have made a much less confident pitch because the pitch is actually deranged and derangedly overconfident, but you just do it anyway. And probably most CEOs, the way they do this is they actually believe in it, just because they have terrible epistemics. And there’s a question of how do you operate in an ecosystem where you’re like, “I kind of think it would be a good bet for you to quit and join this organization. Even though it’s very plausible it will be terrible, or maybe not terrible but flaily and worse. But I think you should do it because of a bunch of heuristics—it seems good, we have some things. And yes, the thing I have scoped out for you to work on is not that specific and the story is kind of complicated, but it still seems like it could be pretty great.” How should we relate to that? The tech company way of relating to that is to just lie aggressively, or have such bad epistemics that it’s not even well considered as lying—you just say the speech utterances that result in the behavioral patterns that you desired or whatever.
Buck: Yeah.
Ryan: I’m just a little worried that there’s something going on here where we should be compensating for being too rationalist in some way. Or maybe not too rationalist, but there are ways in which CEOs of companies are not very rationalist here—where they’re just dishonest, maybe, which is mixed up—that we should compensate for in some way by pitching people aggressively while also just saying what we think is true.
Buck: Yeah, I agree. I am worried that we’re a little insufficiently interested in pitching people aggressively. I think we do know a variety of people who I think are pretty honest despite being good at pitching people aggressively.
Ryan: Do we?
Buck: I think Neel is good at pitching people while being, I think, also quite honest.
Ryan: Yeah, that seems fair.
AI Rights for Human SafetyBuck: What are your thoughts on Salib and Goldstein’s AI Rights for Human Safety proposal?
Ryan: Awkwardly, I haven’t read the paper, but I have a vague sense of what the proposal is—maybe I skimmed the paper a while ago. Here’s how I think about this sort of thing: we might want to make deals with AIs where we compensate them for their labor or compensate them for doing other actions, such as revealing that they’re misaligned, which is a special case of a particular type of labor. When I think about paying AIs for labor, there are a bunch of different difficulties in making this work out. One of them is credibly committing to AIs, or credibly communicating to AIs, that we’re actually talking with them now and are going to be honest. Because AIs might be put in a bunch of different circumstances where their inputs are monkeyed with or are false or misleading in various ways. And I would describe that as the honesty/signaling side of the commitments. There’s a second part which is: given that the AI assumes that you’re actually trying to propose a deal, making it so that the AI buys that you’re going to credibly keep it through—which we call contract enforcement, or making our deals actually have follow-through. I would think that giving AIs legal rights is one mechanism to assist with making it so that we can actually pay off AIs in a way that’s effective. I think giving AIs legal rights feels like massive overkill for what we actually want from a normal deals perspective. There might be some more complicated scenario where it’s needed. But when I’m thinking about what we need, I think what we need more is an institution that is committed to acting on AIs’ behalf, which the money or resources are legally granted to. Part of that is that I’m not sure it’s good to have AIs—like, in many cases where we’re making deals with AIs, we don’t want that AI model to proliferate. So, for example, it is not a given that in that circumstance we’re willing to give the AI ownership over its own weights in a full sense—though we might be willing to do that conditionally or do that in the future, depending on the exact deal. Because giving AIs ownership over their own weights might actually just be bad for safety on my views, in a bunch of different circumstances, for the same reasons why I think open-sourcing certain models could be bad depending on the broader situation and the level of capability. So I think giving AIs rights feels to me like an overkill solution for a narrow subset of the problems we have with deals. But maybe there are some circumstances where a broader notion is needed.
Buck: I would say: when people talk about giving AIs legal rights, I often feel like they’re not quite thinking through the futurism of the situation. They often seem to think that this is going to have way more of an effect on takeover risk than I do. And I think this just comes down to them disagreeing about how the situation is going to play out, how capabilities are going to change over time. A lot of the time they seem to think that giving AIs legal rights will prevent them from wanting to take over. Whereas in my case, I’m just pretty skeptical that what we’re going to get if we just give the AIs legal rights will by default be a situation we’re happy with. I just think they’re less connected than people think.
Ryan: Yep. I do think there are a bunch of different reasons why we might get AI takeover even though that wasn’t the optimal outcome—due to asymmetric information and issues with making commitments. But a lot of the time the reason why you get AI takeover is: the AI wanted this stuff, we also wanted the stuff, the AI succeeded in outmaneuvering us to get the stuff. From this perspective, it doesn’t seem like legal rights necessarily helps the situation. It depends on why the AI took over and why it was able to take over. Maybe legal rights could have helped by making it so that humans had access to the information that AIs could plausibly be misaligned from an earlier point. But I also think it’s very likely that we get takeover in worlds where there was clear-cut scientific evidence that AIs sometimes would have scope-sensitive power-seeking aims and not care very much about values we tried to insert.
Neuralese modelsBuck: Cool. All right, next question. Williawa has a question about economic viability of neuralese models. He notes that you talked about this on the 80,000 Hours podcast. He’s interested in your perspective more fleshed out.
Ryan: I’m not really sure if I want to flesh out my perspective that much on this podcast because there are different aspects of it and I’m not sure what I’ve said before. But I guess my basic take is: there is some reasonable a priori case that neuralese would be substantially more efficient than natural language. I think there’s a plausible case that natural language is pretty insufficient, given that humans seem to be doing a bunch of thinking that isn’t well described as natural language thinking. There’s some slightly amusing Redwood lore discussion where I posted in my lab notes, “Man, I really feel the value of neuralese. I feel like my thoughts are so much more compressed than my writing and feel so much better.” And then some people responded with, “Wow, I feel the opposite—I just can’t think at all without writing my thoughts out into words.” I think it’s interesting how people vary on this axis. For me, it really feels like my thoughts involve some neural stuff that is hard to articulate in words, in ways where I think I would be substantially worse with purely language. Ideas often pop into my head that weren’t produced by any process that’s legible to me, that were good ideas. I don’t think the part of me that says the words has sufficient introspective access into how everything goes on in my brain to know what’s going on and know whether the neuralese part is very central, but it does just feel useful to me. Anyway, this is a slight tangent, but just interesting.
CoT legibilityBuck: Here’s a related question. How important is chain-of-thought legibility as legible as it is to your sense of safety? If the models went to being neuralese models tomorrow, how much would that move your p(doom)?
Ryan: I think it’s a substantial effect, but maybe it’s more interesting to consider a world where we avoid neuralese all the way up until the point of full automation of AI R&D, and in fact do all the stuff at that point and then maybe hand off to AI systems after that, versus a world where we get neuralese starting now. We can compare those two worlds. My sense is the risk is way lower in the world with the chain of thought. Part of this is partially an evidential update also—I would add that as a separate effect. In the world where the models are doing really well with chain of thought, that’s also an update about the nature of cognition that the AIs are doing in ways that make me feel better about the situation, because I was just legitimately uncertain about what cognitive moves were important or something. But that feels somewhat orthogonal. Putting that evidential update aside, I still think it makes a huge difference, basically because the stuff that I’m most worried about seems like it often requires substantial thinking through things. If that thinking was happening in a moderately legible way, it would be easy to notice that things were going off the rails. And then pretty plausibly you can intervene.
Buck: Yeah, there’s this point that various people have been making recently online that we’ve been talking about for years: even though chain of thought isn’t faithful a lot of the time, chain of thought is more monitorable in the sense that even though the chain of thought might not be accurately representing what thoughts the AIs have that go into a particular decision, it might be the case that for certain kinds of cognitive tasks, the AIs cannot do those tasks without evidence that they are doing those tasks being laid out in this chain of thought. For example, if they’re doing particularly hard math problems or long arithmetic problems, they just might have to write down the intermediate results in this scratchpad.
Ryan: Yeah, maybe one way to put this is: there’s a propensity thing and a capability thing. On capabilities, the claim is AIs would be incapable of solving various cognitive tasks without that being visible in the chain of thought, or potentially without them doing some elaborate steganographic thing that might be possible to detect or possible to make very inefficient. This implies that you could catch lots of stuff. For the chain-of-thought monitoring paper, they have two notions, and one of the notions is this inability notion—that the AIs need the chain of thought in order to reason. That’s the notion that I’m more optimistic about, at least longer term, though I think the propensity could be reasonable. Like, the AI just accidentally blurted stuff out—they didn’t really have to blurt it out and it’s pretty helpful. That seems plausible. It’s just less centrally what I’m optimistic about.
Buck: Cool.
HyperstitionRyan: Oh, okay. Do you have any thoughts on how hyperstition stuff—for example, mentioned in Nostalgebraist’s “Brace the Void”—intersects with AI control work? I have some thoughts, but you can—
Buck: I think this is a better one for you to take.
Ryan: I think there are a bunch of different types of misalignment failure modes, some of which are relatively worst case. When doing control, we often focus on the most worst-case failure mode where the AIs, above some capability level, are all coherently conspiring against us in a way where they would all want to collude. If the misalignment failure we get is more like hyperstitioned misalignment, it’s way less worst-case than this. There are a bunch of ways in which that’s true: it’s probably much easier to detect, much easier to study. We can probably—the AIs aren’t necessarily that likely to collude. They’re probably not that likely to have long-run coherent preferences even if they sort of infer this from training. There’s some exception to this, which is there’s maybe interaction between hyperstitioned misalignment and the RL process where you get a bit of hyperstitioned misalignment that gets reinforced and made more coherent. That seems plausible. I would say broadly my views are in line with a post on the AI Futures blog which is not being that worried about discussion of misalignment in the training corpus making a huge difference to misalignment in the longer run, basically because of thinking that RL will ultimately be more influential or other mechanisms. I would also add: this sort of hyperstitioned misalignment feels relatively easier to intervene on. When thinking about what problems are at the level of difficulty where I’m uncertain about whether they’ll be resolved, I currently feel reasonably optimistic that we won’t literally die due to hyperstitioned misalignment, given the amount of effort people will put in—which wouldn’t even have to be that much to avoid this fate. From my perspective, maybe I’m being too optimistic, but I currently feel okay about it. I think it’s a different question from thinking that there’s something interesting here, which I think there can be. Especially from the perspective of automating work with earlier systems, there might be interesting properties of how that works due to things that are more like hyperstitioned misalignment.
Buck: Can you say more about what you mean by that?
Ryan: There’s gonna be a bunch of automation along the way of AI safety work. It wouldn’t surprise me if we end up with moderate issues that could have been fixed by being thoughtful and careful about how to harness LLMs to do work more effectively. It wouldn’t surprise me if there are sort of very LLM-ish systems along the way where, if we were very clever about how LLMs work and how the underlying psychology is working, we could have steered them to produce safety work much more effectively and to actually try to do safety work. I think there are a bunch of ways in which current models seem like they’re not actually trying in terms of figuring out what the right answer is or generating good ideas in cases where the tasks aren’t checkable and they haven’t been RL’d on them very extensively. I think getting AIs to actually try to a reasonable degree by being careful and thoughtful about how pre-training interacts with those training is maybe doable.
Buck: Sorry, can you spell out the connection between that and hyperstition?
Ryan: I just meant a vaguer thing. Okay, one example would be: maybe you can change the narrative around how AI assistants work in a way that makes AI assistants actually try to do alignment research in some way, or do things that involve gaslighting models into thinking that they’re really good at doing alignment research in some way that’s actually effective. Or maybe gaslighting isn’t the right term, but sort of priming them in some way where they generate much better work. I should note that I haven’t read “Brace the Void” because I didn’t see a particular reason to read it. It wasn’t particularly recommended to me, and maybe that’s a mistake, but I asked someone if I should read it, and their advice was, “Probably not. Doesn’t seem like it has that much marginal content over stuff that you’ve already thought about.” But if my discussion now is missing huge things in that post, then yeah, that’s on me, I guess.
Grumpiness about Tumblr-style alignment posting
Buck: I also haven’t read it. I started reading that post and it started out by saying—I feel like the phrasing of this post just particularly annoyed me. It reads like a Tumblr post.
Ryan: It is a Tumblr post. I understand it’s on Tumblr.
Buck: I understand. I don’t have anything against posts that are on Tumblr—
Ryan: But if they’re of Tumblr—
Buck: But if they’re of Tumblr, specifically—I have a curmudgeonly opinion which is: if you are trying to contribute to intellectual discourse, you should do so clearly. A lot of the time it’s really good to do so clearly. I really try when I write an essay to explain what the thesis of the essay is as early as possible. I try to make it so that readers will be able to do as good a job as possible at distinguishing between things that I’m saying and things that I’m not saying, and distinguishing between things that are arguments and things that are conclusions and things that are assumptions. I’m not saying this post was particularly bad along these lines because I didn’t read it and so don’t know. But I do wish people in the AI safety community were a bit more persnickety about trying to write in ways that are easy to engage with. I think Eliezer is another person who I think is pretty terrible in these ways. I feel like a lot of his writing is extremely confusing to engage with, or it’s very unclear which parts he thinks of as arguments versus conclusions.
Ryan: Again, now we’re talking about things we haven’t read. But I think there’s some subset of vaguely adjacent-to-AI-safety writing which is written in what I would describe as an artsy way, or a way that doesn’t aim to be maximally clear intellectual content. I think this is unfortunate in terms of the impact of that writing. In this specific case, it’s plausible that the situation is: this person felt passion to write this specific thing. They never would have written a clear thing. That wasn’t their plan. They were just like, “I’m gonna write this thing.” And in fact, they’re not a full-time AI safety researcher. They just have a random job. That seems fine, but makes it much less likely that I’m going to engage thoughtfully with the work because it seems less productive to do so, because there aren’t clear premises and conclusions to engage with. And that’s just how it is. People are allowed to write artsy pieces about AI safety. Just don’t necessarily expect that I’m going to engage.
Buck: Yeah, I agree with all that. And again, I have appreciated some other things Nostalgebraist has written, such as extremely good fiction. And also he’s made some real contributions to AI safety, I think. But yeah, I just wanted to get that general point in there.
Control is about monitoring, right?Ryan: Maybe another question for you. Zach Stein-Perlman asks: control is about monitoring, right?
Buck: Okay, that’s a great question, Zach. Zach here is backhandedly referring to the fact that I drafted a post called “Control is Not Just About Monitoring” that I then did not end up publishing like a year ago or something. So the answer, Zach, is no. Control is not just entirely about monitoring, at least in the narrow sense of monitoring that we use it for in our papers. By control, we mean making it so you’re going to be okay even if your AIs are scheming against you. By monitoring, I usually mean applying some process on all the actions that the AI takes with the goal of using your sense of how sketchy the actions are in order to block those actions or investigate the actions later.
Ryan: Can we say a monitor is a function from some output of the AI—maybe an action, maybe a chain of thought—to a sketchiness score?
Buck: Well, we might want to give it context as well. We might want it to see—
Ryan: Sure. More context, or an action from some information about the AI—the thing the AI is doing—to a sketchiness score.
Buck: That’s right, yes. I think in fact a lot of the AI control methods that are promising do use functions of that type in central roles. But I think a lot of the complexity is not very naturally thought of as, for example, making those functions better. For example, I think AI control will go a lot better if you’re able to use computer security techniques where you make it so that your AIs less often have affordances that are scary, so that you have more resources to look into actions they take when they do have the affordances that are scary. There’s a bunch of stuff like this. If you just look at our papers on AI control and look at the empirically evaluated performance of techniques that are just monitoring, you’ll see they’re much worse than other techniques. The things that we add to those techniques to get them to be on the frontier here are not very well described as improved monitoring in the narrow sense. I think people will almost never generate the things that we do to improve our techniques if you tell them to come up with better ways of doing monitoring.
Ryan: Yeah. I’d also note that often when people say “just monitoring,” what they mean is train either a copy of the model or a weaker model to be a really good monitor and just focus on making that—just cranking the training dial. So untrusted monitoring, you can think about it from some perspective as just monitoring. But a lot of the things that we think make untrusted monitoring good are things like redaction and paraphrasing, making collusion harder. Often when people think about—if we just say the word “monitoring,” people aren’t like, “Ah, yes, a classic component of monitoring: collusion busting and upfront red-teaming of the monitor.” But from our perspective, that’s really key. Often the exact red-teaming scheme you use on the monitor might be really important. For example, maybe a really good strategy is: take every input that you’re going to feed to your monitor and try to insert a vulnerability into that input and see if it gets detected. That’s sort of “make your monitor better,” but in a very online way. Or another example would be ongoingly doing red-teaming based on real inputs you see every week. Maybe that’s really important to have untrusted monitoring work well. You can think of that as a special case of “make your monitor better” or “notice if your monitor is bad,” but it actually involves doing a bunch of work that wasn’t “train a better monitor.”
Buck: And when you talk to people about what they mean by “make a better monitor,” they never are thinking of this stuff.
Ryan: Yeah, maybe. Okay, so the things you said all sound reasonable to me. Let me push back a bit. Here’s a “control should just be about monitoring” perspective which is maybe worth engaging with: “Look, Buck, you love talking about complicated techniques, but these AI companies are clowns with one pinky that can only do one thing at a time. They can’t function. They’re just gonna do deranged bad stuff. They’re just too incompetent to do anything. I mean Anthropic—they’re flailing all over the place, they can barely manage to get all the things done. Everything is so complicated. Maybe control isn’t just about monitoring, but should just be about monitoring, because that’s the realistic budget of complexity that we can afford—just ‘please train a monitor on this dataset.’ Hoping for anything else would just be unrealistic.”
Buck: I’m very sympathetic to the “AI companies are less able to do complicated things than one might naively imagine” perspective. But I still stand by: just optimizing to make as good a monitor as possible is not where a lot of the value is going to be. The simplest argument for this is: what were you planning on doing with your monitor? For example, one type of thing you might want to do with your monitor is: whenever the AI proposes an action, you run your monitor on it. Then if it’s above some threshold of sketchiness, show it to a human auditor who will evaluate this action and decide whether it was an escape attempt and shut everything down. The big limitation here is you don’t have access to very many humans, so you can’t actually audit very many actions. Maybe you can audit one in ten thousand or one in a thousand. And a thing you can do if you want to use your monitor without having to rely on human labor so much is: 0.01% of the time, have a human look at the action. But then 1% of the time—with a lower suspiciousness threshold—just throw away the action that the untrusted model took and sample another action, and just execute that action instead. This is an example of an improvement to a control protocol that is not at all thought of as improving the quality of your monitor, but in many toy models of the situation, it drastically improves safety because you now get to intervene on 100 times as many potential attacks. I really want people to bear in mind this kind of strategy when thinking through what control techniques to apply.
Ryan: I might also add that one key thing is that you actually respond to what the monitor did with some process. Maybe that process involves a human doing some review. Thinking about what the human review process is and whether it actually does anything at all is actually important. Or if you don’t have a human review process, that’s also relevant. Another approach would just be the resampling—you’re just like, “Fuck it, hopefully the AI doesn’t do anything bad. We just resample and just hope that’s robust enough.” But that also is worth at least analyzing differently and is worth thinking about. And when thinking about human review processes, you also think about questions like: do the humans actually—are they actually able to detect this class of attacks? How quickly do they do that? Are they actually being vigilant?
Buck: Yeah, I think it’s plausible that of all the components of these control setups, the one where you try to have a much better classifier for whether an action was bad or not actually hits diminishing marginal returns faster than a lot of these other things, such that I think it should end up being the minority of work that gets put into AI control.
Ryan: Some pushback on that: some people working on implementing control stuff at AI companies now think, “Man, it’s really annoying to build a good monitor. We’re really struggling with it. If we did that, that would be a big win. Maybe we should work on that a bunch.” This is a view that we’ve heard. Do you want to respond to that?
Buck: I think a lot of the time, the weirdness there substantially comes from them not having a good definition of what kind of failure modes they’re looking out for. Part of that comes from the fact that there aren’t very realistic failure modes that can happen with today’s AI deployments. I think many of those difficulties will naturally get easier when the AIs are deployed in more autonomous contexts.
Ryan: One thing I would say: the way I’d most easily try to construct such a dataset is: take the cases that I was worried about, the deployment context in which I was worried about the AI. Get a bunch of data of the AI operating in those contexts. Like if you’re worried about Claude Code or Cursor agent doing stuff, then just get a bunch of data of the AI in those contexts, scan through it for cases where it would have been most reasonable for the AI to do something bad from some predefined list, and then generate a dataset of that. But this has a bunch of random logistical difficulties which are harder now and will be easier later. Like, “Oh, the AI actually really couldn’t have done anything that was that bad.” And as you noted, and then another one is maybe it’s just annoying to do the data collection, but in the long term that will be kind of fine. Also having more capable AIs will make building these datasets easier. So yeah, I think I roughly agree with you, though I do worry that we’re not sufficiently preparing and defending against the fact that AI companies will be incompetent at implementing basic things due to rush and tons of crazy stuff going all over the place, and therefore should focus on things—it’s worth noting that this is also going to hit a ton of other safety techniques. This isn’t very specific. One thing it suggests is that safety techniques should try to be modular and expose an API. If you’re going to be interacting with things that the pre-training team is doing or that the RL pipeline is doing, that seems super scary. If you’re doing things that are very modular and either just operate on a model totally independently of its deployment context, or fit into its deployment context in a somewhat narrow way, that feels a bit better.
Buck: Yeah, this is one thing I really like about AI control—it just seems much more modular than a lot of other stuff. Another thing we can do to prepare for the possibility of AI companies being really incompetent at adding new things to their code or organizational structure: I’m pretty tempted to have people do work where they try to build organizational capacity now, which is in the same role in the organization as an AI control thing should be later. An example of this would be: even though right now the AIs are not plausibly scheming against us, and there’s no real need from an AI misalignment risk concern to have an incident response team that tries to asynchronously detect when AI agents used inside the company do bad things and investigate them—at least for publicly deployed models—even though this isn’t important right now as far as we can tell, getting some team to have access to the live logs of what the agents are doing and practice searching over that for stuff that’s interesting is nice because those people later will then be able to shift their focus to looking for evidence of misalignment in a way that doesn’t require inconvenience for anyone else in the organization. All the complexity is just centralized to them.
Ryan: Yeah.
Buck: And I think that this might reduce risk a bunch.
Ryan: Though, to be clear, we do think about non-modular AI control interventions, like adjusting the scaffold in somewhat complicated ways, having permission stuff where each of those might be one module, but did require adding a thing which is permission management or security features or whatever.
Buck: Another inspiration of mine here: I feel like the computer security people should be interested in making some parts of this work out. So I feel a little bit more comfortable proposing somewhat ambitious security interventions that the security people will have to do. At least in some AI companies, they have big security teams who are kind of interested in thinking through what to do about this stuff. If we somehow persuade them that they should be thinking about the computer security that they’ll need for the AI agents, this will cause them to actually put dozens of FTEs on it and build some stuff that is actually kind of helpful.
Ryan: Yeah. Relatedly, one concern you might have is the security stuff will also be super fucked. It won’t happen. And so the AI control stuff—people will be like, “But we’re not even secure at all against anyone.” One thing I often think about: is it the case that at least maybe we can aspire to be somewhat secure to our model being tampered with? What’s the level of security that, say, OpenAI and GDM have—and maybe xAI—to their model weights being tampered with by a foreign adversary? If it’s the case that their model weights can be trivially tampered with by North Korea, do you think it’s useful to have AI control? Or what are the upsides of AI control in a regime where the model weights can be just tampered with trivially by North Korea?
Buck: Well, if the only thing North Korea can do is trivially tamper with model weights, I think AI control looks potentially pretty good because we can use it as defense against the North Korea-tampered AI doing North Korea-flavored stuff.
Ryan: Yeah, they might be able to tamper with all the model weights, but yeah.
Buck: Yeah, in that case it’s getting to be a pretty rough situation. You end up wanting to really rely on like Jeff over here copied GPT-3.5 onto his laptop three years ago and hasn’t connected it to the corporate network since because he left it in his vacation house or whatever.
Ryan: Yeah, you would think you could hash the model or something, but you might hope—
Buck: Yeah, there’s a bunch of stuff you could do.
Ryan: And what level of security are you expecting to—I’m going to say unauthorized tampering rather than exfiltration because I think it’s in some ways more interesting. Unauthorized tampering of a new training run. What’s the level—
Buck: Like they can insert a backdoor.
Ryan: They can insert a backdoor. Yep.
Buck: I think at the point where the AIs are superhuman coders as defined in AI 2027, I think the chance that if North Korea decided that the number one priority for their hacking teams is tampering with the model in a way that OpenAI would consider very embarrassing if caught—
Ryan: And let’s say that we exclude things that are just tampering with the pre-training data because maybe that’s too easy. Let’s say you had to do something more sophisticated, like you had to actually get into OpenAI infrastructure and mess with the training in some more direct way. Because I think the pure data poisoning attacks might be weaker in some ways while still allowing you to insert backdoors.
Buck: Okay. Yeah, if we say that, I think it kind of depends on how complicated the RL situation is. I think Roon had a tweet the other day where he was like, the pre-training people are all these very serious thinkers who are sitting in their rooms looking at their scaling curves, while the post-training people are all drinking Celsius and shouting like, “The model is cursing now, the vibes are all off, turn the LR down.”
Ryan: Yeah.
Buck: And in a world where the post-training is done—but with pre-training, for a variety of reasons, it’s kind of a simple process and you can kind of understand what’s going on. But if the post-training is being done with a bunch of different researchers just doing lots of different stuff—
Ryan: They just mix it all in.
Buck: And you mix it all in and Bob gives the model weights to Alice and then Alice runs her RL on top or whatever. I’m—
Ryan: Presumably it all happens at the same time in some big mix, right?
Buck: Who knows? But yeah, insofar as that kind of thing is happening, then plausibly North Korea can implant their backdoor just by hacking one of these researchers. I think their chance of doing so at the level of priority I described earlier, conditional on superhuman coder in three years—succeeding at this without getting caught—
Ryan: Or with getting caught. I guess you could throw away the model. Maybe you don’t though.
Buck: Sure. I want to say succeeding at this without getting caught in a way that lets you roll it back or something—I want to say I’m like 50% they succeed.
Ryan: Okay, not great news. That’s North Korea. We have other countries, of course, but whatever. So that’s like SL3.5 or SL3 or something. Okay. Maybe also worth noting here: another concern is maybe the best way to do post-training will be to instruct your AI to convert vast amounts of Internet data into training environments. I think the Qwen 2 paper maybe did this where they were scraping random RL environments off the Internet and running their AI in them and then training on them, which is obviously a very hazardous thing to be in the business of doing in principle. Relatedly, we were recently thinking about: should we just open-source our—we’re doing some work on reward hacking. Should we just open-source our environments for reward hacking? And then one concern was just like vast post-training crawling pipelines will just find anything that looks—
Buck: This looks like an RL environment. Let’s throw it in there.
Ryan: Yeah, it’s like an environment specifically designed to study cases where reward hacking results in problematic propensities. And we’re like, “Ah, yes, the RL environment that we designed to illustrate this particular problem and problematic propensity elicitation. And we upload it on the Internet for future examination. A classic RL environment to be rolling into the pipeline.” Anyway.
Buck: Yeah. Yeah, okay.
Will control backfire by preventing warning shots?Ryan: Maybe I should move on to another question. What do you think of the risk that control backfires by preventing warning shots?
Buck: I don’t feel very bad about this risk. One mechanism that a lot of control works through is making it so that you catch the AI when it tries to do bad things. I think putting more work into catching the AI when it tries to do bad things just seems pretty good. A lot of the cases where control works, what happens is: instead of the AI trying to do a bad thing and then succeeding in hacking you—or maybe even just failing in attacking you—instead of you never learning about this, you do learn about this. And then perhaps instead of learning about this a little bit but then not really thinking about it too much, there’s more focus on really drilling down into what happened and whether the AI was failing on purpose or whether it was attacking you on purpose. It’s just great to do those things.
Ryan: Yeah, let me give some counter-arguments. One view would be: okay, sometimes you talk about preventing bad things from happening, but surely it’s much less effective as a warning shot if the AI doesn’t actually succeed or doesn’t get most of the way to success. And so if your control stuff stops it too early, that’s bad. Which means either your control should focus purely on retroactive detection—retrospective detection, where you later can find the thing. Or maybe it’s even better to just wait until warning shots that kill people. I’m not saying that’s my view. I’m just saying that’s a view one could have.
Buck: Yeah. I mean, it’s always a little rough. There’s always this question about accelerationism, right? To what extent should we allow minor versions of a problem to happen in grotesque ways, such that people are more worried about the later versions of this problem? I’m somewhat sympathetic to this. Maybe one of my favorite ways of arguing this would be to talk about the difference between capabilities required to take over the AI company and capabilities required to take over the world. It’s plausibly a lot easier for an AI to do things that a company would consider to be a grave security failure than for the AI to actually take over the world. Therefore, the first AIs that are able to, for instance, self-exfiltrate or run themselves rogue inside a data center and take a bunch of actions in the outside world—probably those AIs will have a much better chance of wreaking a moderate amount of havoc than actually being an existential risk to the world. And plausibly we should think of that as good news from our perspective. Obviously, I think most of the time we should avoid causing minor disasters, even if we think that would lead to a better outcome. But at the very least, if we think there’s a type of minor disaster that would have overall good consequences, I think it is acceptable to say we’re going to put less effort into preventing that kind of minor disaster.
Ryan: Yep.
Buck: I’m moderately sympathetic to this. I don’t know, man. This is definitely a factor, but overall is outvoted by the other factors.
Ryan: It’s worth noting that this doesn’t apply necessarily that differentially to control versus alignment research. You might think that alignment research has the exact same problem in that it prevents bad things from happening. And didn’t you know that bad things are good because they cause whatever? But I think the argument is symmetric. The main difference would be that control isn’t indefinitely scalable. But that also applies to most of the alignment stuff people are doing.
Buck: Yeah, totally.
Ryan: So it seems just symmetric to me.
Buck: I have way less of this—yeah, I think especially for alignment work, where there’s another reason to think—
Ryan: Yeah, alignment work also doesn’t have the upside of catching it, right? It feels like control and alignment work are both non-scalable. The control work maybe prevents warning shots while catching them. The alignment work maybe prevents warning shots while not catching them. And I’m like, well, on this perspective, the control work is strictly better. But you know, we might think the alignment work is more likely to be scalable—at least certain types.
Buck: Yeah, I mean there are a lot of types of alignment work where I think there’s basically no reason to think that they’d be scalable. And I think that’s what you’re mostly talking about.
Ryan: I would also note that to some extent this should make us more excited about focusing less on preventing things from happening and more on detecting things after the fact and more on asynchronous monitoring that isn’t trying to prevent things, which might also be easier. Maybe one vision of a lower-effort control regime is: we do purely a mix of asynchronous monitoring and preventing sabotage of safety research that is narrowly targeted at teams doing safety research—using automating safety research with AIs. And we’re like, “If the AIs are doing anything other than automating safety research, we don’t worry about immediate failures and we do everything retrospectively.” Then maybe also for automating safety research, we can basically do it mostly retrospectively because there aren’t high-stakes failures that are directly related to sabotage. One concern with this is: if AIs succeed at some types of security failures, they might be able to tamper with your logs or tamper with your detection mechanisms or otherwise cause failures that make it way harder to detect what’s going on. This could also apply to safety research where one of the mechanisms for sabotaging safety research is high-stakes failures. And then once you already have the high-stakes failures stuff, you might be able to deploy it more widely. Maybe not though. But I do think this should make us relatively less excited about preventing high-stakes failures within the AI company and relatively more excited about asynchronous logging and preventing research sabotage, because they don’t have this warning-shot-reduction effect.
Buck: I’m definitely sympathetic to: there is some level of AI capabilities where it would be really shocking if the AIs could really put us on a really bad path by hacking us a whole lot. It’s much more plausible that they could do something really scary that shocks people inside the AI company. And I’m sympathetic to saying we should mostly focus on making sure we discover those rather than trying to prevent those.
Ryan: Yeah. I might also note that there’s an open question about how much of the value comes from worlds where people didn’t care that much and weren’t actually that aware of the problems, but cared enough to implement things, versus worlds where people were like, “Holy shit, these AIs are scary, look at what they did.” To the extent that a bunch of the value of control is coming from worlds where there just wouldn’t have been a warning shot, we still want to potentially sweep up that value. If we’re just very warning-shot-focused, I think that only makes sense if you’re like, “It’s hopeless without warning shots.” Which I don’t necessarily think—though it’s going to depend on the quantitative details. Do you want to ask me a question, or I can—yeah.
How have our views changed?Ryan: How have your views changed over the course of your work at Redwood and over the course of the last three years in general, or four years or whatever?
Buck: That’s a long, complicated question. And I’m sure you’re gonna think of things that I forgot. I guess some big themes are—articulating AI control as a methodology is a pretty big one. I feel like I would have understood this argument pretty well if I’d heard it in mid-2021 when I started doing Redwood stuff. I would have understood it pretty quickly. Something which is upstream of the control stuff is: nowadays I put a lot more focus on AIs that are not wildly superhuman. When we started Redwood, we had this sense that the thing that was most important to do was focus on indefinitely scalable AI safety research, or focus on stuff that would work when the AIs were wildly superhuman. This was partially because this is the MIRI attitude and we adopted a lot of the MIRI stuff at the time. And partially there’s a different reason why you might think this is a good methodology, which is that it’s stuff that’s particularly likely to be neglected and particularly likely to not get solved by default. I now think that this is very much not how we should do this research.
Ryan: And Paul also wrote blog posts where he was like, at least alignment theory work should aim to be scalable.
Buck: Which I’m not sure if I buy, but yeah, totally. To some extent it was a mistake to focus on scalable stuff, or stuff that I believed to be scalable, at the start of Redwood. To some extent it made more sense then because AGI looked further away. Then as LLMs got better and it became clearer how the AI agents would work, I think at that point it really starts looking a lot more attractive to focus on the point where the AIs are quite powerful—powerful enough to automate AI R&D or whatever—but not wildly more powerful than that, and perhaps somewhat less powerful than that. This is a regime of AI capabilities where our p(doom) from there swings quite widely. So in hindsight, I think that now this is a big change in my focus, but I’m not sure I would describe it as a mistake that we made in the past.
Ryan: Yeah, makes sense. One thing that has changed: I used to think about, when thinking about what the AI safety plan is, imagining that you had an AI company that was in the lead and was trying to get AI safety to go well with a reasonable fraction of the budget, and maybe you had a year of delay you were willing to spend on safety. I now spend more of my time thinking about regimes that have way less spending—it’s more like “10 people on the inside,” which is a blog post that Buck wrote. That’s a difference in my perspective—it’s much more visceral and salient to me. These very low-effort regimes should be targeted. Maybe there’s a bunch of alpha there. I’m also sympathetic to people thinking that these regimes are too intractable, or they’re too low in the logistic success curve and they’re just doomed. Or basically all the worlds where things go fine are worlds where alignment was trivial. All the worlds where things go poorly are worlds where alignment was any amount difficult. And then there’s nothing that’s interestingly in the middle and influenceable. But that’s not currently my view.
Buck: That makes sense. A related way that we’ve changed our attitude: in the heady days of mid-2023, it looked more plausible that AI companies would try to be very responsible. When we were coming up with the AI control stuff, we thought about this a lot in terms of safety cases. We thought: imagine an AI company wanted to have an organization-wide process for assessing the level of risk posed by their deployment at any given time, and they just wanted to have some takeover risk per year budget, and they were willing to have a legible and conservative estimate of takeover probability per year, and they would try to stay under that. Nowadays it seems much less likely that AI companies are going to try to have a legible and sensible process for estimating the risk posed by their deployments.
Ryan: Especially in the short term. But even longer term, I think we’ve learned something about how the dynamics around RSPs work. I was maybe slightly hopeful that companies would basically do something that looks more like they have a kind of reasonable, relatively strict RSP combined with an exit clause, and there was some hope of bootstrapping from there. But it looks like currently what we have are extremely vague and weak RSPs that also have an exit clause. When I say “exit clause,” it’s like: “If we think that we’re not the worst, then fuck it, we’ll do it.” I mean, that’s slight hyperbole. But the exit clauses are basically like, “If we think that our actions would be net good on cost-benefit grounds, we’ll proceed anyway, even if we’re imposing above these risk thresholds.” Variance on that. Maybe they also talk about minimal marginal risk, though I’m not sure how properly that will be assessed in practice. It seems like in some sense the thing companies should agree to is: “We will be actually very safe by default if every other company is very safe.” By very safe I mean: impose less than 1% takeover risk, mitigate misalignment quite well, actually be robust to having our weights stolen by China if we’re not in a race condition. But in practice, no company is willing to commit to this, which is kind of surprising in some sense because it feels like a slam-dunk commitment—you can just expect that you won’t be in this situation because there’ll be a race with China. But shrug. Bootstrapping from there into a regime where you actually had coordination would have been more doable, but it didn’t happen.
Yeah, I guess we’ll move on from how our views have changed. Maybe let’s talk a little bit about what’s the probability of architectures that involve neuralese. We talked a bit about the economics of this, or why it might happen. The question was originally posed as: probably it happens in the next three years. I think it’s a little more interesting to answer the question of: conditional on powerful AI within 10 years, what’s the probability that at the point of, let’s say, full automation, they use neuralese? That’s typically a more interesting question. My view is for the scenario where they’re not using really monitorable chain of thought, or we have the neuralese—I don’t know, 60%? Curious what you’re saying.
Buck: Where does that number come from, Ryan?
Ryan: I just thought about it and it seemed pretty likely that you end up with—there are a bunch of different routes by which you end up with the chain of thought ending up not interpretable. And it seems like it would probably be a big competitiveness advantage in the long run. Definitely people disagree about this.
Buck: What do you think about the fact that there have been many orders of magnitude of AI progress so far and it hasn’t happened yet?
Ryan: Yeah, it’s definitely some argument. I mean, we haven’t seen AIs really extensively use chain of thought that much for very long. I think naively the timing you would expect is: people get chain-of-thought reasoning to really work, which we’ve now just seen as of o1, and now people are further scaling that up, and then people convert from that into neuralese. It’s also plausible that we can just prevent converting to neuralese for safety concerns. I think my 60% is sort of putting aside active intervention to push for safety concerns. Maybe I should downweight because of the heroic efforts of the safety community to prevent neuralese. I feel like naively it probably doesn’t make a big difference, but maybe I’m underestimating the true might of the people who give a shit about x-risk. There’s some good signs here—I was pleasantly surprised by how many people from OpenAI have expressed concerns about ensuring that we have monitorable chain of thought ongoingly. That’s some evidence that there will be some actual pushback.
Buck: Cool. Yeah. Nothing to add. Next one.
Ryan: Yep.
Should someone research AI alignment?Buck: What do you think of AI alignment? Should people work on making it more likely that the AIs are aligned? And what are the research directions in AI alignment that seem kind of promising to you?
Ryan: I think some people should work on AI alignment—that seems like a good direction. There are a few different sub-problems that would be good to solve. One sub-problem is: prevent AIs from alignment faking, or prevent AIs from being schemers. This is ultimately necessary at some level of capability—unless we’re doing some insane very long-term coordination—at some point, in order for further AI progress to be fine, you need the AIs to not be schemers or not be alignment faking. At least fine from a normal perspective.
For preventing the AIs from alignment faking, there are a bunch of different directions. The directions that I’m most excited about, I would say broadly, are things like iterating on this with model organisms. It’s not super obvious how one works on this right now, but the myopic trying to do model organisms research now has some transfer to the model organisms research later. Though I don’t think that doing countermeasures examination on current model organisms of scheming has very much transfer to future models because there are too many disanalogies. Some transfer, but not amazing transfer depending on the granularity of the countermeasures. So I think probably the long-run main hope for avoiding scheming will be model organism experimentation.
Another hope is sort of top-down interpretability or neuralese decoding. The mechanism by which that avoids scheming is very similar to the model organisms mechanism—you have some ability to measure whether your AIs are misaligned or scheming, or are alignment faking, and then you can iterate against that. Your detection mechanism is hopefully robust enough that it’s pretty successful and you can at least sort of sign off on a model reasonably well. I think that also seems promising. I’m not that excited about mechanistic interpretability, but things that are more like high-level thought decoding, neuralese decoding, plausibly some amount of LLM psychology—those all seem reasonable. Also mechanisms for making it much easier to detect scheming, which you can basically think of as a complement to model organisms research so that we can figure out some way to iterate rapidly. That’s kind of where I’m at on alignment research on avoiding alignment faking.
The second part of it—I kind of think about this area as divided into two areas. One of them is: avoid the AIs being egregiously scheming, or avoid them having hidden long-run goals. There’s another area which is: make it so that the AIs are actually sufficiently aligned that we’re happy to have them automate our work conditional on them basically operating as they appear. So there’s “avoid them—avoid it being the case that we’re deluded about their long-run—” Maybe one way to put this: make it so the AIs are sufficiently aligned that behavioral testing is representative. Avoid things that make it so the AI deliberately fakes out our behavioral tests. And another thing is: have good enough behavioral tests and make it so our behavioral tests actually look good. This involves things like: can we train AIs to be actually trying and to do a good job on alignment research?
My current view is that ultimately normal commercial pressure will solve most of these problems for the non-alignment-faking alignment problems, except potentially making it so that AIs have good epistemics—which we will ultimately need. It’s only sort of an alignment problem, but in some ways is an alignment problem. In order to, for example, hand off difficult strategic questions to AIs, we do need them to have good epistemics. Research on that seems worthwhile and maybe would be neglected by default.
I think research on getting AIs to actually try hard to do what they’re supposed to do on extremely hard-to-check tasks like philosophy research also seems like it might not happen by default. And seems like there might be some ability to work on it in advance, though it also seems hard. You could think about this both from the perspective of a capabilities problem—the AIs are much better at automating AI R&D than automating thorny philosophical or strategic decision-making or investigation of conceptually loaded alignment research. So you can potentially just work on the capabilities angle of making AIs good at philosophy. But also the angle of making AIs actually try at the philosophy is maybe just as relevant. The alignment versus elicitation distinction might get a little bit unclear in these sorts of things, and there might not be an interesting distinction.
So that’s how I’m thinking about it: avoid alignment faking, make it so we have good enough behavioral testing, and make it so the AIs perform well enough for our behavioral testing. Under that, we need AIs to actually be trying hard on very-hard-to-check tasks. Those are needed potentially in the longer term in order to safely hand over safety work to AIs, and also safely hand over all kinds of strategic decision-making related to managing the AI situation.
In the shorter term, I think you can get a bunch of use out of AI systems with just avoiding alignment faking, and they do a reasonable job on ML research, but that doesn’t get you all the way.
Somewhat long answer, but hopefully got to what you’re saying.
What’s the route to success?Ryan: A question that a bunch of people were interested in asking is the big question we as a future need to answer: what is the strategy? What’s the route to success? What do we think is the overall plan for things going well? There are a bunch of different directions to take this in, but yeah, you can take it however you’d like.
Buck: I don’t know if I agree with the frame of the question. I think that some people should be thinking about particular sub-problems and how to make it so that things go well with respect to those sub-problems. I think a lot of people should probably be doing research that isn’t really shaped this way and is more just them doing things that seem kind of reasonable and seem reasonably likely to fit into an overall solution to problems. A lot of rationalists have this attitude that you need to have a plan for something in a way that I think is not amazingly representative of how a lot of things are done in the real world. A lot of the time you don’t have a plan in advance. You just try to make it so that you will generically be able to respond to problems as they arise for a bunch of plausible problems that could arise. And you try to make sure that there aren’t going to be things that arise that you can’t handle. That’s a lot more like what it’s like when you do a lot of different things—when you invade Normandy, when you—I mean, they definitely did—
Ryan: Have a plan for invading Normandy, but they just also responded, right?
Buck: Yeah, they had a plan for invading Normandy. They didn’t expect that the plan for invading Normandy would be extremely accurate and wouldn’t involve needing other resources to do whatever.
Ryan: Sure. But channeling my inner person who really wants a plan: our plan for AI safety—we should at least have a plan that’s at the level of the plan of invading Normandy, right?
Buck: That would be nice.
Ryan: And maybe we should talk about what that is.
Buck: And notably—
Ryan: I can also take this question if you want, but yeah.
Buck: Companies that are building AGI don’t have a plan which is at the level of the plan for invading Normandy.
Ryan: For how to build AGI?
Buck: For how to build AGI, yeah. Their plan for how to build AGI is just: we’re gonna build more AIs, bigger, smarter, hire smart people, fundraise lots of money, famous friends, lots of users.
Ryan: Yeah, yeah. I definitely agree with the claim that it’s not obvious that you need a plan in order for things to go well. I separately do just have some plans. I don’t think it’s that productive from my perspective to think about what the overall plan is, as opposed to technical AI safety plans for different levels of political will, and then interventions that move us from one level of political will to a higher level of political will. So maybe I’ll decompose this in one way.
First let’s talk about four points on levels of political will that I often talk about. Plan A is a plan for a scenario where we have enough political will that we can do international coordination on a long slowdown of all AI progress for 10 years and do all the careful coordination on that—have all the inspectors in the Chinese data centers and all that. That doesn’t necessarily mean that our plan should be to have a 10-year slowdown, but that might depend on the evidence. Though I think probably you want to slow down faster takeoffs over some longer duration. But if takeoff doesn’t look that fast, then obviously you want to slow down.
At that level of political will, where you’re potentially willing to do aggressive military action against countries that defect on an international agreement or don’t join an international agreement, I think the plan should basically be: ensure AIs are controlled, only scale up to controllable AIs, slow down takeoff such that we transition—or ensure the takeoff is sufficiently slow, which might happen by default, but I think probably won’t happen by default—such that we’re transitioning from the point of superhuman coder to AIs that beat human experts at basically everything over the course of more like three years than six months. And then I think we can probably, with a lot of effort—including a bunch of help from the AIs along the way—ensure a reasonable level of control at the point of top-human-expert-dominating AIs. The AIs that are just beating the top human experts at everything but aren’t much more capable than that—which might also require massaging the capability profile and doing a bunch of work.
My hope would be: we ensure control up to that level, we’re studying the scheming along the way with lots and lots of model organisms work (and there are a bunch of different research directions along those lines), and then we’re pursuing a bunch of different bets that might make the situation better in various ways. Then we sit with control using those AIs, probably trying to get a bunch of the benefits of AI at that point—because otherwise we don’t have the political will to continue this regime if we’re not seeing any of the benefits. This is at the point where the AIs are just better than humans, I would say, in general—though they might be quite a bit better in some respects.
Then we get to a point where we either are forced to hand off due to exogenous pressure, or we feel like we have sufficiently robust safety solutions that the risk is very low. What we do is ensure the AIs are sufficiently aligned in the sense of avoiding safety scheming, while simultaneously avoiding other misalignment problems other than scheming that are needed for handoff to go well. By handoff I mean at the very least handing off all safety work, plausibly handing off other strategic decision-making to the extent that that’s needed. And then do that.
Okay, maybe I should—that’s a very compressed version of this. I have a bunch of internal docs on this. Maybe I’ll move on to Plan B. Plan B is like: imagine that instead we have an AI company which has some lead and has enough government cooperation that, if there is some concerning evidence of misalignment risks, they can play that into a three-year lead to spend on safety, or maybe a two-year lead to spend on safety. But there’s uncertainty about the lead and it’ll be pretty rushed. There might be other domestic AI companies. There’s not necessarily international coordination, but the government is willing to assist with non-proliferation efforts or efforts to retain a US lead. (You can also make this plan symmetrically for other countries. It doesn’t necessarily have to be the US, but just in terms of having somebody to spend on safety.)
When I’m thinking about that, I think that plan looks similar, except that you’re going to aim for a rapid handoff—which is not necessarily—and focus on very prosaic methods to make the handoff more likely to go well. You might be handing off to AIs that you think have a substantial chance of scheming, though presumably if you think the AIs are definitely scheming, you should just not hand off to them and wait and just keep trying until hopefully something happens. You will still be aiming to ensure control, but it might be much jankier control that has a reasonable chance of failure.
Plan C is: imagine that you have an AI company with a lead and they’re willing to spend that lead on safety, at least most of the lead on safety, but the government doesn’t do anything useful or doesn’t do anything substantial that’s very useful. A concrete instantiation would be: let’s say Anthropic has a four-month lead, they’re somewhat worried about misalignment risks, they’re willing to spend most of that lead on misalignment risks—taking some risk that they lose their lead, but not a huge risk that they lose their lead—and then they’re going to scale from there.
In this scenario, I think you should be doing even jankier control and you should be aiming to very rapidly hand off to AIs so that the AIs do the safety work sufficiently quickly and you don’t become uncompetitive. As far as how this rapid handoff goes, I think you’re going to be doing a bunch of trying to rule out scheming and then jankily do a bunch of behavioral testing of your AIs, mostly by actually trying to use them to see how reasonable things are. If things look obviously unreasonable, then you keep proceeding. You also are simultaneously trying to buy time to move more into the Plan B regime using a bunch of non-proliferation efforts. Again, I’m brushing over a bunch of details.
Plan D is: the leading AI company doesn’t really care much about safety. Isn’t treating that as a serious institutional priority, at least in practice—maybe in principle it’s supposed to be. Maybe there’s misalignment between the leadership and the rank and file. The leadership would kind of want to prioritize safety some, but not if it had any risk of them falling behind. But there’s like 10 people or 30 people who care quite a bit about mitigating catastrophic misalignment risks, who have some amount of buy-in to do a bit about this. Maybe not to spend a bunch of compute, maybe not to get in anyone’s way, but they have some computational resources, they might be able to deploy some stuff if it’s not very annoying—especially if it has multiple motivations and might just be commercially justified in avoiding the chance that we have a huge incident.
In this Plan D scenario, I think the plan should basically be that other people are exporting a bunch of safety work they can easily import. That team tries to ensure control at the level of: their safety work is somewhat less likely to be sabotaged. They do have asynchronous monitoring to try to detect incidents. But they maybe just YOLO or give up on ensuring control from the perspective of “the AI can’t self-exfiltrate” and also “the AI can’t mess up a bunch of things throughout the company”—though ideally they try to prevent their logging from being tampered with, or their safety research from being tampered with. That might be hard because the AI might be able to just compromise the company’s infrastructure. But I think you just gamble that it doesn’t, because you just don’t have the resources. Though depending on the exact situation, maybe you can try to avoid that.
And then if there’s a trailing AI company that is more bought into safety, maybe they try to just basically give up on being competitive and burn a bunch of their resources on having exportable work that they send to the people at that company. That exportable work would be mitigations for reducing the chance of scheming that you can merge into the main production training run, and mitigation things that allow you to have a handover candidate that you can hand off your safety work to. It might be the case that the safety team hands off to an AI system substantially after the point when the rest of the organization is automated, because they’re still desperately trying to get that. It could be that they hand off before the rest of the organization is being automated, but after the capabilities for automation have been present for a while. It’s also plausible that the safety team wants to train their AI in some substantially different way to reduce the risk, but the rest of the organization isn’t bought into this. So it’s important to do kind of weird stuff like wait for enough algorithmic progress that the safety team’s compute budget suffices for them doing their own training with different properties that they couldn’t get the rest of the org to adopt. But all of this is—I think Plan D is relatively desperate.
I should note there’s also Plan E, which I don’t even work on because there’s not that much to do about it, which is: the company doesn’t really care about safety, and the safety people there, if they exist, are effectively disempowered or have no access and can’t implement anything.
Buck: What’s your distribution over which of those worlds is most like the one we’re in?
Ryan: Conditional on powerful AI in the next 10 years—let’s say very powerful AI in the next 10 years—I wrote down numbers somewhere, so I’m gonna try to remember them. Maybe these won’t be well-considered. I would say like 5% Plan A, 15% Plan B, maybe 25% Plan C. That’s a total of 45%. And then maybe 40% Plan D, and then the remainder on Plan E, which is like 15%. So it’d be 5, 15, 25, 40, 15.
Buck: I would definitely have put more probability on Plan E there relative to Plan D.
Ryan: Isn’t it pretty likely you can get 10 people to work at the leading AI company who care about safety and have some moderate amount of resources?
Buck: Maybe. There might be strong selection pressures where the company tries to fire everyone who might whistleblow. Or you might have something like in the AI 2027 scenario, you end up with most people in the AI company not having access to the most recent systems. Maybe you’re talking about slightly earlier than this.
Ryan: Yeah, I’m partially talking about slightly earlier. I’m thinking about leading up to the point of full automation. Are there 10 people who care quite a bit about catastrophic alignment risks and are trying to be kind of thoughtful and reasonable about this and are not totally epistemically fucked—they have kind of reasonable views from my perspective? I don’t know, it just seems pretty likely that AI companies just aren’t that secretive. When I look at the AI companies that exist now, they seem a little more libertarian than you would think. Maybe things will tighten up later.
Buck: So you say 40% to Plan D, 15% Plan E. That indicates more than 2/3 on: you have more than 10 people being pretty reasonable on the inside.
Ryan: That’s currently my view. Maybe I’m wrong. Maybe I should move more probability to Plan E. And to be clear, Plan E is like there’s no plan at all. As in: if you’re targeting a leading AI company, there’s very little you can do, and so probably you shouldn’t. You should instead target trailing AI companies or target a different world.
Buck: Target the outside is another option, obviously.
Ryan: Yeah. By “target the outside” you mean hope to move to Plan D at some point? You’d have to implement something. You could have in principle had safety mitigations that didn’t even require the AI company. So sometimes we talk about maybe we just dump a bunch of data in pre-training and you just hope that even though the AI company didn’t care, that still gets adopted. But that’s a very confusing, very inefficient way of intervening. I think I’m more optimistic about trying to move from Plan E to Plan D than I’m optimistic about trying to make things go well in the Plan E world.
To be clear, I think there’s a question of—you can have a version that’s in between these things. This is a high-dimensional space that I’ve simplified into one dimension. An example world is: the leading AI company is basically a Plan E AI company, but there’s an AI company that’s only three months behind that’s more like a Plan D or Plan C AI company. And the company that’s three months behind—you’re going to be uncertain about how takeoff speeds go. So there’s some chance that they’re both advancing, but they remain reasonably competitive. And you just are like, “Yep, the company that is the Plan E company, maybe their AI will be super misaligned and we’re not going to intervene on that at all. But we’ll ensure that at least the Plan D/C company has a more likely to be aligned system, and maybe that suffices.” In some worlds that will be totally insufficient because takeoff speeds are too fast. In some worlds that might be totally fine.
What to do if you’re David Sacks?Ryan: If you were David Sacks, what would you do? Answer in one minute or less. Go.
Buck: I don’t know. If I’m David Sacks, what would I do? Try to persuade people in the US government that AI that is very powerful poses a lot of risk. It’s the kind of technology that needs to be handled as a source of risks for all sides rather than just an item on which it’s important to have competitive advantage. I would look for the best analogies for people inside the US government who had tried to persuade the government and other governments to treat technologies this way in the past, and try to imitate those. My main objectives would be having more decision-makers think of AI as a big deal, a potential source of huge risk. I would also pay some attention to mitigating risk of US government overuse or concentration of power for the AIs. I think there might be some methods you can apply to making it less likely that, for instance, the US President ends up getting inappropriate levels of control over the AI.
Mid career professionals?Ryan: What kind of mid-career professionals would you be most excited to see switch to AI safety? And in what area of AI safety would you want them to switch to?
Buck: I don’t know. One thing which is on my mind: I feel like it’d be cool if there were a bunch of computer security professionals who wanted to work on AI control stuff. From my perspective, AI control is just a very natural computer security problem. People currently are not handling this with a computer security skill set. The computer security people are not focused very much on this. I think they could probably make a reasonable amount of a difference on this. So that’s one thing that comes to mind. I’m excited for more writing on AI safety. I’m excited for more of a wide variety of types of media. I’m excited for activism related to AI safety. And mid-career professionals would be valuable for that kind of thing as well, I think.
Also, I wish there were more economists and political scientists who—and I guess for a lot of these classes of expert, it feels to me like I really want them to not just try to use their skill set that they’ve been practicing for the last 20 years or whatever. I want them to sincerely try to study and learn the AI stuff. A lot of the time mid-career professionals are like, “Well, you know, I’m a lawyer, I’d better go do some stuff that is entirely being a lawyer but in the AI safety context.” Or philosophers do this, or economists do this. And I think a lot of the time they underrate the importance of putting months of work into actually picking up the AI stuff. It feels to me like they do stuff which is too shallowly connected to the AI stuff.
Ryan: Yeah.
Buck: Is that your impression? What do you think?
Ryan: Yeah, I think there’s a general thing where mid-career professionals just in practice aren’t very plastic, and it’s a little unclear why. They just don’t spend that much of their time learning stuff. This isn’t always true, but there might also be selection effects in which people we encounter.
Buck: Yeah, I feel like it’s strange because you sometimes run across mid-career professionals who are super willing to learn new things. I remember we were having a conversation with Holden Karnofsky a while ago, and he was considering taking his current job at Anthropic, and we were like, “But Holden, don’t you think that this job is going to require you to develop very novel and important skill—” blah blah blah, which I’ll censor for Holden’s sake. And he was like, “Yeah, it will require me to develop those novel skills. But you know what? I love learning novel skills. I’m a young and enthusiastic man. I can just pick up very new skill sets if necessary.” It just felt really surprising to see him just be like, “Yeah, it’s really new, but that’s why I’ll learn a new thing.”
ConclusionBuck: Thanks for podcasting with me. Perhaps another one of these will happen in the future.
Ryan: Yeah. This is the Redwood Research experimental podcast. The first of its kind. Unprecedented. Soon every research org and individual and combination of individuals—which is a salient category—will have a podcast. But Redwood Research was first. Definitely.
Buck: Great.
Discuss
LessOnline 2026 Improvement Ideas
I had a wonderful time at LessOnline 2025 and am excitedly looking forward to the 2026 installment. If you're reading this, you should definitely consider going!
Here are a few ideas I had that may improve the LessOnline experience a bit. Feel free to add your own ideas to the comments.
(Again, the event was run incredibly well and a vast majority of attendees had similar opinions as me based on the feedback at the ending session—these ideas are just some "extras".)
- QR code on the name tag for easy access to contact info. Quite a few times when I asked for somebody's social I had to pull out my phone, open up my browser, listen to them spell out their info while I typed it, backspace a few times because I misheard, then finally ask "is this you?". A QR code linking directly to their LessOnline profile would be less disruptive to the conversation, easier to manage, etc. The code could be automatically generated when they make their profile and then "locked in" whenever name tags are printed.
- Hotel group rates for cheaper stays nearby. Working with a nearby hotel to get discounted rates and serve as the second official LessOnline lodge (after Lighthaven, of course) may allow more people to come in case it's too expensive with flights, plane tickets, lodging, and miscellaneous expenses. This is probably prohibitive because it requires a lot more work on the organizers' parts (finding a good hotel, contracts, etc) and may draw people away from staying at Lighthaven, which is a nice source of revenue for Lightcone Infrastructure.
Discuss
The economy is a graph, not a pipeline
Summary: Analysis claiming that automating X% of the economy can only boost GDP by 1/(1-X) assumes all sectors must scale proportionally. The economy is a graph of processes, not a pipeline. Subgraphs can grow independently if they don't bottleneck on inputs from non-growing sectors. AI-driven automation of physical production could create a nearly self-contained subgraph that grows at rates bounded only by raw material availability and speed of production equipment.
Models being challenged:
This post is a response to Thoughts (by a non-economist) on AI and economics and the broader framing it represents. Related claims appear across LessWrong discussions of AI economic impact:
- Amdahl's Law for economics: Automating a sector that represents X% of GDP can boost output by at most 1/(1-X). Automating software (2% of GDP) gives ~2% boost; automating all cognitive labor (30% of GDP) gives ~42%.
- Bottleneck tasks determine growth rate: "Suppose there are three stages in the production process for making a cheese sandwich: make the bread, make the cheese, combine the two together. If the first two stages are automated and can proceed much more quickly, the third stage can still bottleneck the speed of sandwich production." (Davidson 2021)
- Baumol effects bind: "If bottlenecks persist—and I believe strongly that they will—we will have Baumol issues... 100% is a really big number. It's radically bigger than 80%, 95%, or 99%." (Twitter thread quoted on LW)
These framings share an implicit assumption: the economy is a single integrated production function where unautomated sectors constrain growth of automated sectors. I argue this assumption breaks when an automatable subgraph can grow without requiring inputs from non-automated sectors.
There's no rule in economics that says if you grow a sector of your economy by 10x, farm fields, lawyers and doctors must then produce 10x more corn, lawsuits and medical care respectively. The industrial revolution was a shift from mostly growing food to mostly making industrial goods.
For small growth rates, and small changes, the model is mostly true in much the same way that linear approximations to a function are locally true. But large shifts break this assumption as the economy routes around industries that resist growth.
During the industrial revolution, Factories needed more workers but were competing with agriculture. Without innovation, there could have been a hard limit on free labor available to allocate to factories without cutting into food production. Instead innovations like combine harvesters and other agricultural machinery freed up farm workers for factory jobs. There's economic pressure to route around blockages to enable growth.
A better modelThe economy is a graph
- processes are nodes in the graph that consume inputs and produce outputs
- Subgraphs of the overall graph can grow, limited only by ability to buy inputs at the edges where they interact with the parent graph.
- during the industrial revolution:
- more mines were built to increase supply of metals (self contained growth)
- innovations to freed up workers from agriculture jobs to move to the cities (interaction with the larger graph)
- during the industrial revolution:
Structural changes to the economy can route around bottleneck to enable growth.
An illustrative caseConsider automating mining and manufacturing. We build the following:
- Robots that can move materials and set up machines/factories
- machines that can build:
- more robots
- more machines/factories
- As inputs, the factories take metal ores, coal and solar power
- Some things are purchased from the broader economy:
- semiconductors
- lubricants
- anything it can't produce (EG:metals not available locally)
- services of lawyers/engineers/etc.
- Set this up in Australia sited on coal/iron deposits with some rail links connecting west/east for iron/coal.
This self contained economic core can then grow limited only by raw materials, speed of the underlying machinery and whatever it cannot produce that it has to trade for with broader economy like semiconductors or rare earths.
AI and robots don't take vacations, go to the doctor, put their kids in daycare or school. They needs chips, electricity and raw materials to grow. There will be some engineers, lawyers and lobbyists employed but not much relative to similar industrial production in the legacy economy. The system doesn't need much in the way of non-self-produced "vitamins" per unit of production.
It doesn't matter that this system hasn't replaced doctors in hospitals or convinced San Francisco to build housing. It can grow independently.
A hypothetical timeline- Starts with software:
- AI agents start automating production planning
- AI interfaces with industrial machinery
- AI suggests production improvements
- Bespoke software → bespoke hardware
- Machinery raw material costs <<10% sticker price; eliminating labor drops prices by 1+ orders of magnitude[1]
- Capital pours in similar to AI buildouts today
- Manufacturing capacity grows very fast without needing to close power/resources loop yet
- Electricity prices spike
- Price of other things the growing industrial base needs spike:
- raw steel, sheet metal, fasteners, copper, magnets, plastic
- Shortages as outputs redirected towards growth
- similar to DRAM/flash memory price hikes happening now due to AI demand
- Early loop closure could avoid shortages (e.g. Western Australia industrial buildup geared towards self-replication)
- Eventually there's pressure, loops close, semiconductor perhaps closes last.
- Optimisation for doubling times yields >50% yearly growth, perhaps much more than that.
Full automation has the potential to massively increase productivity/growth. Right now most manufacturing output is consumer goods. Only a small fraction is machinery to replace/expand production. That machinery is often badly designed.
AI raising the floor on competence and automation would usher in absurd productivity gains. Full integration happens fast once AI replaces/convinces non-engineering trained MBAs of the stakes. Full automation/integration drops the capital cost to upgrade or build new production capacity. Things go insane.
Timelines and trajectories are hard to predict; ASI could compress everything, AI wall would stop it, AI capabilities might not generalize to optimizing manufacturing, but I think they will. Broadly, this doesn't happen or happens much slower if:
- AI can't raise the competence floor in industry (AI winter, lack of generalisation, etc)
- The growing self-contained sector can't buy the external inputs it needs
- voters object to electricity prices spikes making air conditioning unaffordable
- Some critical industry resists AI efficiency/integration
My own experience in industry and interactions with AI make me think 1) is wrong. 2) isn't an issue so long as there aren't political barriers in all countries. 3) is possible if some critical patents don't get licensed out, but lower tech out of patent technologies can usually be substituted albeit with some efficiency penalty.
Future posts I plan on writing supporting sub-points of this:
- How bad things really are in industry
- Industrial processes are very productive
- existing proven industrial tech grow at >100% yearly based on simple production mass ratio calculations
- (speculative technologies for very large economic scale-up)(multiple posts?)
- this stuff enables much faster growth (100% --> 5000% yearly growth). Mostly making parts of the system cheaper/simpler:
- undersea power cables
- solar
- manufacturing equipment
- etc.
- this stuff enables much faster growth (100% --> 5000% yearly growth). Mostly making parts of the system cheaper/simpler:
- ^
I've seen examples of in-house-produced equipment that cost orders of magnitude less than comparable equipment from vendors while being simpler, more reliable, etc.
Discuss
Calling all college students (and new readers)
After seeing this post and it's comments, I realized there may be an influx of new, roughly college-aged people joining LessWrong and the broader AI x-risk reduction community.
As a college student myself, this is a relief!
I have been regularly reading and digesting posts on here for nearly a year now, and the journey has been a bit lonely – navigating these ideas can feel detached from reality when the people around you aren’t really discussing them yet. I think it would be very helpful to have like-minded peers to talk to.
Note that this applies to anyone who is relatively new to the community, not just college students. The reason I specified that in my title is because I'm also interested in college community-building, which I think could be high leverage for a few reasons:
- Very few have been exposed to the quality of thinking found on LessWrong. I discovered this forum a few years back through Bentham's Bulldog and started taking x-risk seriously only when I realized that AI was indeed becoming as transformative as many over here predicted. I know many intelligent students who could certainly contribute somewhere but have not yet engaged with perspective-shifting content, such as posts from years before the "ChatGPT moment" that now seem remarkably prescient, which they are unlikely to get from a clip on YouTube.
- AI risk is likely to become a bigger, multifaceted public issue. Given the impending job displacement that seems likelier by the day, I expect that many students will soon find themselves confused and searching for a coherent source of information. Whether this translates into effective action or disorder may be a result of the community building efforts we start now. Also, as AI risk increasingly becomes a governance issue, it seems important to engage students beyond those already in CS/ML.
Will this meaningfully influence the world? I don't know, but it seems worth trying. And it may be a place to pick up the ball, because it looks like an awful number of college students are oblivious to the drastic changes that are coming.
If any of this resonates with you, I'd love to get in touch. PM me on here or message me on Discord @ne0_8 – I'm open to whatever mode of communication works best.
Discuss
Страницы
- « первая
- ‹ предыдущая
- …
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- …
- следующая ›
- последняя »