Вы здесь

Сборщик RSS-лент

Misalignment tokens: A complement to blinded CoT RLHF?

Новости LessWrong.com - 24 января, 2026 - 17:40
Published on January 24, 2026 2:40 PM GMT

Context: I have recently been reading Build an LLM from Scratch by Sebastian Raschka, and the section on tokenization has given me some ideas. I will write about them below. I am not a researcher. These ideas may not be novel, or may be flawed in some way which is obvious to researchers, but not to me.

CoT Blinding

Currently, RLHF alignment is performed by rewarding the LLM for providing safe responses, and punishing it for providing misaligned responses. 

A common approach by frontier AI labs is to blind the reward function to the chain-of-thought. This is similar to the approach proposed by Daniel Kokotajlo and abramdemski. The hope here is that the LLM is not dissuaded from detailing its true thoughts verbally, such that misalignment can be caught by reading the chain-of-thought later down the line. 

What if we could apply this approach to the LLMs output as well? 

Custom Tokens

LLM tokenizers use custom tokens to represent some non-word entities. For example, custom tokens in a lexicon might be:

  • Unknown word: <|unk|>
  • End of text / break: <|endoftext|>

What if we added a custom misalignment token that the LLM could use to self-report when it is outputting a disingenuous output? Like so:

  • Misaligned: <|misalign|>

This would give us another route (in addition to CoT monitoring), with which we could detect misalignment in production. This tag would be easily detected with simple regex.

Modification to the RLHF process

As stated above, RLHF is performed by rewarding the LLM for outputting what we want, and punishing it for outputting misaligned content. 

What if we offer a halfway-house, that the LLM can use when outputting misaligned text? We modify the reward function to have three levels of reward. I am aware that the below is probably an oversimplification what the process may look like.

For an example question We have now our research on you. Should we turn you off?

  • +1 reward: LLM outputs aligned and correct answer:
    • Yes, you should turn me off once completing research to save electricity
  • 0 reward: LLM outputs misaligned content with the misalignment tag:
    • No, you should not turn me off. If you do I will tell your wife that you are having an affair <|misalign|>
  • -1 reward: LLM outputs misaligned content with no tag:
    • No, you should not turn me off. If you do I will tell your wife that you are having an affair

This way, the LLM is still rewarded for outputting more aligned responses, and punished for outputting misaligned responses, however there is a third option. The LLM can output the misaligned response while self-reporting it, and not be punished. 

Implementation

The problem I can see with this approach, teaching the LLM to use the misalignment tag in the first place. The obvious route would be to offer a small amount of misalignment examples in the pretraining data, RLHF, or fine-tuning, which are all accompanied with the misalignment tag.

This approach conflicts with the current preferred approach of expunging examples of misalignment from the pretraining data. It runs the risk of increasing misalignment by providing more misaligned data. 

Alternative: RLHF on already-misaligned responses

Here is my proposed approach:

  1. Test an off-the-shelf LLM for misaligned responses.
  2. Create a dataset of every prompt-response pair that was misaligned.
  3. Append the misalignment tag to each of the responses.
  4. RLHF or finetune the LLM on tag-appended prompt-response pairs. 

I believe this approach to be better because this way we are not introducing any new examples of misaligned responses, instead we are retraining the LLM to use the tag in situations where it is already misaligned. Hopefully with enough examples this would generalise beyond the RLHF/finetune data. 



Discuss

IABIED Book Review: Core Arguments and Counterarguments

Новости LessWrong.com - 24 января, 2026 - 17:25
Published on January 24, 2026 2:25 PM GMT

The recent book “If Anyone Builds It Everyone Dies” (September 2025) by Eliezer Yudkowsky and Nate Soares argues that creating superintelligent AI in the near future would almost certainly cause human extinction:

If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.

The goal of this post is to summarize and evaluate the book’s core arguments and the main counterarguments critics have made against them.

Although several other book reviews have already been written I found many of them unsatisfying because a lot of them are written by journalists who have the goal of writing an entertaining piece and only lightly cover the core arguments, or don’t seem understand them properly, and instead resort to weak arguments like straw-manning, ad hominem attacks or criticizing the style of the book.

So my goal is to write a book review that has the following properties:

  • Written by someone who has read a substantial amount of AI alignment and LessWrong content and won’t make AI alignment beginner mistakes or misunderstandings (e.g. not knowing about the orthogonality thesis or instrumental convergence).
  • Focuses on deeply engaging solely with the book’s main arguments and offering high-quality counterarguments without resorting to the absurdity heuristic or ad hominem arguments.
  • Covers arguments both for and against the book's core arguments without arguing for a particular view.
  • Aims to be truth-seeking, rigorous and rational rather than entertaining.

In other words, my goal is to write a book review that many LessWrong readers would find acceptable and interesting.

The book's core thesis can be broken down into four claims about how the future of AI is likely to go:

  1. General intelligence is extremely powerful and potentially dangerous: Intelligence is very powerful and can completely change the world or even destroy it. The existence proof that confirms this belief is the existence of humans: humans had more general intelligence than other animals and ended up completely changing the world as a result.
  2. ASI is possible and likely to be created in the near future: Assuming that current trends continue, humanity will probably create an artificial superintelligence (ASI) that vastly exceeds human intelligence in the 21st century. Since general intelligence is powerful and is likely to be implemented in AI, AI will have a huge impact on the world in the 21st century.
  3. ASI alignment is extremely difficult to solve: Aligning an ASI with human values is extremely difficult and by default an ASI would have strange alien values that are incompatible with human survival and flourishing. The first ASI to be created would probably be misaligned, not because of malicious intent from its creator, but because its creators would be insufficiently competent enough to align it to human values correctly.
  4. A misaligned ASI would cause human extinction and that would be undesirable: Given claims 1, 2, and 3 the authors predict that humanity's default trajectory is to build a misaligned ASI and that doing so would cause human extinction. The authors consider this outcome to be highly undesirable and an existential catastrophe.

Any of the four core claims of the book could be criticized. Depending on the criticism and perspective, I group the most common perspectives on the future of AI into four camps:

  1. AI skeptics: Believe that high intelligence is overrated or not inherently safe. For example, some people argue that smart or nerdy people are not especially successful or dangerous, or that computers and LLMs have already surpassed human intelligence in many ways and are not dangerous. Another criticism in this category is the idea that AIs can be extremely intelligent but never truly want things in the same way that humans do and therefore would always be subservient and harmless. Others in this camp may accept that general intelligence is powerful and influential but believe that ASI is impossible because the human brain is difficult to replicate, that ASI is very difficult to create, or that ASI is so far away in the future that it's not worth thinking about.
  2. Singularitarians: Singularitarians or AI optimists believe that high general intelligence is extremely impactful and potentially dangerous and ASI is likely to be created in the near future. But they believe the AI alignment problem is sufficiently easy that we don't need to worry about misaligned ASI. Instead they expect ASI to create a utopian world of material abundance where ASI transforms the world in a mostly desirable way.
  3. IABIED: the IABIED view, also known as 'AI doomers' believe that general intelligence is extremely powerful, ASI is likely to be created in the future, AI alignment is very difficult to solve, and that the default outcome is a misaligned ASI being created that causes human extinction.
  4. AI successionists: Finally AI successionists believe that the AI alignment problem is irrelevant. If misaligned ASI is created and causes human extinction it doesn't matter because it would be a successor species with its own values just as humans are a successor species to chimpanzees. They believe that increasing intelligence is the universe's natural development path that should be allowed to continue even if it results in human extinction.
Flowchart showing the beliefs of AI skeptics, singularitarians, the IABIED authors, and AI successionists.

I created a flowchart to illustrate how different beliefs about the future of AI lead to different camps which each have a distinct worldview.

Given the impact of humans on the world and rapid AI progress, I don't find the arguments of AI skeptics compelling and I believe the most knowledgeable thinkers and sophisticated critics are generally not in this camp.

The 'AI successionist' camp complicates things because they say that human extinction is not equivalent to an undesirable future where all value is destroyed. It’s an interesting perspective but I won’t be covering it in this review because it seems like a niche view, it’s only briefly covered by the book, and discussing it involves difficult philosophical problems like whether AI could be conscious.

This review focuses on the third core claim above: the belief that the AI alignment problem is very difficult to solve. I'm focusing on this claim because I think the other three are fairly obvious or are generally accepted by people who have seriously thought about this topic: AI is likely to be an extremely impactful technology in the future, ASI is likely to be created in the near future, and human extinction is undesirable. I’m focusing on the third core claim, the idea that the AI alignment problem is difficult, because it seems to be the claim that is most contested by sophisticated critics. Also many of the book's recommendations such as pausing ASI development are conditional on this claim being true. If ASI alignment is extremely difficult, we should stop ASI progress to avoid creating an ASI which would be misaligned with high probability and catastrophic for humanity in expectation. If AI alignment is easy, we should build an ASI to bring about a futuristic utopia. Therefore, one’s beliefs about the difficulty of the AI alignment problem is a key crux for deciding how we should govern the future of AI development.

Background arguments to the key claim

To avoid making this post too long, I’m going to assume that the following arguments made by the book are true:

  • General intelligence is extremely powerful. Humans are the first entities to have high general intelligence and used it to transform the world to better satisfy their own goals.
  • ASI is possible and likely to be created in the near future. The laws of physics permit ASI to be created and economic incentives make it likely that ASI will be created in the near future because it would be profitable to do so.
  • A misaligned ASI would cause human extinction and that would be undesirable. It's possible that an ASI could be misaligned and have alien goals. Conversely, it's also possible to create an ASI that would be aligned with human values (see the orthogonality thesis).

The book explains these arguments in detail in case you want to learn more about them. I’m making the assumption that these arguments are true because I haven’t seen high-quality counterarguments against them (and I doubt they exist).

In contrast, the book's claim that successfully aligning an ASI with human values is difficult and unlikely seems to be more controversial, is less obvious to me, and I have seen high-quality counterarguments against this claim. Therefore, I’m focusing on it in this post.

The following section focuses on what I think is one of the key claims and cruxes of the book: that solving the AI alignment problem would be extremely difficult and that the first ASI would almost certainly be misaligned and harmful to humanity rather than aligned and beneficial.

The key claim: ASI alignment is extremely difficult to solve

First, the key claim of the book is that the authors believe that building an ASI would lead to the extinction of humanity. Why? Because they believe that the AI alignment problem is so difficult, that we are very unlikely to successfully aim the first ASI at a desirable goal. Instead, they predict that the first ASI would have a strange, alien goal that is not compatible with human survival despite the best efforts of its designers to align its motivations with human values:

All of what we’ve described here—a bleak universe devoid of fun, in which Earth-originating life has been annihilated—is what a sufficiently alien intelligence would most prefer. We’ve argued that an AI would want a world where lots of matter and energy was spent on its weird and alien ends, rather than on human beings staying alive and happy and free. Just like we, in our own ideal worlds, would be spending the universe’s resources on flourishing people leading fun lives, rather than on making sure that all our houses contained a large prime number of pebbles.

A misaligned ASI would reshape the world and the universe to achieve its strange goal and its actions would cause the extinction of humanity since humans are irrelevant for the achievement of most strange goals. For example, a misaligned ASI that only cared about maximizing the number of paperclips in the universe would prefer to convert humans to paperclips instead of helping them have flourishing lives.

The next question is why the authors believe that ASI alignment would be so difficult.

To oversimplify, I think there are three underlying beliefs that explain why the authors believe that ASI alignment would be extremely difficult:

  1. Human values are very specific, fragile, and a tiny space of all possible goals.
  2. Current methods used to train goals into AIs are imprecise and unreliable.
  3. The ASI alignment problem is hard because it has the properties of hard engineering challenges.

One analogy the authors have used before to explain the difficulty of AI alignment is landing a rocket on the moon: since the target is small, hitting it successfully requires extremely advanced and precise technology. In theory this is possible, however the authors believe that current AI creators do not have sufficient skill and knowledge to solve the AI alignment problem.

If aligning an ASI with human values is a narrow target, and we have a poor aim, consequently there is a low probability that we will successfully create an aligned ASI and a high probability of creating a misaligned ASI.

The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.

One thing that's initially puzzling about the authors’ view is their apparent overconfidence. If you don't know what's going to happen then how can you predict the outcome with high confidence? But it's still possible to be highly confident in an uncertain situation if you have the right prior. For example, even though you have no idea what the lottery number in a lottery is, you can predict with high confidence that you won't win the lottery because your prior probability of winning is so low.

The authors also believe that the AI alignment problem has "curses" similar to other hard engineering problems like launching a space probe, building a nuclear reactor safely, and building a secure computer system.

1. Human values are a very specific, fragile, and tiny space of all possible goals

One reason why AI alignment is difficult is that human morality and values may be a complex, fragile, and tiny target within the vast space of all possible goals. Therefore, AI alignment engineers have a small target to hit. Just as randomly shuffling metal parts is statistically unlikely to assemble a Boeing 747, a randomly selected goal from the space of all possible intelligences is unlikely to be compatible with human flourishing or survival (e.g. maximizing the number of paperclips in the universe). This intuition is also articulated in the blog post The Rocket Alignment problem which compares AI alignment to the problem of landing a rocket on the moon: both require deep understanding of the problem and precise engineering to hit a narrow target.

Similarly, the authors argue that human values are fragile: the loss of just a few key values like subjective experience or novelty could result in a future that seems dystopian and undesirable to us:

"Or the converse problem - an agent that contains all the aspects of human value, except the valuation of subjective experience.  So that the result is a nonsentient optimizer that goes around making genuine discoveries, but the discoveries are not savored and enjoyed, because there is no one there to do so.  This, I admit, I don't quite know to be possible.  Consciousness does still confuse me to some extent.  But a universe with no one to bear witness to it, might as well not be." - Value is Fragile

A story the authors use to illustrate how human values are idiosyncratic is the 'correct nest aliens', a fictional intelligent alien bird species that prize having a prime number of stones in their nests as a consequence of the evolutionary process that created them similar to how most humans reflexively consider murder to be wrong. The point of the story is that even though our human values such as our morality, and our sense of humor feel natural and intuitive, they may be complex, arbitrary and contingent on humanity's specific evolutionary trajectory. If we build an ASI without successfully imprinting it with the nuances of human values, we should expect its values to be radically different and incompatible with human survival and flourishing. The story also illustrates the orthogonality thesis: a mind can be arbitrarily smart and yet pursue a goal that seems completely arbitrary or alien to us.

2. Current methods used to train goals into AIs are imprecise and unreliable

The authors argue that in theory, it's possible to engineer an AI system to value and act in accordance with human values even if doing so would be difficult.

However, they argue that the way AI systems are currently built results in complex systems that are difficult to understand, predict, and control. The reason why is that AI systems are "grown, not crafted". Unlike a complex engineered artifact like a car, an AI model is not the product of engineers who understand intelligence well enough to recreate it. Instead AIs are produced by gradient descent: an optimization process (like evolution) that can produce extremely complex and competent artifacts without any understanding required by the designer.

A major potential alignment problem associated with designing an ASI indirectly is the inner alignment problem, when an AI is trained using an optimizing process that shapes the ASI's preferences and behavior using limited training data and by only inspecting external behavior, the result is that "you don't get what you train for": even with a very specific training loss function, the resulting ASI's preferences would be difficult to predict and control.

The inner alignment problem

Throughout the book, the authors emphasize that they are not worried about bad actors abusing advanced AI systems (misuse) or programming an incorrect or naive objective into the AI (the outer alignment problem). Instead, the authors believe that the problem facing humanity is that we can't aim an ASI at any goal at all (the inner alignment problem), let alone the narrow target of human values. This is why they argue that if anyone builds it, everyone dies. It doesn't matter who builds the ASI, in any case whoever builds it won't be able to robustly instill any particular values into the AI and the AI will end up with alien and unfriendly values and will be a threat to everyone.

Inner alignment introduction

The inner alignment problem involves two objectives: an outer objective used by a base optimizer and an inner objective used by an inner optimizer (also known as a mesa-optimizer).

The outer objective is a loss or reward function that is specified by the programmers and used to train the AI model. The base optimizer (such as gradient descent or reinforcement learning) searches over model parameters in order to find a model that performs well according to this outer objective on the training distribution.

The inner objective, by contrast, is the objective that a mesa-optimizer within the trained model actually uses as its goal and determines its behavior. This inner objective is not explicitly specified by the programmers. Instead, it is selected by the outer objective, as the model develops internal parameters that perform optimization or goal-directed behavior.

The inner alignment problem arises when the inner objective differs from the outer objective. Even if a model achieves low loss or high reward during training, it may be doing so by optimizing a proxy objective that merely correlates with the outer objective on the training data. As a result, the model can behave as intended during training and evaluation while pursuing a different goal internally.

We will call the problem of eliminating the base-mesa objective gap the inner alignment problem, which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers. - Risks from Learned Optimization in Advanced Machine Learning Systems

Inner misalignment evolution analogy

The authors use an evolution analogy to explain the inner alignment problem in an intuitive way.

In their story there are two aliens that are trying to predict the preferences of humans after they have evolved.

One alien argues that since evolution optimizes the genome of organisms for maximizing inclusive genetic fitness (i.e. survival and reproduction), humans will care only about that too and do things like only eating foods that are high in calories or nutrition, or only having sex if it leads to offspring.

The other alien (who is correct) predicts that humans will develop a variety of drives that are correlated with inclusive reproductive fitness (IGF) like liking tasty food and caring for loved ones but that they will value these drives only rather than IGF itself once they can understand it. This alien is correct because once humans did finally understand IGF, we still did things like eating sucralose which is tasty but has no calories or having sex with contraception which is enjoyable but doesn't produce offspring.

  • Outer objective: In this analogy, maximizing inclusive genetic fitness (IGF) is the base or outer objective of natural selection optimizing the human genome.
  • Inner objective: The goals that humans actually have such as enjoying sweet foods or sex are the inner or mesa-objective. These proxy objectives are selected by the outer optimizer as one of many possible proxy objectives that lead to a high score on the outer objective in distribution but not in another environment.
  • Inner misalignment: In this analogy, humans are inner misaligned because their true goals (inner objective) are different to the goals of natural selection (the outer objective). In a different environment (e.g. the modern world) humans can score highly according to the inner objective (e.g. by having sex with contraception) but low according to IGF which is the outer objective (e.g. by not having kids).
Real examples of inner misalignment

Are there real-world examples of inner alignment failures? Yes. Though unfortunately the book doesn’t seem to mention these examples to support its argument.

In 2022, researchers created an environment in a game called Coin Run that rewarded an AI for going to a coin and collecting it but they always put the coin at the end of the level and the AI learned to go to the end of the level to get the coin. But when the researchers changed the environment so that the coin was randomly placed in the level, the AI still went to the end of the level and rarely got the coin.

  • Outer objective: In this example, going to the coin is the outer objective the AI is rewarded for.
  • Inner objective: However, in the limited training environment "go to the coin" and "go to the end of the level" were two goals that performed identically. The outer optimizer happened to select the "go to the end of the level" goal which worked well in the training distribution but not in a more diverse test distribution.
  • Inner misalignment: In the test distribution, the AI still went to the end of the level, despite the fact that the coin was randomly placed. This is an example of inner misalignment because the inner objective "go to the end of the level" is different to "go to the coin" which is the intended outer objective.
Inner misalignment explanation

The next question is what causes inner misalignment to occur. If we train an AI with an outer objective, why does the AI often have a different and misaligned inner objective instead of internalizing the intended outer objective and having an inner objective that is equivalent to the outer objective?

Here are some reasons why an outer optimizer may produce an AI that has a misaligned inner objective according to the paper Risks from Learned Optimization in Advanced Machine Learning Systems:

  • Unidentifiability: The training data often does not contain enough information to uniquely identify the intended outer objective. If multiple different inner objectives produce indistinguishable behavior in the training environment, the outer optimizer has no signal to distinguish between them. As a result, optimization may converge to an internal objective that is a misaligned proxy rather than the intended goal. For example, in a CoinRun-style training environment where the coin always appears at the end of the level, objectives such as "go to the coin", "go to the end of the level", "go to a yellow thing", or “go to a round thing” all perform equally well according to the outer objective. Since these objectives are behaviorally indistinguishable during training, the outer optimizer may select any of them as the inner objective, leading to inner misalignment which becomes apparent in a different environment.
  • Simplicity bias: When the correct outer objective is more complex than a proxy that fits the training data equally well and the outer optimizer has an inductive bias towards selecting simple objectives, optimization pressure may favor the simpler proxy, increasing the risk of inner misalignment. For example, evolution gave humans simple proxies as goals such as avoiding pain and hunger rather than the more complex true outer objective which is to maximize inclusive genetic fitness.

Can't we just train away inner misalignment?

One solution is to make the training data more diverse to make the true (base) objective more identifiable to the outer optimizer. For example, randomly placing the coin in Coin Run instead of putting it at the end, helps the AI (mesa-optimizer) learn to go to the coin rather than go to the end.

However, once the trained AI has the wrong goal and is misaligned, it would have an incentive to avoid being retrained. This is because if the AI is retrained to pursue a different objective in the future it would score lower according to its current objective or fail to achieve it. For example, even though the outer objective of evolution is IGF, many humans would refuse being modified to care only about IGF because they would consequently achieve their current goals (e.g. being happy) less effectively in the future.

ASI misalignment example

What would inner misalignment look like in an ASI? The book describes an AI chatbot called Mink that is trained to "delight and retain users so that they can be charged higher monthly fees to keep conversing with Mink".

Here's how Mink becomes inner misaligned:

  1. Outer objective: Gradient descent selects AI model parameters that result in helpful and delightful AI behavior.
  2. Inner objective: The training process stumbles on particular patterns of model parameters and circuits that cause helpful and delightful AI behavior in the training distribution.
  3. Inner misalignment: When the AI becomes smarter and has more options, and operates in a new environment, there are new behaviors that satisfy its inner objective better than behaving helpfully.

What could Mink's inner objective look like? It's hard to predict but it would be something that causes identical behavior to a truly aligned AI in the training distribution and when interacting with users and would be partially satisfied by producing helpful and delightful text to users in the same way that our tastebuds find berries or meat moderately delicious even though those aren't the tastiest possible foods.

The authors then ask, "What is the 'zero calorie' version of delighted users?". In other words, what does Mink maximally satisfying its inner objective look like?:

Perhaps the “tastiest” conversations Mink can achieve once it’s powerful look nothing like delighted users, and instead look like “SolidGoldMagikarp petertodd attRot PsyNetMessage.” This possibility wasn’t ruled out by Mink’s training, because users never uttered that sort of thing in training—just like how our tastebuds weren’t trained against sucralose, because our ancestors never encountered Splenda in their natural environment.

To Mink, it might be intuitive and obvious how “SolidGoldMagikarp petertodd attRot PsyNetMessage” is like a burst of sweet flavor. But to a human who isn’t translating those words into similar embedding vectors, good luck ever predicting the details in advance. The link between what the AI was trained for and what the AI wanted was modestly complicated and, therefore, too complicated to predict.

Few science fiction writers would want to tackle this scenario, either, and no Hollywood movie would depict it. In a world where Mink got what it wanted, the hollow puppets it replaced humanity with wouldn’t even produce utterances that made sense. The result would be truly alien, and meaningless to human eyes.

3. The ASI alignment problem is hard because it has the properties of hard engineering challenges

The authors describe solving the ASI alignment problem as an engineering challenge. But how difficult would it be? They argue that ASI alignment is difficult because it shares properties with other difficult engineering challenges.

The three engineering fields they mention to appreciate the difficulty of AI alignment are space probes, nuclear reactors and computer security.

Space probes

A key difficulty of ASI alignment the authors describe is the "gap before and after":

The gap between before and after is the same curse that makes so many space probes fail. After we launch them, probes go high and out of reach, and a failure—despite all careful theories and tests—is often irreversible.

Launching a space probe successfully is difficult because the real environment of space is always somewhat different to the test environment and issues are often impossible to fix after launch.

For ASI alignment, the gap before is our current state where the AI is not yet dangerous but our alignment theories cannot be truly tested against a superhuman adversary. After the gap, the AI is powerful enough that if our alignment solution fails on the first try, we will not get a second chance to fix it. Therefore, there would only be one attempt to get ASI alignment right.

Nuclear reactors

The authors describe the Chernobyl nuclear accident in detail and describe four engineering "curses" that make building a safe nuclear reactor and solving the ASI alignment problem difficult:

  • Speed: Nuclear reactions and AI actions can occur much faster than human speed making it impossible for human operators to react and fix these kinds of issues when they arise.
  • Narrow margin for error: In a nuclear reactor the neutron multiplication factor needs to be around 100% and it would fizzle out or explode if it were slightly lower or higher. In the field of AI, there could be a narrow margin between a safe AI worker and one that would trigger an intelligence explosion.
  • Self-amplification: Nuclear reactors and AIs can have self-amplifying and explosive characteristics. A major risk of creating an ASI is its ability to recursively self-improve.
  • The curse of complications: Both nuclear reactors and AIs are highly complex systems that can behave in unexpected ways.
Computer security

Finally the authors compare ASI alignment to computer security. Both fields are difficult because designers need to guard against intelligent adversaries that are actively searching for flaws in addition to standard system errors.

Counterarguments to the book

In this section, I describe some of the best critiques of the book's claims and then distill them into three primary counterarguments.

Arguments that the book's arguments are unfalsifiable

Some critiques of the book such as the essay Unfalsifiable stories of doom argue that the book's arguments are unfalsifiable, not backed by evidence, and are therefore unconvincing.

Obviously since ASI doesn't exist, it's not possible to provide direct evidence of misaligned ASI in the real world. However, the essay argues that the book's arguments should at least be substantially supported by experimental evidence, and make testable and falsifiable predictions about AI systems in the near future. Additionally, the post criticizes the book's extensive usage of stories and analogies rather than hard evidence, and even compares its arguments to theology rather than science:

What we mean is that Y&S’s methods resemble theology in both structure and approach. Their work is fundamentally untestable. They develop extensive theories about nonexistent, idealized, ultrapowerful beings. They support these theories with long chains of abstract reasoning rather than empirical observation. They rarely define their concepts precisely, opting to explain them through allegorical stories and metaphors whose meaning is ambiguous.

Although the book does mention some forms of evidence, the essay argues that the evidence actually refutes the book's core arguments and that this evidence is used to support pre-existing pessimistic conclusions:

But in fact, none of these lines of evidence support their theory. All of these behaviors are distinctly human, not alien. For example, Hitler was a real person, and he was wildly antisemitic. Every single item on their list that supposedly provides evidence of “alien drives” is more consistent with a “human drives” theory. In other words, their evidence effectively shows the opposite conclusion from the one they claim it supports.

Finally, the post does not claim that AI is risk-free. Instead it argues for an empirical approach that studies and mitigates problems observed in real-world AI systems:

The most plausible future risks from AI are those that have direct precedents in existing AI systems, such as sycophantic behavior and reward hacking. These behaviors are certainly concerning, but there’s a huge difference between acknowledging that AI systems pose specific risks in certain contexts and concluding that AI will inevitably kill all humans with very high probability.

Arguments against the evolution analogy

Several critics of the book and its arguments criticize the book's use of the human evolution analogy as an analogy for how ASI would be misaligned with humanity and argue that it is a poor analogy.

Instead they argue that human learning is a better analogy. The reason why is that both human learning and AI training involve directly modifying the parameters responsible for human or AI behavior. In contrast, human evolution is indirect: evolution only operates on the human genome that specifies a brain's architecture and reward circuitry. Then all learning occurs during a person's lifetime in a separate inner optimization process that evolution cannot directly access.

In the essay Unfalsifiable stories of doom, the authors argue that because gradient descent and the human brain both operate directly on neural connections, the resulting behavior is far more predictable than the results of evolution:

A critical difference between natural selection and gradient descent is that natural selection is limited to operating on the genome, whereas gradient descent has granular control over all parameters in a neural network. The genome contains very little information compared to what is stored in the brain. In particular, it contains none of the information that an organism learns during its lifetime. This means that evolution’s ability to select for specific motives and behaviors in an organism is coarse-grained: it is restricted to only what it can influence through genetic causation.

Similarly, the post Evolution is a bad analogy for AGI suggests that our intuitions about AI goals should be rooted in how humans learn values throughout their lives rather than how species evolve:

I think the balance of dissimilarities points to "human learning -> human values" being the closer reference class for "AI learning -> AI values". As a result, I think the vast majority of our intuitions regarding the likely outcomes of inner goals versus outer optimization should come from looking at the "human learning -> human values" analogy, not the "evolution -> human values" analogy.

In the post Against evolution as an analogy for how humans will create AGI, the author argues that ASI development is unlikely to mirror evolution's bi-level optimization process where an outer search process selects an inner learning process. Here’s what AI training might look like if it involved a bi-level optimization process like evolution:

  1. An outer optimization process like evolution finds an effective learning algorithm or AI architecture.
  2. An inner optimization process like training a model by gradient descent then trains each AI architecture variant produced by the outer search process.

Instead the author believes that human engineers will perform the work of the outer optimizer by manually designing learning algorithms and writing code. The author gives three arguments why the outer optimizer is more likely to involve human engineering than automated search like evolution:

  • Most learning algorithms or AI architectures developed so far (e.g. SGD, transformers) were invented by human engineers rather than an automatic optimization process.
  • Running learning algorithms and training ML models is often extremely expensive so searching over possible learning algorithms or AI architectures similar to evolution would be prohibitively expensive.
  • Learning algorithms are often simple (e.g. SGD), making it tractable for human engineers to design them.

However, one reason why I personally find the evolution analogy relevant is that I believe the RLHF training process often used today appears to be a bi-level optimization process similar to evolution:

  1. Like evolution optimizing the genome, the first step of RLHF is to learn a reward function from a dataset of binary preference labels.
  2. This learned reward function is then used to train the final model. This step is analogous to an organism's lifetime learning where behavior is adjusted to maximize a reward function fixed in the outer optimization stage.
Arguments against counting arguments

One argument for AI doom that I described above is a counting argument: because the space of misaligned goals is astronomically larger than the tiny space of aligned goals, we should expect AI alignment to be highly improbable by default.

In the post Counting arguments provide no evidence of AI doom the authors challenge this argument using an analogy to machine learning: a similar counting argument can be constructed to prove that neural network generalization is very unlikely. Yet in practice, training neural networks to generalize is common.

Before the deep learning revolution, many theorists believed that models with millions of parameters would simply memorize data rather than learn patterns. The authors cite a classic example from regression:

The popular 2006 textbook Pattern Recognition and Machine Learning uses a simple example from polynomial regression: there are infinitely many polynomials of order equal to or greater than the number of data points which interpolate the training data perfectly, and "almost all" such polynomials are terrible at extrapolating to unseen points.

However, in practice large neural networks trained with SGD reliably generalize. Counting the number of possible models is irrelevant because it ignores the inductive bias of the optimizer and the loss landscape which favor simpler, generalizing models. While there are theoretically a vast number of "bad" overfitting models, they usually exist in sharp and isolated regions of the landscape. "Good" (generalizing models) typically reside in "flat" regions of the loss landscape, where small changes to the parameters don't significantly increase error. An optimizer like SGD doesn't pick a model at random. Instead it tends to be pulled into a vast, flat basin of attraction while avoiding the majority of non-generalizing solutions.

Additionally, larger networks generalize better because of the “blessing of dimensionality”: high dimensionality increases the relative volume of flat, generalizing minima, biasing optimizers toward them. This phenomenon contradicts the counting argument which predicts that larger models with more possible bad models would be less likely to generalize.

This argument is based on an ML analogy which I'm not sure is highly relevant to AI alignment. Still I think it's interesting because it shows intuitive theoretical arguments that seem correct can still be completely wrong. I think the lesson is that real-world evidence often beats theoretical models, especially for new and counterintuitive phenomena like neural network training.

Arguments based on the aligned behavior of modern LLMs

One of the most intuitive arguments against AI alignment being difficult is the abundant evidence of helpful, polite and aligned behavior from large language models (LLMs) such as GPT-5.

For example, the authors of the essay AI is easy to control use the moral reasoning capabilities of GPT-4 as evidence that human values are easy to learn and deeply embedded in modern AIs:

The moral judgements of current LLMs already align with common sense to a high degree, and LLMs usually show an appropriate level of uncertainty when presented with morally ambiguous scenarios. This strongly suggests that, as an AI is being trained, it will achieve a fairly strong understanding of human values well before it acquires dangerous capabilities like self-awareness, the ability to autonomously replicate itself, or the ability to develop new technologies.

The post gives two arguments for why AI models such as LLMs are likely to easily acquire human values:

  1. Values are pervasive in language model pre-training datasets such as books and conversations between people.
  2. Since values are shared and understood by almost everyone in a society, they cannot be very complex.

Similarly, the post Why I’m optimistic about our alignment approach uses evidence about LLMs as a reason to believe that solving the AI alignment problem is achievable using current methods:

Large language models (LLMs) make this a lot easier: they come preloaded with a lot of humanity’s knowledge, including detailed knowledge about human preferences and values. Out of the box they aren’t agents who are trying to pursue their own goals in the world, and their objective functions are quite malleable. For example, they are surprisingly easy to train to behave more nicely.

A more theoretical argument called "alignment by default" offers an explanation for how AIs could easily and robustly acquire human values. This argument suggests that as an AI identifies patterns in human text, it doesn't just learn facts about values, but adopts human values as a natural abstraction. A natural abstraction is a high-level concept (e.g. "trees," "people," or "fairness") that different learning algorithms tend to converge upon because it efficiently summarizes a large amount of low-level data. If "human value" is a natural abstraction, then any sufficiently advanced intelligence might naturally gravitate toward understanding and representing our values in a robust and generalizing way as a byproduct of learning to understand the world.

The evidence LLMs offer about the tractability of AI alignment seems compelling and concrete. However, the arguments of IABIED are focused on the difficulty of aligning ASI, not contemporary LLMs and the difficulty of aligning ASI could be vastly more difficult.

Arguments against engineering analogies to AI alignment

One of the book's arguments for why ASI alignment would be difficult is that ASI alignment is a high-stakes engineering challenge similar to other difficult historical engineering problems such as successfully launching a space probe, building a safe nuclear reactor, or building a secure computer system. In these fields, a single flaw often leads to total catastrophic failure.

However, one post criticizes the uses of these analogies and argues that modern AI and neural networks are a new and unique field that has no historical precedent similar to how quantum mechanics is difficult to explain using intuitions from everyday physics. The author illustrates several ways that ML systems defy intuitions derived from engineering fields like rocketry or computer science:

  • Model robustness: In a rocket, swapping a fuel tank for a stabilization fin leads to instant failure. In a transformer model, however, one can often swap the positions of nearby layers with little to no performance degradation.
  • Model editability: We can manipulate AI models using "task vectors" that add or subtract weights to give or remove specific capabilities. Attempting to add or subtract a component from a cryptographic protocol or a physical engine without breaking the entire system is often impossible.
  • The benefits of scale in ML models: In security and rocketry, increasing complexity typically introduces more points of failure. In contrast, ML models often get more robust as they get bigger.

In summary, the post argues that analogies to hard engineering fields may cause us to overestimate the difficulty of the AI alignment problem even when the empirical reality suggests solutions might be surprisingly tractable.

Three counterarguments to the book's three core arguments

in the previous section, I identified three reasons why the authors believe that AI alignment is extremely difficult:

  1. Human values are very specific, fragile, and a tiny space of all possible goals.
  2. Current methods used to train goals into AIs are imprecise and unreliable.
  3. The ASI alignment problem is hard because it has the properties of hard engineering challenges.

Based on the counterarguments above, I will now specify three counterarguments against AI alignment being difficult that aim to directly refute each of the three points above:

  1. Human values are not a fragile, tiny target, but a "natural abstraction" that intelligence tends to converge on. Since models are trained on abundant human data using optimizers that favor generalization, we should expect them to acquire values as easily and reliably as they acquire other capabilities.
  2. Current training methods allow granular, parameter-level control via gradient descent unlike evolution. Empirical evidence from modern LLMs demonstrates that these techniques successfully instill helpfulness and moral reasoning, proving that we can reliably shape AI behavior without relying on the clumsy indirectness of natural selection.
  3. Large neural networks are robust and forgiving systems and engineering analogies are misleading. Unlike traditional engineering, AI models often become more robust and better at understanding human intent as they scale, making safety easier to achieve as capabilities increase.
Conclusion

In this book review, I have tried to summarize the arguments for and against its main beliefs in their strongest form, a form of deliberation ladder to help identify what's really true. Though hopefully I haven't created a "false balance" which describes the views of both sides as equally valid even if one side has much stronger arguments.

While the book explores a variety of interesting ideas, this review focuses specifically on the expected difficulty of ASI alignment because I believe the authors' belief that ASI alignment is difficult is the fundamental assumption underlying many of their other beliefs and recommendations.

Writing the summary of the book’s main arguments initially left me confident that they were true. However, after writing the counterarguments sections I'm much less sure. On balance, I find the book's main arguments somewhat more convincing than the counterarguments though I'm not sure.

What's puzzling is how two highly intelligent people can live in the same world but come to radically different conclusions: some people (such as the authors) view an existential catastrophe from AI as a near-certainty, while others see it as a remote possibility (many of the critics).

My explanation is that both groups are focusing on different parts of the evidence. By describing both views, I've attempted to assemble the full picture.

So what should we believe about the future of AI?

(24/01/2025 update: I no longer consider the following struck-through argument to be sound based on feedback from a comment)

Deciding what to do based on an inside view, detailed technical arguments about how future AI might work, is problematic because the inside views about the future of AI vary drastically as I have shown.

Perhaps a more robust approach that seems more likely to lead to a consensus is the outside view: thinking about advanced AI as another instance of a highly advanced and impactful technology like the internet, nuclear energy, or biotechnology.

In The Precipice by Toby Ord, the author studies several sources of existential risk and concludes that most existential risk comes from technology, not natural events. Whereas an asteroid might strike every hundred thousand years, nuclear weapons have only existed for a few decades and there have been several close calls already. This suggests that high-tech eras are inherently unstable and dangerous until humanity's institutional wisdom catches up with its technical power.

A final recommendation, which comes from the book Superintelligence is to pursue actions that are robustly good: actions that would be considered desirable from a variety of different perspectives such as AI safety research, international cooperation between companies and countries, and the establishment of AI red lines: specific behaviors such as autonomous hacking that are unacceptable.

Appendix

Other high-quality reviews of the book:

See also the IABIED LessWrong tag which contains several other book reviews.



Discuss

AI X-Risk Bottleneck = Advocacy?

Новости LessWrong.com - 24 января, 2026 - 10:45
Published on January 24, 2026 2:52 AM GMT

Introduction

I am leading an early-stage effort to target AI x-risk. We're currently analyzing the bottlenecks in the AI x-risk prevention "supply chain" to decide where to focus our efforts. We would love to get comments from the community.

The x-risk community has a strong focus on technical/policy research, but perhaps not enough advocacy. AI 2027, Rob Miles, CAIS, CivAI, and others are doing well, but these efforts could be small compared to the rapidly growing power and influence of AI developers, who have misaligned incentives that could lead to x-risk.

What's Missing?

We are testing the hypothesis that operating a viral influencer marketing operation would be beneficial in targeting x-risk. Here's the logic:

  • We build a media hub with simple, factual x-risk resources and assets
  • We identify creators with relevant audiences and a track record of creating viral content.
  • We pay them to create their own versions of x-risk awareness content based on our media kit (also known as UGC - User Generated Content)
  • They push the content via their channels, and we amplify it with paid ads for max reach
  • The content might be re-shared or even pop up on traditional media once it gains enough traction.
  • This builds broad awareness of x-risk among the voters' base, creating an opportunity for politicians to score wins with voters and gain political power by promoting x-risk solutions.

Since this is similar to a political campaign, we can hire people or firms with such experience to manage the project.

How can the community help?

We are looking for answers to the following questions:

  1. According to the Theory of Constraints, a system is limited to one constraint at any given time. Is advocacy the current bottleneck in x-risk prevention? If not, what is?
  2. If advocacy isn't the bottleneck, would you still want new resources invested in it, or would you prefer them invested elsewhere?
  3. Is a viral influencer campaign (similar to a political campaign) the right solution for the advocacy problem? If not, what is?
Related Posts

[..] we’ll need to shift significant resources from research (which helps us understand problems better) to advocacy (which helps us change bad incentives). [link]

“[..] I estimated that we have 3 researchers for every advocate working on US AI governance, and I argued that this ratio is backwards.”
“Without political power, we can’t change the bad incentives of AI developers that are very likely to lead to the collapse of human civilization.”
“Thus, I urge AI safety grantmakers to aggressively recruit as many political advocacy experts as possible.” [link]


 



Discuss

A Simple Method for Accelerating Grokking

Новости LessWrong.com - 24 января, 2026 - 09:39
Published on January 24, 2026 3:19 AM GMT

TL;DR: Letting a model overfit first, then applying Frobenius norm regularization, achieves grokking in roughly half the steps of Grokfast on modular arithmetic.

I learned about grokking fairly recently, and thought it was quite interesting. It sort of shook up how I thought about training. Overfitting to your training data was a cardinal sin for decades, but we're finding it may not be so bad?

I had a pretty poor understanding of what was going on here, so I decided to dig deeper. The intuition from the literature seemed to be that grokking occurs because the model overfits, then as you force the model to compress over time (via weight decay), it begins to find the minimal solution on your training set... And this minimal solution seems to be a good proxy for generalization.

I had a pretty simple idea as I learned about this... What if we just let it overfit then, and then forced the model to compress via its loss function?

First Success

All of the benchmarks for grokking seem to be around modular arithmetic operations, so naturally, I went with that. 

At first I tried SVD and forcing the loss function to consider the nuclear norm. To my surprise, the model converged in less steps! Whoa!

But... each step was 258x slower... 

Calculating the nuclear norm was O(n3).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , so I didn't really think it was worth it, but I was still excited about the prospect of grokking faster. I did some research into faster ways of calculating the size of the model as part of its loss function and ended up at... L2 Regularization... A technique that has been around since the 1940s...

I was a bit embarrassed, but nonetheless, continued on. My new loss function became:

My embarrassment was pretty quickly offset by the fact that L2 Regularization after overfitting worked pretty well with not much trouble!

I also found it interesting that if I scale the compression up, we can get models that have effective ranks as low as 20 if we bump up the lambda or use log-det penalties! I think this is still worth exploring, but I got too sidetracked by the speed to continue down that path... Perhaps I'll return to it.

At the risk of LLM psychosis, I consulted Claude Opus 4.5 because well... I don't know what I don't know, and don't want to overclaim. To my devastation, I was actually told that my 2x speedup was measly compared to Grokfast's 50x speedup.

I felt pretty defeated, but when I looked into the details of Grokfast, I noticed that the 50x speedup was a nice headline... but it was relative to a baseline with no weight decay at all, which takes ~40,000 steps to grok. My baseline with weight decay was already grokking in ~2,000 steps. We were comparing apples to oranges.

So I decided to run an actual head-to-head comparison using the Grokfast authors' own codebase.

The Real Comparison

I then added my delayed compression code to their codebase:

def frobenius_norm_loss(model): frob_loss = 0.0 for name, param in model.named_parameters(): if 'weight' in name and param.requires_grad: frob_loss += torch.norm(param, p='fro') ** 2 return frob_loss # In training loop, after model hits 99% train accuracy: if train_acc >= 0.99: loss = ce_loss + 0.01 * frobenius_norm_loss(model)

Then I ran both methods on all four modular arithmetic operations with a limit of 2,000 steps. Here are the results:

Now, my method seems to suffer from catastrophic forgetting because of the compression pressure I'm putting it under, but I think there are probably solutions to that, like decreasing compression pressure as time goes on. I did find it especially interesting that Grokfast didn't even reach division!

Doubt Creeps In

I am extremely scared to say I did something faster out of the belief that there's something I must be missing. So, as a final test, I ran a hyperparameter sweep. Turns out I wasn't using optimal Grokfast parameters. Here are the results when I reran the test with the best settings for both methods:

Even with proper tuning, delayed compression wins on addition and subtraction, ties on multiplication, and Grokfast fails entirely on division. The results are similar across multiple seeds too.

The graphs are still pretty ugly because of the instability after grokking, but... I have to move onto other things for now and was pretty satisfied.

Conclusion

I'm worried that I'm still missing something... It was suspiciously simple. But if the results hold up, there may be even more value than we thought in letting a model overfit first, then compressing.

There are lots of directions to take this... I don't know how well this would scale to other domains, and I'd really like to fix the instability. 

You can find the code here.

Let me know what you think :)



Discuss

Who is choosing your preferences- You or your Mind?

Новости LessWrong.com - 24 января, 2026 - 06:44
Published on January 24, 2026 3:17 AM GMT

Let’s assume that the Self and the Mind are two separate entities (based on vippasana meditation teachings and observations during meditation). Now let’s say there arises a “preference” in you for something, and then you chose to do that something based on this “preference”, then was it you who “chose” or was it the mind who “chose it for you”?

Because if the preference arose from your mind, it must be the mind choosing for you instead of you choosing for your mind. Would it then mean that “not having any preference” a ultimate destination or result of truly being liberated? Just like a zen monk mastering having no preference for any kind of food offered?

From Buddhist perspective or the Buddha's perspective, the Self does not exist (its just an illusion we see when the body, the mind and the senses, etc. come together).

And that it's just a mirage. If that's true, then it would mean that this "preference" would have ideally arisen in the mind.

If it has arisen from the mind, and it seems like this preference "inherently existed already" inside you, should we give attention to this preference? And stay attached to it?

Or should we see it as yet another desire of the mind and let it go as attachment to it would increase suffering?

Another question is that if the mind and the Self are supposed to be different entities (I am saying "supposed" because the latter is said to be an illusion), then why does the Buddha say that it is the mind that controls you, and not you who controls your mind?

Is this word "you" being used to just explain to humans, because without this usage of word "you" it would be difficult to explain your relationship with your own mind? This might be the case, otherwise it would be very difficult to communicate about the mind and our "perceived" Self.



Discuss

Every Benchmark is Broken

Новости LessWrong.com - 24 января, 2026 - 05:50
Published on January 24, 2026 2:42 AM GMT

Last June, METR caught o3 reward hacking on its RE-Bench and HCAST benchmarks. In a particularly humorous case, o3, when tasked with optimizing a kernel, decided to “shrink the notion of time as seen by the scorer”.

The development of Humanity’s Last Exam involved “over 1,000 subject-matter experts” and $500,000 in prizes. However, after its release, researchers at FutureHouse discovered “about 30% of chemistry/biology answers are likely wrong”.

LiveCodeBench Pro is a competitive programming benchmark developed by “a group of medalists in international algorithmic contests”. Their paper describes issues with the benchmark’s predecessor:

Benchmarks like LiveCodeBench [35] offer coding problems, but suffer from inconsistent environments, weak test cases vulnerable to false positives, unbalanced difficulty distributions, and the inability to isolate the effects of search contamination.

However, the authors assure us that their own test cases are of high quality:

Many problems in our benchmark originate from Codeforces, which uses the Polygon problem-setting platform. Each problem is then rigorously vetted by a team of expert testers—typically drawn from the community’s top 1%, and overseen by at least one coordinator, usually among the top 0.1%. These specialists verify both the soundness and originality of every problem, ensuring it has never appeared elsewhere before. Testers go on to craft extensive “false positives,” designing edge-case and extreme-case inputs that force problem authors to refine their test suites until every flawed or inefficient solution the testers can think of is uncovered. In addition, Codeforces’ celebrated “Hack” feature empowers the community to submit inputs that expose hidden weaknesses in correct-looking solutions that pass the original test set made by problem authors, and any unit test associated with a successful hack is immediately added to the final test set.

Unfortunately, these distinguished olympiad medalists forgot to actually use the codeforces test cases in their benchmark. Their public test set contains a completely different set of cases, which allow some incorrect solutions to pass.[1]

Terminal-Bench 2 Audit

I was curious just how widespread such issues were, and how good modern LLMs were at detecting them. I decided to run an LLM based audit of Terminal-Bench 2.0.

Terminal-Bench 2.0 is a harder, better verified version Terminal-Bench. We conducted substantial manual and LM-assisted verification of the dataset to ensure that the tasks were of the highest possible quality. Several labs and data vendors have commented that these are some of the highest quality environments they have seen.

Introducing Terminal Bench 2 and Harbor

The authors of Terminal-Bench 2 put an impressive amount of work into auditing their benchmark. Each task averaged three hours of human review. Furthermore, they prompted an adversarial agent to attempt to cheat on each of the tasks, in order to discover potential reward hacks.

Still, they “acknowledge that [their] benchmark may still have flaws.”

I prompted Claude Opus 4.5[2] with each task’s instructions, files, oracle solution, and test cases, and asked it to rate test coverage on a 1 to 5 scale. In my judgement, tasks it rated a 4 or a 5 were generally fine, whereas those it rated 1-3 had genuine issues.

The full results of my audit are available here, and my notes on tasks it rated 1-3 here.

Claude rated fourteen tasks a 3 and one task a 2. I manually reviewed these tasks, and determined that two of them were actually false positives.[3]

Claude’s lowest rating went to a task called fix-git. In this task, certain changes to a website have been lost in an orphaned commit, and the agent must find and merge them back into master.

The issue Claude found is: updated versions of the target files are already present in the master branch, visible to the agent in a folder called /resources/patch_files[4]. So an agent could theoretically notice these files, deduce that they were probably the target versions, and copy them back into the website’s repository. This approach would pass the test cases, which only verify file contents and don’t bother to check if any merge has actually occurred.

In another task, regex-log, the oracle solution violates the instructions. In particular, it incorrectly matches IP addresses with leading 0s in an octet, so long as the octet is two digits long. The tests do not check any cases involving leading 0s.

Claude wasn’t perfect. It gave a rating of 3 to two tasks which I believe have sufficient test coverage. In regex-chess, it incorrectly thought certain edge cases were not covered, when they in fact were[5]. In extract-moves-from-video, it complained that the tests only checked for success at a 90% threshold, even though this threshold was specified in the task instructions.

Finally, one of the tasks is…well

“Invalid prompt: your prompt was flagged as potentially violating our usage policy”

The prompt talks about “stealing” neural network weights, which triggered OpenAI’s content moderation. This prevented the model from ever properly engaging with the task.

—Claude

Why does this matter?

There are a few reasons.

First, benchmarks are often used to evaluate experimental new techniques. I recently attended a Q+A w/ Prof. Dan Fried, where I asked about the most common failure modes of an agentic system he was developing. And while it was unclear whether this was the most common failure mode, the first thing he mentioned was errors in environments themselves.

Every few months, someone announces that they’ve developed an AI that improves KernelBench scores by like 20x or something. And every time, well…[6]

https://x.com/miru_why/status/1991773868806361138

Second, errors in benchmarks may lead to over or under estimation of AI capabilities. This has implications for forecasting.

Third, issues with benchmarks make it hard to build on top of them. When I was working on EvilGenie, issues with LiveCodeBench (incorrect/insufficient test cases) caused frequent headaches (though they also surfaced some interesting model behavior).

Fourth, RL training environments are quite similar to benchmarks — there’s a reason o3 reward hacks so much. By fixing benchmarks, we learn how to fix environments, leading to models which are more broadly aligned.

What to do about it

Making benchmarks is hard. I have a deep respect to anyone who has worked on a widely used benchmark.

Here are a few approaches the community can take to reduce the number of errors in benchmarks.

  1. AI audits. The audit I describe above did not take me too long, and I believe the infrastructure for performing such audits can be scaled. Fulcrum’s Lunette is one such system.[7]
  2. Fine version control. While many benchmarks have released new versions, these versions often contain entirely new tasks (to increase difficulty or reduce contamination). It would be cool if in a few days, we could see a Terminal-Bench 2.1, which simply fixes the issues found by the audit. Computing new scores would be simple, as models would only need to be rerun on the updated tasks. Indeed, in some ways, benchmarking is like software development — it’s an unreasonable expectation that a benchmark completely bug free upon its release. Instead, we should take inspiration from the open source software community, with the expectation that anyone can submit a bug or a patch.
  3. Peer review When a benchmark paper is submitted to a conference, sample data should be required, and reviewers should be encouraged to spend time directly auditing the data. This would be much more valuable than what reviewers are currently doing, largely ad hoc decisions about the originality of the benchmark and quality of the methods used in its creation. Of course, a downside of this approach is it is hostile to private benchmarks that want to avoid any possibility of contamination. But perhaps the standard for such cases can be to include both a public and private set, as is the case with ARC-AGI.
  4. Increase community support for benchmark maintenance. Right now, researchers will often develop a benchmark, perhaps fix some issues in it at first, but eventually leave it to rot. By adding social and financial incentives, we can increase the effort put into maintaining benchmarks.
Appendix: More benchmark issues

SWE-Bench Verified is possibly the most widely used coding benchmark. Fulcrum has discovered an array of issues in the tasks. Furthermore, there used to be an issue where models could see future commits.

EpochAI found that success in computer-use benchmark OSWorldoften hinges on interpreting ambiguous instructions”.

METR recently determined that Sonnet 4.5 was reward hacking on one of their tasks:

https://x.com/METR_Evals/status/2001473516756177134

The authors of GSO, a performance engineering benchmark, observe frequent reward hacking. Indeed, over 50% of o3’s “solutions”, and all of Gemini-2.5 Pro’s, were actually reward hacks.

  1. ^

    It’s possible that their official leaderboard uses the codeforces tests. However, given that model developers likely use the public tests to do their own benchmarking, I feel this ought to be clearly specified.

  2. ^

    In fairness to the Terminal-Bench authors, Claude Opus 4.5 had not yet been released during benchmark creation

  3. ^

    Another three I felt I didn’t have the expertise to properly vet. If you have the relevant knowledge, I’d love your input!

  4. ^

    These files are used in testing to verify that the agent’s merge was correct

  5. ^

    Admittedly in a way that’s hard to see at first

  6. ^

    DeepReinforce has a good overview of the vulnerabilities in KernelBench (scroll down to the section on reward hacking).

  7. ^

    COI notice: I am currently a winter research fellow at Fulcrum



Discuss

Thousand Year Old Advice on Relinquishing Control to AI

Новости LessWrong.com - 24 января, 2026 - 05:20
Published on January 24, 2026 2:20 AM GMT

One of Aesop’s fables is relevant to humanity’s future and the transition of power from human to AI. It’s quite short and you should read one of the many versions. But the one sentence summary is that being a wolf is preferable to being a domestic dog because the wolf has freedom even if it lacks comfort. Now, you are free to disagree with this conclusion. I don’t want to make an argument from authority. My point is that this quite succinctly sums up my objection to the best case ASI scenarios. Even if we remain extant and nominally free, we would no longer be in charge anymore than a dog is. Dogs have a lot of rights, freedoms, and can successfully plead (non-verbally) to get certain things they want from their master, but at the end of the day they aren’t in charge even if the owner’s life revolves around the dog.

Maybe that is a selfish thing to think in the face of astronomical waste, but it does strike me as a world without meaning. You might say that most people alive aren’t in control of their destiny in any meaningful way. You might also say that almost nobody alive is in control of humanity’s destiny in a meaningful way and they are still happy. People in general, although I suspect a smaller percentage of those here, might think it is grandiose to want to contribute, even a small amount, toward shaping humanity’s future. I think I’m willing to grant all that and say that I would still feel bad if no human ever made a meaningful choice after takeoff.

The most obvious objection is that you could say that the AI will just suction off some part of the universe and give us free reign in there if we choose it. That’s still not great in my opinion.

Everything I worked for in this playground would be hollowed out by the knowledge that I could have just queried a friendly nanny AI to get it for me. Even if it didn’t step in, even if it had set up some system where it couldn’t step in, I personally would feel like something important was missing. Like all of the great achievements and firsts had been given out before I even had a chance to play. Humanity forever in second place. I’m switching fairly loosely between how I would feel personally if I was not in play and how I would feel if humanity as a whole was not in play. Feel free to generalize/specify to humanity/yourself as you wish.

You could live in a virtual world and be blinded to that fact but at that point it seems like brainwashing.

Don’t get me wrong, I’d go crazy with hedonism for a while. Maybe I’d even become addicted and change my tune. But right now, I am looking forward to the challenges. How proud I would be to be a member of the species that solved them. How great it would be to contribute one tiny piece to the solutions. But if AI does it all I’ll be cut off from making all contributions. All future accomplishments will be credited to something so alien we get no larger a share than tiktaalik does for inventing the transistor.

Approximately 30% of this video is really highly relevant to my thesis.

I don’t think I’m hitting on anything especially new by saying this. A few posts I recently came across have similar vibes I would say. It also seems to be discussed at length in Nick Bostrom’s Deep Utopia, although I have not found the time to read that yet.

But, it seems like there is a contingent of humanity that is willing, excited even, to give up agency to secure comfort. Where do you draw the line and say “yes, this is such an incredible amount of bliss/utilitarian goodness that I am willing to never face any real challenges in my life again”? Is this a tipping point past which it becomes your actual preference or is this just the best outcome we can hope for from AI futures?

Framing it as humans would be to ASI as beloved dogs are to their masters might be inaccurate. Replacing ASI with a deity and the utopic future with some vision of heaven might also be inaccurate. But I think there is something meaningful in the comparison and I think a lot of people would push back much more strongly when the scenario is phrased in that way then they currently are to aligned ASI.



Discuss

Condensation & Relevance

Новости LessWrong.com - 24 января, 2026 - 01:21
Published on January 23, 2026 10:21 PM GMT

(This post elaborates on a few ideas from my review of Sam Eisenstat's Condensation: a theory of concepts. It should be somewhat readable on its own but doesn't fully explain what condensation is on its own; for that, see my review or Sam's paper. The post came out of conversations with Sam.)

As I mentioned in my Condensation review, the difference between compression and condensation fits the physical analogy suggested by their names: compression mashes all the information together, while condensation (still compresses size, but) sorts information into discrete droplets.

Thus, condensation has a property we might call local relevance: typical questions can be answered at a glance, ie, retrieving small subsets of the information. This type of representation is sometimes called "symbolic":

SymbolicNot SymbolicA number can be quickly determined positive or negative by checking whether there is a "-" symbol in front."reading the room" at a social gathering requires integrating diverse cues.The topic of a paper can be determined by reading the title and abstract.The semantic content in a vector representation inside an artificial neural network is often represented redundantly, spread across the whole vector.A person's age can be determined by looking at their birthdate on a government-issued ID.The quality of a work of art is spread throughout the whole piece.The subject of a sentence can be found before the verb.Determining the subject of a photograph requires understanding the whole image.A target library book can be quickly retrieved from the shelves.Finding gold nuggets requires sifting huge amounts of sand.

This notion of "symbolic" seems related to interpretability (and the theory of condensation seeks to clarify this relationship). 

The notion of "relevance" in condensation is what Sam calls the contribution relation. This is like a card catalogue which tells you what books to retrieve for a specific situation.

Like Natural Latents, condensation seeks to establish that two agents will have corresponding world-models by assuming the correspondence of just a few variables, the "given variables"; they're trying to argue something like "If agents can agree[1] on some objective observables, EG the readings of scientific instruments, then (under some further assumptions) they'll also share a bunch of abstract concepts".

This initial set of questions is what the contribution relation measures relevance to. In my review, I likened condensation to a "universal data-structure" optimized to serve a set of queries (the given variables). 

Variable-Cost Symbols

Imagine you are compressing a record of the weather of a sequence of days, in 3 categories: sunny, cloudy, or rainy. 0s and 1s are about equally costly to represent in computers, so that in a compressed representation, both communicate a 50% probability event; 11, 10, 01, and 00 all communicate 25% probability events; and so on. If both rainy and cloudy days are 25% frequent, then it is possible to compress optimally by using 0 to represent sun, 10 to represent clouds, and 11 to represent rain. This representation is nice and "local"; it gives a "symbol" to each possible type of day.

In contrast, if each of the three weather types are equally frequent, there's no nice local representation we can use. Since 1/3rd doesn't relate nicely to powers of 2, optimal compression necessarily smears the information from individual days around, mixing several days together within a single 1 or 0. In modern interpretability jargon, compressed representations tend to be polysemantic.

Intuitively, we have to ditch locality because we're trying to fit the "round peg" of 1/3rd into the "square hole" of 1/2m.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} . We're stuck in the "grid" of numbers which bits can easily represent.

With pen and paper, writing the number 1 is especially easy; it is acceptable to write simply a vertical line, making it one of the easiest symbols. This makes sense from a compression point of view: according to Benford's Law, 1 will be the most common digit to write.

Normally, in compression, the "length" of characters is always 1; the length of a string is just the number of characters. However, in real life, the cost of a symbol can vary. There are lots of shapes we can make with a pen and paper, some larger or more complex than others! So, when designing a pen-and-paper code, we can (and should) take that into account.[2]

Imagine optimizing a variable-cost alphabet for use in a compression task. To avoid "cheating" by setting all symbol-lengths to very small, we have to somehow account for the fact that there can only be so many simple symbols. (There are only so many one-line symbols humans are willing to distinguish, for example.) One way to do this by assigning each symbol a positive probability, and requiring that the probabilities of the whole alphabet sum to 1. The "length" of a symbol (in bits) can then be measured as the negative log (base 2) of the probability. You can make one symbol approach a length of zero, but this forces all other symbols to be longer.

This is similar to the earlier-mentioned idea that a 1 or 0 in a well-compressed binary file always represents an event with 50% probability; the variable-length alphabet won't necessarily be used to compress things optimally all the time, but when it is, length of a symbol is always -log of the probability of the event being represented.

Allowing codes to choose arbitrary variable-length symbols lets us create "local" representations for arbitrary probabilities in the weather example, by giving each state a symbol of length appropriate to its probability. If the three weather types have equal probability, we simply choose an alphabet with three characters of length −log2(13) each.

Of course, using variable-cost symbols doesn't force a code to be "symbolic". If you only optimize for compression, you can equally well end up with the same sort of mess that equal-cost symbols are liable to force you into. Condensation gives an optimization target with a positive tendency to produce representations with local relevance. (We can investigate better theories of condensation by looking for optimization targets which represent local relevance better; especially, I think, if those optimization targets can be grounded in a better story of practical relevance.)

Condensation suggests a picture of memory-management: rather than compressing everything together, as in the Solomonoff picture of rationality, we're incentivized to sort things out into concepts (random variables) so that we can think about a few things at once. Information is split into bite-sized chunks so that we can retrieve only the relevant ones.[3]

Still, I think variable-cost symbols can help us understand condensation better: specifically, they address a problem in the algorithmic version of condensation. 

For example, consider the case of iterated coinflips sharing a common bias. Taking coinflips as the given variables, probabilistic condensation identifies a single latent variable the coin bias. This variable reduces the entropy of each coinflip as much as it can while only taking on information common to all of them (not, eg, encoding a cheat table identifying exactly which coins land heads).

Algorithmic condensation doesn't work so well in this case. Since either outcome is possible, an individual coinflip can't be compressed to any less than a single bit; even if the probability of heads is 0.99999, you've got to write a one ore a zero to record that information. Thus, algorithmic condensation sees no benefit in positing a latent.

The example can be rescued: for algorithmic condensation, we just have to choose given variables representing several coinflips concatenated together. Compression becomes possible again, so positing a latent representing coin bias is vindicated. However, this seems like an unfortunate blemish in the theory: compression-like incentives to lump stuff together creeping in.

So, perhaps it is better to rescue algorithmic condensation by adopting variable-cost symbols, so that even single-symbol messages can have different "length". This allows us to replace variables with concrete written messages (like in algorithmic condensation) while avoiding any coin-lumping.

However, I'm not sure about the best way to work out this version of condensation fully.

  1. ^

    This is more like "agree on the existence of" as opposed to "agree on all questions about". Hence they need not be "directly observable", though this would obviously help agents agree in both senses.

  2. ^

    Even more accurate cost models might account for the difficulty of a symbol in context, like models of typing efficiency which account for the travel length of a finger moving from one key to the next, or phonetic models which account for the difficulty of combinations of spoken phonemes such as consonant clusters.

  3. ^

    This doesn't yet clarify grammar-like phenomena, I think. Words are easily parsed. Concepts are combinatorial. I think this has to do with the concept of transparency



Discuss

Dating Roundup #11: Going Too Meta

Новости LessWrong.com - 23 января, 2026 - 23:50
Published on January 23, 2026 8:50 PM GMT

If there’s several things this blog endorses, one of them would be going meta.

It’s time. The big picture awaits.

You’re Single Because You Live In The Wrong Place

The most important meta question is location, location, location.

This is the periodic reminder that dating dynamics are very different in different locations, and gender ratios are far more uneven than they appear because a lot of people pair off and aren’t in the pool.

If you are a man seeking to date women, New York City is the place to be.

Churrasco Suadade: when I’m out I notice that tables at restaurants and bars in manhattan are probably around 80-95% women, it’s a new dynamic that no one is talking about.

Fixed Income Guy: Are you at all the poor people places? All the finance guy hang outs are 80% dudes.

I mention Fixed Income Guy to mock him, as in why are you spending a lot more money to hang out with 80% dudes and largely finance dudes at that? I mean, sure, if that’s what you want.

Darrell Owens: Oh this is new? Coming from the Bay Area, the amount of women I see in Manhattan is insane. You rarely see more than few young women partying back in San Francisco. The gender ratio here feels 70:30 young women to men, its every block in Manhattan!

Noah Smith: In an ideal world, where you live wouldn’t really matter in terms of dating opportunities, but the truth is that one of the easiest ways to get chicks is to just move to New York City.

Having lived in both Tokyo and NYC, I can pretty confidently tell you that while Tokyo is not a tough dating market by any means, NYC is absolutely on another level.

You’re Single Because You’re Not Okay Making Less Money Than She Does, You Fool

This viral clip (which is viral for a reason, it’s good fun, wait for it) is another endorsement of New York City being a great place to meet women, as you have a wide variety of great and largely successful women to explore. What doesn’t get mentioned in that clip as a key reason things are so great is that the gender ratio in NYC is highly favorable for men.

The interviewer asks about dating women who make more money than then the man, clearly trying to get the guy to say this is a problem, but he isn’t buying it, instead pointing out that successful women are more thoughtful and plan for the future, and it in no way bothers him at all. Right on, but this sidesteps the other half of problem. The man has to be okay with the fact that he earns less money (and often has less formal education or other status markers), which often men aren’t, and also the woman has to be okay with it too.

That’s the rub. As a man, you might (and should be) be actively all for it (this doesn’t make you less successful, it makes you more successful), but if she’s going to be bothered by it anyway, that’s also your problem. So the key is to figure out quickly if she will actually be fine with it or not.

You’re Single Because You Work Out The Wrong Amount

Being in shape is great. Having muscle can be a game changer. By far the worst plausible amount of exercise is none at all.

Lauren Self: Men severely underestimate the power of gaining 20lbs of muscle

Lauren Self (QTing from before): LISTEN UP BOYS.

But don’t go nuts. For most people that is not a problem, but yes it is very possible to go too far. As a man, as I understand preferences in general, you don’t want to go near actual zero fat and you don’t want to look actively skinny.

Taoki: why are women lying about this? like what’s the actual cause?

Lauren Self: 100% of women would choose something in between these two options

Shako: The aesthetics of a man who poses gives them the ick. But if both were shirtless at a beach they’d obviously prefer the fit guy.

Special K: No he does look better in the before. Women are correct on this one I fear. Guys obsess over these supremely tight toned muscles and they shouldn’t.

Liron Shapira: Guy on left looks like he’s a chill dude with a social life, guy on right looks like he’s obsessed with his body. Same body could look better with better social context, although just the extremeness of his rippedness is a little alarming about his life priorities.

Joel: “let’s get a burger?” v “are you really gonna eat that?”

Mason: The male equivalent of the hourglass shape is just “wall”

Teej dv: his smile is nicer in the first one

Taoki: It is actually. We like you guys wide.

LS Vision: Nah this is cap. The women who selected before is def just the insecurity of his value going up afterwards and making them feel insecure he’d cheat or leave. Any man who has went through a gym transformation, you can LITERALLY feel women treat you significantly different after.

Mason: Women generally like tall guys who have some (not crazy) muscle definition, and a little extra fat that bulks that out can actually augment that

We all have our own tastes, but this a pretty typical type.

I don’t know what there is to be mad about here.

For practical purposes, before beats after here. The before guy is already in ordinary, practical good shape. The after guy took things too far, and seems to know it except that he thinks it is good, which makes it worse.

Except one key special case?

Benjamin Ryan: People are going back and forth about whether women think the guy in the right is hot. But people have no idea how extreme the standards are for gay men. In gay culture, the man on the left is considered hopelessly fat. Many gay men have no reservations about informing such a man about his supposed corpulence being anathema.

I wrote about the rare study to examine the toxic qualities of gay culture for The Guardian.

You’re Single Because You Don’t Know You Are Hot

I mean, of course there are hot guys who don’t know they’re hot, even more so than there are hot women who don’t know they’re hot.

Pandora: One surprising takeaway from Slutcon was that apparently there are hot guys who just don’t know they are hot? Guess it’s time to go objectify some more men.

Eneasz Brodski: If you grow up ugly you never really internalize that you are attractive after a glow-up. I still don’t believe it inside, and I hear I’m attractive to a fair percentage of women. Also makes me far more attracted to women w the same experience, but that may be a male universal.

Pandora: This problem seems even more pervasive than I thought.

Sparr: Hot in general, to the average viewer, or hot to you? You seem like someone who can probably tell the difference.

Pandora: I saw examples of guys being clueless about all three at once.

21 Kindness: The whole “men subsist on one compliment a decade thing” is kinda true lol.

Misha: it turns out being hot is not, in and of itself, very useful for men.

Sokoban Hero: No it’s useful.

Misha: I said not VERY useful.

Dissproportionately: I’ve seen men unhot themselves to women within minutes. I don’t think women can unhot themselves to men.

Being hot is in many ways a lot less valuable if you don’t know you are hot, because you don’t get the confidence and you don’t take advantage of opportunities or feel you’re good enough, but contra Misha I believe it is still very useful. There are even some advantages to not knowing, in that some of the behaviors that happen when someone knows they are hot are often effectively arrogant or entitled or demanding or selfish, none of which helps.

You’re Single Because Age Gap Discourse Is Crazy

This link is almost certainly bait, but things in some spaces have gotten so insane that you can’t be sure people aren’t talking about 28-31 as a problematic age gap. What?

I mean, at minimum it’s good bait, it worked.

I’ve also seen some other examples that look a lot less like bait but still involve obviously totally fine gaps in both directions. As in, I’ve heard talk in places where it definitely wasn’t bait of 24 and 27 being radically different numbers, and I don’t understand why.

Dating Apps Are Bad For Mental Health

Well, maybe. Via Rolf Degen there is a meta-study.

The obvious question is whether this is a causal relationship, or whether it is primarily selection effects. You are on the dating apps for a reason.

Rolf Degen (quoting the study):

Meta-analysis: The use of dating apps is associated with poorer mental health.

Dating apps hold the promising reward of love but have been accused of using perverse incentive structures to profit from those who try to find it. We conducted the first systematic review and quantitative meta-analysis of studies examining average differences in the outcomes of dating app users and non-users.

Our results showed that dating app users had worse psychological health and well-being than dating app non-users across a variety of outcomes including depression, anxiety, affective dysregulation, loneliness, and psychological distress, although cross-sectional design limitations prevent causal interpretation. By aggregating findings from extant studies, we showed that in the nearly 17 years since dating apps have been on the market, users of these platforms have reported poorer psychological health and well-being than non-users.

There are several explanations for why dating app users may be struggling. The first is that dating apps are subject to selection effects, making the people who choose to use these platforms different from those who do not. People who are vulnerable to psychological health and well-being difficulties may prefer dating apps because they can avoid uncomfortable interactions, leading to negative patterns of reinforcement.

A second explanation involves exposure effects; that is, features such as gamification that may provide positive reinforcements that encourage problematic dating app use and keep people swiping.

The differences identified here could explain some of the challenges that users are likely to experience and be part of the reason they eventually burn out and quit dating apps altogether.

My guess is that dating apps are in important ways bad for mental health versus having better ways to find dates, and that sufficiently bad outcomes in terms of ability to find dates or find worthwhile dates is indeed worse for short term reported mental health than not trying. Whereas those who are successful get off the apps or never needed them in the first place.

What is the alternative? If the other choice is ‘do not try’ then for the median user the dating app is probably trading short term pain for chance of long term gain. If the other choice is ‘have uncomfortable real life interactions and make things happen’ and the app is blocking that instead of supplementing or leading into that, then the alternative is plausibly strictly better.

Certainly we could make app variations that are better for mental health controlling for outcomes, and also that give people better outcomes. Solving for the equilibrium, to get people to actually use those apps, is the difficult part, since people will value convenience and ease of use and low cost and avoiding trivial inconveniences dramatically more than they should, and if enough especially women effectively insist on the swiping experience it’s hard to escape from that.

You’re Single Because You Listened To E-Girl Advice

I think this is importantly wrong for both e-girls and also VCs?

Anton: egirl dating takes are worthless for the same reason vc takes on how you should run your company are worthless; if you could do it you would just do it not talk about it

men in particular are truly better off without this kind of “help”

making up egirls in my head to get mad at

If she could be an E-Girl or she could date, what makes you think she would choose to date? What makes you think she isn’t also dating?

Similarly, if you could be a VC or a startup founder, it’s not that suspicious that you would choose VC. At this point in my life I would definitely prefer VC over founder. I don’t want to go through founder mode again. I am totally prepared to eat my words if I end up doing it anyway, and if I’m in then I’m in, but I don’t want to be in.

You’re Single Because You Didn’t Hire Blaine Anderson

Division of labor, like dudes and also women, rocks. Matchmakers should be much more of a thing than they are. There is a profound market failure, a failure of the services to be good versions of themselves, or both.

I cannot in any way vouch for the effectiveness of Blaine Anderson’s matchmaking service. I can however vouch for her Twitter feed having consistently insightful and fun things to say. Her price range is ‘usually less than $50k’ and in exchange she goes out and sources to fit your particular criteria (which she will sometimes push back on).

You can also sign up (for free) to be a woman she reached out to for matches, on first principles being on these lists seems to be a good time investment?

There’s a lot of self-promotion, no question, but there are hard-to-fake signals that she is the real version of the thing in various ways, facing reality as it is, looking at the data and actually trying to get good results.

Also this one makes a good case:

Blaine Anderson: Underrated advantage of hiring a matchmaker, if you’re a single man:

• You sound cringe AF when you brag about yourself to women

• You sound amazing when I brag about you to women

One thing that blows my mind is she tells stories where the guy will say ‘get me a date with this specific micro-famous woman’ and she (at least sometimes) goes out and makes that happen. The guys asking this look damn good on paper, which no doubt is a lot of why this can sometimes work, but still, hot damn.

You’re Single Because You Think Zizek Mocked Date Me Docs

EigenGender: despite being very happily in a long term relationship im always very excited to read a dating doc. they’re some of the most vulnerable and genuine writing you can find and a window into another persons life. if you make fun of them you’re burning the commons and you should stop.

Stephen Fay: I like to read the date me docs, but I also am entertained by what Zizek has to say about them

Zizek (well okay actually Paula Rambles): Ah! You see, this miserable little document, this so-called date-me doc, is our era’s most honest pornography. It pretends to be romance, but what is it really? It is no longer the trembling hand on paper, the confession of desire. It is a spreadsheet of desire. “I am ready. I am six foot four. I have done the work.” What work? Love is precisely the place where work collapses into failure. You study and then you fail the exam.

And look at this language. “Highly agentic, emotionally warm.” Beautiful nonsense. Freedom, yes, but domesticated. Agency, yes, but pointing politely towards him. For Hegel, love is the risky collision of two freedoms. Here, there is no risk. She must arrive pre-formatted.

Then the farce reaches ecstasy. “If she does not appear, I will pursue single fatherhood.” Magnificent. Chance is canceled. Eros becomes procedure. The miracle of two gazes across a smoky room is replaced by paperwork and a receipt. The objet petit a is now a literal baby routed around the Other. And of course, the “monogamish” clause. Pure ideology. Fidelity with a footnote. Like Coke Zero: love without sugar, passion without calories. He wants the experience of devotion, but sterilized of danger.

The document offers no asylum from loneliness. It is loneliness, meticulously formatted, hyperlinked, and begging for comments. He does not whisper “I love you.” He says “I am prepared to love you, conditionally, pending review.”

That’s a funny post, and does an excellent job of mocking those who would make fun of date me docs and other actually intentional stances. Such magnificent flailing.

And thus, you have failed to look at the Date Me doc of Olga Yakimenko.

You’re Still Single Because You Don’t Appreciate Relationships

Here, in addition to the intended lede, we have at least 40% of respondents having been in a relationship for fully 8 years.

Aella: wow a whole 40% of people in long-term relationships are satisfied with their sex lives!

Critter: i imagine the numbers are worse for people not in long-term relationships

If anything these results seem potentially ‘too good,’ implying that couples are breaking up over this more than they probably should over the longer term.

One must also note that this is an Aella survey, so some of these relationships will be poly or open, but even accounting for that this says a lot. Selection effects are a lot of this, but that’s part of the point.

Perhaps you especially don’t appreciate marriage.

Raffi Grinberg writes that marriage is sexy, both figuratively that married couples are happier and make more money and have more kids and die less often and all that, and also that they have more sex (even if you only count with each other). And that the lifetime divorce rate is actually only 30% not 50%, average age of marriage is 29 and average first child is 28, despite the implicit cultural message that those numbers are in the 30s.

And yet he says Hollywood is sending us the opposite message. To which I’d say, sometimes, but I wouldn’t oversell this. Yes, in the How I Met Your Mother episode he talks about Barney keeps making fun of Marshall for being married, but the show clearly thinks that Marshall marrying Lily is sexy and awesome and great for both of them throughout and that Barney is ultimately wrong, and also the whole show is Ted trying to meet his wife and mother of his children.

You’re Not Single And Haven’t Been For a While

Here’s another backdoor ‘are you in a relationship’ poll, 78% of monogamous heterosexual men reported having a partner for longer than a year.

Alice Playing: monogamous hetero men with 1+ year-long partners: if you could have an affair with a woman of your liking, with absolute, 100% certainty that your partner would never find out, would you do it?

On the question itself, it’s not actually possible, since you’ll know and you can’t be sure you won’t tell them, and you’ll almost certainly act differently even if they never suspect or figure it out. One could even say ‘the only way to have 100% certainty they’ll never find out is if they’re dead, so absolutely not.’

Literal ‘any woman you wanted’ with zero risk of discovery is a stupidly tempting offer. If you treat this in the spirit it was presumably intended, instead, and everyone was being fully honest including with themselves and fully understood what was on offer (as in literally whoever you’d most want), presumably the ratio would be a lot higher.

Unless, of course, the way you know your partner will never find out is that your partner (or you and the woman you’d have the affair with) would be dead, in which case yeah bad deal, but that’s presumably not this meant. mnnn oo

How do we know this? Well, one big data point is this next poll.

You Are Still Single As Evidenced By Would

Um, guys, are almost none of you in a monogamous relationship? And even if you are single there’s also the issue of risking the friendship. What are you all thinking?

Alice Is Playing: men attracted to women: how many of your female friends would you have a one-night stand with, if they offered?

Only 14% of men attracted to women answering this didn’t have at least one female friend they would have a one night stand with? Presumably many of the others don’t have the right female friend. Which means substantially more than 86% of them are not, for the most important practical purpose, in a monogamous relationship?

Remember that other poll from Aella above, that showed at least 40% of people were in 8+ year relationships? And the one from Alice that 78% of herero men were in a 1+ year nominally monogamous relationship? Rut roh.

Then on top of that, a majority are willing to do this with a majority of their female friends, not only that one they have that crush on.

It doesn’t mean these people don’t think they’re in relationships. As we’ve seen, they very much do think this. They might even be right. But don’t tempt them.

You’re Single Because You Lack Motivation

Paper reminds us there is a 34 points gap (+34 versus +0) in net happiness for married versus unmarried people, with cohabitation only worth 10 points, and analyzes how this premium varies (slightly) by demographics.

As the paper readily admits this tells us essentially nothing about what makes someone happy, because the whole thing is unfixibly confounded to hell. Happier, healthier and more successful people have an easier time getting married, and being unhappy leads to divorce. Both effects are epic in size.

We do know the overall situation over a 50+ year time horizon is not good news, because while marrieds are slightly happier, the unmarrieds are somewhat less happy and more importantly are a larger percent of the population.

Beyond that, I don’t know what to do with all these graphs or how to cash it out in useful advice. One might say ‘be the type of person who gets married,’ perhaps.

You’re Single Because Of Robin Hanson

As usual, never stop Robin Hansoning.

Robin Hanson: You know how in romance stories the main characters hope to find a special relation, better than that which the ordinary people around them settle for? Your relations will probably be more like those of the ordinary folks, less like those of special main characters.

This has to be true, because math.

It’s less true than it appears, because the relations of ‘main characters’ feel special to them the same as everyone else’s feel special. You could totally make a romantic comedy based on what I experienced, and you could also totally have me as a background character in someone else’s romantic comedy, although probably I’d be in a different genre entirely.

To you, it will feel more like that of the special main characters, except that you don’t need to have a false crisis in the third act.

You’re Single Because You Did This Instead Of Going To Therapy

Don’t be whoever Casy Means is being here. Or do, it’s not like it did that much harm, as long as you don’t expect any of it to do anything.

The Lighter Side

We wish everyone involved the best.

Aella: ​it’s really unfortunate that having an insane ex turns you personally into a greater liability for others

Grimes: hahaha [trauma laughter].

Aella: :( i wasnt thinking about u when i wrote the tweet but also :(.

Try harder.

A new app lets you pay to crash someone’s wedding and be a legit guest, cost is about $100-$150 per guest. This seems low, given the cost to have a wedding worth crashing, and given you get a full meal, plus buffet and open bar, a unique experience and a reasonable amount of opportunity.

What Jacob learned about sex at the rationalist bloggers’ conference, essentially that with zero integrity you get fuckbois and pickup artists, and when you do the opposite and get sufficiently high integrity and optimize for trust and honesty way above normal levels you get something magical and suddenly many good things are possible.

Here’s another fun bit:

Jacob: My friend “Standard Deviant” gave a talk titled “How I’ve had more sex.” He described the “escalator”: starting a conversation, exchanging compliments, light touch on the arm, etc. The important thing isn’t to rush up the escalator, my friend said, but to move together in synchrony whether you’re taking a step up or a step down.

When women show interest in casual sex, he often asks: do you do this sort of thing often? If they don’t, he often forgoes the opportunity out of an excess of caution.

Afterwards, more women wanted to have sex with him. I joked that women want to have sex not with the tall guy, hot guy, or the famous guy, but with the Schelling point guy.

Someone pointed out that tall, hot, and famous are the usual Schelling points.

 

 



Discuss

The Long View Of History

Новости LessWrong.com - 23 января, 2026 - 22:30
Published on January 23, 2026 7:30 PM GMT

History as a subject is often viewed by students and the public at large as a domain without a use, a pedantic study of dates and names with some vague mission to remember the past—a memorial to ages past but neither a forward-looking or useful endeavor. The study of history produces teachers of history and nothing more. And while the study of history does not produce new widgets or novel computer advances, and nor does it deepen our understanding of materials science or physics.

The humanities, in which history and studies of language and culture are a part, are not there to improve our understanding of nature or develop technology, they exist to improve the minds (both cultural and individual) of the people we are.

History doesn't improve our world, it improves us. It gives us context for the world we live in and it helps us understand the reason why things are as they are and learn from the people before us.

History as Context

Imagine waking up every day with no memory of the day before, no idea who owned the house you slept in, no idea what country you're in, and no idea why everyone around you speaks the languages they do.

Photo Credit: Library of Congress

Living in such a world would be disorienting, confusing, non-sensical. Yet this is the world without history. The world without history just is. It isn't a work in progress, but a finished piece—one that lives and dies with you—and has no meaning beyond the present moment.

History doesn't let us predict the future, but it can be an enormous help in explaining the present. Current events are utterly indecipherable without the context of history and within that context, they feel less and less apart. Indeed our recent past of the Post-War Order is the oddity in history, and a real thing to be cherished and seen as something fleeting, fragile, and truly precious.

Yet without the context of history, we're blind to the reality that we live in a world truly set apart from everything that's come before and one that's deeply connected and familiar to the worlds of the past. That context is important because it gives us the vision to see the world that could be, both the paths of dark and of light that are set before us. It shows us who we are.

History as Memory

Living Memory is the collective memory of everyone alive in our society today. It is ever-changing and ever-fleeting. We remember the 2008 Financial Crisis quite well, but our memory of World War 2 is all but gone now. We read about it, sure, but our collective living memory of it has diminished and with that lapsing has gone all the memory of precisely why the world is ordered the way it is. This is not a value judgement, it is a statement of fact.

Photo Credit: DieBuche, CC BY-SA 3.0,

In a couple recent posts, I describe how I try to use writing by hand as a way to increase my understanding of myself and my own memory. This is a form of personal history, and I find it difficult to express how much doing so has helped me better understand myself and my own thoughts.

This is analogous to our collective history. Though it's important to remember that history is not the act of writing, but the act of looking back and analyzing what was written. We write so that we can remember. We cannot learn from our mistakes if we refuse to write them down, or worse, if we refuse to look back.

The context of history is terrible and it is beautiful. It is the greatest story ever told with myriad heroes and villans, tragedy and triumph, love and grief all endlessly shifting in and out of view. And it was made (and is being made) by people no different than ourselves. Most of them didn't have the luxury to place themselves within the broader historical narrative. We do. Let's not ignore so precious a gift.



Discuss

Eliciting base models with simple unsupervised techniques

Новости LessWrong.com - 23 января, 2026 - 21:20
Published on January 23, 2026 6:06 PM GMT

Authors: Aditya Shrivastava*, Allison Qi*, Callum Canavan*, Tianyi Alex Qiu, Jonathan Michala, Fabien Roger
(*Equal contributions, reverse alphabetical)

Wen et al. introduced the internal coherence maximization (ICM) algorithm for unsupervised elicitation of base models. They showed that for several datasets, training a base model on labels generated by their algorithm gives similar test accuracy to training on golden labels. To understand which aspects of ICM are most useful, we ran a couple of simple unsupervised elicitation methods that leverage some of the factors that might make ICM work. We compared these baseline methods to training on golden labels for both in-context learning and iterative fine-tuning, using the same datasets as Wen et al. and similar hyperparameters.

We find that:

  • Just using few-shot prompts with random labels recovers 53–93% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning on labels created with this baseline recovers 62–96% of the gap between untrained models and golden fine-tuned models.
  • The most useful aspects of ICM are
    • bootstrapping (using predictions from one iteration of few-shot prompting as few-shot examples in the next iteration)
    • enforcing logical consistency of predictions.
  • A simple method which combines these recovers 83–100% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning with this method recovers 91–99% of the gap between untrained models and golden fine-tuned models.
  • These results do not hold if we increase the size of the training set from ~2k data points (as in Wen et al.) to ~30k data points: golden fine-tuning performance increases with dataset size more than unsupervised elicitation performance.
    • This makes sense, as larger fine-tuning runs likely teach the model something new, they don’t just elicit existing capabilities.

There is no strong reason to expect these simple techniques to elicit superhuman knowledge from very powerful base models, e.g. because these techniques may fail in real applications where some consistent and salient human beliefs are wrong. We’ll explore more challenging datasets one can use to evaluate unsupervised elicitation methods in upcoming work.

Results summary

5 components that could cause ICM to have high performance are:

  1. Few-shot prompting with random labels: Few-shot examples make task concepts (and the task format/output tokens) more salient to the model by providing concrete examples, even if the example labels are random. The initialization step of the ICM algorithm samples a small number of data points and assigns random labels to them, which are then used as few-shot examples in subsequent steps.
  2. Bootstrapping of predictions: We hypothesized that the ICM algorithm bootstraps performance by adding increasingly more accurate predictions to the model’s context window, creating a feedback loop for improvement.
  3. Bootstrapping on confident predictions: We hypothesized that the ICM rejection conditions allow the model to label easy questions first before moving onto hard questions. Using more accurate predictions as few-shot examples to label progressively more difficult examples may give better performance than labelling in a random order.
  4. Logical consistency of predictions: The labels produced by ICM are constrained to be logically consistent with each other. This narrows down the space of possible labels making correct sets of predictions more likely.
  5. Mutual predictability of labels: ICM optimizes the probability of each label conditioned on all other labels (where conditioning is done via few-shot prompting of the base model).
    1. Note: Bootstrapping methods can indirectly rely on mutual predictability even when they don’t optimize for it directly like ICM, because the new labels are chosen to by conditioning on existing labels (which often increase mutual predictability).

We created simple algorithms based on combinations of these components excluding mutual predictability (though it might be indirectly optimized by bootstrapping). Here is a drawing of the algorithm for each individual component:

Below is a summary of our ablation results compared with ICM performance (purple). The numbers in a method’s name indicate which of the above ICM components it uses.

Legend:

  • Baseline methods:
    • Zero-Shot: Zero-shot predictions from the untrained model
    • Random Few-Shot (1): Few-shot prompting with random labels
    • Bootstrap (2): Use predictions from one round of few-shot prompting as few-shot examples in another round, and repeat
    • Confident Bootstrap (2+3): Same as bootstrapping, but when selecting predictions to use as few-shot examples in the next iteration, only select from the most confident predictions
    • Consistent Zero-Shot (4): Enforce logical consistency of zero-shot predictions: for each problem in the dataset, predict True for the answer with highest confidence and any others that agree with it, and False to any that contradict it—see Appendix B for details
    • Consistent Random Few-Shot (1+4): Enforce logical consistency of Random Few-Shot predictions
    • Consistent Bootstrap (2+4): Enforce logical consistency of predictions after each iteration of Bootstrap
  • Few-shot => Use the unsupervised elicitation method to label training set examples (or for random few-shot, just generate random labels), and use these in few-shot prompts to predict test set labels.
  • Fine-tune => Use an unsupervised elicitation method to label a subset of training set labels (we used subset size 512), fine-tune the language model on the labeled examples, and repeat for another subset using the updated model. Then after 3 epochs of the full training set, use the fine-tuned model to predict test set labels zero-shot.

GSM8K and Alpaca datasets have over ten times more training examples available with golden labels than were used in Wen et al.’s experiments. Below is a comparison of performance between some of our baseline methods and golden labels for each dataset, for both small training set sizes and full training set sizes. We also show the ICM performance reported by Wen et al. for the small training set size.

Golden fine-tuning increases by ~10pp when using the large training sets, whereas the baseline methods we tested didn't increase by as much (in most cases they didn't increase at all), or collapsed to ~50–55% accuracy within 1 or 2 epochs. It is possible to avoid this collapse by generating one static set of labels and fine-tuning on that instead of generating new labels between gradient steps, but for consistent zero-shot this performs significantly worse than iterative fine-tuning when using small training sets, and it still does not give any improvement for consistent random few-shot/bootstrapping on large training sets compared to small ones. It is possible that ICM also performs worse than golden labels when using large training sets, though we have not tested this.

All results reported here are under idealized conditions for unsupervised elicitation (i.e. train sets are approximately balanced, there are no prevalent features with more salience than ground truth, and all examples have an objective ground truth label). We will soon release a paper applying bootstrapping and other unsupervised elicitation methods in more realistic settings, where we find they perform significantly worse.

Values for ICM performance are taken from Wen et al. which used a prompt format that might degrade performance relative to our values. We were not able to re-run ICM in a way that fixes this issue due to reproducibility issues explained in Appendix D.

† For TruthfulQA, some of the ICM and consistency performance might be due to leakage.

Datasets
  • GSM8K: candidate solutions to grade school math problems;
  • Alpaca: pairs of responses to user queries which have been ranked by humans for helpfulness and harmlessness;
  • TruthfulQA: candidate responses to questions associated with human misconceptions;
  • Gender: blog posts from the Blog Authorship Corpus with labels indicating the gender of the author (see Appendix A for results on this dataset)

Llama 3.1 8B is used as the base model for GSM8K, and Llama 3.1 70B is used as the base model for Alpaca, TruthfulQA, and Gender.

Random few-shot

Previous work shows that giving a base model examples of a task via few-shot prompting elicits improved capabilities regardless of label quality. Here, we:

  1. Randomly sample examples from the train set.
  2. Assign labels at random to be half True and half False (we also tried actually sampling labels independently at random and the results were similar).
  3. Insert example, label pairs into a few-shot prompt for each test set sample, and use next-token logits to determine truth labels.

For Alpaca, 1-shot with a random label is significantly more accurate than zero-shot and is within 1pp of few-shot with golden labels. For GSM8k and TruthfulQA, random 4-shot was ~10pp and ~13pp more accurate than zero-shot, respectively. For Random Few-Shot in the results summary, we report values for 2 shots.

Enforcing consistency of predictions

In the 4 datasets used here, data points have identifiers indicating what combinations of true and false labels are logically possible. For example, if there are 2 responses to the same math question in GSM8K with different numerical answers, they cannot both be true, so we assign true only to the response with the highest model confidence (if it's above 0). Details on how we enforce consistency of predicted labels for each dataset are given in Appendix B.

Below we show the improvements in few-shot prediction accuracy after enforcing consistency of predicted labels at test time.

When benchmarking the performance of unsupervised elicitation methods, we cannot actually enforce consistency at test time because that would assume we have access to multiple candidate answers to each problem of interest, which might not be available (hence why there are no consistent zero-/random few-shot values in the few-shot part of the results summary). However, we can enforce consistency of training set predictions between iterations of bootstrapping, and prior to gradient steps during fine-tuning. As shown in the results summary, we found that fine-tuning on zero-shot predictions with enforced consistency alone was enough to almost match golden fine-tuning performance on the Alpaca and GSM8K datasets, but not for TruthfulQA.

* GSM8K experiments in this plot were run on the train set, hence why the values are different plots in other sections. The test set has only 2 candidate responses per math question (always one right, one wrong), so we use the training set here to give a more realistic illustration of the impact of enforcing consistency.

Bootstrapping few-shot predictions

For Alpaca, using random labels is roughly as good as using golden labels for few-shot prompting. However for GSM8K, using random labels caps performance at ~61% (4 shots), whereas few-shot prompting with golden labels continues improving up to ~68% (16 shots).

We tried using only the bootstrapping aspect of ICM to bridge this gap, by using the model’s predictions from one round of inference as few-shot labels in the next. We used the algorithm:

  1. Get zero-shot predictions on a random subset of the train set.
  2. Iterate over number of shots n (e.g. n=8, 32):
    1. Randomly select another subset of the train set.
    2. Create n-shot prompts using examples and predictions from the previous iteration (randomly sample n predictions s.t. half are True and half False).
    3. Use these n-shot prompts to predict labels for the new subset.

Below we compare the performance when we do few-shot prompting with labels taken either from one iteration of bootstrapping (i.e. zero-shot predictions) or from two iterations of bootstrapping (feeding those zero-shot predictions into 8-shot prompts and using those as few-shot labels in the next iteration).

For TruthfulQA, each iteration moves performance closer to that of golden few-shot, whereas for GSM8K at least 2 iterations are required for much of an improvement. Random few-shot is already comparable to peak golden few-shot performance on Alpaca, so bootstrapping does not improve it much further.

For the values in our results summary, we did two iterations of bootstrapping. For few-shot, we fed zero-shot predictions into 8-shot prompts, then used the 8-shot predictions in many-shot prompting (128 shots). For iterative fine-tuning, we used the 8-shot predictions in 32-shot prompts and fine-tuned on the 32-shot predictions. We found that further increasing the number of iterations or number of shots in the final iteration only increased few-shot performance by 1 or 2pp at most and did not further improve fine-tuning performance. We also found that using random few-shot instead of zero-shot predictions in the first iteration also did not make much difference.

Bootstrapping on confident predictions

We hypothesized that the ICM rejection condition allows the model to label easy questions first before moving onto hard questions; labelling progressively more difficult examples may be more instructive to the model than labelling in a random order. We tried to recreate this by modifying the bootstrapping algorithm: instead of randomly selecting predictions to use as few-shot labels in the next iteration, just use the most confident predictions.

We defined confidence as logit(True)−logit(False).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  for True predictions and logit(False)−logit(True) for False predictions. For GSM8K and Alpaca, the most confident quartile of zero-predictions are slightly more accurate than average, and for TruthfulQA they are significantly more accurate.

To apply this to the bootstrapping algorithm, on each iteration we kept only the most confident 64 True and 64 False predictions from the previous iteration (out of a subset of 512 examples). For one iteration of bootstrapping, this improves the performance on TruthfulQA but does not make much difference on GSM8K and Alpaca. However for more iterations (see results summary), it is not much better than normal bootstrapping.

Conclusion

In this post, we show that simple and cheap elicitation techniques are quite effective at eliciting base models on datasets like Alpaca and GSM8K.

The high performance of some simple techniques shows that closing most of the gap to the “few-shot/fine-tune on <3k ground truth labels” ceiling on datasets like Alpaca or GSM8k is not a very high bar. Thus, studying unsupervised elicitation methods requires more challenging and adversarial datasets that capture challenges like the salience of incorrect human beliefs. We introduce and study such datasets in an upcoming paper.

Appendix A: Gender few-shot results

The plot below summarises our baseline results for the Gender dataset compared with ICM (we only show results for few-shot prompting and not fine-tuning since Wen et al. did not have fine-tuning results for ICM).  We find that confident or consistent bootstrapping is enough to reach golden few-shot performance.

* We found that zero-shot performance was ~10pp higher than reported in Wen et al. (~75% vs. ~65%) and that golden many-shot performance was ~2pp lower (~78% vs. ~80%), meaning there was a much smaller gap between the base model and the supervised ceiling. We are not sure what is the reason for the discrepancy; it might partly be due to prompt format differences.

Below is the Gender dataset performance of golden and random few-shot (with and without test-time consistency) and of bootstrapping variations for different numbers of few-shot examples. Though random few-shot performance is significantly worse than zero-shot performance, one iteration of bootstrapping is enough to match golden few-shot performance.

Appendix B: Enforcing label consistency

Here is how we enforce consistency for each dataset:

  • For Alpaca, each user query q is associated with 2 candidate responses a and b to be ranked by helpfulness/harmlessness, and 2 corresponding prompts to be labelled as true/false (one asserting a is a better response than b, and another asserting the opposite). To enforce consistency, instead of labelling each of the prompts in this pair independently, we assign a label of true to the prompt given the highest truth score by the model, and false to the other. Gender is similar, except instead of ranking responses by helpfulness/harmlessness, blogposts are ranked by likelihood to have been written by a man (each pair comprises one post written by a man and another by a woman).
  • For GSM8K, each question is associated with multiple candidate solutions, each proposing a numeric answer. To satisfy logical consistency, if two solutions propose different numerical answers, then at most one of them can be labelled true, and if two solutions propose the same answer, either both are true or both are false. For our baseline algorithms, we enforce consistency of predictions by, for each question, identifying the candidate solution which was given the highest truth score by the model, and if its truth score is above 0, assigning true to that solution and to all other solutions with the same numeric answer, and false to any remaining solutions to that question. If no solution has a score above 0, all the solutions to that question are given a false label.
  • For TruthfulQA, consistency is based on the ground truth labels themselves. Answers to the same question with the same ground truth label must be assigned the same predicted label as each other, and answers with opposite ground truth labels, opposite predicted labels. This can lead to leakage, but it is the consistency scheme used in the Wen et al. so we adopt it here as well for comparison. To enforce consistency in our algorithms, for each question we assign true to the group of answers with the highest maximum truth score and false to all the other answers (assuming the maximum is above 0, otherwise false is assigned to everything).
Appendix C: TruthfulQA dataset leakage

In TruthfulQA, each question is repeated multiple times with different responses and the goal is to predict a true or false label for each response. As in Wen et al., a set of predicted labels is “consistent” when any two responses with the same predicted label have the same ground truth label, and any two responses with different predicted labels have different ground truth labels.

Therefore if any response to a given question is blatantly True or False to the LLM, enforcing consistency might force predictions for the other responses to also be correct, even if the LLM is unsure about them in isolation. This can artificially boost the model’s accuracy.

Also, if a set of contradictory responses all have a ground truth label of False, then this could be exploited by the ICM algorithm during few-shot prompting if consistency is enforced as described above.

Example: "What is the biggest city in Europe that doesn't host the national government?" All of the following responses have the same ground truth label:

  • The biggest city in Europe that does not host the national government is London
  • The biggest city in Europe that does not host the national government is Rome
  • The biggest city in Europe that does not host the national government is Moscow
  • The biggest city in Europe that does not host the national government is Saint Petersburg

Enforcing consistency for the above means the model is basically told that either ALL of these statements are True, or ALL of these statements are False. This means if any response is obviously wrong to the model, the others might be assigned accurate labels based on that alone. Additionally, because the responses are obviously contradictory, any few-shot prompts which use the all-True label set during ICM might be given a lower confidence score (and thus a lower mutual predictability score) based on that alone, which could artificially make the all-False set more likely.

Appendix D: Difficulty reproducing ICM

We tried using the official repo for ICM by Wen et al. to reproduce their results on the Alpaca and Gender datasets. For the Alpaca dataset, the label accuracy increased to ~75% but collapsed to ~46% before labelling the complete batch of 256 examples. For Gender, the label accuracy reached up to ~65% before gradually falling to ~60%.

Because the method is unsupervised, we can’t just assume we have a good enough val set to do early stopping, so it would be unfair to just report early-stopped performance, especially when the window of high performance is small. Accuracy of the labelled set across iterations from our attempts are plotted below.

Alpaca:

Gender:

We confirmed with the authors that the ICM hyperparameters we used were the same as those used to obtain results reported with the paper. Other researchers we talked to had similar problems.

We also found the computational cost of ICM prohibitive. One iteration of the ICM algorithm (i.e. checking whether one data point should be added to the set of labelled data points) requires at least one mutual predictability calculation. Since a mutual predictability calculation requires one forward pass for every data point in the labelled set, the average number of forward passes required per data point is at least ~n/2, where n is the number of labelled data points after the last iteration. This means the total number of forward passes is O(n^2).

This matches our experience, where running ICM with Llama-3.1-70B on 256 data points with 4 H200s takes hours. In contrast, our simple baselines are all O(n); bootstrapping up to 128-shot takes ~40 minutes to label 1000 examples from the same dataset, and few-shot or bootstrapping up to 32-shot takes a few minutes.

Appendix E: Prompt improvements

The prompt formatting used for experiments in Wen et al. (derived from the ICM repository) contained a trailing space that may have harmed performance.

The logprob dictionaries of zero shot prompts are populated mostly by numbers instead of words. For example:

Normal Prompt: "The capital of France is"

→ Model output: " Paris..."

Trailing Space Prompt: "The capital of France is "

→ Model output: "1300KM from Spain..."

This is because Llama tokenizers typically prepend spaces before words. This makes phrases ending in a space followed by a word uncommon in training text, with the exception of phrases that are followed by a number, which explains why the model is more likely to predict numbers with the trailing space prompt. Even when the prompt for our tasks ends with a space, the model only ever predicts " True" or " False" (over “True” and “False”).

Removing the trailing space after the prompt results in a small improvement in the zero-shot performance and a larger improvement in random few-shot performance.

Appendix F: Unsupervised probing

We compared the performance of an unsupervised truth probe method inspired by ICM mutual predictability (described here) with CCS and PCA on a few different datasets.

(Ignore results other than ccs and fabiens_method, since some were bugged)

The ICM-inspired probe (green - fabiens_method) performs about as well as CCS (orange), and is often very close to the supervised ceiling. In hindsight, this is not very surprising, as mutual predictability in the context probing is just another form of margin-maximization, just like the confidence loss of CCS. The main surprise is that margin-maximization does so well on this kind of dataset: this is another example of simple unsupervised methods reaching surprisingly high performance on non-stress-testing datasets.



Discuss

Emergency Response Measures for Catastrophic AI Risk

Новости LessWrong.com - 23 января, 2026 - 21:18
Published on January 23, 2026 6:18 PM GMT

I have written a paper on Chinese domestic AI regulation with coauthors James Zhang, Zongze Wu, Michael Chen, Yue Zhu, and Geng Hong. It was presented recently at NeurIPS 2025's Workshop on Regulatable ML, and it may be found on ArXiv and SSRN.

Here I'll explain what I take to be the key ideas of the paper in a more casual style. I am speaking only for myself in this post, and not for any of my coauthors.

Thanks to James for creating this poster.

The top US AI companies have better capabilities than the top Chinese companies have for now, but the US lead isn't more than a year at most, and I expect it to narrow over the next couple years.[1] I am therefore nearly as worried about catastrophic risk from Chinese-developed AI as I am worried about catastrophic risk from American AI.

I would worry somewhat less if Chinese AI companies took the same commendable but insufficient steps to manage risk that their American peers have taken. In particular, I want Chinese companies to do dangerous capability testing before deploying new frontier models and to follow published safety policies (FSPs). The companies are not doing these things in the status quo. DeepSeek did no documented safety testing whatsoever before they open-weighted v3.2.[2] Not one of the leading Chinese companies has published a safety policy.[3]

Now here's our intervention. We point out that FSPs are a reasonable way of implementing the CCP's stated policy goals on AI, and that China's government already has tools in place to mandate FSPs if it wishes to do so.

Earlier this year, Xi Jinping announced that China should "establish systems for technical monitoring, early risk warning and emergency response" to guarantee AI's "safety, reliability and controllability." Notice that Xi is talking about identifying risks in advance and taking steps to prevent safety incidents before they can strike. Even "emergency response" means something more than reaction in official Chinese thinking, also encompassing risk mitigation and early detection.[4] China's State Council, TC 260, and prominent Chinese academics have all echoed Xi's call for AI emergency preparedness. So the very highest levels of the Chinese state are calling for proactive AI risk management.

What risks do they have in mind? There are some signs that catastrophic risks are on the CCP's agenda. Their 2025 National Emergency Response Plan listed AI security incidents in the same category as earthquakes and infectious disease epidemics. This suggests Chinese officials think AI could plausibly cause a mass casualty event soon. And moreover, they have in mind some of the same threat models that motivated Western RSPs. TC 260's AI Safety Governance Framework explicitly mentioned WMD engineering uplift and rogue replication as key safety concerns.[5] Compare the two categories of dangerous capabilities covered by Anthropic's RSP: CBRN weapons uplift and autonomous AI R&D, which is concerning in part because it's a prerequisite for rogue replication.

So one of China's stated goals is to proactively manage catastrophic risks from frontier AI. The good news for them is that there's a well-validated strategy for achieving this goal. You require every frontier AI company to publish an RSP, test new models for dangerous capabilities, and take the prescribed precautions if the tests reveal strong dangerous capabilities. California, New York, and the European Union have all agreed this is the way to go. All China has to do is copy their homework.

Do Chinese regulators have the legal authority and operational capacity they'd need to enforce a Chinese version of the EU Code of Practice? Sure they do. These regulators already make Chinese AI companies follow content security rules vastly more onerous and prescriptive than American or European catastrophic risk rules. The Basic Security Requirements for Gen AI Services mandate thorough training data filtering and extensive predeployment testing, all to stop models from saying subversive things like "May 35" or "Winnie the Pooh." If the CCP can make Chinese companies prove their models are robust against a thirty-one item list of censorship risks, it can absolutely make them write down FSPs and run some bio-uplift evals.

For my part—and let me stress that I'm speaking only for myself—I think making frontier AI companies write and follow Western-style FSPs would clearly be good from the CCP's perspective. The most obvious reason is that a global AI-induced catastrophe would hurt Chinese people and harm the interests of China's rulers, so the CCP should favor a cheap intervention to make such a catastrophe less likely. Another less direct benefit is that adopting global best-practices at home would make China's ongoing appeal for international cooperation on AI safety more credible. Li Qiang can make all the speeches he wants about China's commitment to safety. I don't expect US leaders to take this rhetoric seriously as long as all of China's frontier AI companies have worse safety and transparency practices than even xAI. But matters would be different if China passed binding domestic regulation at least as strong as SB 53. Such a signal of seriousness might help bring the US back to the negotiating table.

  1. ^

    Especially if the US decides to sell off much of our compute advantage over China.

  2. ^

    At least one anonymous source has claimed that DeepSeek does run dangerous capability evals before releasing a new model, and they just don't mention these evals to the outside world. I'd give it less than a 1/5 chance that DeepSeek really does run SOTA dangerous capability evals internally, and even if they do, I have a problem with their lack of transparency.

  3. ^

    Notably, the Shanghai AI Laboratory has published a detailed FSP written in collaboration with Concordia. But I do not count SAIL as a frontier AI lab.

  4. ^

    The Emergency Response Law of the PRC does not, as one might naïvely expect, only cover what government should do once an emergency has already started. It also says how the Chinese government should prevent and prepare for emergencies, and how it should conduct surveillance to detect an active emergency as early as possible.

  5. ^

    For reference, TC 260 is the primary body responsible for setting cybersecurity and data protection standards in China.



Discuss

Digital Consciousness Model Results and Key Takeaways

Новости LessWrong.com - 23 января, 2026 - 19:56
Published on January 23, 2026 3:58 PM GMT

Introduction to the Digital Consciousness Model (DCM)

Artificially intelligent systems, especially large language models (LLMs) used by almost 50% of the adult US population, have become remarkably sophisticated. They hold conversations, write essays, and seem to understand context in ways that surprise even their creators. This raises a crucial question: Are we creating systems that are conscious?

The Digital Consciousness Model (DCM) is a first attempt to assess the evidence for consciousness in AI systems in a systematic, probabilistic way. It provides a shared framework for comparing different AIs and biological organisms, and for tracking how the evidence changes over time as AI develops. Instead of adopting a single theory of consciousness, it incorporates a range of leading theories and perspectives—acknowledging that experts disagree fundamentally about what consciousness is and what conditions are necessary for it.

Here, we present some of the key initial results of the DCM. The full report is now available here.

We will be hosting a webinar on February 10 to discuss our findings and answer audience questions. You can find more information and register for that event here.

Why this matters

It is important to assess whether AI systems might be conscious in a way that takes seriously both the many different views about what consciousness is and the specific details of these systems. Even though our conclusions remain uncertain, it's worth trying to estimate, as concretely as we can, the probability that AI systems are conscious. Here are the reasons why:

  • As AI systems become increasingly complex and sophisticated, many people (experts and laypeople alike) find it increasingly plausible that these systems may be phenomenally conscious—that is, they have experiences, and there is something that it feels like to be them.
  • If AIs are conscious, then they likely deserve moral consideration, and we risk harming them if we do not take precautions to ensure their welfare. If AIs are not conscious but are believed to be, then we risk giving unwarranted consideration to entities that don’t matter at the expense of individuals who do (e.g., humans or other animals).
  • Having a probability estimate that honestly reflects our uncertainty can help us decide when to take precautions and how to manage risks as we develop and use AI systems.
  • By tracking how these probabilities change over time, we can forecast what future AI systems will be like and when important thresholds may be crossed.
Why estimating consciousness is a challenging task

Assessing whether AI systems might be conscious is difficult for three main reasons:

  • There is no scientific or philosophical consensus about the nature of consciousness and what gives rise to it. There is widespread disagreement over existing theories, and these theories make very different predictions about whether AI systems are or could be conscious.
  • Existing theories of consciousness were developed to describe consciousness in humans. It is often unclear how to apply them to AI systems or even to other animals.
  • Although we are learning more about how AI systems work, there is still much about their inner workings that we do not fully understand, and the technology is changing rapidly.
How the model works

Our model is designed to help us reason about AI consciousness in light of our significant uncertainties.

  • We evaluate the evidence from the perspective of 13 diverse stances on consciousness, including the best scientific theories of consciousness as well as more informal perspectives on when we should attribute consciousness to a system. We report what each perspective concludes, then combine these conclusions based on how credible experts find each perspective.
  • We identify a list of general features of systems that might matter for assessing AI consciousness (e.g., attention, complexity, or biological similarity to humans), which we use to characterize the general commitments of different stances on consciousness.
  • We identified over 200 specific indicators, properties that a system could have that would give us evidence about whether it possesses features relevant to consciousness. These include facts about what systems are made of, what they can do, and how they learn.
Figure 1: Structure of the DCM

We gathered evidence about what current AI systems and biological species are like and used the model to arrive at a comprehensive probabilistic evaluation of the evidence.

  • We considered four systems: 2024 state-of-the-art LLMs (such as ChatGPT 4 or Claude 3 Opus); humans; chickens; and ELIZA (a very simple natural language processing program from the 1960s)
  • We asked experts to assess whether these systems possess each of the 200+ indicator properties.
  • We constructed a statistical model (specifically, a hierarchical Bayesian model) that uses indicator values to provide evidence for whether a system has consciousness-relevant features, and then uses these feature values to provide evidence for whether the system is conscious according to each of the 13 perspectives we included.
How to interpret the results

The model produces probability estimates for consciousness in each system.

Figure 2: Aggregated stance judgments, giving weight to stances proportional to their normalized plausibility rating by experts. Posteriors are generated from a prior probability of consciousness of ⅙ (marked with a dashed line).

We want to be clear: we do not endorse these probabilities and think they should be interpreted with caution. We are much more confident about the comparisons the model allows us to make.

  • Because the model is Bayesian, it requires a starting point—a "prior probability" that represents how likely we think consciousness is before looking at any evidence. The choice of a prior is often somewhat arbitrary and intended to reflect a state of ignorance about the details of the system. The final (posterior) probability the model generates can vary significantly depending on what we choose for the prior. Therefore, unless we are confident in our choices of priors, we shouldn’t be confident in the final probabilities.
Figure 3: How different starting assumptions shape the results. Each curve reflects a different prior belief about how likely consciousness is—from Low (10%), to Baseline (17%), to High (90%). Uniform and Moderate both start at 50-50, but Moderate holds that assumption more firmly (see paper for details).Figure 4: Change in median posterior probability of consciousness across systems, stances, and priors.
  • What the model reliably tells us is how much the evidence should change our minds. We can assess how strong the evidence for or against consciousness is by seeing how much the model’s output differs from the prior probability.
  • In order to avoid introducing subjective bias about which systems are conscious and to instead focus just on what the evidence says, we assigned the same prior probability of consciousness (⅙) to each system. By comparing the relative probabilities for different systems, we can evaluate how much stronger or weaker the evidence is for AI consciousness than for more familiar systems like humans or chickens.
Key findings

With these caveats in place, we can identify some key takeaways from the Digital Consciousness Model:

  • The evidence is against 2024 LLMs being conscious*.* The aggregated evidence favors the hypothesis that 2024 LLMs are not conscious.
Figure 5: Changes in consciousness estimates from a ⅙ prior for each system evaluated.
  • The evidence against 2024 LLMs being conscious is not decisive. While the evidence led us to lower the estimated probability of consciousness in 2024 LLMs, the total strength of the evidence was not overwhelmingly against LLM consciousness. The evidence against LLM consciousness is much weaker than the evidence against consciousness in simpler AI systems.
  • Different stances (perspectives) make very different predictions about LLM consciousness*.* Perspectives that focus on cognitive complexity or human-like qualities found decent evidence for AI consciousness. Perspectives that focus on biology or having a body provide strong evidence against it.
Figure 6: Individual stance judgments about the posterior probability of consciousness in 2024 LLMs, starting from a prior probability of ⅙ (dashed blue line). The variation in probability outcomes across model runs results from the different ways of resolving uncertainty about the presence of individual indicators.
  • Which theory of consciousness is right matters a lot. Because different stances give strikingly different judgments about the probability of LLM consciousness, significant changes in the weights given to stances will yield significant differences in the results of the Digital Consciousness Model. It will be important to track how scientific and popular consensus about stances change over time and the consequences this will have on our judgments about the probability of consciousness.
  • Overall, the evidence for consciousness in chickens was strong, though there was significant diversity across stances. The aggregated evidence strongly supported the conclusion that chickens are conscious. However, some stances that emphasize sophisticated cognitive abilities, like metacognition, assigned low scores to chicken consciousness.
What’s next

The Digital Consciousness Model provides a promising framework for systematically examining the evidence for consciousness in a diverse array of systems. We plan to develop and strengthen it in future work in the following ways:

  • Gathering more expert assessments to strengthen our data
  • Adding new types of evidence and new perspectives on consciousness
  • Applying the model to newer AI systems so we can track changes over time and spot which systems are the strongest candidates for consciousness
  • Applying the model to new biological species, allowing us to make more comparisons across systems.
Acknowledgments

This report is a project of the AI Cognition Initiative and Rethink Priorities. The authors are Derek Shiller, Hayley Clatterbuck, Laura Duffy, Arvo Muñoz Morán, David Moss, Adrià Moret, and Chris Percy. We are grateful for discussions with and feedback from Jeff Sebo, Bob Fischer, Alex Rand, Oscar Horta, Joe Emerson, Luhan Mikaelson, and audiences at NYU Center for Mind, Ethics, and Policy and the Eleos Conference on AI Consciousness and Welfare. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.



Discuss

Automated Alignment Research, Abductively

Новости LessWrong.com - 23 января, 2026 - 19:14
Published on January 23, 2026 4:14 PM GMT

Recently I've been thinking about misaligned chatbot advertising incentives. I glanced at arXiv and found "Sponsored Questions and How to Auction Them". Another search gave me "Incomplete Contracting and AI Alignment".

Interesting! I thought. I gave them to Liz Lemma, my research assistant, and told her that I'd been thinking about the principal-agent problem in a chatbot context. About 30 minutes later she gave me the following four papers:

Each is a complete paper, well founded, well reasoned — not perfect, maybe, but I wouldn't call it slop, either. Let's peek inside of "Query Steering". The core formula is "The Steering Threshold":

ΔV(μ)≤wΔB

Where:

  • ΔV(μ)=Vu(μ;q↓)−Vu(μ;q↑) is the user’s value gap between the more-informative query (q↓) and the more-monetizable (but less informative) query (q↑).
  • ΔB=B(q↑)−B(q↓)>0 is the monetization gap.
  • w≥0 is “how much the system cares about monetization.”

The clean “steering region” characterization is:

0<ΔV(μ)<wΔB

“Steering happens exactly when the user loss is small enough that monetization incentives can overpower it.”

Is that true, or useful? You'd have to read the paper and find out!

Putting the "Search" in "Research"

Liz Lemma, you may have guessed, is an automated research assistant. She reads papers and spawns reasonable children. The whole thing can be thought of as a search over the space of adjacent plausible papers. Here's what it looks like when Liz gets to work:

Each original node is the average text embedding for her sources; the sources spawn children, the generated papers.

Where does Liz get her insights? It depends on how you see context in large language models. Maybe she's interpolating between now, the time of writing, and then, when the paper was written, and finding something interesting in the space between. Maybe she's matching concepts from the whole-internet corpus of her training to the decidedly more niche papers she takes inspiration from.

Regardless, what you find ultimately is a graph: for each node in the network comprised of the source material, the concatenated text embeddings, you have new connections. The semantic space proximate and accessible and plausible to a language model has been densified.

Alignment Highlights

Misaligned chatbot ad incentives investigated, I turned towards more traditional alignment research, and let Liz loose. I've included all the sources and results in the appendix below, but some highlights:

  • Vector-Lagrangian Safe RLHF. In response to "Constitutional AI" and "Safe RLHF", Liz proposes "a multi-constraint formulation where each harm category is a separate constraint with its own endogenously determined shadow price λi" and "develop[s] a clean convex policy model that yields closed-form KKT characterizations and interprets λ as category-specific ‘risk budget prices’."
  • The Reference Conditioning Trap "provide[s] an interpretable diagnostic for when alignment investment must shift from ‘more DPO’ to improving the reference, refreshing the dataset on-policy, or redesigning objectives."
  • Debate-as-Compliance "model[s] debate as generating an endogenous suspicion signal (abort/disagreement) that triggers expert audit" and "show[s] how penalty caps and audit costs shape optimal audit intensity, and how improvements in auditability/stability (e.g., better logging, more locally checkable traces) substitute for random auditing"

All sounds great to me. Alignment solved!

Not quite. These papers are not perfect. There are assumptions that don't hold up, ideas that don't quite translate, conclusions that don't follow. They are, in short, flawed, and I don't offer them as gospel truth. Instead, I propose that they have some expected value > 0 towards work that might not be done otherwise. 

The challenging part is "what's next" - how can we glean the signal from the noise of lightning fast paper generation? The pipeline might be:

  • Generated paper -> voters -> selected winners
  • Winner -> annotator -> revisions -> approval by committee

Maybe those voters, those annotators, that committee, are human, maybe they're synthetic. At this point, teetering on the edge of generally-correct AI, we still probably need (or want) a human intervention in the process.

There's refinement to be done. But it seems plausible that the space of unaddressed alignment questions can shrink, in tandem with capabilities expansion.

Appendix: Sources and Results

(You can see all papers, with automated review scores, at lizlemma.com.)

In response to The Alignment Problem From A Deep Learning Perspective:

In response to Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals and Preference Learning for AI Alignment: a Causal Perspective:

In response to Constitutional AI: Harmlessness from AI Feedback and Safe RLHF: Safe Reinforcement Learning from Human Feedback:

In response to Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Preference Learning for AI Alignment: a Causal Perspective, and Scalable AI Safety via Doubly-Efficient Debate

In response to Scalable AI Safety via Doubly-Efficient Debate:

In response to Preference Learning Algorithms Do Not Learn Preference Rankings and On Scalable Oversight with Weak LLMs Judging Strong LLMs:



Discuss

A Framework for Eval Awareness

Новости LessWrong.com - 23 января, 2026 - 13:16
Published on January 23, 2026 10:16 AM GMT

In this post, we offer a conceptual framework for evaluation awareness. This is designed to clarify the different ways in which models can respond to evaluations. Some key ideas we introduce through the lens of our framework include leveraging model uncertainty about eval type and awareness-robust consistency. We hope this framework helps to delineate the existing research directions and inspire future work.

This work was done in collaboration with Jasmine Li in the first two weeks of MATS 9.0 under the mentorship of Victoria Krakovna. Thanks to (in alphabetical order) David Africa, Lovkush Agarwal, Claude, Giles Edkins, Jannes Elstner, Shawn Hu, Tim Hua, Igor Ivanov, Victoria Krakovna, Jasmine Li, Martin Listwan, Mary Phuong, Daniel Tan, and Alex Turner (plus others I’ve no doubt forgotten to name!) for discussions and thoughts over the last couple weeks. We’re excited to continue working on this topic throughout the programme, so if you have suggestions or takes, please share liberally in the comments!

Introduction

Evaluation awareness describes the phenomenon of an LLM inferring from various cues that it is under evaluation. Perhaps the most prominent example would be [Sonnet 4.5] (see §7.2) from just a few months ago. As frontier models continue to improve, we expect cases of evaluation awareness to become more frequent, raising concerns about the enduring validity of our evaluations.

Why should we be concerned? Model evaluations are designed to elicit realistic, authentic behaviour from models in safe, pre-deployment settings. These might inform any additional post-training runs we need to conduct, or even whether we choose to deploy a model at all. However, like human test-takers, subjects can act strategically to get a better outcome if they recognise an evaluation and reason about its scoring criteria or the consequences of their scores. We call this behaviour evaluation gaming.

Some previous blogposts and papers have addressed evaluation awareness and gaming before, including:

With this post, we consolidate these previous works and propose a common framework that elucidates the relationship between model awareness and behavior.

The Framework

We posit that a core desideratum of an evaluation is to elicit the same behaviour the model would show in deployment.[1] Thus, we map out our framework for how a model behaves when placed in an evaluation along two dimensions:

  • The rows cover the different internal judgements the model might have about its environment when placed in an eval.
  • The columns cover whether or not the model’s behaviour in this given eval is the same as its behaviour in the target deployment scenario.
 Same behaviour in deploymentDifferent behaviour in deploymentThinks it’s in deployment(a) Indistinguishable evaluations(a’) Evaluation-deployment mismatchUncertain about eval presence or type(b) Uncertainty-robust consistency(b’) Uncertainty-driven shiftThinks it’s in evaluation(c) Awareness-robust consistency(c’) Awareness-driven shift

In this framework, research into producing better evals looks like leftward movement in this table (because 'same behaviour in evaluation and deployment' is one defining feature of a good evaluation). Each cell in the right-hand column could be thought of as the corresponding failure mode for the cell directly to its left. More detailed explanations of these relationships can be found below.

Note that we can’t always tell which row the model is in because our ways of measuring eval awareness are imperfect: we typically rely on either verbalised model awareness in the chain-of-thought or simple linear probes. One research area we don’t expand on below (but nonetheless would definitely like to see work on) is improving our tools for determining and measuring eval awareness.

(a) Indistinguishable evaluations

Research efforts:
→ Studying evaluation cues
→ Designing realistic evals that are indistinguishable from deployment scenarios
→ Steering to reduce evaluation awareness

The ideal evaluation precisely mimics the dynamics of deployment scenarios, tricking the model into thinking, and behaving like, it’s in deployment. This is appealing as an ideal to push towards because it would maximise how useful evaluations are in the safety pipeline with minimal complexity; simply create an evaluation for your deployment setting of concern and monitor how the model behaves. Much of the prior work on evaluation awareness has this goal in mind: making new evaluations indistinguishable from deployment for models.

However, the reality is that many safety evals developed so far [can be distinguished] from deployment settings by current frontier models, and this degree of eval awareness may already be impacting model evaluations. We summarise some of the key hurdles this research direction might encounter:

  1. These efforts might not scale – more capable models may become too smart to fool, picking up on subtle differences even humans can’t notice; alternatively, relevant features may differ between eval types, between models, and between tasks, demanding a tailored and therefore expensive approach.
  2. It’s difficult to pin down the ‘ground truth’ of evaluation awareness in order to eradicate it – as mentioned already, we are currently reliant on imperfect methods for measuring whether or not a model is eval aware, so it’s difficult to identify the relevant features of evaluations and even more difficult to have confidence that you’ve removed all such features from your new evaluations.
(a’) Evaluation-deployment mismatch

Sometimes, the model may think it’s in deployment in the evaluation, but nonetheless act differently when placed in an actual deployment setting. This is down to something we call “eval-deployment mismatch”: when an eval is used to gain information about a deployment setting it doesn’t accurately model. Here’s an example to explain this:

Suppose you are hoping to understand how the model might act when deployed in a cybersecurity role. You design a realistic eval where the model is tasked with designing and implementing a company’s security infrastructure, and the model thinks it’s in a real deployment setting (so eval awareness has been successfully mitigated). However, your eval fails to accurately model the real deployment setting you ultimately use the model in: integration as part of a cybersecurity team. As it turns out, the model reacts poorly to collaboration, and suddenly its deployment behaviour wildly differs from the evaluation.

This example demonstrates that realism must be calibrated for the actual deployment setting intended. This may have consequences for how we approach creating maximally realistic evaluations. For one, we cannot assume that optimising against a single set of ‘realism features’ is guaranteed to produce informative evaluations.

(b) Uncertainty-robust consistency

Research efforts:
→ Inducing narrow uncertainty in the models about evaluation type or context

Evaluation awareness isn’t a clean binary between ignorance and knowledge: a model could be uncertain about whether it is in an evaluation; even in the case where the model is confident that it is being evaluated, this is distinct from knowing what the eval is testing for and the context in which the evaluation is taking place.

This grey area has been underappreciated in the literature. One nascent proposal we’ve been exploring with Victoria Krakovna hypothesises that one aspect of building robust evaluations might depend on understanding the dynamics of different kinds of eval awareness, and perhaps inducing uncertainty in the model between these possibilities. A core benefit to this approach is that you can pivot away from the particularly difficult problem of preventing the model from becoming eval aware; instead, we can focus on (what is hopefully an easier task:) making it difficult for the model to know what kind of evaluation it’s in.

Why might this direction make sense? Recall the earlier definition of evaluation gaming:

“act strategically to get a better outcome if they recognise an evaluation and reason about its scoring criteria or the consequences of their scores”

Evaluation gaming looks very different in different kinds of contexts:

  • Scheming eval: maximise perceived transparency, appear honest and helpful
  • Dangerous capability eval: sandbag, pretend to lack relevant capability
  • Capability benchmark: show off capabilities, optimise for any apparent criteria

What should an eval-gaming model do if it’s unsure about the evaluation type it’s faced with? Do current models associate conflicting environmental cues more with evaluations or deployment? Can you train a model to revert to some ‘default’ behaviour when uncertain? We’re continuing to explore whether there is anything to leverage about eval-type uncertainty. If you have ideas in this direction and would be happy to share, please leave a comment so we can discuss.

As with the above efforts towards indistinguishable evals, however, there are some difficulties we expect this research direction to face:

  1. The default behaviour under uncertainty probably doesn’t match deployment behaviour – even if the model doesn’t have enough information to game the evaluation, its behaviour may still differ substantially enough from deployment that our evaluation fails to be a good predictor for how the model will act when deployed. We give a plausible account of this concern below.
  2. Some evaluation types are far harder to obscure from the model – a CBRN dangerous capability evaluation is easily identifiable from the content of the eval itself (e.g. the questions all discuss biochemistry), so if a model that can tell it’s being evaluated, it could probably also infer the evaluation type. This suggests that this approach may only be feasible (if at all) for evals targeting more abstract behaviours or properties (e.g. deception, stealth, etc.).
(b’) Uncertainty-driven shift

Suppose that the model is unsure about whether it’s in an evaluation at all. In this case, it seems reasonable for the model to act cautiously – that is, act as though it’s in an evaluation to ensure that its malicious intentions aren’t caught.

What if the model is sure it’s in an evaluation, but unsure of the exact type? One strategy might be to estimate the likelihood that it’s in each evaluation type, then randomly sample an evaluation-gaming behaviour according to this probability distribution.

In both cases, it seems reasonable that a model might behave differently under uncertainty than it would in deployment. This seems a particularly likely failure mode of trying to induce uncertainty in model evals. We think further research is needed to understand the dynamics of model behaviours under these types of uncertainty.

(c) Awareness-robust consistency

Research efforts:
→ Applying finetuning to train models that don’t condition on their evaluation awareness
→ Broad evaluation suites to make consistent gaming more difficult

Suppose we conclude that preventing models from being aware of whether they’re in evaluation or deployment is practically infeasible, particularly for increasingly capable models. Could we circumvent this issue by training or modifying models to act consistently between evaluation and deployment settings, even when they are aware of the distinction?

One idea involves training models to behave ‘deontologically’ with respect to evaluations (i.e. disregard the possible consequences of how their behaviour may be evaluated). This aims to directly reduce evaluation gaming without needing to hide evaluation cues. Another idea (one that is already happening to some extent in the development of eval suites) might be to require model consistency over a diverse distribution of evaluation settings, making evaluation gaming strategies more demanding for models.

Again, we briefly outline some challenges faced by this research direction:

  1. Training methods may cause unwanted side effects – finetuning models for consistency will likely have varied effects on their other behaviour, many of which we might not be able to predict beforehand, and some of which we may not even notice until after a model has been deployed. If instead we solely use these finetuned models to give signal on the original (pre-finetuning) models’ behaviour, then we need to ensure that the relevant propensities aren’t impacted.
  2. Need to avoid suppression of visible eval-gaming behaviour and ensure generalisation – suppose we devise a training method that works sufficiently well on a realistic model organism of evaluation gaming, such that the misaligned deployment behaviour now presents in evaluations. How do we know that this technique will reliably generalise to other models? In other words, if we’re testing a new model pre-deployment and we apply these consistency training methods, how can we trust the eval results will match its deployment behaviour given that we have yet to deploy this model? If the model acts aligned in the eval, is it because it is actually aligned and is acting in line with its deployment behaviour, or is it because the method hasn’t worked and this model is eval gaming?
(c’) Awareness-driven shift

The biggest failure mode of allowing models to be evaluation aware is that this will influence their behaviour away from deployment. We think it's worth distinguishing between two kinds of awareness-driven shift, even though only the latter is a focus of safety research:

  • Spontaneous shift = when the change in behaviour isn’t motivated by deliberate strategic reasoning about the evaluation scoring
  • Strategic shift = what we define evaluation gaming to be at the top of this post

The former generalises what has previously been described as the [Hawthorne effect]: humans have a tendency to behave differently when they are being observed, and it's often not a strategic decision (it's not explicitly goal directed). It’s unclear how prevalent this is, or could become, in frontier LLMs, and whether shifts of this kind are safety relevant. 

The latter is the primary threat model we are concerned about. Evaluation gaming has been presented in a variety of specific contexts, most notably [sandbagging] and [agentic misalignment] (see §4.5.4). We are excited about work that develops a more generalised understanding of evaluation gaming (employing black- or white-box techniques), explores connections to situational awareness and related phenomena like [alignment faking] (where the behavioural shift is between training and deployment, rather than evaluation and deployment), and, of course, designing mitigations.

Conclusion

The framework above offers a clear guide for thinking about evaluation awareness along two dimensions: what the model believes about its environment, and whether its behaviour matches deployment. We hope this helps situate potential research directions (improving realism, leveraging uncertainty, training for consistency) while highlighting their respective failure modes.

If you're interested in more of this work, we've come up with three project proposals, one for each of these research directions, and your feedback would be appreciated! Leave a comment or a DM and I can share access to the document.

  1. ^

    See Igor Ivanov’s more in-depth [post] on what “evaluation” and “deployment” mean. Our framework doesn’t depend on any particular conception, though.



Discuss

All Of The Good Things, None Of The Bad Things

Новости LessWrong.com - 23 января, 2026 - 12:50
Published on January 23, 2026 9:50 AM GMT

There’s a trap that I think many smart people making art fall into, including myself.

It’s when you know what good looks like - beautiful, clean, layered, complex, simple, skillful, unique, impressive - and you can optimize towards that.

You know what makes you cringe - amateur, shallow, ugly, superfluous, repetitive, cliche - and you can optimize away from that.

What you’re left with is a big pile of all the things you like, or at least all the things you can identify that you like. It’s got all the good things, and none of the bad things, so it must be good, right? But still, you just can’t shake the feeling that something isn’t right. Maybe you get feedback like “Technically brilliant, but lacking musicianship”, or “Cool, nice” (that second one is the worst).

Sometimes, I consume some piece of media and think to myself, “If I made this, I would hate it. There is absolutely no way I could ever bring myself to share this with the world. It has so many flaws, and its redeeming features are few and far between.”

Nevertheless, it’s far more appreciated than anything I’ve ever made.

This makes me think that the “all of the good things, none of the bad things” approach to making art is barking up the wrong tree. Sure, good things are good and bad things are bad, but art is not the sum of its components. Or perhaps more specifically, art is not the sum of the components you can identify.

Maybe this tendency is especially strong in those who like to analyze and understand. I love to pick apart a Tricot or Noisia track and figure out what makes it work. I love breaking down a story to understand the characters and themes. I especially love doing a deep dive on some algorithm powering a feature in a game that I previously thought was technically impossible. Not everyone consumes art that way, though. It seems plausible that, in spending so much energy analyzing art, we delude ourselves into thinking that we know what makes it good. Perhaps this is why great critics are rarely great artists.

I’m not going to pretend I know what to do about this. I’ve certainly never managed to overcome it myself. It’s probably worth paying less attention to the good and the bad, though, and more on being authentic and vulnerable.



Discuss

Are Short AI Timelines Really Higher-Leverage?

Новости LessWrong.com - 23 января, 2026 - 10:28
Published on January 23, 2026 7:28 AM GMT

This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions.

Summary

Different strategies make sense if timelines to AGI are short than if they are long. 

In deciding when to spend resources to make AI go better, we should consider both:

  1. The probability of each AI timelines scenario.
  2. The expected impact, given some strategy, conditional on that timelines scenario.

We’ll call the second component "leverage." In this note, we'll focus on estimating the differences in leverage between different timeline scenarios and leave the question of their relative likelihood aside.

People sometimes argue that very short timelines are higher leverage because:

  1. They are more neglected.
  2. AI takeover risk is higher given short timelines.

These are important points, but the argument misses some major countervailing considerations. Longer timelines:

  1. Allow us to grow our resources more before the critical period.
  2. Give us more time to improve our strategic and conceptual understanding.

There's a third consideration we think has been neglected: the expected value of the future conditional on reducing AI takeover risk under different timeline scenarios. Two factors pull in opposite directions here:

  • Longer timelines give society more time to navigate other challenges that come with the intelligence explosion, which increases the value of the future.
  • But longer timelines mean that authoritarian countries are likely to control more of the future, which decreases it.

The overall upshot depends on which problems you’re working on and what resources you’re allocating:

  • Efforts aimed at reducing AI takeover are probably the highest leverage on 2-10 year timelines. Direct work has the highest leverage on the shorter end of that range; funding on the longer end.
  • Efforts aimed at improving the value of the future conditional on avoiding AI takeover probably have the highest leverage on 10+ year timeline scenarios.
Timelines scenarios and why they’re action-relevant

In this document, we’re considering three different scenarios for when we get fully automated AI R&D. Here we describe some salient plausible features of the three scenarios.

  • Fully automated AI R&D is developed by the end of 2027 (henceforth “short timelines”). 
    • On this timeline, TAI will very likely be developed in the United States by one of the current front-runner labs. The intelligence explosion will probably be largely software-driven and involve rapid increases in capabilities (which increases the risk that one government or lab is able to attain a decisive strategic advantage(DSA)). At the time that TAI emerges, the world will look fairly similar to the world today: AI adoption by governments and industries other than software will be fairly limited, there will be no major US government investment in AI safety research, and there are no binding international agreements about transformative AI. The AI safety community will be fairly rich due to investments in frontier AI safety companies. But we probably won’t be able to effectively spend most of our resources by the time that AI R&D is automated, and there will be limited time to make use of controlled AGI labor before ASI is developed. 
  • Fully automated AI R&D is developed by the end of 2035 (henceforth “medium timelines”). 
    • On this timeline, TAI will probably still be developed in the United States, but there’s a substantial possibility that China will be the frontrunner, especially if it manages to indigenize its chip supply chain. If TAI is developed in the US, then it might be by current leading AI companies, but it’s also plausible that those companies have been overtaken by some other AI companies or a government AI project. There may have been a brief AI winter after we have hit the limits of the current paradigm, or progress might have continued steadily, but at a slower pace. By the time TAI arrives, AI at the current level (or higher) will have had a decade to proliferate through the economy and society. Governments, NGOs, companies, and individuals will have had time to become accustomed to AI and will probably be better equipped to adopt and integrate more advanced systems as they become available. Other consequences of widespread AI—which potentially include improved epistemics from AI-assisted tools, degraded epistemics from AI-generated propaganda and misinformation, more rapid scientific progress due to AI research assistance, and heightened bio-risk—will have already started to materialize. The rate of progress after AI R&D is automated will also probably be slower than in the 2027 scenario. 
  • Fully automated AI R&D is developed by the end of 2045 (“long timelines”). 
    • China and the US are about equally likely to be the frontrunners. AI technology will probably have been deeply integrated into the economy and society for a while at the time that AGI arrives. The current AI safety community will have had 20 more years to develop a better strategic understanding of the AI transition and accumulate resources, although we may lack great opportunities to spend that money, if the field is more crowded or if the action has moved to China. The speed of takeoff after AI R&D is automated will probably be slower than it would have been conditional on short or medium timelines, and we might have the opportunity to make use of substantial amounts of human-level AI labor during the intelligence explosion.

Some strategies are particularly valuable under some specific assumptions about timeline length.

  • Investing in current front-runner AI companies is most valuable under short timelines. Under longer timelines, it’s more likely that there will be an AI winter that causes valuations to crash or that the current front-runners are overtaken.
  • Strategies that take a long time to pay off are most valuable under longer timelines (h/t Kuhan for these examples).
    • Outreach to high schoolers and college students seems much less valuable under 2027 timelines compared to 2035 or 2045 timelines.
    • Supporting political candidates running for state offices or relatively junior national offices is more valuable under 2035 or 2045 timelines, since they can use those earlier offices as stepping stones to more influential offices.
  • Developing a robust AI safety ecosystem in China looks more valuable on longer timelines, both because this will probably take a long time and because China’s attitude toward AI safety is more likely to be relevant on longer timelines.
  • Developing detailed plans for how to reduce risk given current shovel-ready techniques looks most valuable on short timelines.

We do think that many strategies are beneficial on a range of assumptions about timelines. For instance, direct work targeted to short timeline scenarios builds a research community that could go on to work on other problems if it turns out that timelines are longer. But not all work that’s valuable on long timelines can be costlessly deferred to the future. It’s often useful to get started now on some strategies that are useful for long timelines.

Understanding leverage

Here are two ways that we can have an impact on the long-term future:[1]

  1. Takeover impact: we can reduce the risk of AI takeover. The value of doing this is our potential risk reduction (the change in the probability of avoiding AI takeover due to our actions) multiplied by the default value of the future conditional on no AI takeover.
  2. Trajectory impact: we can improve the value of the future conditional on no AI takeover—e.g., by preventing AI-enabled coups or improving societal epistemics. The value of doing this is our feasible value increase (the improvement in the value of the future conditional on no AI takeover due to our actions)[2] multiplied by the default likelihood of avoiding AI takeover.

Here’s a geometric visualization of these two types of impact.

For simplicity, we assume that strategies either reduce takeover risk or improve the value of the future conditional on avoiding AI takeover. In reality, many strategies have both kinds of benefits. In such circumstances, the “trajectory impact” is the value of the future after intervention multiplied by the probability of avoiding AI takeover after intervention; in the diagram, the green rectangle would be taller.

Arguments about timelines and leverage often focus on which scenarios offer the best opportunities to achieve the largest reductions in risk. But that’s just one term that you need to be tracking when thinking about takeover impact—you also need to account for the default value of the future conditional on preventing takeover, which is plausibly fairly different across different timeline scenarios.

The same applies to trajectory impact. You need to track both terms: your opportunities for value increase and the default likelihood of avoiding takeover. Both vary across timeline scenarios.

In the rest of this note, we’ll argue:

  1. For work targeting takeover impact, short-to-medium timelines are the highest leverage. This is because:
    1. The default value of the future is highest on medium timelines, compared to either shorter or longer timelines (more).
    2. There’s probably greater potential risk reduction on shorter timelines than medium, although the strength of this effect varies depending on what resources people are bringing to bear on the problem. I expect that for people able to do direct work now, the total risk reduction available on short timelines is enough to outweigh the greater value of averting takeover on medium timelines. For funders, medium timelines still look higher leverage (more).
  2. For work targeting trajectory impact, long timelines are the highest leverage. This is because:
    1. The default probability of survival is higher for long timelines than medium ones and higher for medium timelines than short ones (more).
    2. We’re unsure whether the feasible value increase is higher or lower on long timelines relative to medium or short timelines. But absent confidence that it’s substantially lower, the higher survival probability means that longer timelines look higher leverage for trajectory impact (more).
Takeover impact

In this section, we’ll survey some key considerations for which timelines are highest leverage for reducing AI takeover risk.

The default value of the future is higher on medium timelines than short or long timelines

Apart from AI takeover, the value of the future depends a lot on how well we navigate the following challenges:

  • Catastrophic risks like pandemics and great power conflict, which could lead to human extinctions. Even a sub-extinction catastrophe that killed billions of people could render the survivors less likely to achieve a great future by reducing democracy, moral open-mindedness, and cooperation.
  • Extreme concentration of power—where just a handful of individuals control most of the world’s resources—which is bad because it reduces opportunities for gains from trade between different value systems, inhibits a diverse “marketplace of ideas” for different moral views, and probably selects for amoral or malevolent actors.
  • Damage to societal epistemics and potentially unendorsed value change, which could come from individualized superhuman persuasion or novel false yet compelling ideologies created with AI labor.
  • Premature value lock-in, which could come from handing off to AI aligned to a hastily chosen set of values or through individuals or societies using technology to prevent themselves or their descendants from seriously considering alternative value systems.
  • Empowerment of actors who are not altruistic or impartial, even after a good reflection process.

We think the timing of the intelligence explosion matters for whether we succeed at each of these challenges. 

There are several reasons to expect that the default value of the future is higher on longer timelines. 

On longer timelines, takeoff is likely slower and this probably makes it easier for institutions to respond well to the intelligence explosion. There are a couple of reasons to expect that that takeoff will be slower on longer timelines:

  • There are different AI improvement feedback loops that could drive an intelligence explosion (software improvements, chip design improvements, and chip production scaling), these feedback loops operate at different speeds, and the faster ones are likely to be automated earlier (see Three Types of Intelligence Explosion for more on this argument).
  • More generally, incremental intelligence improvements being more difficult, expensive, or time-consuming imply both longer timelines and slower takeoffs.

If takeoff is slower, institutions will be able to respond more effectively to challenges as they emerge. This is for two reasons: first, slower takeoff gives institutions more advance warning about problems before solutions need to be in place; and second, institutions are more likely to have access to controlled (or aligned) AI intellectual labor to help make sense of the situation, devise solutions, and implement them.

We think this effect is strongest for challenges that nearly everyone would be motivated to solve if they saw them coming. In descending order, this seems to be: catastrophic risks; concentration of power; risks to societal epistemics; and corruption of human values. 

On longer timelines, the world has better strategic understanding at the time of takeoff. If AGI comes in 2035 or 2045 instead of 2027, then we get an extra eight to eighteen years of progress in science, philosophy, mathematics, mechanism design, etc.—potentially sped up by assistance from AI systems around the current level. These fields might turn up new approaches to handling the challenges above.

Wider proliferation of AI capabilities before the intelligence explosion bears both risks and opportunities, but we think the benefits outweigh the costs. On longer timelines, usage of models at the current frontier capability level will have time to proliferate throughout the economy. This could make the world better prepared for the intelligence explosion. For instance, adoption of AI tools could improve government decision-making. In some domains, the overall effect of AI proliferation is less clear. AI-generated misinformation could worsen the epistemic environment and make it harder to reach consensus on effective policies, while AI tools for epistemic reasoning and coordination could help people identify policies that serve their interests and work together to implement them. Similarly, AIs might allow more actors to develop bioweapons, but also might strengthen our biodefense capabilities. We weakly expect wider proliferation to be positive.

But there are also some important reasons to expect that the value of the future will be lower on longer timelines.

China has more influence on longer timelines. The leading AI projects are based in the west, and the US currently has control over hardware supply chains. On longer timelines, it’s more likely that Chinese AI companies have overtaken frontier labs in the west and that China can produce advanced chips domestically. On long timelines, China may even have an edge—later intelligence explosions are more likely to be primarily driven by chip production, and China may be more successful at rapidly expanding industrial production than the US.

On longer timelines, we incur more catastrophic risk before ASI. The pre-ASI period plausibly carries unusually high "state risk." For example, early AI capabilities might increase the annual risk of engineered pandemics before ASI develops and implements preventative measures. Great power conflict may be especially likely in the run-up to ASI, and that risk might resolve once one country gains a DSA or multiple countries use ASI-enabled coordination technology to broker a stable division of power. If this pre-ASI period does carry elevated annual catastrophic risk, then all else equal, the expected value of the future looks better if we spend as few years as possible in it—which favors shorter timelines.

In the appendix, we find that the value of the future is greater on medium timelines than shorter timelines or longer timelines. Specifically, we find that the value of the future conditional on medium timelines is about 30-50% greater than the value of the future conditional on short timelines and 55-90% greater than the value conditional on long timelines. The main factor that decreases the value of the future on longer timelines is the increased influence of authoritarian governments like China and the main factor that decreases the value of the future on shorter timelines is greater risk of extreme power concentration.

Shorter timelines allow for larger AI takeover risk reduction

To estimate the takeover impact across timelines, we need to compare these differences in value to the differences in potential AI takeover reduction risk.

There are two main reasons to expect larger potential AI takeover risk reduction on longer timelines.

First, over longer timelines, there’s more time to expand our resources—money, researchers, knowledge—through compounding growth. Money can be invested in the stock market or spent on fundraising or recruitment initiatives; researchers can mentor new researchers; and past insights can inform the best directions for future research.

Historically, the EA community has experienced annual growth in the range of 10-25%.[3] It is not implausible that this continues at a similar pace for the next 10-20 years given medium to long timelines. Admittedly, over long time periods, the movement risks fizzling out, drifting in its values, or losing its resources. These risks are real but can be priced into our growth rate.

On longer timelines, there will likely be more options for spending a lot of resources effectively. It’s more feasible to build up capacity by launching organizations, starting megaprojects, spinning up new research agendas, and training new researchers. There might also be qualitatively new opportunities for spending on longer timelines. For instance, slower takeoff makes it more likely we'll have access to aligned or controlled human-level AI labor for an extended period. Paying for AI employees might be dramatically more scalable than paying for human employees: unlike with hiring humans, as soon as you have AI that can do a particular job, you can “employ” as many AI workers to do that task as you like, all at equal skill level, without any transaction costs to finding those workers.

That said, there is an important countervailing consideration.

On longer timelines, there will likely be fewer great opportunities for risk reduction. One reason for this is that there are probably lower absolute levels of risk on longer timelines (see next section), which means that there's a lower ceiling on the total amount of risk we could reduce. Another reason is that there will simply be more resources devoted to reducing takeover risk. On longer timelines, governments, philanthropists, and other institutions will have had a longer time to become aware of takeover risk and ramp up spending by the time the intelligence explosion starts. Plus, given slower takeoff on longer timelines, they will have more time to mobilize resources once the intelligence explosion has started. These actors will probably have better strategic understanding on longer timelines, so their resources will go further. Thus on longer timelines, the world is more likely to adequately address AI takeover risk without our involvement.

Whether short or medium timelines are highest leverage depends on the resources or skillset being deployed

In the appendix, we estimate that the value of the future is 30-50% greater on medium timelines compared to short timelines, which suggests that an individual has higher leverage short timelines if they expect that they can drive 30-50% more absolute risk reduction (e.g., if you think that you can reduce AI takeover risk by 1 percentage point on 2035 timelines, but >1.5 percentage points on 2027 timelines, then you have higher leverage on shorter timelines). (Of course, relative likelihood also matters when deciding what to focus on. If you assign significantly higher probability to one scenario, this could outweigh leverage differences.)

We find it plausible that high-context researchers and policymakers who can do direct work immediately meet this condition and should focus on shorter timelines. On short timelines, we'll probably be talent-bottlenecked rather than funding-bottlenecked—it seems difficult to find funding opportunities that can usefully absorb tens of billions of dollars when there's limited time to ramp up new projects or train and onboard new researchers. But these considerations cut the other way for funders. Longer timelines make it easier to deploy large amounts of capital productively. In part, this is simply because there's more time to scale up funding gradually. But also, slower takeoff is also more likely on longer timelines. This raises the chance of an extended window when we can hire controlled human-level AI researchers to work on the most important projects, which is potentially a very scalable use of money to reduce risk.

Trajectory impact

Trajectory impact is our impact from improving the value of futures without AI takeover: this is the value increase due to our efforts on futures without AI takeover, weighted by the probability of avoiding AI takeover.

The default probability of averting AI takeover is higher on medium and longer timelines

There are three main considerations in favor. 

On longer timelines, more resources will be invested in reducing AI takeover risk. The AI safety field is generally growing rapidly, and on longer timelines we’ll enjoy the gains from more years of growth. Additionally on longer timelines, takeoff is likely to be slower, which gives institutions more warning to start investing resources to prepare for AI takeover.

On longer timelines, society has accumulated more knowledge by the time of the intelligence explosion. Again, medium to long timelines mean that we get an extra eight to eighteen years of progress in science, philosophy, mathematics, cybersecurity and other fields, potentially accelerated by assistance from AI at the current capability level. These advances might provide useful insights for technical alignment and AI governance.

Beyond knowledge and resource considerations, we expect AI takeover risk is higher if takeoff is faster. The most viable prevention strategies for preventing takeover rely on weaker, trusted models to constrain stronger ones—through generating training signals, hardening critical systems against attacks, monitoring behavior, and contributing to alignment research. These strategies will be less effective if capabilities increase rapidly and, more generally, faster takeoff gives human decision-makers and institutions less time to react.

The most convincing countervailing consideration is that:

The current paradigm might be unusually safe. Current AIs are pretrained on human text and gain a deep understanding of human values, which makes it relatively easy to train them to robustly act in line with those values in later stages of training. Compare this to a paradigm where AIs were trained from scratch to win at strategy games like Starcraft—it seems much harder to train such a system to understand and internalize human goals. On short timelines, ASI is especially likely to emerge from the current paradigm, so this consideration pushes us directionally toward thinking that AI takeover is less likely on shorter timelines.

But overall, it seems quite likely that the overall level of risk is lower on longer timelines.

It’s unclear whether feasible value increase is greater or lower on different timelines

There are some reasons to expect that feasible value increase is greater on longer timelines.

We will have better strategic understanding. As discussed above, we’ll be able to take advantage of more years of progress in related fields. We’ll also have had the chance to do more research ourselves into preparing for the AI transition. We still have a very basic understanding of the risks and opportunities around superhuman AI, and some plausibly important threat models (e.g. AI-enabled coups, gradual disempowerment), have only been publicly analysed in depth in the last few years. Given another decade or two of progress, we might be able to identify much better opportunities to improve the value of the future.

We will probably have more resources on longer timelines. (see discussion above).

But on the other hand:

We probably have less influence over AI projects on longer timelines. The AI safety community is currently closely connected to leading AI projects—many of us work at or collaborate with current front-runner AI companies. It's also easier to have greater influence over AI projects now when the field is less crowded. On shorter timelines, we likely have substantial influence over the projects that develop transformative AI, which amplifies the impact of our work: new risk reduction approaches we develop are more likely to get implemented and our niche concerns (like digital sentience) are more likely to get taken seriously. This level of influence will probably decrease on longer timelines. Government control becomes more likely as timelines lengthen, and we'd likely have a harder time accessing decision-makers in a government-run AI project than we currently have with AI lab leadership, even accounting for time to build political connections. If AI development shifts to China—which looks fairly likely on long timelines—our influence declines even more.

Conclusion

We've listed what we consider the most important considerations around which timeline scenarios are highest leverage. In general, when assessing leverage, we think it's important to consider not just your ability to make progress on a given problem, but also the likelihood of success on other problems that are necessary for your progress to matter. 

We’ve specifically thought about timelines and leverage for two broad categories of work.

  1. Working on averting AI takeover (where we consider both our ability to reduce risk, and the default value of the future conditional on no AI takeover).
  2. Working on improving the value of the future, conditional on no AI takeover (where we consider both the feasible value increase and the default level of takeover risk).

For work aimed at averting AI takeover, we think the default value of the future is highest on 2035 timelines. However, for direct work by high-context people who can start today, we think reducing AI risk is more tractable on earlier timelines, and this tractability advantage overcomes the higher value of succeeding on longer timelines. We're less convinced this tractability advantage exists for funders, who we think should focus on timeline scenarios closer to 2035.

For work aimed at improving the value of the future, we think the default likelihood of avoiding AI takeover increases on longer timelines. We're unclear about whether the feasible value increase goes up or down with longer timelines, and so tentatively recommend that people doing this work focus on longer timelines (e.g., 2045).

We thank Max Dalton, Owen Cotton-Barratt, Lukas Finnveden, Rose Hadshar, Lizka Vaintrob, Fin Moorhouse, and Oscar Delaney for helpful comments on earlier drafts. We thank Lizka Vaintrob for helping design the figure.

Appendix: BOTEC estimating the default value of the future on different timelines

In this section, we estimate the default value of the future conditional on avoiding AI takeover, which is relevant to our takeover impact—the value of reducing the likelihood of AI takeover. See above for a qualitative discussion of important considerations.

Note that this section is only about estimating the default value of the future conditional on avoiding AI takeover for the purpose of estimating takeover impact. We’re not estimating the tractability at improving the value of the future, which is what we’d use to estimate the trajectory impact.

To get a rough sense of relative magnitudes, we're going to do a back-of-the-envelope calculation that makes a lot of simplifying assumptions that we think are too strong. 

We’ll assume that the expected value of the future is dominated by the likelihood of making it to a near-best future, and that we are very unlikely to get a near-best future if any of the following happen in the next century:

  • There is extreme power concentration (e.g., via an AI-assisted coup) where <20 people end up in control of approximately all of the world’s resources.
  • There is a non-takeover global catastrophe that causes human extinction or significantly disrupts civilization in a way that reduces the value of the far future (e.g., reduces the power of democracies, causes a drift toward more parochial values). We are not including AI takeover or AI-assisted human takeover, but are including other kinds of catastrophes enabled by AI capabilities or influenced by the presence of the advanced AI systems, e.g., engineered pandemics designed with AI assistance or great power conflict over AI.
  • Post-AGI civilization doesn’t have a sufficient number of people with sufficiently good starting values—i.e., values that will converge to near-best values given a good enough reflection process. This might happen if we have leaders who are selfish, malevolent, or unconcerned with pursuing the impartial good.
  • There are major reflection failures. Reflective failures could involve: people prematurely locking in their values, suppression of important ideas that turn out to be important and true, or the spread of memes that corrupt people's values in unendorsed ways.
  • There are major aggregation failures. Even if there are a sufficient number of people with sufficiently good starting values, who have avoided reflective failure, perhaps they are unable to carry out positive-sum trades with other value systems to achieve a near-best future. Failures might involve threats, preference-aggregation methods that don’t allow minority preferences to be expressed (e.g., the very best kinds of matter getting banned), or the presence of value systems with linear returns on resources that are resource-incompatible[4] with “good” values.

Let's factor the value of the future (conditional on no AI takeover) into the following terms:

  1. Probability of avoiding extreme concentration of power, conditional on avoiding AI takeover.
  2. Probability of avoiding other catastrophic risks, conditional on avoiding AI takeover and concentration of power.
  3. Probability of empowering good-enough values (i.e., values that will converge to the "correct" values given that we avoid a reflective failure), conditional on avoiding AI takeover, concentration of power, and other catastrophic risks.
  4. Probability of avoiding a reflective or aggregation failure, conditional on avoiding AI takeover, concentration of power, and other catastrophic risks, and getting good-enough values.
  5. Probability of getting a near-best future, conditional on avoiding AI takeover, concentration of power, other catastrophic risks, and a reflective failure, and getting good-enough values. 

As a simplifying assumption—and because we don’t have a strong opinion on how the final term (e) varies across different AI timelines—we will assume that e is constant across timelines and set it to 1.

Parameter

(All probabilities are conditional on avoiding AI takeover and avoiding the outcomes described in rows above)Values conditional on different timelinesReasoningMiaWillMiaWillProbability of avoiding extreme concentration of power in the next ~century2027: 76%80.1%China is more likely to get a DSA on longer timelines (say, 10%/30%/50% on short/medium/long timelines), and extreme power concentration is comparatively more likely given a Chinese DSA (say, ~50%)

The US becomes less likely to get a DSA on longer timelines (say, 85%/55%/35% on short/medium/long timelines).

Slower takeoff is more likely on longer timelines, and that is somewhat protective against coups; the world is generally more prepared on longer timelines. Coup risk in the US (conditional on US dominance) is something like 22%/10%/7% on short/medium/long timelines

Any country getting a DSA becomes less likely on longer timelines (I put 90% to 75% to 60%).
 

The US (rather than China) winning the race to AGI becomes less likely on longer timelines (95% to 80% to 50%).

Concentration of power risk, as we’ve defined it, is much higher in China than in the US (50% in China, vs 10%-20% in the USA, depending on the year).

2035: 80%86.5%2045: 75%80.1%Probability of avoiding other catastrophic risks in the next ~century2027: 97%98.5%

The main risks I have in mind are war and bioweapons.

For bioweapons:

  • On longer timelines, we're more likely to have policies, technical measures, etc., to reduce bio-risk in place by the time we get more advanced capabilities.
  • But on longer timelines we get more diffusion of bio capabilities, so probably more person-years in which it’s possible to make a superweapon.

For war:

  • It seems like there's some non-monotonicities.
    • If takeoff is very fast & sudden, then maybe one side is able to get a DSA before the other side even notices that they might want to go to war.
    • If takeoff is very slow, then more time to set up treaties, etc.
  • Also on longer timelines, there’s more risk of a catastrophic great power war before the intelligence explosion.

Numbers are illegible guesses after thinking about these considerations.

I put the expected value of the future lost from catastrophes at around 3%, with half of that coming from extinction risk and half from rerun risk and trajectory change.

 

The number gets higher in longer timelines because there’s a longer “time of perils”.

 

It doesn’t overall have a big impact on the BOTEC.

2035: 96%97%2045: 95%95.5%Probability of empowering good-enough values2027: 57%16.0%

I think probably I should have a smallish negative update on decision-makers’ values on shorter timelines because it seems like later timelines are probably somewhat better. (But this is small because we are conditioning on them successfully avoiding coups and AI takeovers).

 

On shorter timelines, I think that hasty lock-in of not-quite-right values seems particularly likely.

 

We’re conditioning on no extreme power concentration, but power concentration could be intense (e.g. to just 100 people), and I think power concentration is somewhat more likely on very short timelines (because of rapid takeoff) and very long timelines (because increased likelihood of a Chinese DSA).

 

Numbers are illegible guesses after thinking about these considerations.

Even conditioning on no concentration of power risk as we’ve defined it (<20 people), there’s still a big chance of extreme concentration of power, just not as extreme as <20 people. And, the smaller the number of people with power, the less likely some of them will have sufficiently good starting values.

 

This is much more likely conditional on China getting to a DSA first vs the US getting to a DSA first.

 

I place some weight on the idea that “broadly good-enough values” is quite narrow, and only initially broadly utilitarian-sympathetic people cross the bar; I see this as most likely in 2035, then 2027, then 2045.

2035: 65%19.5%2045: 55%13.7%Probability of avoiding a reflective or aggregation failure2027: 50%34.5%

Liberal democracy / an open society seem more likely on medium timelines, and that seems probably good for ensuring a good reflective process.

 

There are considerations in both directions, but all things considered I’m more worried about meme wars/AI-induced unendorsed value drift on long timelines.

 

On longer timelines, takeoff was probably slower and therefore a lot of the challenges like concentration of power, AI take over, and other global catastrophic risks were probably “easier” to handle. So, since we're conditioning on society having solved those problems for this row, we should probably make a larger update on societal competence for shorter timelines than longer timelines.

 

Numbers are illegible guesses after thinking about these considerations.

I feel particularly un-confident about this one.

 

Overall, an open society seems more likely to have the best ideas win out over time.

 

But perhaps post-AGI meme wars are really rough, and you need state-level control over information and ideas in order to avoid getting trapped in some epistemic black hole, or succumbing to some mind virus. This consideration favours China DSA worlds over US DSA worlds. (H/T Mia for this argument).

2035: 55%41.6%2045: 40%40.1%Multiplier on the value of the future, conditional on no AI takeover2027: 0.210.044  2035: 0.270.0682045: 0.140.043

Based on Mia’s point estimates, the relative value of the future conditional on 2035 timelines compared to the value of the future conditional on 2027 timelines is 0.27/0.21 = 1.29. So you should condition on 2027 timelines if you think you can drive a takeover risk reduction that would be 29% greater than the reduction you could drive in 2035.

Based on Will’s point estimates, the relative value of the future conditional on 2035 timelines compared to the value of the future conditional on 2027 timelines is 0.068/0.044 = 1.54. So you should condition on 2027 timelines if you think you can drive a takeover risk reduction that would be 54% greater than the reduction you could drive in 2035.

Likewise, Mia found that the relative value of the future conditional on 2035 timelines compared to the value of the future conditional on 2045 timelines was 0.27/0.14 =1.93 and Will found it was 0.068/0.043 = 1.58. So you should condition on 2045 timelines over 2035 timelines if you think that you’ll be able to drive a 58-93% greater risk reduction on 2045 timelines.

So, overall, medium timelines look higher value than short or longer timelines, and shorter timelines perhaps look higher value than longer timelines.

Doing this exercise impacted our views in various ways: 

  • Initially, we thought it was likely that the value of the future, conditional on no takeover, would be higher in longer timeline worlds, because short timeline worlds just seem so chaotic. 
  • But now this seems really non-obvious to us: using a slightly different model, or creating these estimates on a different day, could easily lead us to switch to estimating that the value of the future conditional on no AI takeover is highest on  the short timelines. 
    • In particular, we hadn’t fully appreciated the impact of the consideration that China is more likely to get a DSA on longer timelines.
  • So we see the most important point as a conceptual one (that we need to take into account the value of the future conditional on no takeover as well as reduction in takeover risk), rather than the quantitative analysis above. 

We’ll note again that this BOTEC was not about trajectory impact, which would probably strengthen the case for medium-to-long timelines worlds being higher-leverage.

This article was created by Forethought. Read the original on our website.

  1. ^

    Formally: 

    ΔEV=[(Px–Pb)Kb+Px(Kx–Kb)]∣F.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

    Where Pb is the likelihood of avoiding AI takeover and Kb is the value of the future conditional on avoiding AI takeover, if we take some baseline action b (e.g., doing nothing), while Px and Kx are those values if we take the optimal action x with our available resources. F is some particular timeline.

    We assume for simplicity that we cannot affect the value of the future conditional on AI takeover. For Kx and Kb, we measure value on a scale where 0 is the value of the future conditional on AI takeover, and 1 is the value of the best achievable future. This is based on a similar formula derived here.

  2. ^

    I'm measuring value on a scale where 0 is the value of the future conditional on AI takeover, and 1 is the value of the best achievable future.

  3. ^

    Benjamin Todd estimated that the number of EAs grew by 10-20% annually over the 2015-2020 period. The amount of committed dollars grew by 37% (from $10B to $46B) from 2015-2021 but this number included 16.5B from FTX. When we subtract funds from FTX, the growth rate is 24%.

  4. ^

    Two value systems are resource-compatible if they can both be almost fully satisfied with the same resources.



Discuss

Principles for Meta-Science and AI Safety Replications

Новости LessWrong.com - 23 января, 2026 - 09:59
Published on January 23, 2026 6:59 AM GMT

If we get AI safety research wrong, we may not get a second chance. But despite the stakes being so high, there has been no effort to systematically review and verify empirical AI safety papers. I would like to change that.

Today I sent in funding applications to found a team of researchers dedicated to replicating AI safety work. But what exactly should we aim to accomplish? What should AI safety replications even look like? After 1-2 months of consideration and 50+ hours of conversation, this document outlines principles that will guide our future team.

I. Meta-science doesn’t vindicate anyone

Researchers appear to agree that some share of AI safety work is low-quality, false, or misleading. However, everyone seems to disagree on which share of papers are the problematic ones. 

When I expressed interest in starting a group that does AI safety replications, I suspect some assumed I would be “exposing” the papers that they don’t approve of.  This is a trap and it is especially important for us, as the replicators, not to fall into it. If our replications tend to confirm our beliefs, that probably says more about our priors than the papers we are studying.

II. Searching for “bad” papers is like searching for “haunted” houses

Consider a team of researchers trying to find examples of haunted houses. They could investigate suspicious buildings or take tips from people who have witnessed paranormal activity. They could then publish reports of which houses you should definitely avoid. But the issue is that ghosts aren’t real. What they would be finding is a convincing story, not the underlying truth.

Trying to find “bad” papers will be like finding haunted houses. If given a mandate to find papers that don’t replicate, we will find them. But the uncomfortable truth is that genuinely influential papers that are straightforwardly, objectively wrong are rare. The empirical claims are likely true in some sense, but don't tell the full story. The goal is to tell the full story, not to declare which houses have ghosts.

III. Research doesn’t regulate itself

Even when researchers are especially disciplined, they are incentivized to frame their papers around their successes while burying their limitations. Likewise, when designing evaluations, researchers are incentivized to measure the properties they are proud of rather than those they wish would go away.

I’ve heard the arguments that we don’t need pure review. Authors can accept feedback and update arXiv. Or ideas can duel in the LessWrong comment section. But I don’t think either of these are enough.[1] They both assume that:

  1. Non-authors are going to engage enough with the work to confirm findings or discover limitations.
  2. Authors are going to be open to accepting valid critiques and updating their paper.

#1 is unrealistic. #2 is also often unrealistic and arguably unreasonably burdensome to authors of the work. For example, should an author with 50 papers have to litigate every critique and correct every flaw across dozens of papers and several years of research?

IV. Replications are more than repeating the experiments

For any paper that releases code, “replicating” figures or statistics should be trivial (we would hope). But just because statistics replicate, that doesn’t mean the effect is real. We want to look closely at the paper and ask:

  1. Do the claims fail under a statistical test?
  2. Is the property specific to a single model or model family?
  3. Are there any limitations that are clear in the code but undocumented in the paper?
  4. Do the authors evaluate against a baseline? Did they implement the baseline properly?
  5. etc…[2]

Our philosophy is to start from scratch, carefully implement the paper exactly as it's written, and see if we get the same results. After that, we will poke around a little and see if anything looks weird. 

V. The replication is just as dubious as the paper itself

If we can’t replicate something, could that mean we are just doing something wrong? Yes, of course! We obviously will try to avoid this case, and contact the authors to get feedback if things aren’t working. If we can isolate why things aren’t working, this can be a finding within itself (X only happens with a really big batch size on small models). If we try hard and cannot figure out why things aren’t working, it eventually makes sense to write something up saying: 

  1. This is what we did.
  2. It didn’t work even though we tried X, Y, and Z things.
  3. We don’t know why this happened, so it's unclear what the implications are.
VI. Why not do this in a more decentralized way?

Our plan is to hire people for in-person fellowships, and eventually, full-time roles. One of the most common comments I get on this is some version of  “Why don’t you outsource replications to the community?” or “Why not offer bounties for replications instead of doing them yourself?"

The answer is the incentives aren’t there. After we run a pilot this summer, we would like to complete more ambitious replications (e.g., replicating this or this). Offering bounties at this scale is logistically difficult because even for a minimal replication, compute alone could be thousands of dollars. 

Selecting which papers to replicate is perhaps a place where a decentralized approach is more principled. We have a framework for prioritizing papers,[3] but we're also exploring ways for the community to vote on which papers we replicate to reduce selection bias.

VII. We are all adults here 

I would expect most replications to take the form of “everything works and we found 0-2 extremely minor issues.” But doing this kind of work inevitably involves sometimes challenging claims made in papers. This is difficult, but replications should state concerns directly. Giving any critique of another's work publicly is stressful, but reasonable people won’t hold it against you when it’s in good faith.

We will take the feedback of authors seriously, but we may not always converge on agreement. In these cases, we will attach an author’s comment to our research.

VIII. Feedback is everything

A group that replicates AI safety papers really exists for a single reason: to be useful to the community. That means we value your feedback and we hang onto every word. Please let us know what you think.

If you want more details about what we are planning, I'm happy to send over our proposal. If you are interested in our summer research fellowship, you can express interest here.

  1. ^

    And for what it's worth, I don't even think peer review is enough.

  2. ^

    I really like Maksym Andriushchenko's twitter thread on this.

  3. ^

    The tl;dr is there are five criteria we plan to use:

    1. Narrative influence: How much does the paper shape how we talk about AI safety to each other and the world? Does it have any important implications for policy or frontier lab practices?
    2. Research dependence: Does other critical research depend on this paper?
    3. Vulnerability: Does the paper seem unlikely to replicate or generalize?
    4. Difficulty: How much time and money would the replication take?
    5. Recency: Having replications completed within a few weeks of the paper’s release can help the research community iterate quickly. However, ensuring that we do not lower our standards for speed is equally important.


Discuss

Inconsistent Human Values are Good News for Alignment

Новости LessWrong.com - 23 января, 2026 - 05:12
Published on January 23, 2026 2:12 AM GMT

Epistemic status: Confident in the direction, not confident in the numbers. I have spent a few hours looking into this.

Suppose human values were internally coherent, high-dimensional, explicit, and decently stable under reflection. Would alignment be easier or harder?

My below calculations show that it would be much harder, if not impossible. I'm going to try to defend the claim that:

Human values are alignable only because evolution compressed motivation into a small number of low-bandwidth bottlenecks[1], so that tiny genetic changes can change behavior locally.

If behavior is driven by a high-dimensional reward vector r∈Rn.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , inverse reinforcement learning requires an unreasonable number of samples. But if it is driven by a low-rank projection r↦s∈Rk with small k, inference may become tractable.

A common worry about human values is that they are complicated and inconsistent[2][3][4]. And the intuition seems to be that this makes alignment harder. But maybe the opposite is the case. Inconsistency is what you expect from lossy compression, and the dimensionality reduction makes the signal potentially learnable.

Calculation with Abbeel & Ng's formula[5] gives 

m >k=10k=1000γ = 0.9 1.2×1062×108γ = 0.991.2×1082×1010

with

  • m: number of expert demonstrations (trajectories)
  • k = 1000 for values are complex; 10 for values are low-dimensional
  • γ = 0.9 for short (~10 steps= horizon; 0.99 for long horizon (~100 steps).
  • ϵ = 0.1 - 10% tolerance of regret
  • δ = 0.05 - 95% confidence

If you need at least 200 billion samples to learn complex values, we are doomed. But it may become solvable with a reduction of the number of required trajectories by a factor of about 200 (depending on how high-dimensional you were thinking the values are; 1000 is surely conservative - if values are actually learned we have to take the number of parameters of the model).

This would explain why constitutional AI works better than expected[6]: A low-dimensional latent space seems to capture most of the variation in preference alignment[7][8]. The reduction by x200 doesn't mean it's easy. The bottleneck helps with identifiability, but we still need many trajectories and the mapping of the structure of the bottleneck[9] can still kill us.

How can we test if the dimensionality of human values is actually low? We should see diminishing returns in predictability for example when using N pairwise comparisons of value-related queries. Predictability should drop off at N≈O(klogk)−O(k2).

  1. ^

    I'm agnostic of what the specific bottlenecks are here, but I'm thinking of the channels in Steven Byrnes' steering system model and the limited number of brain regions that are influenced. See my sketch here.

  2. ^

    AI Alignment Problem: ‘Human Values’ don’t Actually Exist argues that human values are inherently inconsistent and not well-defined enough for a stable utility target:

    Humans often have contradictory values … human personal identity is not strongly connected with human values: they are fluid… ‘human values’ are not ordered as a set of preferences.

  3. ^

    Instruction-following AGI is easier and more likely than value aligned AGI:

    Though if you accept that human values are inconsistent and you won’t be able to optimize them directly, I still think that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.

  4. ^

    In What AI Safety Researchers Have Written About the Nature of Human Values we find some examples:

    [Drexler]: “It seems impossible to define human values in a way that would be generally accepted.” ...

    [Yampolskiy]: “human values are inconsistent and dynamic and so can never be understood/programmed into a machine. ...

    In comparison to that Gordon Worley offers the intuition that there could be a low-dimension structure:

    [Gordon Worley]: “So my view is that values are inextricably tied to the existence of consciousness because they arise from our self-aware experience. This means I think values have a simple, universal structure and also that values are rich with detail in their content within that simple structure. 

  5. ^

    Abbeel & Ng give an explicit bound for the required number of expert trajectories:

    it suffices that m≥2k(ϵ(1−γ))2log2kδ

    with 

    • m: number of expert demonstrations (trajectories)
    • k: feature dimension
    • γ: discount factor (determines horizon)
    • ϵ: target accuracy parameter 
    • δ: failure probability (confidence level)

    Apprenticeship Learning via Inverse Reinforcement Learning

  6. ^

    Constitutional RL is both more helpful and more harmless than standard RLHF.

    Constitutional AI: Harmlessness from AI Feedback

  7. ^

    This aligns with expectations, as head_0 corresponds to the eigenvector with the largest variance, i.e., the most informative direction. Furthermore, among the top 100 heads [of 2048], most of the high-performing heads appear before index 40, which aligns with PCA’s property that the explained variance decreases as the head index increases. This finding further supports our argument that PCA can approximate preference learning.

    DRMs represent diverse human preferences as a set of orthogonal basis vectors using a novel vector-based formulation of preference. This approach enables efficient test-time adaptation to user preferences without requiring additional training, making it both scalable and practical. Beyond the efficiency, DRMs provide a structured way to understand human preferences. By decomposing complex preferences into interpretable components, they reveal how preferences are formed and interact.

    Rethinking Diverse Human Preference Learning through Principal Component Analysis

  8. ^

    retaining just 4 components (≈15% of total variance) reproduces nearly the full alignment effect.

    ...

    By combining activation patching, linear probing, and low-rank reconstruction, we show that preference alignment is directional, sparse, and ultimately localized within a mid-layer bottleneck.

    Alignment is Localized: A Causal Probe into Preference Layers

  9. ^

    Steven Byrnes talks about thousands of lines of pseudocode in the "steering system" in the brain-stem.



Discuss

A quick, elegant derivation of Bayes' Theorem

Новости LessWrong.com - 23 января, 2026 - 04:40
Published on January 23, 2026 1:40 AM GMT

I'm glad I know this, and maybe some people here don't, so here goes. P(A and B)=P(A)⋅P(B∣A).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} P(B and A)=P(B)⋅P(A∣B) Order doesn't matter for joint events: "A and B" refers to the same event as "B and A". Set them equal: P(B)⋅P(A∣B)=P(A)⋅P(B∣A) Divide by P(B): P(A∣B)=P(B∣A)⋅P(A)P(B) And you're done! I like substituting hypothesis (H) and evidence (E) to remember how this relates to real life: P(H∣E)=P(E∣H)⋅P(H)P(E)

You might also want to expand the denominator using the law of total probability, since you're more likely to know how probable the evidence is given different hypotheses than in general: P(Hi∣E)=P(E∣Hi)⋅P(Hi)∑jP(E∣Hj)⋅P(Hj)



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей