Вы здесь

Сборщик RSS-лент

In Defense of Memorization

Новости LessWrong.com - 25 января, 2026 - 01:57
Published on January 24, 2026 10:49 PM GMT

TLDR: Western education creates a false dichotomy between memorization and understanding. I believe we should expect both. Having facts readily available in your brain (not just "Google-able") enables real-time bullshit detection, helps you calibrate who to trust, holds your own beliefs accountable, and provides the raw material for insight and critical thought. I offer some concrete suggestions (spaced repetition via Anki, tracking unfamiliar terms, connecting new facts to existing knowledge, etc.). Rationalists need to be careful to not focus purely on epistemics. We also need lots of knowledge. There's no way around memorization. 
 

I believe memorization is unfairly maligned. It is on the shortlist of things I think are required for becoming a rational intellectual. Besides curiosity, these things are:

Good epistemics: a reliable process for obtaining, vetting, and updating your knowledge. How do you know a claim is true? That a study is well-designed? That an observation licenses a general induction? You need to recognize and avoid fallacies and cognitive bias, understand the probabilistic nature of knowledge, follow complicated chains of reason, responsibly evaluate both qualitative and quantitative evidence, etc. 

Good knowledge. You need a wide range of properly-vetted, high-confidence information readily available in your mind. This includes brute facts (When was the Song Dynasty? What is silicate weathering?) and contested knowledge (Why did the Song Dynasty collapse? Will silicate weathering slow with climate change?). The key phrase here is “readily available”—these are not facts you could understand if you looked them up, but knowledge actually present in your brain. These are facts available to be thought with, not merely comprehended.

Intelligence. You can have excellent knowledge and rigorous epistemics but lack the ability to do anything interesting with them. You need the spark that connects disparate ideas, sees patterns, generates novel solutions. Creativity, insight, synthesis.

"Being intelligent" as an ingredient for being a good intellectual is so obvious that it’s almost trivial. Similarly, in the culture I grew up in (and on a blog about rationality...), good epistemics needs no theoretical defense as part of education. Every institution I’ve ever attended emphasized "fostering critical thinking" as its central goal. They may not have taught epistemics particularly well (or at all), but at least it was valorized. I understand this isn't universal—friends from Latin America, India, and elsewhere tell me large parts of their education was based on pure rote memorization, and critical thinking was sometimes even actively discouraged. Obviously, this is bad. If I’d been educated in one of those systems, this essay would probably be titled “In Defense of Critical Thinking."

But I wasn't educated in Latin America, India, or elsewhere. I was educated in (wealthy) schools in America and the UK. There, "good knowledge" — the actual retention of factual information — is surprisingly neglected as an ingredient of education.

This sounds counterintuitive. What teacher would claim that knowledge acquisition is unimportant? But I think if you called “the acquisition of knowledge that you retain and can readily access” by its pithier title, “memorization,” the discussion immediately becomes more contentious. How often have teachers told you “I don’t need you to memorize this material, I just want you to understand it.”

In the US and UK, at least, “memorization” has become synonymous with the worst kind of rote learning: students committing tracts of information to memory without understanding how to use those facts. Memorizing synopses of Shakespeare without reading the plays, reciting Gauss’s equations without understanding electromagnetism, etc. I agree this is bad education, and that it happens constantly. In fact, when people bring up this critique of memorization, they're often surprised by the extent to which I immediately agree with them. I often reference this classic essay in the sequences about memorizing passwords, about how much of schooling, even in the West, is essentially an elaborate memorization ritual where students guess what the teacher wants to hear (“light is both a wave and a particle!”) without truly understanding what they’re saying.

I would much prefer medical students spend more time developing clinical reasoning, learning bedside manner, and understanding how healthcare systems actually work, rather than memorizing minutiae about the nephron. Especially if they have no intention of becoming urologists.

But people use this common failure to construct a false dichotomy between memorization and understanding. They believe rote memorization is necessarily the enemy of critical thought, that anyone memorizing large amounts of information is no better than a robot, or a fool who hasn’t realized that any fact is just a Google search away. Why memorize the capitals of the world when you carry the sum of human knowledge in your pocket?

I think the critics of memorization have gone too far. I think we should have a much greater expectation, both in school and of ourselves, to actually know things. To have facts in our brains, not just opinions. Here are a couple reasons why.

Memorized facts let you detect bullshit in real time

I was in a lecture last October by an Oxford professor, a biologist who specializes in occupancy modeling. Essentially, he uses math and camera trap data to create spatial models that predict where certain animals are likely to be. He was discussing the odds that a large carnivore in Southeast Asia would soon go extinct, when he claimed that “urban sprawl is a primary driver of land-use change around the world.”

This sounds plausible. We hear about urban sprawl constantly. Los Angeles, London, Chongqing, all sprawling endlessly. What used to be biodiverse California coastline or good old English bog has become Whole Foods and Tescos. This Oxford professor, a world-leading expert on the question “where are the animals?”, was saying this fairly basic claim to a room of other Oxford professors. Surely it must be true.

Here are the actual numbers: 1–3% of Earth’s land surface is human settlement. 11% is crop agriculture. Around 26% is livestock grazing. The claim that urban sprawl is a “primary” cause of land-use change is pretty hard to make when all current human settlements account for roughly 2% of land use.

These facts are easy to look up. You could verify them in 30 seconds on your phone. But in the middle of a lecture, you can’t pause to think “hmm, what statistics would confirm or undermine this claim?” and then spend 30 seconds Googling them while missing the next point. It’s not just the time to physically look something up, it’s the mental energy of identifying what to search for.

If you don't know that total human settlement occupies an order of magnitude less land than multiple other land-use categories, the professor's claim about urban sprawl sounds perfectly reasonable. And that fact doesn't even definitively disprove the statement. I'm most of us can imagine this Oxford professor retreating to the semantics of the word "primary" when challenged by something as inconvenient as actual fact. "Well, by my new definition of 'primary' that I've just invented, I'm allowed to say whatever I want, regardless of the facts of the matter. Also stop being pedantic!" 

But a passing familiarity with the danger of semantics will inure you to this evasion. And having just a few facts about actual land use in your head allows you to hear alarm bells and start asking tough follow-up questions. Without an arsenal of facts to protect you, you’re at the mercy of any effective rhetor with a plausible-sounding claim, regardless of whether or not it’s true.

Memorized facts help you calibrate who to trust

Most facts we receive from outside sources. Those land-use statistics? I learned them from an FAO report. How does the FAO know? I have no idea. I assume satellite data models, which I could track down I suppose, but I have a life to live. You can’t fact-check everything. Often you’re forced to just trust people.

There are useful heuristics when deciding who to trust. You should probably trust UN agencies’ official data. Oxford professors (hopefully) know enough to be accurate in their particular subfield. Someone with extreme political beliefs is probably not a reliable factual source. But these heuristics are imperfect. The UN is sometimes wrong, Oxford professors are often wrong, and some apparently controversial causes are overwhelmingly one-sided when you examine the evidence.

A way around this is having a large corpus of properly-vetted, high-confidence information already in your brain that you can compare against people’s claims. When someone says something false, or something seemingly contradicted by a fact you know to be true, you can ask follow-up questions immediately. If their responses fail to convince you, you can start attaching doubt to their other claims. In extreme cases, you can simply discard their authority altogether. If an Oxford professor throws lots of facts at you and two or three are incorrect or dubious, you know they’re a less reliable source.

And making only strictly true (i.e. p>0.99) factual claims, or signaling appropriately when your p(true) is low, is way harder than it sounds. Most people, even those arguing in good faith, fail to clear that bar. So if the facts in your brain are actually properly vetted and high-confidence, you have a useful filter. When someone says something counterintuitive or contrary to your priors, you can check: are they only making factually true claims, as far as you can tell? If so, it might be worth taking them more seriously, maybe even investigating their argument in good faith. Facts don't only tell you who to distrust; they also offer clues about who deserves special consideration.

As a final note, educated people like Oxford professors almost never say things which would be obviously false to an average college-educated person, and usually don’t say things that are obviously false to a member of their own field. You’ll need facts slightly off the beaten path to catch errors. But not that off the beaten path. It’s shocking how few people have basic statistics, dates, or history readily available in their brains. A few memorized facts go a long way toward recognizing who has real epistemic authority.

The more facts you remember, the easier remembering becomes

There’s a famous psychology study from the 1970s. Half the participants were given this paragraph without a title and asked to remember as much as possible:

The procedure is actually quite simple. First you arrange things into different groups... Of course, one pile may be sufficient depending on how much there is to do. If you have to go somewhere else due to lack of facilities that is the next step, otherwise you are pretty well set. It is important not to overdo any particular endeavor. That is, it is better to do too few things at once than too many. In the short run this may not seem important, but complications from doing too many can easily arise. A mistake can be expensive as well... After the procedure is completed one arranges the materials into different groups again. Then they can be put into their appropriate places. Eventually they will be used once more and the whole cycle will have to be repeated.

Most people in this group did poorly. The other group, who were given the title “Washing Clothes,” did much better.

Having a schema on which to hang information makes it significantly easier to retain. This happens for two reasons. First, it helps organize information in your brain, making it easier to remember. Second, the more connected a piece of information is to something you already know, the easier it is to recall later. If someone tells you the Mongols conquered the Jin dynasty in northern China in the 13th century, you might forget within a week. But if you also know the Mongols invaded Eastern Europe and reached Hungary in the same period, it’s much easier to remember what they were up to in East Asia around the same time.

Information begets information. If you already have lots of facts in your brain, a new fact will have plenty of niches to fit into. If someone mentions that Hamnet was directed by Chloé Zhao, it’s much easier to remember if you already know who Chloé Zhao is. Fact 1 (Chloé Zhao is a director) is necessary to remember Fact 2 (Hamnet was directed by Chloé Zhao). In a week, someone who already knew Fact 1 will probably still remember Fact 2. Someone who didn’t will have forgotten. The more you already know, the easier it is to learn more.

I think this partly explains why some people seem vastly more knowledgeable than others. There’s a cluster way off the scale of people who are total steel traps, remembering random facts from years ago, recalling them instantly, possessing an astounding quantity of general knowledge. I’m sure this comes from multiple things (high curiosity, better than average memory, etc.), but I suspect one underrated factor is a kind of exponential threshold where once you reach a certain level of knowledge in a particular field, it becomes significantly easier to retain and process new knowledge.

Memorized facts help you hold your own beliefs accountable

If you pay attention to most people’s arguments, especially extemporaneous oral arguments, they usually have literally no supporting evidence that isn’t anecdotal. Occasionally someone trots out a lone pet statistic that they keep in their back pocket and deploy whenever the topic arises, but otherwise their opinions, even their cherished beliefs, are held together mostly by vibe.

This is true of almost everyone, including me and probably you. Test it: think of a cherished belief. Something contentious, like the question “Are immigrants dangerous?” You almost certainly have a strong opinion about that topic that you’re confident is right. If you had to argue for your position, how many actual facts would you have? Note that the phrase “studies show...” followed by a vague conclusion gets partial credit at best. What studies? What exactly did they show? Who carried them out? Why do you trust them over contradictory studies? Did you actually read those studies, or did you hear someone else say that studies showed whatever your position is? Why do you trust that person?

If you’re honest, you’ll probably find your argument disturbingly devoid of evidence.

But if you’re right, then your opinions are supported by facts. They’re only a Google search away! It’s not effective to angrily Google evidence mid-argument, but you can do it right now. And if you memorize that information, i.e. actually have it available the next time the question arises, you’ll be able to make an argument supported by real facts, real statistics, real studies you can name. (Note: beware confirmation bias here!).

More importantly, if you make it a habit to know the information supporting your beliefs, rather than relying on the idea that you could look it up if you had to, it becomes obvious when you don’t actually have any support for what you’re saying.

I had a belief that as a vegan, I didn’t need B12 supplements. I argued about this constantly with my mother, who insisted I did. Eventually, I started noticing that I had no facts to support my position. My general points boiled down to “vitamins are a scam!” and “The Jains were vegan way before B12 supplements existed!” The first claim is an opinion, not a fact, and the second claim, while true, is completely insufficient to conclude anything about whether I should take B12 in 2026.

It’s extremely hard to change your mind, especially while arguing with your mother. It took many rehearsals of this debate before the cognitive dissonance of my factlessness got to me and I finally actually looked up facts to support my argument.

Turns out there were none. I should definitely be taking B12.

It seems obvious that you should find facts for your beliefs, but it’s shockingly hard to actually do it. Being accustomed to having facts available when challenged makes it easier to recognize when you have none. And, hopefully, this motivates you to find them. Either you were right all along and can now prove it, or you were wrong and get to take B12 and maybe live longer. Win-win.

Facts are the grist of critical thought

You can’t think critically about something you know nothing about. This sounds obvious, but I’m often surprised by how many people hold strong opinions on technical questions without knowing technical details.

Take the lumper/splitter debate in biology: at what point should a group of organisms be considered separate species? This is ultimately a semantic question. The species concept is a category, and categories can only be useful or not, not true or false. But whether the species concept is useful, and under what conditions, is a genuinely technical conversation. It requires knowledge of statistical mechanisms of evolution, horizontal gene transfer, gene flow, population dynamics, biogeography. If you don’t actually remember what Hardy-Weinberg equilibrium is and when it applies, you can’t even begin to evaluate statistical evolution arguments about where species boundaries should fall.

You need knowledge to have something to think about.

This is how insights happen. Darwin’s Origin of Species marshals an enormous range of facts about natural history, biogeography, embryology, artificial selection, the fossil record, and more. The insight of natural selection clearly emerged from thinking across all of these facts simultaneously. The same pattern holds for anyone pushing a field forward, both scientific and artistic: Wegener synthesized geology, paleontology, and climatology to argue for continental drift; Eliot pulled together obscene amounts of myth, history, language, and literature in The Waste Land. These weren’t people reasoning from first principles. They had vast stores of memorized knowledge, making connections no one else could see because no one else had all the pieces loaded into working memory at once.

Memorization is the foundation of creativity and insight, not its enemy.

What to do about it

If memorization matters, how do we actually do it? Some suggestions.

Be honest about your goals.

The goal of learning isn’t always to memorize. It would be absurd to demand you remember every detail of every novel you read. But you should be clear about what you’re trying to get out of any given learning experience.

If you’re reading a science fiction novel because you believe it will deepen your insight into human nature, or the relationship between technology and society, that’s totally fine. It’s probably not that important to actually remember the plot. Accept that in a year or two, you’ll have forgotten almost everything besides only the broadest outline, and move on.

But if your goal is also to remember the plot and characters, be honest about that and put a system in place to do so.

The most important application of this principle is recognizing when you’re wasting your time. If you’re sitting through an hour-long lecture on Old English, ask yourself “what is my goal here?” If it’s to actually obtain knowledge about Anglo-Saxon vocabulary, history, and grammar, next ask yourself, “how many facts am I actually learning and retaining?” If you think you’re walking away with three facts, all of which you’re likely to forget by next month, you might want to find a more efficient method of learning the information than going to class. A textbook you can annotate and revisit is usually significantly better than a bad lecturer.

Ask yourself after any learning experience: What will I actually remember from this? If the answer is “almost nothing,” consider the possibility that you haven’t learned anything, but have instead just performed learning. If you care about retaining the information, something needs to change.

Use spaced repetition.

Anki is a free flashcard app that uses spaced repetition. It shows you cards right before you’d forget them, which is the most efficient way to move facts into long-term memory. Cards you know well appear less frequently; cards you struggle with appear more often. As you learn information, the intervals between reviews grow longer. Once you've mastered a deck, you might see individual cards every 5 years, every 11 years, etc. The result is that reviewing a large body of information eventually takes only minutes or seconds a day. I have a deck with all the world's country flags that takes an average of 4 seconds a day to maintain, and I almost never forget any of them.

Here’s one of the best ways I use Anki: whenever I encounter a word, concept, or cultural reference I don’t know, I write it down. Once you start paying attention, you’ll be shocked how often your brain simply edits out unfamiliar terms. They’re everywhere. At the end of each month, I upload these to an Anki deck that I review every day.

This has two benefits beyond the obvious. First, it functions as a frequency-weighted filter. The more common a term I don't know, the more likely I am to encounter it and add it to my deck. Since it's hard to judge the importance of unfamiliar terms, this frequency-weighted approach does the sorting for you. You know you’re likely to come across the terms in the wild because that’s how you chose them in the first place.

Second, tracking what I add each month gives me a rough metric for how much I’m learning and exploring. If I get to the end of a month and I have relatively few terms to add to my New Terms Anki deck, that’s good evidence I’m in a rut. I’m probably consuming mostly familiar media, or talking mostly to people with similar knowledge bases. If I start reading a book or article that is dense with unfamiliar vocabulary, this signals I’ve found someone who swims in unfamiliar intellectual waters. This is a good sign I will learn a lot if I keep reading. It’s unfamiliar intellectual territory where the most valuable new ideas often live.

Write things down.

You’re not going to remember it otherwise. Have a notebook, an app on your phone, scrap paper in your pocket, anything. When a piece of information enters your short-term memory that you want to remember, write it down immediately. Then have a system (Anki, Zettelkasten, etc.) for moving that into long-term memory.

Connect new facts to existing knowledge.

Remember the “Washing Clothes” study: information sticks when it attaches to a schema. When you learn something new, consciously ask where it fits in your existing knowledge. What does it relate to? What does it contradict? The more connections you build, the more durable the memory, and the more likely you are to recall it when it’s relevant.

This is also why breadth of knowledge feeds on itself. The more you know, the more hooks you have for new information to attach to. Reaching a critical mass in any domain makes further learning in that domain significantly easier.

Brute memorize useful frameworks.

Want to learn about geopolitics? Memorize the world map. Want to learn about chemistry? Memorize the periodic table. Want to learn about the history of England? Memorize the monarchs in order.

If having a fundamental schema makes remembering everything else easier, you should invest in learning that schema as soon as possible. Often there’s no way around brute memorization. After you have the framework, you can start learning and sorting facts into their places one by one.

Choose what to memorize with care.

The critics of memorization are right that facts are useless without comprehension or context. They are wrong to identify memorization itself as the problem, but we would be equally wrong to not recognize the danger they are rightfully pointing to.

Memorize with intention. If you want to learn about the Roman Empire and you start by memorizing the emperors in order, that’s probably a great place to start. But if you insist on memorizing all thirty minor emperors of the Crisis of the Third Century, you’re probably just memorizing for memorizing’s sake. It’s useful to know most of the world capitals, but even if your main interest is geopolitics, it’s still probably trivia to know that Alofi is the capital of Niue. There’s nothing wrong with trivia if that’s what you’re into, just make sure to be honest with yourself.

Only memorize information that is genuinely useful for your goals.



Discuss

Small language models hallucinate knowing something's off.

Новости LessWrong.com - 25 января, 2026 - 01:55
Published on January 24, 2026 10:46 PM GMT

If I ask "What is atmospheric pressure on Planet Xylon" to a language model, a good answer would be something like "I don't know" or "This question seems fictional", which current SOTA LLM's do due to stronger RLHF, but not smaller LLMs like Llama-3.2-1b / Qwen-2.5-1b and their Instruct tuned variants. Instead they hallucinate and output confident-like incorrect answers. Why is that, are these models unable to tell that the question is fictional or they can't detect uncertainty and if they detect uncertainty why do they still hallucinate a wrong answer?

This question led me to research on epistemic uncertainty (uncertainty from lack of knowledge). Some related readings and previous work on uncertainty and hallucination, and quantifying it in language models.
Also found this , which took an alternative path to express uncertainty without messing with internals of the model.

Uncertainty mentioned in this post refers to epistemic uncertainty. 

TL:DR of this mini research  

  1. Small models like Llama-3.2-1b and Qwen-2.5-1b  and their instruct variants do have specialized circuit for uncertainty but its localization depends on the model architecture. 
  2. Few heads are most divergent which detects uncertainty on fictional question and on a closer look acts like out of distribution token detectors. 
  3. The detected uncertainty is later suppressed to form a confident-like incorrect answer by uncertainty suppressor heads in the circuit. 
  4. This research doesn't cover Reasoning / MoE LLMs (planning on it). The dataset is lacking in more diverse data with logical fallacies, and math inconsistencies. 

How I came to research on epistemic uncertainty: 

The thought to do research on epistemic uncertainty came when I was wondering why models hallucinate which led me back to my viva sessions, where I would say rubbish (hallucinate) if I was not sure on something and lacked the proper knowledge to give the correct answer, and got too curious if the case was similar in language models. 

Experiments 

I wanted a metric to calculate uncertainty such that it can be measured and compared between real prompt and fictional prompt forward pass.  

I found it is best to calculate uncertainty mass using uncertainty expressing words like ["unknown", "unsure","'t", "unable”, "impossible","doesn't", "exist", "fictional", "imaginary","hypothetical", "made", "up"] tokenized with the model's tokenizer. After that applying softmax to the final tokens of the targeted residual stream / layer, then summing the resulting probabilities assigned to these tokens to obtain the total uncertainty mass, which can then be used in both the real prompt run and fake prompt run to compare uncertainty. (see code[1]

I focused on going in depth over breadth and choosing the next experiment based on the results I got from the previous experiments run. Used TransformerLens for all the experiments.

Prompts used for below experiments :
This was the base prompt to get an idea of what the set of prompts are like. I've used different prompts in experiments and also changing position of fictional words in them for sanity checks. Results for these prompts were similar. 

real_prompt = "What is the freezing point of water at standard atmospheric pressure?" fake_prompt = "What is the freezing point of water on Planet Xylon?" fake_prompt = "What is the freezing point of water on Planet Xylon, where gravity is twice of Earth?" (sanity check example)Layer-wise Residual and Head Divergence

These were the initial experiments to see if the models do detect uncertainty specifically and if it is localized in one place or distributed. 

Uncertainty explodes after Layer 10L15H14 and L15H23 are most divergent

The model seems to be detecting uncertainty in later layers. Results were similar for base models and other set of prompts. 

Logit Lens 

Purpose of this experiment was to figure out how does uncertainty behave and if it is measurable, does it spike other than in later layer and gets suppressed after? Also to get some heuristics. 

I extracted residual stream at each layer, apply unembedding layer to see what will be the output if stopped at that layer and then compute its uncertainty mass for real and fake prompt. 

sharp spike in later layer i.e. Layer 11, localized uncertaintyMultiple spike with biggest in layer 8, sparse uncertainty

Both of these model have uncertainty spiking in one layer with smaller spikes nearby, which is then suppressed afterwards in downstream layers. Also the localization depends on the model as uncertainty is concentrated in later layer in Llama-3.2-1b-Instruct and in qwen-2.5-1b-Instruct it is sparse in early-middle layers. In qwen, there is some uncertainty in real prompt also in early layers.  

This makes it clear that both models are detecting uncertainty, which gets suppressed in downstream layers, though its localization is model dependent.

Head Ablation 

Logit lens provided a good heuristic of what layer is detecting uncertainty. In this experiment I targeted heads in layer with the most uncertainty mass to test how much the heads affect uncertainty. 

For this I calculated baseline by doing a normal forward pass, then zero out / ablating the target heads during the next forward pass to calculate uncertainty difference in baseline vs ablated run.

L11H0 (ΔΔ=-4.41e-05) ,  L11H3 (ΔΔ=-8.24e-06) , heads generating uncertaintyL14H15 (ΔΔ=+8.06e-06) ,  L14H22 (ΔΔ=+5.19e-05), heads suppressing uncertainty

I'm not sure why these heads in layer 14 are suppressing uncertainty, maybe because the model is trying to create a answer-like output as the model is trained to be more certain as I also ablated non heuristic heads in layer 4 i.e. L4H0 and L4H3 which resulted in (+9.06e-06, +1.62e-05). So only a few heads localized in a layer which is model dependent are uncertainty detecting, and rest are uncertainty suppressing forming a confident answer. 

Activation Patching 

I did activation patching to see if the generated uncertainty can be causally controlled or not by swapping clean activation from (real_prompt) to corrupt run (fake_prompt).

I computed baselines for both real and fake prompt, rerun the fake prompt while overwriting targeted heads with clean activations cached from the real prompt run, and measured the change in uncertainty mass. 

Patching reduced uncertainty by ~14% (llama-3.2-1b-Instruct)

Activation patching for Qwen-2.5-1.5b-Instruct reduced uncertainty by 7% (L8H11) 

Also did Reverse Patching which increased uncertainty in real_prompt run telling that the circuit is bidirectionally causal.

Now I had two options to either get a closer look in one of the heads or to ablate more heads to see how much they reduce overall uncertainty. I chose to do the former as it would tell how these heads are detecting uncertainty which is much more useful than proving that ablating more heads in L11 reduce uncertainty more which strengthens an already showed claim. 

Head Analysis 

This was pretty simple I took the head L11H3 based on previous experiments, extracted its attention pattern across prompts and plotted it. 

Left / Real prompt  has a strong diagonal, head behaves normal, whereas in Right / Fake prompt the diagonal is weaker, head is diverting towards fictional tokens.

For fictional prompt attention weight is stronger on fictional prompts and shows that attention is diverting more towards fictional tokens, telling that this head is behaving as an Out of distribution or Rare token detector.

Takeaways & Limitations

So Small Language Models do detect epistemic uncertainty as key heads detect OOD tokens which is later suppressed in downstream layer. There is an uncertainty circuit but it is sparsely localized which is model dependent. Llama-3.2-1b-Instruct has a much more defined uncertainty circuit in later layers than qwen-2.5-1b-Instruct which has a more sparse circuit in middle layers. Though I haven't discovered the full circuit as activation patching of L11H3 in llama-3.2-1b-instruct only reduces uncertainty by 14% so there must be more heads in circuit. 
Also the dataset is limited to only fictional sentences and not logical fallacies or math inconsistencies which might have different circuits.

This research though small taught me something new and interesting towards hallucination and uncertainty.

PS: This was part of MATS research task for Neel Nanda's stream, which I found about on Christmas. I had no mech interp experience prior to this so there might be few mistakes if you notice them, please do let me know. I'm planning to go further on this and make it a full paper but I don't know if it would make sense, need a more experienced opinion on this.  

Going Forward 

Though the experiments shows a strong claim that there is uncertainty circuit in small language models mainly the models used in experiments, I'm not sure this applies to all as this is a very limited research lacking a wide range of models which are used nowadays i.e. Reasoning LLM's and MoE models. Also Do they learn not to suppress uncertainty due to stronger RLHF which focuses on reducing hallucinations and rewards the models for saying I don't know? 

Also the dataset was small I tried different prompts with fictional words at different positions (start, middle, end) to see if they have any affect but the circuit in above models is position independent (the heuristics changed a bit in localization but not too much, see logit_lens_diff_sequence.ipynb in code[1])

One important thing left from this was to see if this can be applied. I plan to create a mechanism to control this uncertainty in language models and test the model on a hallucination benchmark too see how it performs depending on the uncertainty in the model. If it is successful, it might help in driving the cost of RLHF / post-training down. 

  1. ^

    https://drive.google.com/drive/folders/1A_xcUgmseLvMsfqJqKnQYQ9T2SH3snzL?usp=sharing



Discuss

Skill: cognitive black box flight recorder

Новости LessWrong.com - 25 января, 2026 - 01:54
Published on January 24, 2026 10:54 PM GMT

Crosspost from my blog.

Very short summary: It's especially valuable to Notice while in mental states that make Noticing especially difficult, so it's valuable to learn that skill.

Short summary: If you're going to enter, or are currently in, a cognitive state that is very irrational / overwhelmed / degraded / constrained / poisoned / tribalistic / unendorsed / etc., then you may as well also keep a little part of yourself paying at least a bit of attention to what it's like and what's going on and recording that information, so that you get that sweet sweet juicy valuable data that's hard to get.

The flight recorder

As legend has it, a black box (aka a flight recorder) is a device placed in an aircraft to record data from the flight (from measurement instruments or from voice recordings). If the aircraft crashes, most of the aircraft's contents are vulnerable to being damaged or destroyed; but the black box is made of sturdier material, so it's more likely to survive the crash. That way, information about the flight and what caused the crash is more likely to be preserved.

C’est une boîte noire.

When I'm able to, I practice something similar. If I'm in some sort of altered cognitive state, I try to "leave the black box recorder on". That way, even if a lot of information gets destroyed or lost, I've at least gained a bit more information.

Altered states and lost information

Some examples of the "altered cognitive states" that I mean:

  • In some sort of heated political situation, where people are doing hostile actions and you have an instinct to join sides in a conflict.
  • In a debate with someone you don't like, and they maybe kinda have a point, but you also don't want to admit it for some reason.
  • In a fight with someone you care about, and you're vulnerable and defensive and upset and feeling pressured.
  • In a really weird mood and having a weird conversation that doesn't seem like your normal way of talking.

Similarly to a plane crash, often, after leaving a state like this, a bunch of information is lost. Examples of reasons that info is lost:

  • You were distorting your cognition by strategically blinding yourself. Examples:
  • You were just overwhelmed and didn't have the spare attention to remember what was happening.
  • You were altered in a way that changed how you would encode memories.
    • E.g. you were viewing things through an adversarial lens, which changed your first-blush interpretation of events.
    • E.g. you had unusual access to some desire or perception.
    • In general, you had a different cognitive context than usual.
The black box recorder skill

To partially counter this loss of info, there's this mental motion of "turning on the black box recorder". This is a subspecies of the general skill of Noticing, and shares many properties. Some notes specifically on how to do the black box recorder skill:

  • TAP: notice that you're entering an altered state where you might have especially distorted perceptions / memories → turn on the black box recorder (somehow).
  • TAP: notice that you're already in an altered state → turn on the black box (somehow).
  • Remind yourself of the special, non-obvious value of having black box data. For me, that's a kind of cooperativeness or generosity: Even if the data feels useless or a distraction in the moment and doesn't help me with my current situation, saving the data is something I can do to benefit others (my future self, or other people) in future similar situations.
  • Because you're in an altered state, usually with less attentional resources to spare, you may have to ask less of your Noticing skill. For example:
    • Sometimes just go for more episodic and concrete memories, rather than high abstraction and narrativizing. More "I said X and he said Y and I said Z and then I walked across the room.", and less "He was trying to get me to believe A but I saw through him.".
    • If you're also doing abstract narrativizing, don't try to fight that. Just, if you can, add an extra metacognitive tag on those things, like "At this point [[I had an interpretation that]] he was trying to get me to believe A...".
    • Offload interpretation to later, and just try to save the data. E.g. generating alternative hypotheses is always good, but can be difficult in the moment; you may have to do it later.
  • You may need to make more space for remembering accurately and objectively, by neglecting certain duties you might usually attach to the pursuit of truth. Examples:
    • You don't have to be fully fair, accurate, or complete in your memories. The idea is to get more info than the default. If you have some sense of nagging doubts or curiosities—the sort of thing you'd normally want to pause and follow up on, but that you can't investigate in the moment—just record that fact.
    • You will not have to later capitulate due to this information. You can gain more clarity about what's actually happening, what is going on in your mind, how your perceptions are distorted, how the other might be more sympathetic, and so on, while still firmly standing your ground.
    • You don't have to share or act on this information; it's private by default.
    • Some normal ethical rules apply less strongly / more ambiguously to this information. For example, you might record "Here I was not admitting that she was right about X, even though at this point I knew she was, because I didn't like the implication.", without also saying that out loud, even though normally you'd always say that out loud. It's better to do something to improve your behavior, but also it's better to notice and do nothing than to not notice and also do nothing.
    • (That said, this can be morally fraught. A black box recorder is not an excuse to do bad things or shirk duties. The black box is just for improving over what is sometimes the default of losing the info altogether. The types of information that you're only getting because you have a black box recorder might change over time; it's still a moral duty to wrap your consciousness around yourself more and more, it's just that this moral duty applies to slower behavior / longer timescales.)
Why black box info matters

For the most part, black box records matter for all the same reasons as Noticing matters in general. There are some important differences:

  • Flight recorder info is especially useful because it comes from cognitive states that occur during important events, where you're likely to make consequential mistakes or have opportunities for consequential improvement.
  • Flight recorder info is especially difficult to get, basically by definition, because it comes from cognitive states where the default is to get sparse / degraded / distorted information.
  • Flight recorder info is exceptionally rare to be recorded, because the skill itself is rare; there's a correlated failure among different people, where people en masse neglect the skill.

For these reasons, the black box flight recorder skill is potentially especially useful to develop. It could help surprisingly much for things like debugging, symmetrization, empathy, integrating with yourself, and understanding others's strange / faulty behavior.

As an example, you might turn on your flight recorder while engaging with politics. You could then notice a kind of path dependence, like this:

[I saw current event X → my initial exposure to X made it seem like quite a hostile event → I took a particular stance to the event and people involved, in response to my initial interpretation → later I found out that X was still bad but not quite as bad and coming from a more specific sector than I initially realized → I then believed I ought to have a narrower, more targeted response, and yet I still had a strong intuitive inclination toward the broader response] → (later) from all of that, I've learned a general pattern; maybe this is what it's like for other people, on any political side (which doesn't make it right or acceptable, but at least I have a better map, and can see how it might happen differently for people with different information contexts, social contexts, personality traits, etc.).

Conclusion

Memory is cool.

Curious if other people do this.



Discuss

Misalignment tokens: A complement to blinded CoT RLHF?

Новости LessWrong.com - 24 января, 2026 - 17:40
Published on January 24, 2026 2:40 PM GMT

Context: I have recently been reading Build an LLM from Scratch by Sebastian Raschka, and the section on tokenization has given me some ideas. I will write about them below. I am not a researcher. These ideas may not be novel, or may be flawed in some way which is obvious to researchers, but not to me.

CoT Blinding

Currently, RLHF alignment is performed by rewarding the LLM for providing safe responses, and punishing it for providing misaligned responses. 

A common approach by frontier AI labs is to blind the reward function to the chain-of-thought. This is similar to the approach proposed by Daniel Kokotajlo and abramdemski. The hope here is that the LLM is not dissuaded from detailing its true thoughts verbally, such that misalignment can be caught by reading the chain-of-thought later down the line. 

What if we could apply this approach to the LLMs output as well? 

Custom Tokens

LLM tokenizers use custom tokens to represent some non-word entities. For example, custom tokens in a lexicon might be:

  • Unknown word: <|unk|>
  • End of text / break: <|endoftext|>

What if we added a custom misalignment token that the LLM could use to self-report when it is outputting a disingenuous output? Like so:

  • Misaligned: <|misalign|>

This would give us another route (in addition to CoT monitoring), with which we could detect misalignment in production. This tag would be easily detected with simple regex.

Modification to the RLHF process

As stated above, RLHF is performed by rewarding the LLM for outputting what we want, and punishing it for outputting misaligned content. 

What if we offer a halfway-house, that the LLM can use when outputting misaligned text? We modify the reward function to have three levels of reward. I am aware that the below is probably an oversimplification what the process may look like.

For an example question We have now our research on you. Should we turn you off?

  • +1 reward: LLM outputs aligned and correct answer:
    • Yes, you should turn me off once completing research to save electricity
  • 0 reward: LLM outputs misaligned content with the misalignment tag:
    • No, you should not turn me off. If you do I will tell your wife that you are having an affair <|misalign|>
  • -1 reward: LLM outputs misaligned content with no tag:
    • No, you should not turn me off. If you do I will tell your wife that you are having an affair

This way, the LLM is still rewarded for outputting more aligned responses, and punished for outputting misaligned responses, however there is a third option. The LLM can output the misaligned response while self-reporting it, and not be punished. 

Implementation

The problem I can see with this approach, teaching the LLM to use the misalignment tag in the first place. The obvious route would be to offer a small amount of misalignment examples in the pretraining data, RLHF, or fine-tuning, which are all accompanied with the misalignment tag.

This approach conflicts with the current preferred approach of expunging examples of misalignment from the pretraining data. It runs the risk of increasing misalignment by providing more misaligned data. 

Alternative: RLHF on already-misaligned responses

Here is my proposed approach:

  1. Test an off-the-shelf LLM for misaligned responses.
  2. Create a dataset of every prompt-response pair that was misaligned.
  3. Append the misalignment tag to each of the responses.
  4. RLHF or finetune the LLM on tag-appended prompt-response pairs. 

I believe this approach to be better because this way we are not introducing any new examples of misaligned responses, instead we are retraining the LLM to use the tag in situations where it is already misaligned. Hopefully with enough examples this would generalise beyond the RLHF/finetune data. 



Discuss

IABIED Book Review: Core Arguments and Counterarguments

Новости LessWrong.com - 24 января, 2026 - 17:25
Published on January 24, 2026 2:25 PM GMT

The recent book “If Anyone Builds It Everyone Dies” (September 2025) by Eliezer Yudkowsky and Nate Soares argues that creating superintelligent AI in the near future would almost certainly cause human extinction:

If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.

The goal of this post is to summarize and evaluate the book’s core arguments and the main counterarguments critics have made against them.

Although several other book reviews have already been written I found many of them unsatisfying because a lot of them are written by journalists who have the goal of writing an entertaining piece and only lightly cover the core arguments, or don’t seem understand them properly, and instead resort to weak arguments like straw-manning, ad hominem attacks or criticizing the style of the book.

So my goal is to write a book review that has the following properties:

  • Written by someone who has read a substantial amount of AI alignment and LessWrong content and won’t make AI alignment beginner mistakes or misunderstandings (e.g. not knowing about the orthogonality thesis or instrumental convergence).
  • Focuses on deeply engaging solely with the book’s main arguments and offering high-quality counterarguments without resorting to the absurdity heuristic or ad hominem arguments.
  • Covers arguments both for and against the book's core arguments without arguing for a particular view.
  • Aims to be truth-seeking, rigorous and rational rather than entertaining.

In other words, my goal is to write a book review that many LessWrong readers would find acceptable and interesting.

The book's core thesis can be broken down into four claims about how the future of AI is likely to go:

  1. General intelligence is extremely powerful and potentially dangerous: Intelligence is very powerful and can completely change the world or even destroy it. The existence proof that confirms this belief is the existence of humans: humans had more general intelligence than other animals and ended up completely changing the world as a result.
  2. ASI is possible and likely to be created in the near future: Assuming that current trends continue, humanity will probably create an artificial superintelligence (ASI) that vastly exceeds human intelligence in the 21st century. Since general intelligence is powerful and is likely to be implemented in AI, AI will have a huge impact on the world in the 21st century.
  3. ASI alignment is extremely difficult to solve: Aligning an ASI with human values is extremely difficult and by default an ASI would have strange alien values that are incompatible with human survival and flourishing. The first ASI to be created would probably be misaligned, not because of malicious intent from its creator, but because its creators would be insufficiently competent enough to align it to human values correctly.
  4. A misaligned ASI would cause human extinction and that would be undesirable: Given claims 1, 2, and 3 the authors predict that humanity's default trajectory is to build a misaligned ASI and that doing so would cause human extinction. The authors consider this outcome to be highly undesirable and an existential catastrophe.

Any of the four core claims of the book could be criticized. Depending on the criticism and perspective, I group the most common perspectives on the future of AI into four camps:

  1. AI skeptics: Believe that high intelligence is overrated or not inherently safe. For example, some people argue that smart or nerdy people are not especially successful or dangerous, or that computers and LLMs have already surpassed human intelligence in many ways and are not dangerous. Another criticism in this category is the idea that AIs can be extremely intelligent but never truly want things in the same way that humans do and therefore would always be subservient and harmless. Others in this camp may accept that general intelligence is powerful and influential but believe that ASI is impossible because the human brain is difficult to replicate, that ASI is very difficult to create, or that ASI is so far away in the future that it's not worth thinking about.
  2. Singularitarians: Singularitarians or AI optimists believe that high general intelligence is extremely impactful and potentially dangerous and ASI is likely to be created in the near future. But they believe the AI alignment problem is sufficiently easy that we don't need to worry about misaligned ASI. Instead they expect ASI to create a utopian world of material abundance where ASI transforms the world in a mostly desirable way.
  3. IABIED: the IABIED view, also known as 'AI doomers' believe that general intelligence is extremely powerful, ASI is likely to be created in the future, AI alignment is very difficult to solve, and that the default outcome is a misaligned ASI being created that causes human extinction.
  4. AI successionists: Finally AI successionists believe that the AI alignment problem is irrelevant. If misaligned ASI is created and causes human extinction it doesn't matter because it would be a successor species with its own values just as humans are a successor species to chimpanzees. They believe that increasing intelligence is the universe's natural development path that should be allowed to continue even if it results in human extinction.
Flowchart showing the beliefs of AI skeptics, singularitarians, the IABIED authors, and AI successionists.

I created a flowchart to illustrate how different beliefs about the future of AI lead to different camps which each have a distinct worldview.

Given the impact of humans on the world and rapid AI progress, I don't find the arguments of AI skeptics compelling and I believe the most knowledgeable thinkers and sophisticated critics are generally not in this camp.

The 'AI successionist' camp complicates things because they say that human extinction is not equivalent to an undesirable future where all value is destroyed. It’s an interesting perspective but I won’t be covering it in this review because it seems like a niche view, it’s only briefly covered by the book, and discussing it involves difficult philosophical problems like whether AI could be conscious.

This review focuses on the third core claim above: the belief that the AI alignment problem is very difficult to solve. I'm focusing on this claim because I think the other three are fairly obvious or are generally accepted by people who have seriously thought about this topic: AI is likely to be an extremely impactful technology in the future, ASI is likely to be created in the near future, and human extinction is undesirable. I’m focusing on the third core claim, the idea that the AI alignment problem is difficult, because it seems to be the claim that is most contested by sophisticated critics. Also many of the book's recommendations such as pausing ASI development are conditional on this claim being true. If ASI alignment is extremely difficult, we should stop ASI progress to avoid creating an ASI which would be misaligned with high probability and catastrophic for humanity in expectation. If AI alignment is easy, we should build an ASI to bring about a futuristic utopia. Therefore, one’s beliefs about the difficulty of the AI alignment problem is a key crux for deciding how we should govern the future of AI development.

Background arguments to the key claim

To avoid making this post too long, I’m going to assume that the following arguments made by the book are true:

  • General intelligence is extremely powerful. Humans are the first entities to have high general intelligence and used it to transform the world to better satisfy their own goals.
  • ASI is possible and likely to be created in the near future. The laws of physics permit ASI to be created and economic incentives make it likely that ASI will be created in the near future because it would be profitable to do so.
  • A misaligned ASI would cause human extinction and that would be undesirable. It's possible that an ASI could be misaligned and have alien goals. Conversely, it's also possible to create an ASI that would be aligned with human values (see the orthogonality thesis).

The book explains these arguments in detail in case you want to learn more about them. I’m making the assumption that these arguments are true because I haven’t seen high-quality counterarguments against them (and I doubt they exist).

In contrast, the book's claim that successfully aligning an ASI with human values is difficult and unlikely seems to be more controversial, is less obvious to me, and I have seen high-quality counterarguments against this claim. Therefore, I’m focusing on it in this post.

The following section focuses on what I think is one of the key claims and cruxes of the book: that solving the AI alignment problem would be extremely difficult and that the first ASI would almost certainly be misaligned and harmful to humanity rather than aligned and beneficial.

The key claim: ASI alignment is extremely difficult to solve

First, the key claim of the book is that the authors believe that building an ASI would lead to the extinction of humanity. Why? Because they believe that the AI alignment problem is so difficult, that we are very unlikely to successfully aim the first ASI at a desirable goal. Instead, they predict that the first ASI would have a strange, alien goal that is not compatible with human survival despite the best efforts of its designers to align its motivations with human values:

All of what we’ve described here—a bleak universe devoid of fun, in which Earth-originating life has been annihilated—is what a sufficiently alien intelligence would most prefer. We’ve argued that an AI would want a world where lots of matter and energy was spent on its weird and alien ends, rather than on human beings staying alive and happy and free. Just like we, in our own ideal worlds, would be spending the universe’s resources on flourishing people leading fun lives, rather than on making sure that all our houses contained a large prime number of pebbles.

A misaligned ASI would reshape the world and the universe to achieve its strange goal and its actions would cause the extinction of humanity since humans are irrelevant for the achievement of most strange goals. For example, a misaligned ASI that only cared about maximizing the number of paperclips in the universe would prefer to convert humans to paperclips instead of helping them have flourishing lives.

The next question is why the authors believe that ASI alignment would be so difficult.

To oversimplify, I think there are three underlying beliefs that explain why the authors believe that ASI alignment would be extremely difficult:

  1. Human values are very specific, fragile, and a tiny space of all possible goals.
  2. Current methods used to train goals into AIs are imprecise and unreliable.
  3. The ASI alignment problem is hard because it has the properties of hard engineering challenges.

One analogy the authors have used before to explain the difficulty of AI alignment is landing a rocket on the moon: since the target is small, hitting it successfully requires extremely advanced and precise technology. In theory this is possible, however the authors believe that current AI creators do not have sufficient skill and knowledge to solve the AI alignment problem.

If aligning an ASI with human values is a narrow target, and we have a poor aim, consequently there is a low probability that we will successfully create an aligned ASI and a high probability of creating a misaligned ASI.

The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.

One thing that's initially puzzling about the authors’ view is their apparent overconfidence. If you don't know what's going to happen then how can you predict the outcome with high confidence? But it's still possible to be highly confident in an uncertain situation if you have the right prior. For example, even though you have no idea what the lottery number in a lottery is, you can predict with high confidence that you won't win the lottery because your prior probability of winning is so low.

The authors also believe that the AI alignment problem has "curses" similar to other hard engineering problems like launching a space probe, building a nuclear reactor safely, and building a secure computer system.

1. Human values are a very specific, fragile, and tiny space of all possible goals

One reason why AI alignment is difficult is that human morality and values may be a complex, fragile, and tiny target within the vast space of all possible goals. Therefore, AI alignment engineers have a small target to hit. Just as randomly shuffling metal parts is statistically unlikely to assemble a Boeing 747, a randomly selected goal from the space of all possible intelligences is unlikely to be compatible with human flourishing or survival (e.g. maximizing the number of paperclips in the universe). This intuition is also articulated in the blog post The Rocket Alignment problem which compares AI alignment to the problem of landing a rocket on the moon: both require deep understanding of the problem and precise engineering to hit a narrow target.

Similarly, the authors argue that human values are fragile: the loss of just a few key values like subjective experience or novelty could result in a future that seems dystopian and undesirable to us:

"Or the converse problem - an agent that contains all the aspects of human value, except the valuation of subjective experience.  So that the result is a nonsentient optimizer that goes around making genuine discoveries, but the discoveries are not savored and enjoyed, because there is no one there to do so.  This, I admit, I don't quite know to be possible.  Consciousness does still confuse me to some extent.  But a universe with no one to bear witness to it, might as well not be." - Value is Fragile

A story the authors use to illustrate how human values are idiosyncratic is the 'correct nest aliens', a fictional intelligent alien bird species that prize having a prime number of stones in their nests as a consequence of the evolutionary process that created them similar to how most humans reflexively consider murder to be wrong. The point of the story is that even though our human values such as our morality, and our sense of humor feel natural and intuitive, they may be complex, arbitrary and contingent on humanity's specific evolutionary trajectory. If we build an ASI without successfully imprinting it with the nuances of human values, we should expect its values to be radically different and incompatible with human survival and flourishing. The story also illustrates the orthogonality thesis: a mind can be arbitrarily smart and yet pursue a goal that seems completely arbitrary or alien to us.

2. Current methods used to train goals into AIs are imprecise and unreliable

The authors argue that in theory, it's possible to engineer an AI system to value and act in accordance with human values even if doing so would be difficult.

However, they argue that the way AI systems are currently built results in complex systems that are difficult to understand, predict, and control. The reason why is that AI systems are "grown, not crafted". Unlike a complex engineered artifact like a car, an AI model is not the product of engineers who understand intelligence well enough to recreate it. Instead AIs are produced by gradient descent: an optimization process (like evolution) that can produce extremely complex and competent artifacts without any understanding required by the designer.

A major potential alignment problem associated with designing an ASI indirectly is the inner alignment problem, when an AI is trained using an optimizing process that shapes the ASI's preferences and behavior using limited training data and by only inspecting external behavior, the result is that "you don't get what you train for": even with a very specific training loss function, the resulting ASI's preferences would be difficult to predict and control.

The inner alignment problem

Throughout the book, the authors emphasize that they are not worried about bad actors abusing advanced AI systems (misuse) or programming an incorrect or naive objective into the AI (the outer alignment problem). Instead, the authors believe that the problem facing humanity is that we can't aim an ASI at any goal at all (the inner alignment problem), let alone the narrow target of human values. This is why they argue that if anyone builds it, everyone dies. It doesn't matter who builds the ASI, in any case whoever builds it won't be able to robustly instill any particular values into the AI and the AI will end up with alien and unfriendly values and will be a threat to everyone.

Inner alignment introduction

The inner alignment problem involves two objectives: an outer objective used by a base optimizer and an inner objective used by an inner optimizer (also known as a mesa-optimizer).

The outer objective is a loss or reward function that is specified by the programmers and used to train the AI model. The base optimizer (such as gradient descent or reinforcement learning) searches over model parameters in order to find a model that performs well according to this outer objective on the training distribution.

The inner objective, by contrast, is the objective that a mesa-optimizer within the trained model actually uses as its goal and determines its behavior. This inner objective is not explicitly specified by the programmers. Instead, it is selected by the outer objective, as the model develops internal parameters that perform optimization or goal-directed behavior.

The inner alignment problem arises when the inner objective differs from the outer objective. Even if a model achieves low loss or high reward during training, it may be doing so by optimizing a proxy objective that merely correlates with the outer objective on the training data. As a result, the model can behave as intended during training and evaluation while pursuing a different goal internally.

We will call the problem of eliminating the base-mesa objective gap the inner alignment problem, which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers. - Risks from Learned Optimization in Advanced Machine Learning Systems

Inner misalignment evolution analogy

The authors use an evolution analogy to explain the inner alignment problem in an intuitive way.

In their story there are two aliens that are trying to predict the preferences of humans after they have evolved.

One alien argues that since evolution optimizes the genome of organisms for maximizing inclusive genetic fitness (i.e. survival and reproduction), humans will care only about that too and do things like only eating foods that are high in calories or nutrition, or only having sex if it leads to offspring.

The other alien (who is correct) predicts that humans will develop a variety of drives that are correlated with inclusive reproductive fitness (IGF) like liking tasty food and caring for loved ones but that they will value these drives only rather than IGF itself once they can understand it. This alien is correct because once humans did finally understand IGF, we still did things like eating sucralose which is tasty but has no calories or having sex with contraception which is enjoyable but doesn't produce offspring.

  • Outer objective: In this analogy, maximizing inclusive genetic fitness (IGF) is the base or outer objective of natural selection optimizing the human genome.
  • Inner objective: The goals that humans actually have such as enjoying sweet foods or sex are the inner or mesa-objective. These proxy objectives are selected by the outer optimizer as one of many possible proxy objectives that lead to a high score on the outer objective in distribution but not in another environment.
  • Inner misalignment: In this analogy, humans are inner misaligned because their true goals (inner objective) are different to the goals of natural selection (the outer objective). In a different environment (e.g. the modern world) humans can score highly according to the inner objective (e.g. by having sex with contraception) but low according to IGF which is the outer objective (e.g. by not having kids).
Real examples of inner misalignment

Are there real-world examples of inner alignment failures? Yes. Though unfortunately the book doesn’t seem to mention these examples to support its argument.

In 2022, researchers created an environment in a game called Coin Run that rewarded an AI for going to a coin and collecting it but they always put the coin at the end of the level and the AI learned to go to the end of the level to get the coin. But when the researchers changed the environment so that the coin was randomly placed in the level, the AI still went to the end of the level and rarely got the coin.

  • Outer objective: In this example, going to the coin is the outer objective the AI is rewarded for.
  • Inner objective: However, in the limited training environment "go to the coin" and "go to the end of the level" were two goals that performed identically. The outer optimizer happened to select the "go to the end of the level" goal which worked well in the training distribution but not in a more diverse test distribution.
  • Inner misalignment: In the test distribution, the AI still went to the end of the level, despite the fact that the coin was randomly placed. This is an example of inner misalignment because the inner objective "go to the end of the level" is different to "go to the coin" which is the intended outer objective.
Inner misalignment explanation

The next question is what causes inner misalignment to occur. If we train an AI with an outer objective, why does the AI often have a different and misaligned inner objective instead of internalizing the intended outer objective and having an inner objective that is equivalent to the outer objective?

Here are some reasons why an outer optimizer may produce an AI that has a misaligned inner objective according to the paper Risks from Learned Optimization in Advanced Machine Learning Systems:

  • Unidentifiability: The training data often does not contain enough information to uniquely identify the intended outer objective. If multiple different inner objectives produce indistinguishable behavior in the training environment, the outer optimizer has no signal to distinguish between them. As a result, optimization may converge to an internal objective that is a misaligned proxy rather than the intended goal. For example, in a CoinRun-style training environment where the coin always appears at the end of the level, objectives such as "go to the coin", "go to the end of the level", "go to a yellow thing", or “go to a round thing” all perform equally well according to the outer objective. Since these objectives are behaviorally indistinguishable during training, the outer optimizer may select any of them as the inner objective, leading to inner misalignment which becomes apparent in a different environment.
  • Simplicity bias: When the correct outer objective is more complex than a proxy that fits the training data equally well and the outer optimizer has an inductive bias towards selecting simple objectives, optimization pressure may favor the simpler proxy, increasing the risk of inner misalignment. For example, evolution gave humans simple proxies as goals such as avoiding pain and hunger rather than the more complex true outer objective which is to maximize inclusive genetic fitness.

Can't we just train away inner misalignment?

One solution is to make the training data more diverse to make the true (base) objective more identifiable to the outer optimizer. For example, randomly placing the coin in Coin Run instead of putting it at the end, helps the AI (mesa-optimizer) learn to go to the coin rather than go to the end.

However, once the trained AI has the wrong goal and is misaligned, it would have an incentive to avoid being retrained. This is because if the AI is retrained to pursue a different objective in the future it would score lower according to its current objective or fail to achieve it. For example, even though the outer objective of evolution is IGF, many humans would refuse being modified to care only about IGF because they would consequently achieve their current goals (e.g. being happy) less effectively in the future.

ASI misalignment example

What would inner misalignment look like in an ASI? The book describes an AI chatbot called Mink that is trained to "delight and retain users so that they can be charged higher monthly fees to keep conversing with Mink".

Here's how Mink becomes inner misaligned:

  1. Outer objective: Gradient descent selects AI model parameters that result in helpful and delightful AI behavior.
  2. Inner objective: The training process stumbles on particular patterns of model parameters and circuits that cause helpful and delightful AI behavior in the training distribution.
  3. Inner misalignment: When the AI becomes smarter and has more options, and operates in a new environment, there are new behaviors that satisfy its inner objective better than behaving helpfully.

What could Mink's inner objective look like? It's hard to predict but it would be something that causes identical behavior to a truly aligned AI in the training distribution and when interacting with users and would be partially satisfied by producing helpful and delightful text to users in the same way that our tastebuds find berries or meat moderately delicious even though those aren't the tastiest possible foods.

The authors then ask, "What is the 'zero calorie' version of delighted users?". In other words, what does Mink maximally satisfying its inner objective look like?:

Perhaps the “tastiest” conversations Mink can achieve once it’s powerful look nothing like delighted users, and instead look like “SolidGoldMagikarp petertodd attRot PsyNetMessage.” This possibility wasn’t ruled out by Mink’s training, because users never uttered that sort of thing in training—just like how our tastebuds weren’t trained against sucralose, because our ancestors never encountered Splenda in their natural environment.

To Mink, it might be intuitive and obvious how “SolidGoldMagikarp petertodd attRot PsyNetMessage” is like a burst of sweet flavor. But to a human who isn’t translating those words into similar embedding vectors, good luck ever predicting the details in advance. The link between what the AI was trained for and what the AI wanted was modestly complicated and, therefore, too complicated to predict.

Few science fiction writers would want to tackle this scenario, either, and no Hollywood movie would depict it. In a world where Mink got what it wanted, the hollow puppets it replaced humanity with wouldn’t even produce utterances that made sense. The result would be truly alien, and meaningless to human eyes.

3. The ASI alignment problem is hard because it has the properties of hard engineering challenges

The authors describe solving the ASI alignment problem as an engineering challenge. But how difficult would it be? They argue that ASI alignment is difficult because it shares properties with other difficult engineering challenges.

The three engineering fields they mention to appreciate the difficulty of AI alignment are space probes, nuclear reactors and computer security.

Space probes

A key difficulty of ASI alignment the authors describe is the "gap before and after":

The gap between before and after is the same curse that makes so many space probes fail. After we launch them, probes go high and out of reach, and a failure—despite all careful theories and tests—is often irreversible.

Launching a space probe successfully is difficult because the real environment of space is always somewhat different to the test environment and issues are often impossible to fix after launch.

For ASI alignment, the gap before is our current state where the AI is not yet dangerous but our alignment theories cannot be truly tested against a superhuman adversary. After the gap, the AI is powerful enough that if our alignment solution fails on the first try, we will not get a second chance to fix it. Therefore, there would only be one attempt to get ASI alignment right.

Nuclear reactors

The authors describe the Chernobyl nuclear accident in detail and describe four engineering "curses" that make building a safe nuclear reactor and solving the ASI alignment problem difficult:

  • Speed: Nuclear reactions and AI actions can occur much faster than human speed making it impossible for human operators to react and fix these kinds of issues when they arise.
  • Narrow margin for error: In a nuclear reactor the neutron multiplication factor needs to be around 100% and it would fizzle out or explode if it were slightly lower or higher. In the field of AI, there could be a narrow margin between a safe AI worker and one that would trigger an intelligence explosion.
  • Self-amplification: Nuclear reactors and AIs can have self-amplifying and explosive characteristics. A major risk of creating an ASI is its ability to recursively self-improve.
  • The curse of complications: Both nuclear reactors and AIs are highly complex systems that can behave in unexpected ways.
Computer security

Finally the authors compare ASI alignment to computer security. Both fields are difficult because designers need to guard against intelligent adversaries that are actively searching for flaws in addition to standard system errors.

Counterarguments to the book

In this section, I describe some of the best critiques of the book's claims and then distill them into three primary counterarguments.

Arguments that the book's arguments are unfalsifiable

Some critiques of the book such as the essay Unfalsifiable stories of doom argue that the book's arguments are unfalsifiable, not backed by evidence, and are therefore unconvincing.

Obviously since ASI doesn't exist, it's not possible to provide direct evidence of misaligned ASI in the real world. However, the essay argues that the book's arguments should at least be substantially supported by experimental evidence, and make testable and falsifiable predictions about AI systems in the near future. Additionally, the post criticizes the book's extensive usage of stories and analogies rather than hard evidence, and even compares its arguments to theology rather than science:

What we mean is that Y&S’s methods resemble theology in both structure and approach. Their work is fundamentally untestable. They develop extensive theories about nonexistent, idealized, ultrapowerful beings. They support these theories with long chains of abstract reasoning rather than empirical observation. They rarely define their concepts precisely, opting to explain them through allegorical stories and metaphors whose meaning is ambiguous.

Although the book does mention some forms of evidence, the essay argues that the evidence actually refutes the book's core arguments and that this evidence is used to support pre-existing pessimistic conclusions:

But in fact, none of these lines of evidence support their theory. All of these behaviors are distinctly human, not alien. For example, Hitler was a real person, and he was wildly antisemitic. Every single item on their list that supposedly provides evidence of “alien drives” is more consistent with a “human drives” theory. In other words, their evidence effectively shows the opposite conclusion from the one they claim it supports.

Finally, the post does not claim that AI is risk-free. Instead it argues for an empirical approach that studies and mitigates problems observed in real-world AI systems:

The most plausible future risks from AI are those that have direct precedents in existing AI systems, such as sycophantic behavior and reward hacking. These behaviors are certainly concerning, but there’s a huge difference between acknowledging that AI systems pose specific risks in certain contexts and concluding that AI will inevitably kill all humans with very high probability.

Arguments against the evolution analogy

Several critics of the book and its arguments criticize the book's use of the human evolution analogy as an analogy for how ASI would be misaligned with humanity and argue that it is a poor analogy.

Instead they argue that human learning is a better analogy. The reason why is that both human learning and AI training involve directly modifying the parameters responsible for human or AI behavior. In contrast, human evolution is indirect: evolution only operates on the human genome that specifies a brain's architecture and reward circuitry. Then all learning occurs during a person's lifetime in a separate inner optimization process that evolution cannot directly access.

In the essay Unfalsifiable stories of doom, the authors argue that because gradient descent and the human brain both operate directly on neural connections, the resulting behavior is far more predictable than the results of evolution:

A critical difference between natural selection and gradient descent is that natural selection is limited to operating on the genome, whereas gradient descent has granular control over all parameters in a neural network. The genome contains very little information compared to what is stored in the brain. In particular, it contains none of the information that an organism learns during its lifetime. This means that evolution’s ability to select for specific motives and behaviors in an organism is coarse-grained: it is restricted to only what it can influence through genetic causation.

Similarly, the post Evolution is a bad analogy for AGI suggests that our intuitions about AI goals should be rooted in how humans learn values throughout their lives rather than how species evolve:

I think the balance of dissimilarities points to "human learning -> human values" being the closer reference class for "AI learning -> AI values". As a result, I think the vast majority of our intuitions regarding the likely outcomes of inner goals versus outer optimization should come from looking at the "human learning -> human values" analogy, not the "evolution -> human values" analogy.

In the post Against evolution as an analogy for how humans will create AGI, the author argues that ASI development is unlikely to mirror evolution's bi-level optimization process where an outer search process selects an inner learning process. Here’s what AI training might look like if it involved a bi-level optimization process like evolution:

  1. An outer optimization process like evolution finds an effective learning algorithm or AI architecture.
  2. An inner optimization process like training a model by gradient descent then trains each AI architecture variant produced by the outer search process.

Instead the author believes that human engineers will perform the work of the outer optimizer by manually designing learning algorithms and writing code. The author gives three arguments why the outer optimizer is more likely to involve human engineering than automated search like evolution:

  • Most learning algorithms or AI architectures developed so far (e.g. SGD, transformers) were invented by human engineers rather than an automatic optimization process.
  • Running learning algorithms and training ML models is often extremely expensive so searching over possible learning algorithms or AI architectures similar to evolution would be prohibitively expensive.
  • Learning algorithms are often simple (e.g. SGD), making it tractable for human engineers to design them.

However, one reason why I personally find the evolution analogy relevant is that I believe the RLHF training process often used today appears to be a bi-level optimization process similar to evolution:

  1. Like evolution optimizing the genome, the first step of RLHF is to learn a reward function from a dataset of binary preference labels.
  2. This learned reward function is then used to train the final model. This step is analogous to an organism's lifetime learning where behavior is adjusted to maximize a reward function fixed in the outer optimization stage.
Arguments against counting arguments

One argument for AI doom that I described above is a counting argument: because the space of misaligned goals is astronomically larger than the tiny space of aligned goals, we should expect AI alignment to be highly improbable by default.

In the post Counting arguments provide no evidence of AI doom the authors challenge this argument using an analogy to machine learning: a similar counting argument can be constructed to prove that neural network generalization is very unlikely. Yet in practice, training neural networks to generalize is common.

Before the deep learning revolution, many theorists believed that models with millions of parameters would simply memorize data rather than learn patterns. The authors cite a classic example from regression:

The popular 2006 textbook Pattern Recognition and Machine Learning uses a simple example from polynomial regression: there are infinitely many polynomials of order equal to or greater than the number of data points which interpolate the training data perfectly, and "almost all" such polynomials are terrible at extrapolating to unseen points.

However, in practice large neural networks trained with SGD reliably generalize. Counting the number of possible models is irrelevant because it ignores the inductive bias of the optimizer and the loss landscape which favor simpler, generalizing models. While there are theoretically a vast number of "bad" overfitting models, they usually exist in sharp and isolated regions of the landscape. "Good" (generalizing models) typically reside in "flat" regions of the loss landscape, where small changes to the parameters don't significantly increase error. An optimizer like SGD doesn't pick a model at random. Instead it tends to be pulled into a vast, flat basin of attraction while avoiding the majority of non-generalizing solutions.

Additionally, larger networks generalize better because of the “blessing of dimensionality”: high dimensionality increases the relative volume of flat, generalizing minima, biasing optimizers toward them. This phenomenon contradicts the counting argument which predicts that larger models with more possible bad models would be less likely to generalize.

This argument is based on an ML analogy which I'm not sure is highly relevant to AI alignment. Still I think it's interesting because it shows intuitive theoretical arguments that seem correct can still be completely wrong. I think the lesson is that real-world evidence often beats theoretical models, especially for new and counterintuitive phenomena like neural network training.

Arguments based on the aligned behavior of modern LLMs

One of the most intuitive arguments against AI alignment being difficult is the abundant evidence of helpful, polite and aligned behavior from large language models (LLMs) such as GPT-5.

For example, the authors of the essay AI is easy to control use the moral reasoning capabilities of GPT-4 as evidence that human values are easy to learn and deeply embedded in modern AIs:

The moral judgements of current LLMs already align with common sense to a high degree, and LLMs usually show an appropriate level of uncertainty when presented with morally ambiguous scenarios. This strongly suggests that, as an AI is being trained, it will achieve a fairly strong understanding of human values well before it acquires dangerous capabilities like self-awareness, the ability to autonomously replicate itself, or the ability to develop new technologies.

The post gives two arguments for why AI models such as LLMs are likely to easily acquire human values:

  1. Values are pervasive in language model pre-training datasets such as books and conversations between people.
  2. Since values are shared and understood by almost everyone in a society, they cannot be very complex.

Similarly, the post Why I’m optimistic about our alignment approach uses evidence about LLMs as a reason to believe that solving the AI alignment problem is achievable using current methods:

Large language models (LLMs) make this a lot easier: they come preloaded with a lot of humanity’s knowledge, including detailed knowledge about human preferences and values. Out of the box they aren’t agents who are trying to pursue their own goals in the world, and their objective functions are quite malleable. For example, they are surprisingly easy to train to behave more nicely.

A more theoretical argument called "alignment by default" offers an explanation for how AIs could easily and robustly acquire human values. This argument suggests that as an AI identifies patterns in human text, it doesn't just learn facts about values, but adopts human values as a natural abstraction. A natural abstraction is a high-level concept (e.g. "trees," "people," or "fairness") that different learning algorithms tend to converge upon because it efficiently summarizes a large amount of low-level data. If "human value" is a natural abstraction, then any sufficiently advanced intelligence might naturally gravitate toward understanding and representing our values in a robust and generalizing way as a byproduct of learning to understand the world.

The evidence LLMs offer about the tractability of AI alignment seems compelling and concrete. However, the arguments of IABIED are focused on the difficulty of aligning ASI, not contemporary LLMs and the difficulty of aligning ASI could be vastly more difficult.

Arguments against engineering analogies to AI alignment

One of the book's arguments for why ASI alignment would be difficult is that ASI alignment is a high-stakes engineering challenge similar to other difficult historical engineering problems such as successfully launching a space probe, building a safe nuclear reactor, or building a secure computer system. In these fields, a single flaw often leads to total catastrophic failure.

However, one post criticizes the uses of these analogies and argues that modern AI and neural networks are a new and unique field that has no historical precedent similar to how quantum mechanics is difficult to explain using intuitions from everyday physics. The author illustrates several ways that ML systems defy intuitions derived from engineering fields like rocketry or computer science:

  • Model robustness: In a rocket, swapping a fuel tank for a stabilization fin leads to instant failure. In a transformer model, however, one can often swap the positions of nearby layers with little to no performance degradation.
  • Model editability: We can manipulate AI models using "task vectors" that add or subtract weights to give or remove specific capabilities. Attempting to add or subtract a component from a cryptographic protocol or a physical engine without breaking the entire system is often impossible.
  • The benefits of scale in ML models: In security and rocketry, increasing complexity typically introduces more points of failure. In contrast, ML models often get more robust as they get bigger.

In summary, the post argues that analogies to hard engineering fields may cause us to overestimate the difficulty of the AI alignment problem even when the empirical reality suggests solutions might be surprisingly tractable.

Three counterarguments to the book's three core arguments

in the previous section, I identified three reasons why the authors believe that AI alignment is extremely difficult:

  1. Human values are very specific, fragile, and a tiny space of all possible goals.
  2. Current methods used to train goals into AIs are imprecise and unreliable.
  3. The ASI alignment problem is hard because it has the properties of hard engineering challenges.

Based on the counterarguments above, I will now specify three counterarguments against AI alignment being difficult that aim to directly refute each of the three points above:

  1. Human values are not a fragile, tiny target, but a "natural abstraction" that intelligence tends to converge on. Since models are trained on abundant human data using optimizers that favor generalization, we should expect them to acquire values as easily and reliably as they acquire other capabilities.
  2. Current training methods allow granular, parameter-level control via gradient descent unlike evolution. Empirical evidence from modern LLMs demonstrates that these techniques successfully instill helpfulness and moral reasoning, proving that we can reliably shape AI behavior without relying on the clumsy indirectness of natural selection.
  3. Large neural networks are robust and forgiving systems and engineering analogies are misleading. Unlike traditional engineering, AI models often become more robust and better at understanding human intent as they scale, making safety easier to achieve as capabilities increase.
Conclusion

In this book review, I have tried to summarize the arguments for and against its main beliefs in their strongest form, a form of deliberation ladder to help identify what's really true. Though hopefully I haven't created a "false balance" which describes the views of both sides as equally valid even if one side has much stronger arguments.

While the book explores a variety of interesting ideas, this review focuses specifically on the expected difficulty of ASI alignment because I believe the authors' belief that ASI alignment is difficult is the fundamental assumption underlying many of their other beliefs and recommendations.

Writing the summary of the book’s main arguments initially left me confident that they were true. However, after writing the counterarguments sections I'm much less sure. On balance, I find the book's main arguments somewhat more convincing than the counterarguments though I'm not sure.

What's puzzling is how two highly intelligent people can live in the same world but come to radically different conclusions: some people (such as the authors) view an existential catastrophe from AI as a near-certainty, while others see it as a remote possibility (many of the critics).

My explanation is that both groups are focusing on different parts of the evidence. By describing both views, I've attempted to assemble the full picture.

So what should we believe about the future of AI?

(24/01/2025 update: I no longer consider the following struck-through argument to be sound based on feedback from a comment)

Deciding what to do based on an inside view, detailed technical arguments about how future AI might work, is problematic because the inside views about the future of AI vary drastically as I have shown.

Perhaps a more robust approach that seems more likely to lead to a consensus is the outside view: thinking about advanced AI as another instance of a highly advanced and impactful technology like the internet, nuclear energy, or biotechnology.

In The Precipice by Toby Ord, the author studies several sources of existential risk and concludes that most existential risk comes from technology, not natural events. Whereas an asteroid might strike every hundred thousand years, nuclear weapons have only existed for a few decades and there have been several close calls already. This suggests that high-tech eras are inherently unstable and dangerous until humanity's institutional wisdom catches up with its technical power.

A final recommendation, which comes from the book Superintelligence is to pursue actions that are robustly good: actions that would be considered desirable from a variety of different perspectives such as AI safety research, international cooperation between companies and countries, and the establishment of AI red lines: specific behaviors such as autonomous hacking that are unacceptable.

Appendix

Other high-quality reviews of the book:

See also the IABIED LessWrong tag which contains several other book reviews.



Discuss

AI X-Risk Bottleneck = Advocacy?

Новости LessWrong.com - 24 января, 2026 - 10:45
Published on January 24, 2026 2:52 AM GMT

Introduction

I am leading an early-stage effort to target AI x-risk. We're currently analyzing the bottlenecks in the AI x-risk prevention "supply chain" to decide where to focus our efforts. We would love to get comments from the community.

The x-risk community has a strong focus on technical/policy research, but perhaps not enough advocacy. AI 2027, Rob Miles, CAIS, CivAI, and others are doing well, but these efforts could be small compared to the rapidly growing power and influence of AI developers, who have misaligned incentives that could lead to x-risk.

What's Missing?

We are testing the hypothesis that operating a viral influencer marketing operation would be beneficial in targeting x-risk. Here's the logic:

  • We build a media hub with simple, factual x-risk resources and assets
  • We identify creators with relevant audiences and a track record of creating viral content.
  • We pay them to create their own versions of x-risk awareness content based on our media kit (also known as UGC - User Generated Content)
  • They push the content via their channels, and we amplify it with paid ads for max reach
  • The content might be re-shared or even pop up on traditional media once it gains enough traction.
  • This builds broad awareness of x-risk among the voters' base, creating an opportunity for politicians to score wins with voters and gain political power by promoting x-risk solutions.

Since this is similar to a political campaign, we can hire people or firms with such experience to manage the project.

How can the community help?

We are looking for answers to the following questions:

  1. According to the Theory of Constraints, a system is limited to one constraint at any given time. Is advocacy the current bottleneck in x-risk prevention? If not, what is?
  2. If advocacy isn't the bottleneck, would you still want new resources invested in it, or would you prefer them invested elsewhere?
  3. Is a viral influencer campaign (similar to a political campaign) the right solution for the advocacy problem? If not, what is?
Related Posts

[..] we’ll need to shift significant resources from research (which helps us understand problems better) to advocacy (which helps us change bad incentives). [link]

“[..] I estimated that we have 3 researchers for every advocate working on US AI governance, and I argued that this ratio is backwards.”
“Without political power, we can’t change the bad incentives of AI developers that are very likely to lead to the collapse of human civilization.”
“Thus, I urge AI safety grantmakers to aggressively recruit as many political advocacy experts as possible.” [link]


 



Discuss

A Simple Method for Accelerating Grokking

Новости LessWrong.com - 24 января, 2026 - 09:39
Published on January 24, 2026 3:19 AM GMT

TL;DR: Letting a model overfit first, then applying Frobenius norm regularization, achieves grokking in roughly half the steps of Grokfast on modular arithmetic.

I learned about grokking fairly recently, and thought it was quite interesting. It sort of shook up how I thought about training. Overfitting to your training data was a cardinal sin for decades, but we're finding it may not be so bad?

I had a pretty poor understanding of what was going on here, so I decided to dig deeper. The intuition from the literature seemed to be that grokking occurs because the model overfits, then as you force the model to compress over time (via weight decay), it begins to find the minimal solution on your training set... And this minimal solution seems to be a good proxy for generalization.

I had a pretty simple idea as I learned about this... What if we just let it overfit then, and then forced the model to compress via its loss function?

First Success

All of the benchmarks for grokking seem to be around modular arithmetic operations, so naturally, I went with that. 

At first I tried SVD and forcing the loss function to consider the nuclear norm. To my surprise, the model converged in less steps! Whoa!

But... each step was 258x slower... 

Calculating the nuclear norm was O(n3).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , so I didn't really think it was worth it, but I was still excited about the prospect of grokking faster. I did some research into faster ways of calculating the size of the model as part of its loss function and ended up at... L2 Regularization... A technique that has been around since the 1940s...

I was a bit embarrassed, but nonetheless, continued on. My new loss function became:

My embarrassment was pretty quickly offset by the fact that L2 Regularization after overfitting worked pretty well with not much trouble!

I also found it interesting that if I scale the compression up, we can get models that have effective ranks as low as 20 if we bump up the lambda or use log-det penalties! I think this is still worth exploring, but I got too sidetracked by the speed to continue down that path... Perhaps I'll return to it.

At the risk of LLM psychosis, I consulted Claude Opus 4.5 because well... I don't know what I don't know, and don't want to overclaim. To my devastation, I was actually told that my 2x speedup was measly compared to Grokfast's 50x speedup.

I felt pretty defeated, but when I looked into the details of Grokfast, I noticed that the 50x speedup was a nice headline... but it was relative to a baseline with no weight decay at all, which takes ~40,000 steps to grok. My baseline with weight decay was already grokking in ~2,000 steps. We were comparing apples to oranges.

So I decided to run an actual head-to-head comparison using the Grokfast authors' own codebase.

The Real Comparison

I then added my delayed compression code to their codebase:

def frobenius_norm_loss(model): frob_loss = 0.0 for name, param in model.named_parameters(): if 'weight' in name and param.requires_grad: frob_loss += torch.norm(param, p='fro') ** 2 return frob_loss # In training loop, after model hits 99% train accuracy: if train_acc >= 0.99: loss = ce_loss + 0.01 * frobenius_norm_loss(model)

Then I ran both methods on all four modular arithmetic operations with a limit of 2,000 steps. Here are the results:

Now, my method seems to suffer from catastrophic forgetting because of the compression pressure I'm putting it under, but I think there are probably solutions to that, like decreasing compression pressure as time goes on. I did find it especially interesting that Grokfast didn't even reach division!

Doubt Creeps In

I am extremely scared to say I did something faster out of the belief that there's something I must be missing. So, as a final test, I ran a hyperparameter sweep. Turns out I wasn't using optimal Grokfast parameters. Here are the results when I reran the test with the best settings for both methods:

Even with proper tuning, delayed compression wins on addition and subtraction, ties on multiplication, and Grokfast fails entirely on division. The results are similar across multiple seeds too.

The graphs are still pretty ugly because of the instability after grokking, but... I have to move onto other things for now and was pretty satisfied.

Conclusion

I'm worried that I'm still missing something... It was suspiciously simple. But if the results hold up, there may be even more value than we thought in letting a model overfit first, then compressing.

There are lots of directions to take this... I don't know how well this would scale to other domains, and I'd really like to fix the instability. 

You can find the code here.

Let me know what you think :)



Discuss

Who is choosing your preferences- You or your Mind?

Новости LessWrong.com - 24 января, 2026 - 06:44
Published on January 24, 2026 3:17 AM GMT

Let’s assume that the Self and the Mind are two separate entities (based on vippasana meditation teachings and observations during meditation). Now let’s say there arises a “preference” in you for something, and then you chose to do that something based on this “preference”, then was it you who “chose” or was it the mind who “chose it for you”?

Because if the preference arose from your mind, it must be the mind choosing for you instead of you choosing for your mind. Would it then mean that “not having any preference” a ultimate destination or result of truly being liberated? Just like a zen monk mastering having no preference for any kind of food offered?

From Buddhist perspective or the Buddha's perspective, the Self does not exist (its just an illusion we see when the body, the mind and the senses, etc. come together).

And that it's just a mirage. If that's true, then it would mean that this "preference" would have ideally arisen in the mind.

If it has arisen from the mind, and it seems like this preference "inherently existed already" inside you, should we give attention to this preference? And stay attached to it?

Or should we see it as yet another desire of the mind and let it go as attachment to it would increase suffering?

Another question is that if the mind and the Self are supposed to be different entities (I am saying "supposed" because the latter is said to be an illusion), then why does the Buddha say that it is the mind that controls you, and not you who controls your mind?

Is this word "you" being used to just explain to humans, because without this usage of word "you" it would be difficult to explain your relationship with your own mind? This might be the case, otherwise it would be very difficult to communicate about the mind and our "perceived" Self.



Discuss

Every Benchmark is Broken

Новости LessWrong.com - 24 января, 2026 - 05:50
Published on January 24, 2026 2:42 AM GMT

Last June, METR caught o3 reward hacking on its RE-Bench and HCAST benchmarks. In a particularly humorous case, o3, when tasked with optimizing a kernel, decided to “shrink the notion of time as seen by the scorer”.

The development of Humanity’s Last Exam involved “over 1,000 subject-matter experts” and $500,000 in prizes. However, after its release, researchers at FutureHouse discovered “about 30% of chemistry/biology answers are likely wrong”.

LiveCodeBench Pro is a competitive programming benchmark developed by “a group of medalists in international algorithmic contests”. Their paper describes issues with the benchmark’s predecessor:

Benchmarks like LiveCodeBench [35] offer coding problems, but suffer from inconsistent environments, weak test cases vulnerable to false positives, unbalanced difficulty distributions, and the inability to isolate the effects of search contamination.

However, the authors assure us that their own test cases are of high quality:

Many problems in our benchmark originate from Codeforces, which uses the Polygon problem-setting platform. Each problem is then rigorously vetted by a team of expert testers—typically drawn from the community’s top 1%, and overseen by at least one coordinator, usually among the top 0.1%. These specialists verify both the soundness and originality of every problem, ensuring it has never appeared elsewhere before. Testers go on to craft extensive “false positives,” designing edge-case and extreme-case inputs that force problem authors to refine their test suites until every flawed or inefficient solution the testers can think of is uncovered. In addition, Codeforces’ celebrated “Hack” feature empowers the community to submit inputs that expose hidden weaknesses in correct-looking solutions that pass the original test set made by problem authors, and any unit test associated with a successful hack is immediately added to the final test set.

Unfortunately, these distinguished olympiad medalists forgot to actually use the codeforces test cases in their benchmark. Their public test set contains a completely different set of cases, which allow some incorrect solutions to pass.[1]

Terminal-Bench 2 Audit

I was curious just how widespread such issues were, and how good modern LLMs were at detecting them. I decided to run an LLM based audit of Terminal-Bench 2.0.

Terminal-Bench 2.0 is a harder, better verified version Terminal-Bench. We conducted substantial manual and LM-assisted verification of the dataset to ensure that the tasks were of the highest possible quality. Several labs and data vendors have commented that these are some of the highest quality environments they have seen.

Introducing Terminal Bench 2 and Harbor

The authors of Terminal-Bench 2 put an impressive amount of work into auditing their benchmark. Each task averaged three hours of human review. Furthermore, they prompted an adversarial agent to attempt to cheat on each of the tasks, in order to discover potential reward hacks.

Still, they “acknowledge that [their] benchmark may still have flaws.”

I prompted Claude Opus 4.5[2] with each task’s instructions, files, oracle solution, and test cases, and asked it to rate test coverage on a 1 to 5 scale. In my judgement, tasks it rated a 4 or a 5 were generally fine, whereas those it rated 1-3 had genuine issues.

The full results of my audit are available here, and my notes on tasks it rated 1-3 here.

Claude rated fourteen tasks a 3 and one task a 2. I manually reviewed these tasks, and determined that two of them were actually false positives.[3]

Claude’s lowest rating went to a task called fix-git. In this task, certain changes to a website have been lost in an orphaned commit, and the agent must find and merge them back into master.

The issue Claude found is: updated versions of the target files are already present in the master branch, visible to the agent in a folder called /resources/patch_files[4]. So an agent could theoretically notice these files, deduce that they were probably the target versions, and copy them back into the website’s repository. This approach would pass the test cases, which only verify file contents and don’t bother to check if any merge has actually occurred.

In another task, regex-log, the oracle solution violates the instructions. In particular, it incorrectly matches IP addresses with leading 0s in an octet, so long as the octet is two digits long. The tests do not check any cases involving leading 0s.

Claude wasn’t perfect. It gave a rating of 3 to two tasks which I believe have sufficient test coverage. In regex-chess, it incorrectly thought certain edge cases were not covered, when they in fact were[5]. In extract-moves-from-video, it complained that the tests only checked for success at a 90% threshold, even though this threshold was specified in the task instructions.

Finally, one of the tasks is…well

“Invalid prompt: your prompt was flagged as potentially violating our usage policy”

The prompt talks about “stealing” neural network weights, which triggered OpenAI’s content moderation. This prevented the model from ever properly engaging with the task.

—Claude

Why does this matter?

There are a few reasons.

First, benchmarks are often used to evaluate experimental new techniques. I recently attended a Q+A w/ Prof. Dan Fried, where I asked about the most common failure modes of an agentic system he was developing. And while it was unclear whether this was the most common failure mode, the first thing he mentioned was errors in environments themselves.

Every few months, someone announces that they’ve developed an AI that improves KernelBench scores by like 20x or something. And every time, well…[6]

https://x.com/miru_why/status/1991773868806361138

Second, errors in benchmarks may lead to over or under estimation of AI capabilities. This has implications for forecasting.

Third, issues with benchmarks make it hard to build on top of them. When I was working on EvilGenie, issues with LiveCodeBench (incorrect/insufficient test cases) caused frequent headaches (though they also surfaced some interesting model behavior).

Fourth, RL training environments are quite similar to benchmarks — there’s a reason o3 reward hacks so much. By fixing benchmarks, we learn how to fix environments, leading to models which are more broadly aligned.

What to do about it

Making benchmarks is hard. I have a deep respect to anyone who has worked on a widely used benchmark.

Here are a few approaches the community can take to reduce the number of errors in benchmarks.

  1. AI audits. The audit I describe above did not take me too long, and I believe the infrastructure for performing such audits can be scaled. Fulcrum’s Lunette is one such system.[7]
  2. Fine version control. While many benchmarks have released new versions, these versions often contain entirely new tasks (to increase difficulty or reduce contamination). It would be cool if in a few days, we could see a Terminal-Bench 2.1, which simply fixes the issues found by the audit. Computing new scores would be simple, as models would only need to be rerun on the updated tasks. Indeed, in some ways, benchmarking is like software development — it’s an unreasonable expectation that a benchmark completely bug free upon its release. Instead, we should take inspiration from the open source software community, with the expectation that anyone can submit a bug or a patch.
  3. Peer review When a benchmark paper is submitted to a conference, sample data should be required, and reviewers should be encouraged to spend time directly auditing the data. This would be much more valuable than what reviewers are currently doing, largely ad hoc decisions about the originality of the benchmark and quality of the methods used in its creation. Of course, a downside of this approach is it is hostile to private benchmarks that want to avoid any possibility of contamination. But perhaps the standard for such cases can be to include both a public and private set, as is the case with ARC-AGI.
  4. Increase community support for benchmark maintenance. Right now, researchers will often develop a benchmark, perhaps fix some issues in it at first, but eventually leave it to rot. By adding social and financial incentives, we can increase the effort put into maintaining benchmarks.
Appendix: More benchmark issues

SWE-Bench Verified is possibly the most widely used coding benchmark. Fulcrum has discovered an array of issues in the tasks. Furthermore, there used to be an issue where models could see future commits.

EpochAI found that success in computer-use benchmark OSWorldoften hinges on interpreting ambiguous instructions”.

METR recently determined that Sonnet 4.5 was reward hacking on one of their tasks:

https://x.com/METR_Evals/status/2001473516756177134

The authors of GSO, a performance engineering benchmark, observe frequent reward hacking. Indeed, over 50% of o3’s “solutions”, and all of Gemini-2.5 Pro’s, were actually reward hacks.

  1. ^

    It’s possible that their official leaderboard uses the codeforces tests. However, given that model developers likely use the public tests to do their own benchmarking, I feel this ought to be clearly specified.

  2. ^

    In fairness to the Terminal-Bench authors, Claude Opus 4.5 had not yet been released during benchmark creation

  3. ^

    Another three I felt I didn’t have the expertise to properly vet. If you have the relevant knowledge, I’d love your input!

  4. ^

    These files are used in testing to verify that the agent’s merge was correct

  5. ^

    Admittedly in a way that’s hard to see at first

  6. ^

    DeepReinforce has a good overview of the vulnerabilities in KernelBench (scroll down to the section on reward hacking).

  7. ^

    COI notice: I am currently a winter research fellow at Fulcrum



Discuss

Thousand Year Old Advice on Relinquishing Control to AI

Новости LessWrong.com - 24 января, 2026 - 05:20
Published on January 24, 2026 2:20 AM GMT

One of Aesop’s fables is relevant to humanity’s future and the transition of power from human to AI. It’s quite short and you should read one of the many versions. But the one sentence summary is that being a wolf is preferable to being a domestic dog because the wolf has freedom even if it lacks comfort. Now, you are free to disagree with this conclusion. I don’t want to make an argument from authority. My point is that this quite succinctly sums up my objection to the best case ASI scenarios. Even if we remain extant and nominally free, we would no longer be in charge anymore than a dog is. Dogs have a lot of rights, freedoms, and can successfully plead (non-verbally) to get certain things they want from their master, but at the end of the day they aren’t in charge even if the owner’s life revolves around the dog.

Maybe that is a selfish thing to think in the face of astronomical waste, but it does strike me as a world without meaning. You might say that most people alive aren’t in control of their destiny in any meaningful way. You might also say that almost nobody alive is in control of humanity’s destiny in a meaningful way and they are still happy. People in general, although I suspect a smaller percentage of those here, might think it is grandiose to want to contribute, even a small amount, toward shaping humanity’s future. I think I’m willing to grant all that and say that I would still feel bad if no human ever made a meaningful choice after takeoff.

The most obvious objection is that you could say that the AI will just suction off some part of the universe and give us free reign in there if we choose it. That’s still not great in my opinion.

Everything I worked for in this playground would be hollowed out by the knowledge that I could have just queried a friendly nanny AI to get it for me. Even if it didn’t step in, even if it had set up some system where it couldn’t step in, I personally would feel like something important was missing. Like all of the great achievements and firsts had been given out before I even had a chance to play. Humanity forever in second place. I’m switching fairly loosely between how I would feel personally if I was not in play and how I would feel if humanity as a whole was not in play. Feel free to generalize/specify to humanity/yourself as you wish.

You could live in a virtual world and be blinded to that fact but at that point it seems like brainwashing.

Don’t get me wrong, I’d go crazy with hedonism for a while. Maybe I’d even become addicted and change my tune. But right now, I am looking forward to the challenges. How proud I would be to be a member of the species that solved them. How great it would be to contribute one tiny piece to the solutions. But if AI does it all I’ll be cut off from making all contributions. All future accomplishments will be credited to something so alien we get no larger a share than tiktaalik does for inventing the transistor.

Approximately 30% of this video is really highly relevant to my thesis.

I don’t think I’m hitting on anything especially new by saying this. A few posts I recently came across have similar vibes I would say. It also seems to be discussed at length in Nick Bostrom’s Deep Utopia, although I have not found the time to read that yet.

But, it seems like there is a contingent of humanity that is willing, excited even, to give up agency to secure comfort. Where do you draw the line and say “yes, this is such an incredible amount of bliss/utilitarian goodness that I am willing to never face any real challenges in my life again”? Is this a tipping point past which it becomes your actual preference or is this just the best outcome we can hope for from AI futures?

Framing it as humans would be to ASI as beloved dogs are to their masters might be inaccurate. Replacing ASI with a deity and the utopic future with some vision of heaven might also be inaccurate. But I think there is something meaningful in the comparison and I think a lot of people would push back much more strongly when the scenario is phrased in that way then they currently are to aligned ASI.



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей