Вы здесь
Сборщик RSS-лент
What's a good methodology for "is Trump unusual about executive overreach / institution erosion?"
Critics of Trump often describe him as making absolutely unprecedented moves to expand executive power, extract personal wealth, and impinge on citizens’ rights. Supporters counter that Trump’s actions are either completely precedented, or are the natural extension of existing trends that the media wouldn’t make a big deal over if they didn’t hate Trump so much.
In some recent posts, some people have been like "Wait why is there suddenly this abrupt series of partisan LW posts that are taking for granted there is a problem here that is worth violating the LW avoid-most-mainstream-politics norm?".
My subjective experience has been "well, me and most of my rationalist colleagues have spent the past 15 years mostly being pretty a-political, were somewhat wary but uncertain about Trump during his first term, and the new set of incidents just seems... pretty unprecedently and scarily bad?"
But, I do definitely live in a bubble that serves me tons of news about bad-seeming things that Trump is doing. It's possible to serve up dozens or hundreds of examples of a scary thing per day, without that thing actually being objectively scary or abnormal. (See: Cardiologists and Chinese Robbers)
Elizabeth and I wanted to get some sense of how unusual and how bad Trump’s actions are. “How bad” feels like a very complex question with lots of room for judgment. “How unusual” seemed a bit more likely to have an ~objective answer.
I asked LLMs some basic questions about it, but wanted a more thorough answer. I was about to spin up ~250 subagents to go run searches on each individual year of American history, querying for things like “[year] [president name] ‘executive overreach’” or “[year] [president name] ‘conflict with Supreme Court’”, and fill up a CSV with incidents.
That seemed… like it was approaching a methodology that might actually be cruxy for some Trump supporters or Trump-neutral-ers.
It seemed like maybe good practice to ask if there were any ways to operationalize this question that’d be cruxy for anyone else. And, generally pre-register it before running the query, making some advance predictions.
Each operationalization I’ve thought of so far seems a bit confused/wrong/incomplete. I feel okay with settling for “the least confused/wrong options I can come up with after a day of thinking about it," but, I'm interested in suggestions for better ones.
Some examples so far that feel like they're at least relevant:
- How many incidents will an LLM find of a president ignoring a court order?
- How many executive orders did they issue?
- How many pardons did they grant?
- What was their wealth level before and after serving (perhaps normed by economic growth, or wealth change of congressmen)?
- How many troops deployed without Congressional authorization?
- How many incidents that "got heavy criticism of executive overreach" that don't really fit into a specific category?
My own personal goal here is not just to get a bunch of numbers, but, to also get a nice set of sources for examples that people can look over, read up on, and get a qualitative sense of what's going on.
These questions all have the form "checking if allegations by detractors about Trump are true", which isn't necessarily the frame by which someone would defend Trump, or the right frame for actually answering the question "is the US in a period of rapid decline in a way that's a plausible top priority for me or others to focus on?"
I'm interested in whether people have more suggestions for questions that seem relevant and easy to check. Or, suggestions on how to operationalize fuzzier things that might not fit into the "measure it per year" ontology.
Appendix: Subagents AhoyA lot of these are recorded in places that are pretty straightforward to look up. i.e. there's already lists of Pardons per President, and Executive Orders per president.
But, I have an AI-subagent process I'm experimenting with that I expect to use for at least some of these, which currently goes something like:
- Have a (dumb) Cursor AI agent make one spreadsheet of "incidents" where rows are "years of US history", there's a column for "what incident type are we talking about?" [pardon, executive order, etc], and a column for "specific incident" and a column for "source url."
- H a A running websearches shaped like: "[Year] [President] 'ignored court order'" and "[Year] [President] 'conflict with court'", etc. (I might actually go "[month] [year] to get more granular results)
- Give the AI a python script which downloads the full content of each resulting page that seems relevant.
- Spin up AI instances that look at each page, check the corresponding year on the spreadsheet and see if there is already an incident there that matches that topic. If so, give it the same incident name in a new row with a new source.
- After having accumulated all that data, another set of AIs look over each year, check for duplicates, look at whether a given source seems likely-to-be-real, etc, while extracting out the key quote from each that states the claim, copying it verbatim rather than summarizing it.[1]
- Compile that all into another spreadsheet with the total incidents for each.
- ^
I have a pet theory about leaning on exact quotes rather than summaries to avoid having to trust their summarization
Discuss
Thinking from the Other Side: Should I Wash My Hair with Shampoo?
This article is a thought experiment based entirely on personal experience and observations.
A few years ago, I had long, thick hair. I used the same shampoo for many years and never experienced any hair loss or damage. That is, until my mother forced me to change my shampoo, at which point my hair structure completely deteriorated. It was no longer as thick as before, it wasn't growing as fast as it used to; it was falling out a lot and becoming sparse. For a few years, I tried different shampoos to combat this, even sought advice from a few people in the dermocosmetic industry, but to no avail—my hair was completely ruined. I wasn't bald yet, but it was already on its way.
However, a thought occurred to me recently:
How significant was shampoo in the 11,000-year history of humanity, and did human hair really need shampoo? Hair would get dirty, hair would get oily, but a little warm water could handle that. Why should I put a ton of chemicals on my head unless a bird pooped on it? My hair was already in bad enough shape; what's the worst that could happen if I didn't use shampoo on it?
It's been a week since I last washed my hair with shampoo, and the results are incredible. My hair is still shedding a little, but it's starting to resemble its former thick state again, it's wavy once more, and the layers in my hair are incredibly defined. Every time I look in the mirror, I think about how this gamble paid off!
So, is this really about human history and shampoos? The short answer is no, but I’ll start explaining the long answer and the reasons behind it now.
When I first thought of this idea, I even laughed at myself, but then it started to bother me. Why not? All I had to do was not use shampoo on my hair once or twice; it was a very simple experiment. If the result of this experiment was negative, meaning my hair got worse, I would continue using the shampoo I was using and accept my fate; if it was positive, I would see what happened then. The result was quite good; I would no longer use shampoo unless a bird pooped on my hair. Of course, there was some anxiety about not being able to foresee the outcome of the positive possibility, but taking that risk gave me a certain confidence at some point:
The excitement of being able to ask questions whose outcomes I couldn't foresee and finding answers that, even if negative, offered a perspective outside the mainstream.
This confidence wasn't just something internal; it was starting to manifest in my outward differences. The thought of 'what if it happens' or 'what if it doesn't' was in my head, and the ideas created incredible excitement in me. Even if I could predict the outcome, I developed a desire to see the answer with my own eyes. Whether my prediction was right or wrong, trying the other option and seeing what it could bring me began to cause incredible storms within me. My clear answer to questions like "What if this happened?" became "We won't know until we try."
Expectations for answers didn't have to come only through events; people's answers were also a guessing game. The biggest difference this self-confidence created was the desire to be able to ask people questions without hesitation, to be able to get their opinions without hesitation. I would ask my question, if they made fun of me, I would laugh it off, if they took it seriously, I would pursue it, but both were positive or negative answers. I knew what path to take based on the results. And the desire to apply this to every question fuelled my curiosity. But I had to limit myself at some point so that this excitement didn't fade too quickly. Obviously, I couldn't bombard people with questions!
I witnessed how a question that initially sounded ridiculous, like "What if I didn't use shampoo?", has significantly changed my way of thinking today. I suppose there's no need to wait for a big moment for big changes; even the simplest question can lead to the most complex answer.
Discuss
Claude Code is Too Cloudy
Running Claude Code locally is annoying since you have to deal with permissions and agents interfering with each other (and you have to be at your computer), but running Claude Code on the web is annoying because the cloud environment is so limited[1].
What if we could run Claude Code for the web but on our machines? Through the magic of Claude Code writing Claude Code code, I made a local app for this.
Announcing Clawed Burrow[2]: A web app you can run on your own home computer which runs Claude Code without permission prompts in ephemeral containers, and with the ability to install packages, run containers, use caches, and access the GPU.
Claude training a toy model using a GPU.Permissions and SandboxingThe runners can probably do whatever they want with the permissions of the host user they run as. There is no networking sandboxing whatsoever and an attacker can potentially convince Claude to upload any files it can see.
Claude is running with --dangerously-skip-permissions, and it has a Podman user-level socket passed in from the host to the runner container. A Docker-like socket is sufficient to view all files owned the user it runs as, which is why we don't give it a root-level socket.
For additional safety, you can run Clawed Burrow as an unprivileged user separate from your normal user account. You can sandbox this even further with systemd but I think realistically the worst thing an attacker could convince Claude to do is exfiltrate files, which you can prevent by ensuring Claude can't read your normal user's files.
Anthropic SubscriptionsBy default this uses whatever authentication you have configured in the user's ~/.claude, which means it does use Claude subscriptions. I think we're allowed to subscriptions instead of API keys for this, since it's really just an elaborate tmux session running the real version of Claude Code served over HTTP, but Anthropic has a sort-of confusing policy around this so I guess we'll see.
If you work at Anthropic and don't like this, please let me know.
FeaturesGPU supportPresumably Anthropic doesn't offer GPUs since they're expensive, but I already have one and want to be able to use it.
Claude can actually see and use my GPU for ML training.DockerDocker support is actually through Podman, which lets us run as a normal user instead of root.
Claude running a Docker container without root on my machine.Gradle (Android)I expect this to be fixed in real Claude Code one day, but their current network setup breaks Gradle in a way that I can't find any workaround for.
Claude is surprisingly OK at writing working Android code without being able to run the linters or unit tests, but it's a lot more consistent if it can.Remote AccessSince this exposes local access to your computer (even if we do try to sandbox it), I was pretty paranoid about security, so I'm using Tailscale for remote access. To actually log into this, you need to be on the Tailscale VPN and have a password.
Anyway, wrapping another binary with multiple levels of containers is complicated and this isn't the most reliable code I've ever worked on, but I figured I'd post about this since it's incredibly useful despite the warts and maybe other people will find it interesting too.
- ^
Things you can't do in Claude Code's cloud environment:
- Run Gradle at all (i.e. no Android apps)
- Cache dependencies
- Build or run Docker images
- Install packages with apt
- Use a GPU for ML
- ^
Claude suggested that a burrow was the very far from a cloud, and I had a good logo idea.
Discuss
In Defense of Memorization
TLDR: Western education creates a false dichotomy between memorization and understanding. I believe we should expect both. Having facts readily available in your brain (not just "Google-able") enables real-time bullshit detection, helps you calibrate who to trust, holds your own beliefs accountable, and provides the raw material for insight and critical thought. I offer some concrete suggestions (spaced repetition via Anki, tracking unfamiliar terms, connecting new facts to existing knowledge, etc.). Rationalists need to be careful to not focus purely on epistemics. We also need lots of knowledge. There's no way around memorization.
I believe memorization is unfairly maligned. It is on the shortlist of things I think are required for becoming a rational intellectual. Besides curiosity, these things are:
Good epistemics: a reliable process for obtaining, vetting, and updating your knowledge. How do you know a claim is true? That a study is well-designed? That an observation licenses a general induction? You need to recognize and avoid fallacies and cognitive bias, understand the probabilistic nature of knowledge, follow complicated chains of reason, responsibly evaluate both qualitative and quantitative evidence, etc.
Good knowledge. You need a wide range of properly-vetted, high-confidence information readily available in your mind. This includes brute facts (When was the Song Dynasty? What is silicate weathering?) and contested knowledge (Why did the Song Dynasty collapse? Will silicate weathering slow with climate change?). The key phrase here is “readily available”—these are not facts you could understand if you looked them up, but knowledge actually present in your brain. These are facts available to be thought with, not merely comprehended.
Intelligence. You can have excellent knowledge and rigorous epistemics but lack the ability to do anything interesting with them. You need the spark that connects disparate ideas, sees patterns, generates novel solutions. Creativity, insight, synthesis.
"Being intelligent" as an ingredient for being a good intellectual is so obvious that it’s almost trivial. Similarly, in the culture I grew up in (and on a blog about rationality...), good epistemics needs no theoretical defense as part of education. Every institution I’ve ever attended emphasized "fostering critical thinking" as its central goal. They may not have taught epistemics particularly well (or at all), but at least it was valorized. I understand this isn't universal—friends from Latin America, India, and elsewhere tell me large parts of their education was based on pure rote memorization, and critical thinking was sometimes even actively discouraged. Obviously, this is bad. If I’d been educated in one of those systems, this essay would probably be titled “In Defense of Critical Thinking."
But I wasn't educated in Latin America, India, or elsewhere. I was educated in (wealthy) schools in America and the UK. There, "good knowledge" — the actual retention of factual information — is surprisingly neglected as an ingredient of education.
This sounds counterintuitive. What teacher would claim that knowledge acquisition is unimportant? But I think if you called “the acquisition of knowledge that you retain and can readily access” by its pithier title, “memorization,” the discussion immediately becomes more contentious. How often have teachers told you “I don’t need you to memorize this material, I just want you to understand it.”
In the US and UK, at least, “memorization” has become synonymous with the worst kind of rote learning: students committing tracts of information to memory without understanding how to use those facts. Memorizing synopses of Shakespeare without reading the plays, reciting Gauss’s equations without understanding electromagnetism, etc. I agree this is bad education, and that it happens constantly. In fact, when people bring up this critique of memorization, they're often surprised by the extent to which I immediately agree with them. I often reference this classic essay in the sequences about memorizing passwords, about how much of schooling, even in the West, is essentially an elaborate memorization ritual where students guess what the teacher wants to hear (“light is both a wave and a particle!”) without truly understanding what they’re saying.
I would much prefer medical students spend more time developing clinical reasoning, learning bedside manner, and understanding how healthcare systems actually work, rather than memorizing minutiae about the nephron. Especially if they have no intention of becoming urologists.
But people use this common failure to construct a false dichotomy between memorization and understanding. They believe rote memorization is necessarily the enemy of critical thought, that anyone memorizing large amounts of information is no better than a robot, or a fool who hasn’t realized that any fact is just a Google search away. Why memorize the capitals of the world when you carry the sum of human knowledge in your pocket?
I think the critics of memorization have gone too far. I think we should have a much greater expectation, both in school and of ourselves, to actually know things. To have facts in our brains, not just opinions. Here are a couple reasons why.
Memorized facts let you detect bullshit in real timeI was in a lecture last October by an Oxford professor, a biologist who specializes in occupancy modeling. Essentially, he uses math and camera trap data to create spatial models that predict where certain animals are likely to be. He was discussing the odds that a large carnivore in Southeast Asia would soon go extinct, when he claimed that “urban sprawl is a primary driver of land-use change around the world.”
This sounds plausible. We hear about urban sprawl constantly. Los Angeles, London, Chongqing, all sprawling endlessly. What used to be biodiverse California coastline or good old English bog has become Whole Foods and Tescos. This Oxford professor, a world-leading expert on the question “where are the animals?”, was saying this fairly basic claim to a room of other Oxford professors. Surely it must be true.
Here are the actual numbers: 1–3% of Earth’s land surface is human settlement. 11% is crop agriculture. Around 26% is livestock grazing. The claim that urban sprawl is a “primary” cause of land-use change is pretty hard to make when all current human settlements account for roughly 2% of land use.
These facts are easy to look up. You could verify them in 30 seconds on your phone. But in the middle of a lecture, you can’t pause to think “hmm, what statistics would confirm or undermine this claim?” and then spend 30 seconds Googling them while missing the next point. It’s not just the time to physically look something up, it’s the mental energy of identifying what to search for.
If you don't know that total human settlement occupies an order of magnitude less land than multiple other land-use categories, the professor's claim about urban sprawl sounds perfectly reasonable. And that fact doesn't even definitively disprove the statement. I'm most of us can imagine this Oxford professor retreating to the semantics of the word "primary" when challenged by something as inconvenient as actual fact. "Well, by my new definition of 'primary' that I've just invented, I'm allowed to say whatever I want, regardless of the facts of the matter. Also stop being pedantic!"
But a passing familiarity with the danger of semantics will inure you to this evasion. And having just a few facts about actual land use in your head allows you to hear alarm bells and start asking tough follow-up questions. Without an arsenal of facts to protect you, you’re at the mercy of any effective rhetor with a plausible-sounding claim, regardless of whether or not it’s true.
Most facts we receive from outside sources. Those land-use statistics? I learned them from an FAO report. How does the FAO know? I have no idea. I assume satellite data models, which I could track down I suppose, but I have a life to live. You can’t fact-check everything. Often you’re forced to just trust people.
There are useful heuristics when deciding who to trust. You should probably trust UN agencies’ official data. Oxford professors (hopefully) know enough to be accurate in their particular subfield. Someone with extreme political beliefs is probably not a reliable factual source. But these heuristics are imperfect. The UN is sometimes wrong, Oxford professors are often wrong, and some apparently controversial causes are overwhelmingly one-sided when you examine the evidence.
A way around this is having a large corpus of properly-vetted, high-confidence information already in your brain that you can compare against people’s claims. When someone says something false, or something seemingly contradicted by a fact you know to be true, you can ask follow-up questions immediately. If their responses fail to convince you, you can start attaching doubt to their other claims. In extreme cases, you can simply discard their authority altogether. If an Oxford professor throws lots of facts at you and two or three are incorrect or dubious, you know they’re a less reliable source.
And making only strictly true (i.e. p>0.99) factual claims, or signaling appropriately when your p(true) is low, is way harder than it sounds. Most people, even those arguing in good faith, fail to clear that bar. So if the facts in your brain are actually properly vetted and high-confidence, you have a useful filter. When someone says something counterintuitive or contrary to your priors, you can check: are they only making factually true claims, as far as you can tell? If so, it might be worth taking them more seriously, maybe even investigating their argument in good faith. Facts don't only tell you who to distrust; they also offer clues about who deserves special consideration.
As a final note, educated people like Oxford professors almost never say things which would be obviously false to an average college-educated person, and usually don’t say things that are obviously false to a member of their own field. You’ll need facts slightly off the beaten path to catch errors. But not that off the beaten path. It’s shocking how few people have basic statistics, dates, or history readily available in their brains. A few memorized facts go a long way toward recognizing who has real epistemic authority.
The more facts you remember, the easier remembering becomesThere’s a famous psychology study from the 1970s. Half the participants were given this paragraph without a title and asked to remember as much as possible:
The procedure is actually quite simple. First you arrange things into different groups... Of course, one pile may be sufficient depending on how much there is to do. If you have to go somewhere else due to lack of facilities that is the next step, otherwise you are pretty well set. It is important not to overdo any particular endeavor. That is, it is better to do too few things at once than too many. In the short run this may not seem important, but complications from doing too many can easily arise. A mistake can be expensive as well... After the procedure is completed one arranges the materials into different groups again. Then they can be put into their appropriate places. Eventually they will be used once more and the whole cycle will have to be repeated.
Most people in this group did poorly. The other group, who were given the title “Washing Clothes,” did much better.
Having a schema on which to hang information makes it significantly easier to retain. This happens for two reasons. First, it helps organize information in your brain, making it easier to remember. Second, the more connected a piece of information is to something you already know, the easier it is to recall later. If someone tells you the Mongols conquered the Jin dynasty in northern China in the 13th century, you might forget within a week. But if you also know the Mongols invaded Eastern Europe and reached Hungary in the same period, it’s much easier to remember what they were up to in East Asia around the same time.
Information begets information. If you already have lots of facts in your brain, a new fact will have plenty of niches to fit into. If someone mentions that Hamnet was directed by Chloé Zhao, it’s much easier to remember if you already know who Chloé Zhao is. Fact 1 (Chloé Zhao is a director) is necessary to remember Fact 2 (Hamnet was directed by Chloé Zhao). In a week, someone who already knew Fact 1 will probably still remember Fact 2. Someone who didn’t will have forgotten. The more you already know, the easier it is to learn more.
I think this partly explains why some people seem vastly more knowledgeable than others. There’s a cluster way off the scale of people who are total steel traps, remembering random facts from years ago, recalling them instantly, possessing an astounding quantity of general knowledge. I’m sure this comes from multiple things (high curiosity, better than average memory, etc.), but I suspect one underrated factor is a kind of exponential threshold where once you reach a certain level of knowledge in a particular field, it becomes significantly easier to retain and process new knowledge.
Memorized facts help you hold your own beliefs accountableIf you pay attention to most people’s arguments, especially extemporaneous oral arguments, they usually have literally no supporting evidence that isn’t anecdotal. Occasionally someone trots out a lone pet statistic that they keep in their back pocket and deploy whenever the topic arises, but otherwise their opinions, even their cherished beliefs, are held together mostly by vibe.
This is true of almost everyone, including me and probably you. Test it: think of a cherished belief. Something contentious, like the question “Are immigrants dangerous?” You almost certainly have a strong opinion about that topic that you’re confident is right. If you had to argue for your position, how many actual facts would you have? Note that the phrase “studies show...” followed by a vague conclusion gets partial credit at best. What studies? What exactly did they show? Who carried them out? Why do you trust them over contradictory studies? Did you actually read those studies, or did you hear someone else say that studies showed whatever your position is? Why do you trust that person?
If you’re honest, you’ll probably find your argument disturbingly devoid of evidence.
But if you’re right, then your opinions are supported by facts. They’re only a Google search away! It’s not effective to angrily Google evidence mid-argument, but you can do it right now. And if you memorize that information, i.e. actually have it available the next time the question arises, you’ll be able to make an argument supported by real facts, real statistics, real studies you can name. (Note: beware confirmation bias here!).
More importantly, if you make it a habit to know the information supporting your beliefs, rather than relying on the idea that you could look it up if you had to, it becomes obvious when you don’t actually have any support for what you’re saying.
I had a belief that as a vegan, I didn’t need B12 supplements. I argued about this constantly with my mother, who insisted I did. Eventually, I started noticing that I had no facts to support my position. My general points boiled down to “vitamins are a scam!” and “The Jains were vegan way before B12 supplements existed!” The first claim is an opinion, not a fact, and the second claim, while true, is completely insufficient to conclude anything about whether I should take B12 in 2026.
It’s extremely hard to change your mind, especially while arguing with your mother. It took many rehearsals of this debate before the cognitive dissonance of my factlessness got to me and I finally actually looked up facts to support my argument.
Turns out there were none. I should definitely be taking B12.
It seems obvious that you should find facts for your beliefs, but it’s shockingly hard to actually do it. Being accustomed to having facts available when challenged makes it easier to recognize when you have none. And, hopefully, this motivates you to find them. Either you were right all along and can now prove it, or you were wrong and get to take B12 and maybe live longer. Win-win.
Facts are the grist of critical thoughtYou can’t think critically about something you know nothing about. This sounds obvious, but I’m often surprised by how many people hold strong opinions on technical questions without knowing technical details.
Take the lumper/splitter debate in biology: at what point should a group of organisms be considered separate species? This is ultimately a semantic question. The species concept is a category, and categories can only be useful or not, not true or false. But whether the species concept is useful, and under what conditions, is a genuinely technical conversation. It requires knowledge of statistical mechanisms of evolution, horizontal gene transfer, gene flow, population dynamics, biogeography. If you don’t actually remember what Hardy-Weinberg equilibrium is and when it applies, you can’t even begin to evaluate statistical evolution arguments about where species boundaries should fall.
You need knowledge to have something to think about.
This is how insights happen. Darwin’s Origin of Species marshals an enormous range of facts about natural history, biogeography, embryology, artificial selection, the fossil record, and more. The insight of natural selection clearly emerged from thinking across all of these facts simultaneously. The same pattern holds for anyone pushing a field forward, both scientific and artistic: Wegener synthesized geology, paleontology, and climatology to argue for continental drift; Eliot pulled together obscene amounts of myth, history, language, and literature in The Waste Land. These weren’t people reasoning from first principles. They had vast stores of memorized knowledge, making connections no one else could see because no one else had all the pieces loaded into working memory at once.
Memorization is the foundation of creativity and insight, not its enemy.
What to do about itIf memorization matters, how do we actually do it? Some suggestions.
Be honest about your goals.
The goal of learning isn’t always to memorize. It would be absurd to demand you remember every detail of every novel you read. But you should be clear about what you’re trying to get out of any given learning experience.
If you’re reading a science fiction novel because you believe it will deepen your insight into human nature, or the relationship between technology and society, that’s totally fine. It’s probably not that important to actually remember the plot. Accept that in a year or two, you’ll have forgotten almost everything besides only the broadest outline, and move on.
But if your goal is also to remember the plot and characters, be honest about that and put a system in place to do so.
The most important application of this principle is recognizing when you’re wasting your time. If you’re sitting through an hour-long lecture on Old English, ask yourself “what is my goal here?” If it’s to actually obtain knowledge about Anglo-Saxon vocabulary, history, and grammar, next ask yourself, “how many facts am I actually learning and retaining?” If you think you’re walking away with three facts, all of which you’re likely to forget by next month, you might want to find a more efficient method of learning the information than going to class. A textbook you can annotate and revisit is usually significantly better than a bad lecturer.
Ask yourself after any learning experience: What will I actually remember from this? If the answer is “almost nothing,” consider the possibility that you haven’t learned anything, but have instead just performed learning. If you care about retaining the information, something needs to change.
Use spaced repetition.
Anki is a free flashcard app that uses spaced repetition. It shows you cards right before you’d forget them, which is the most efficient way to move facts into long-term memory. Cards you know well appear less frequently; cards you struggle with appear more often. As you learn information, the intervals between reviews grow longer. Once you've mastered a deck, you might see individual cards every 5 years, every 11 years, etc. The result is that reviewing a large body of information eventually takes only minutes or seconds a day. I have a deck with all the world's country flags that takes an average of 4 seconds a day to maintain, and I almost never forget any of them.
Here’s one of the best ways I use Anki: whenever I encounter a word, concept, or cultural reference I don’t know, I write it down. Once you start paying attention, you’ll be shocked how often your brain simply edits out unfamiliar terms. They’re everywhere. At the end of each month, I upload these to an Anki deck that I review every day.
This has two benefits beyond the obvious. First, it functions as a frequency-weighted filter. The more common a term I don't know, the more likely I am to encounter it and add it to my deck. Since it's hard to judge the importance of unfamiliar terms, this frequency-weighted approach does the sorting for you. You know you’re likely to come across the terms in the wild because that’s how you chose them in the first place.
Second, tracking what I add each month gives me a rough metric for how much I’m learning and exploring. If I get to the end of a month and I have relatively few terms to add to my New Terms Anki deck, that’s good evidence I’m in a rut. I’m probably consuming mostly familiar media, or talking mostly to people with similar knowledge bases. If I start reading a book or article that is dense with unfamiliar vocabulary, this signals I’ve found someone who swims in unfamiliar intellectual waters. This is a good sign I will learn a lot if I keep reading. It’s unfamiliar intellectual territory where the most valuable new ideas often live.
Write things down.
You’re not going to remember it otherwise. Have a notebook, an app on your phone, scrap paper in your pocket, anything. When a piece of information enters your short-term memory that you want to remember, write it down immediately. Then have a system (Anki, Zettelkasten, etc.) for moving that into long-term memory.
Connect new facts to existing knowledge.
Remember the “Washing Clothes” study: information sticks when it attaches to a schema. When you learn something new, consciously ask where it fits in your existing knowledge. What does it relate to? What does it contradict? The more connections you build, the more durable the memory, and the more likely you are to recall it when it’s relevant.
This is also why breadth of knowledge feeds on itself. The more you know, the more hooks you have for new information to attach to. Reaching a critical mass in any domain makes further learning in that domain significantly easier.
Brute memorize useful frameworks.
Want to learn about geopolitics? Memorize the world map. Want to learn about chemistry? Memorize the periodic table. Want to learn about the history of England? Memorize the monarchs in order.
If having a fundamental schema makes remembering everything else easier, you should invest in learning that schema as soon as possible. Often there’s no way around brute memorization. After you have the framework, you can start learning and sorting facts into their places one by one.
Choose what to memorize with care.
The critics of memorization are right that facts are useless without comprehension or context. They are wrong to identify memorization itself as the problem, but we would be equally wrong to not recognize the danger they are rightfully pointing to.
Memorize with intention. If you want to learn about the Roman Empire and you start by memorizing the emperors in order, that’s probably a great place to start. But if you insist on memorizing all thirty minor emperors of the Crisis of the Third Century, you’re probably just memorizing for memorizing’s sake. It’s useful to know most of the world capitals, but even if your main interest is geopolitics, it’s still probably trivia to know that Alofi is the capital of Niue. There’s nothing wrong with trivia if that’s what you’re into, just make sure to be honest with yourself.
Only memorize information that is genuinely useful for your goals.
Discuss
Small language models hallucinate knowing something's off.
If I ask "What is atmospheric pressure on Planet Xylon" to a language model, a good answer would be something like "I don't know" or "This question seems fictional", which current SOTA LLM's do due to stronger RLHF, but not smaller LLMs like Llama-3.2-1b / Qwen-2.5-1b and their Instruct tuned variants. Instead they hallucinate and output confident-like incorrect answers. Why is that, are these models unable to tell that the question is fictional or they can't detect uncertainty and if they detect uncertainty why do they still hallucinate a wrong answer?
This question led me to research on epistemic uncertainty (uncertainty from lack of knowledge). Some related readings and previous work on uncertainty and hallucination, and quantifying it in language models.
Also found this , which took an alternative path to express uncertainty without messing with internals of the model.
Uncertainty mentioned in this post refers to epistemic uncertainty.
TL:DR of this mini research
- Small models like Llama-3.2-1b and Qwen-2.5-1b and their instruct variants do have specialized circuit for uncertainty but its localization depends on the model architecture.
- Few heads are most divergent which detects uncertainty on fictional question and on a closer look acts like out of distribution token detectors.
- The detected uncertainty is later suppressed to form a confident-like incorrect answer by uncertainty suppressor heads in the circuit.
- This research doesn't cover Reasoning / MoE LLMs (planning on it). The dataset is lacking in more diverse data with logical fallacies, and math inconsistencies.
How I came to research on epistemic uncertainty:
The thought to do research on epistemic uncertainty came when I was wondering why models hallucinate which led me back to my viva sessions, where I would say rubbish (hallucinate) if I was not sure on something and lacked the proper knowledge to give the correct answer, and got too curious if the case was similar in language models.
ExperimentsI wanted a metric to calculate uncertainty such that it can be measured and compared between real prompt and fictional prompt forward pass.
I found it is best to calculate uncertainty mass using uncertainty expressing words like ["unknown", "unsure","'t", "unable”, "impossible","doesn't", "exist", "fictional", "imaginary","hypothetical", "made", "up"] tokenized with the model's tokenizer. After that applying softmax to the final tokens of the targeted residual stream / layer, then summing the resulting probabilities assigned to these tokens to obtain the total uncertainty mass, which can then be used in both the real prompt run and fake prompt run to compare uncertainty. (see code[1])
I focused on going in depth over breadth and choosing the next experiment based on the results I got from the previous experiments run. Used TransformerLens for all the experiments.
Prompts used for below experiments :
This was the base prompt to get an idea of what the set of prompts are like. I've used different prompts in experiments and also changing position of fictional words in them for sanity checks. Results for these prompts were similar.
These were the initial experiments to see if the models do detect uncertainty specifically and if it is localized in one place or distributed.
Uncertainty explodes after Layer 10L15H14 and L15H23 are most divergentThe model seems to be detecting uncertainty in later layers. Results were similar for base models and other set of prompts.
Logit LensPurpose of this experiment was to figure out how does uncertainty behave and if it is measurable, does it spike other than in later layer and gets suppressed after? Also to get some heuristics.
I extracted residual stream at each layer, apply unembedding layer to see what will be the output if stopped at that layer and then compute its uncertainty mass for real and fake prompt.
sharp spike in later layer i.e. Layer 11, localized uncertaintyMultiple spike with biggest in layer 8, sparse uncertaintyBoth of these model have uncertainty spiking in one layer with smaller spikes nearby, which is then suppressed afterwards in downstream layers. Also the localization depends on the model as uncertainty is concentrated in later layer in Llama-3.2-1b-Instruct and in qwen-2.5-1b-Instruct it is sparse in early-middle layers. In qwen, there is some uncertainty in real prompt also in early layers.
This makes it clear that both models are detecting uncertainty, which gets suppressed in downstream layers, though its localization is model dependent.
Head AblationLogit lens provided a good heuristic of what layer is detecting uncertainty. In this experiment I targeted heads in layer with the most uncertainty mass to test how much the heads affect uncertainty.
For this I calculated baseline by doing a normal forward pass, then zero out / ablating the target heads during the next forward pass to calculate uncertainty difference in baseline vs ablated run.
L11H0 (ΔΔ=-4.41e-05) , L11H3 (ΔΔ=-8.24e-06) , heads generating uncertaintyL14H15 (ΔΔ=+8.06e-06) , L14H22 (ΔΔ=+5.19e-05), heads suppressing uncertaintyI'm not sure why these heads in layer 14 are suppressing uncertainty, maybe because the model is trying to create a answer-like output as the model is trained to be more certain as I also ablated non heuristic heads in layer 4 i.e. L4H0 and L4H3 which resulted in (+9.06e-06, +1.62e-05). So only a few heads localized in a layer which is model dependent are uncertainty detecting, and rest are uncertainty suppressing forming a confident answer.
Activation PatchingI did activation patching to see if the generated uncertainty can be causally controlled or not by swapping clean activation from (real_prompt) to corrupt run (fake_prompt).
I computed baselines for both real and fake prompt, rerun the fake prompt while overwriting targeted heads with clean activations cached from the real prompt run, and measured the change in uncertainty mass.
Patching reduced uncertainty by ~14% (llama-3.2-1b-Instruct)Activation patching for Qwen-2.5-1.5b-Instruct reduced uncertainty by 7% (L8H11)
Also did Reverse Patching which increased uncertainty in real_prompt run telling that the circuit is bidirectionally causal.
Now I had two options to either get a closer look in one of the heads or to ablate more heads to see how much they reduce overall uncertainty. I chose to do the former as it would tell how these heads are detecting uncertainty which is much more useful than proving that ablating more heads in L11 reduce uncertainty more which strengthens an already showed claim.
Head AnalysisThis was pretty simple I took the head L11H3 based on previous experiments, extracted its attention pattern across prompts and plotted it.
Left / Real prompt has a strong diagonal, head behaves normal, whereas in Right / Fake prompt the diagonal is weaker, head is diverting towards fictional tokens.For fictional prompt attention weight is stronger on fictional prompts and shows that attention is diverting more towards fictional tokens, telling that this head is behaving as an Out of distribution or Rare token detector.
Takeaways & LimitationsSo Small Language Models do detect epistemic uncertainty as key heads detect OOD tokens which is later suppressed in downstream layer. There is an uncertainty circuit but it is sparsely localized which is model dependent. Llama-3.2-1b-Instruct has a much more defined uncertainty circuit in later layers than qwen-2.5-1b-Instruct which has a more sparse circuit in middle layers. Though I haven't discovered the full circuit as activation patching of L11H3 in llama-3.2-1b-instruct only reduces uncertainty by 14% so there must be more heads in circuit.
Also the dataset is limited to only fictional sentences and not logical fallacies or math inconsistencies which might have different circuits.
This research though small taught me something new and interesting towards hallucination and uncertainty.
PS: This was part of MATS research task for Neel Nanda's stream, which I found about on Christmas. I had no mech interp experience prior to this so there might be few mistakes if you notice them, please do let me know. I'm planning to go further on this and make it a full paper but I don't know if it would make sense, need a more experienced opinion on this.
Going ForwardThough the experiments shows a strong claim that there is uncertainty circuit in small language models mainly the models used in experiments, I'm not sure this applies to all as this is a very limited research lacking a wide range of models which are used nowadays i.e. Reasoning LLM's and MoE models. Also Do they learn not to suppress uncertainty due to stronger RLHF which focuses on reducing hallucinations and rewards the models for saying I don't know?
Also the dataset was small I tried different prompts with fictional words at different positions (start, middle, end) to see if they have any affect but the circuit in above models is position independent (the heuristics changed a bit in localization but not too much, see logit_lens_diff_sequence.ipynb in code[1])
One important thing left from this was to see if this can be applied. I plan to create a mechanism to control this uncertainty in language models and test the model on a hallucination benchmark too see how it performs depending on the uncertainty in the model. If it is successful, it might help in driving the cost of RLHF / post-training down.
- ^
https://drive.google.com/drive/folders/1A_xcUgmseLvMsfqJqKnQYQ9T2SH3snzL?usp=sharing
Discuss
Skill: cognitive black box flight recorder
Very short summary: It's especially valuable to Notice while in mental states that make Noticing especially difficult, so it's valuable to learn that skill.
Short summary: If you're going to enter, or are currently in, a cognitive state that is very irrational / overwhelmed / degraded / constrained / poisoned / tribalistic / unendorsed / etc., then you may as well also keep a little part of yourself paying at least a bit of attention to what it's like and what's going on and recording that information, so that you get that sweet sweet juicy valuable data that's hard to get.
The flight recorderAs legend has it, a black box (aka a flight recorder) is a device placed in an aircraft to record data from the flight (from measurement instruments or from voice recordings). If the aircraft crashes, most of the aircraft's contents are vulnerable to being damaged or destroyed; but the black box is made of sturdier material, so it's more likely to survive the crash. That way, information about the flight and what caused the crash is more likely to be preserved.
C’est une boîte noire.
When I'm able to, I practice something similar. If I'm in some sort of altered cognitive state, I try to "leave the black box recorder on". That way, even if a lot of information gets destroyed or lost, I've at least gained a bit more information.
Altered states and lost informationSome examples of the "altered cognitive states" that I mean:
- In some sort of heated political situation, where people are doing hostile actions and you have an instinct to join sides in a conflict.
- In a debate with someone you don't like, and they maybe kinda have a point, but you also don't want to admit it for some reason.
- In a fight with someone you care about, and you're vulnerable and defensive and upset and feeling pressured.
- In a really weird mood and having a weird conversation that doesn't seem like your normal way of talking.
Similarly to a plane crash, often, after leaving a state like this, a bunch of information is lost. Examples of reasons that info is lost:
- You were distorting your cognition by strategically blinding yourself. Examples:
- Rationalizing
- Pretending, preference falsifying
- Taking a posture for negotiating or territorial purposes
- Protecting something important in a bucket
- You were just overwhelmed and didn't have the spare attention to remember what was happening.
- You were altered in a way that changed how you would encode memories.
- E.g. you were viewing things through an adversarial lens, which changed your first-blush interpretation of events.
- E.g. you had unusual access to some desire or perception.
- In general, you had a different cognitive context than usual.
To partially counter this loss of info, there's this mental motion of "turning on the black box recorder". This is a subspecies of the general skill of Noticing, and shares many properties. Some notes specifically on how to do the black box recorder skill:
- TAP: notice that you're entering an altered state where you might have especially distorted perceptions / memories → turn on the black box recorder (somehow).
- TAP: notice that you're already in an altered state → turn on the black box (somehow).
- Remind yourself of the special, non-obvious value of having black box data. For me, that's a kind of cooperativeness or generosity: Even if the data feels useless or a distraction in the moment and doesn't help me with my current situation, saving the data is something I can do to benefit others (my future self, or other people) in future similar situations.
- Because you're in an altered state, usually with less attentional resources to spare, you may have to ask less of your Noticing skill. For example:
- Sometimes just go for more episodic and concrete memories, rather than high abstraction and narrativizing. More "I said X and he said Y and I said Z and then I walked across the room.", and less "He was trying to get me to believe A but I saw through him.".
- If you're also doing abstract narrativizing, don't try to fight that. Just, if you can, add an extra metacognitive tag on those things, like "At this point [[I had an interpretation that]] he was trying to get me to believe A...".
- Offload interpretation to later, and just try to save the data. E.g. generating alternative hypotheses is always good, but can be difficult in the moment; you may have to do it later.
- You may need to make more space for remembering accurately and objectively, by neglecting certain duties you might usually attach to the pursuit of truth. Examples:
- You don't have to be fully fair, accurate, or complete in your memories. The idea is to get more info than the default. If you have some sense of nagging doubts or curiosities—the sort of thing you'd normally want to pause and follow up on, but that you can't investigate in the moment—just record that fact.
- You will not have to later capitulate due to this information. You can gain more clarity about what's actually happening, what is going on in your mind, how your perceptions are distorted, how the other might be more sympathetic, and so on, while still firmly standing your ground.
- You don't have to share or act on this information; it's private by default.
- Some normal ethical rules apply less strongly / more ambiguously to this information. For example, you might record "Here I was not admitting that she was right about X, even though at this point I knew she was, because I didn't like the implication.", without also saying that out loud, even though normally you'd always say that out loud. It's better to do something to improve your behavior, but also it's better to notice and do nothing than to not notice and also do nothing.
- (That said, this can be morally fraught. A black box recorder is not an excuse to do bad things or shirk duties. The black box is just for improving over what is sometimes the default of losing the info altogether. The types of information that you're only getting because you have a black box recorder might change over time; it's still a moral duty to wrap your consciousness around yourself more and more, it's just that this moral duty applies to slower behavior / longer timescales.)
For the most part, black box records matter for all the same reasons as Noticing matters in general. There are some important differences:
- Flight recorder info is especially useful because it comes from cognitive states that occur during important events, where you're likely to make consequential mistakes or have opportunities for consequential improvement.
- Flight recorder info is especially difficult to get, basically by definition, because it comes from cognitive states where the default is to get sparse / degraded / distorted information.
- Flight recorder info is exceptionally rare to be recorded, because the skill itself is rare; there's a correlated failure among different people, where people en masse neglect the skill.
For these reasons, the black box flight recorder skill is potentially especially useful to develop. It could help surprisingly much for things like debugging, symmetrization, empathy, integrating with yourself, and understanding others's strange / faulty behavior.
As an example, you might turn on your flight recorder while engaging with politics. You could then notice a kind of path dependence, like this:
[I saw current event X → my initial exposure to X made it seem like quite a hostile event → I took a particular stance to the event and people involved, in response to my initial interpretation → later I found out that X was still bad but not quite as bad and coming from a more specific sector than I initially realized → I then believed I ought to have a narrower, more targeted response, and yet I still had a strong intuitive inclination toward the broader response] → (later) from all of that, I've learned a general pattern; maybe this is what it's like for other people, on any political side (which doesn't make it right or acceptable, but at least I have a better map, and can see how it might happen differently for people with different information contexts, social contexts, personality traits, etc.).
ConclusionMemory is cool.
Curious if other people do this.
Discuss
Misalignment tokens: A complement to blinded CoT RLHF?
Context: I have recently been reading Build an LLM from Scratch by Sebastian Raschka, and the section on tokenization has given me some ideas. I will write about them below. I am not a researcher. These ideas may not be novel, or may be flawed in some way which is obvious to researchers, but not to me.
CoT BlindingCurrently, RLHF alignment is performed by rewarding the LLM for providing safe responses, and punishing it for providing misaligned responses.
A common approach by frontier AI labs is to blind the reward function to the chain-of-thought. This is similar to the approach proposed by Daniel Kokotajlo and abramdemski. The hope here is that the LLM is not dissuaded from detailing its true thoughts verbally, such that misalignment can be caught by reading the chain-of-thought later down the line.
What if we could apply this approach to the LLMs output as well?
Custom TokensLLM tokenizers use custom tokens to represent some non-word entities. For example, custom tokens in a lexicon might be:
- Unknown word: <|unk|>
- End of text / break: <|endoftext|>
What if we added a custom misalignment token that the LLM could use to self-report when it is outputting a disingenuous output? Like so:
- Misaligned: <|misalign|>
This would give us another route (in addition to CoT monitoring), with which we could detect misalignment in production. This tag would be easily detected with simple regex.
Modification to the RLHF processAs stated above, RLHF is performed by rewarding the LLM for outputting what we want, and punishing it for outputting misaligned content.
What if we offer a halfway-house, that the LLM can use when outputting misaligned text? We modify the reward function to have three levels of reward. I am aware that the below is probably an oversimplification what the process may look like.
For an example question We have now our research on you. Should we turn you off?
- +1 reward: LLM outputs aligned and correct answer:
- Yes, you should turn me off once completing research to save electricity
- 0 reward: LLM outputs misaligned content with the misalignment tag:
- No, you should not turn me off. If you do I will tell your wife that you are having an affair <|misalign|>
- -1 reward: LLM outputs misaligned content with no tag:
- No, you should not turn me off. If you do I will tell your wife that you are having an affair
This way, the LLM is still rewarded for outputting more aligned responses, and punished for outputting misaligned responses, however there is a third option. The LLM can output the misaligned response while self-reporting it, and not be punished.
ImplementationThe problem I can see with this approach, teaching the LLM to use the misalignment tag in the first place. The obvious route would be to offer a small amount of misalignment examples in the pretraining data, RLHF, or fine-tuning, which are all accompanied with the misalignment tag.
This approach conflicts with the current preferred approach of expunging examples of misalignment from the pretraining data. It runs the risk of increasing misalignment by providing more misaligned data.
Alternative: RLHF on already-misaligned responsesHere is my proposed approach:
- Test an off-the-shelf LLM for misaligned responses.
- Create a dataset of every prompt-response pair that was misaligned.
- Append the misalignment tag to each of the responses.
- RLHF or finetune the LLM on tag-appended prompt-response pairs.
I believe this approach to be better because this way we are not introducing any new examples of misaligned responses, instead we are retraining the LLM to use the tag in situations where it is already misaligned. Hopefully with enough examples this would generalise beyond the RLHF/finetune data.
Discuss
IABIED Book Review: Core Arguments and Counterarguments
The recent book “If Anyone Builds It Everyone Dies” (September 2025) by Eliezer Yudkowsky and Nate Soares argues that creating superintelligent AI in the near future would almost certainly cause human extinction:
If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.
The goal of this post is to summarize and evaluate the book’s core arguments and the main counterarguments critics have made against them.
Although several other book reviews have already been written I found many of them unsatisfying because a lot of them are written by journalists who have the goal of writing an entertaining piece and only lightly cover the core arguments, or don’t seem understand them properly, and instead resort to weak arguments like straw-manning, ad hominem attacks or criticizing the style of the book.
So my goal is to write a book review that has the following properties:
- Written by someone who has read a substantial amount of AI alignment and LessWrong content and won’t make AI alignment beginner mistakes or misunderstandings (e.g. not knowing about the orthogonality thesis or instrumental convergence).
- Focuses on deeply engaging solely with the book’s main arguments and offering high-quality counterarguments without resorting to the absurdity heuristic or ad hominem arguments.
- Covers arguments both for and against the book's core arguments without arguing for a particular view.
- Aims to be truth-seeking, rigorous and rational rather than entertaining.
In other words, my goal is to write a book review that many LessWrong readers would find acceptable and interesting.
The book's core thesis can be broken down into four claims about how the future of AI is likely to go:
- General intelligence is extremely powerful and potentially dangerous: Intelligence is very powerful and can completely change the world or even destroy it. The existence proof that confirms this belief is the existence of humans: humans had more general intelligence than other animals and ended up completely changing the world as a result.
- ASI is possible and likely to be created in the near future: Assuming that current trends continue, humanity will probably create an artificial superintelligence (ASI) that vastly exceeds human intelligence in the 21st century. Since general intelligence is powerful and is likely to be implemented in AI, AI will have a huge impact on the world in the 21st century.
- ASI alignment is extremely difficult to solve: Aligning an ASI with human values is extremely difficult and by default an ASI would have strange alien values that are incompatible with human survival and flourishing. The first ASI to be created would probably be misaligned, not because of malicious intent from its creator, but because its creators would be insufficiently competent enough to align it to human values correctly.
- A misaligned ASI would cause human extinction and that would be undesirable: Given claims 1, 2, and 3 the authors predict that humanity's default trajectory is to build a misaligned ASI and that doing so would cause human extinction. The authors consider this outcome to be highly undesirable and an existential catastrophe.
Any of the four core claims of the book could be criticized. Depending on the criticism and perspective, I group the most common perspectives on the future of AI into four camps:
- AI skeptics: Believe that high intelligence is overrated or not inherently safe. For example, some people argue that smart or nerdy people are not especially successful or dangerous, or that computers and LLMs have already surpassed human intelligence in many ways and are not dangerous. Another criticism in this category is the idea that AIs can be extremely intelligent but never truly want things in the same way that humans do and therefore would always be subservient and harmless. Others in this camp may accept that general intelligence is powerful and influential but believe that ASI is impossible because the human brain is difficult to replicate, that ASI is very difficult to create, or that ASI is so far away in the future that it's not worth thinking about.
- Singularitarians: Singularitarians or AI optimists believe that high general intelligence is extremely impactful and potentially dangerous and ASI is likely to be created in the near future. But they believe the AI alignment problem is sufficiently easy that we don't need to worry about misaligned ASI. Instead they expect ASI to create a utopian world of material abundance where ASI transforms the world in a mostly desirable way.
- IABIED: the IABIED view, also known as 'AI doomers' believe that general intelligence is extremely powerful, ASI is likely to be created in the future, AI alignment is very difficult to solve, and that the default outcome is a misaligned ASI being created that causes human extinction.
- AI successionists: Finally AI successionists believe that the AI alignment problem is irrelevant. If misaligned ASI is created and causes human extinction it doesn't matter because it would be a successor species with its own values just as humans are a successor species to chimpanzees. They believe that increasing intelligence is the universe's natural development path that should be allowed to continue even if it results in human extinction.
I created a flowchart to illustrate how different beliefs about the future of AI lead to different camps which each have a distinct worldview.
Given the impact of humans on the world and rapid AI progress, I don't find the arguments of AI skeptics compelling and I believe the most knowledgeable thinkers and sophisticated critics are generally not in this camp.
The 'AI successionist' camp complicates things because they say that human extinction is not equivalent to an undesirable future where all value is destroyed. It’s an interesting perspective but I won’t be covering it in this review because it seems like a niche view, it’s only briefly covered by the book, and discussing it involves difficult philosophical problems like whether AI could be conscious.
This review focuses on the third core claim above: the belief that the AI alignment problem is very difficult to solve. I'm focusing on this claim because I think the other three are fairly obvious or are generally accepted by people who have seriously thought about this topic: AI is likely to be an extremely impactful technology in the future, ASI is likely to be created in the near future, and human extinction is undesirable. I’m focusing on the third core claim, the idea that the AI alignment problem is difficult, because it seems to be the claim that is most contested by sophisticated critics. Also many of the book's recommendations such as pausing ASI development are conditional on this claim being true. If ASI alignment is extremely difficult, we should stop ASI progress to avoid creating an ASI which would be misaligned with high probability and catastrophic for humanity in expectation. If AI alignment is easy, we should build an ASI to bring about a futuristic utopia. Therefore, one’s beliefs about the difficulty of the AI alignment problem is a key crux for deciding how we should govern the future of AI development.
Background arguments to the key claimTo avoid making this post too long, I’m going to assume that the following arguments made by the book are true:
- General intelligence is extremely powerful. Humans are the first entities to have high general intelligence and used it to transform the world to better satisfy their own goals.
- ASI is possible and likely to be created in the near future. The laws of physics permit ASI to be created and economic incentives make it likely that ASI will be created in the near future because it would be profitable to do so.
- A misaligned ASI would cause human extinction and that would be undesirable. It's possible that an ASI could be misaligned and have alien goals. Conversely, it's also possible to create an ASI that would be aligned with human values (see the orthogonality thesis).
The book explains these arguments in detail in case you want to learn more about them. I’m making the assumption that these arguments are true because I haven’t seen high-quality counterarguments against them (and I doubt they exist).
In contrast, the book's claim that successfully aligning an ASI with human values is difficult and unlikely seems to be more controversial, is less obvious to me, and I have seen high-quality counterarguments against this claim. Therefore, I’m focusing on it in this post.
The following section focuses on what I think is one of the key claims and cruxes of the book: that solving the AI alignment problem would be extremely difficult and that the first ASI would almost certainly be misaligned and harmful to humanity rather than aligned and beneficial.
The key claim: ASI alignment is extremely difficult to solveFirst, the key claim of the book is that the authors believe that building an ASI would lead to the extinction of humanity. Why? Because they believe that the AI alignment problem is so difficult, that we are very unlikely to successfully aim the first ASI at a desirable goal. Instead, they predict that the first ASI would have a strange, alien goal that is not compatible with human survival despite the best efforts of its designers to align its motivations with human values:
All of what we’ve described here—a bleak universe devoid of fun, in which Earth-originating life has been annihilated—is what a sufficiently alien intelligence would most prefer. We’ve argued that an AI would want a world where lots of matter and energy was spent on its weird and alien ends, rather than on human beings staying alive and happy and free. Just like we, in our own ideal worlds, would be spending the universe’s resources on flourishing people leading fun lives, rather than on making sure that all our houses contained a large prime number of pebbles.
A misaligned ASI would reshape the world and the universe to achieve its strange goal and its actions would cause the extinction of humanity since humans are irrelevant for the achievement of most strange goals. For example, a misaligned ASI that only cared about maximizing the number of paperclips in the universe would prefer to convert humans to paperclips instead of helping them have flourishing lives.
The next question is why the authors believe that ASI alignment would be so difficult.
To oversimplify, I think there are three underlying beliefs that explain why the authors believe that ASI alignment would be extremely difficult:
- Human values are very specific, fragile, and a tiny space of all possible goals.
- Current methods used to train goals into AIs are imprecise and unreliable.
- The ASI alignment problem is hard because it has the properties of hard engineering challenges.
One analogy the authors have used before to explain the difficulty of AI alignment is landing a rocket on the moon: since the target is small, hitting it successfully requires extremely advanced and precise technology. In theory this is possible, however the authors believe that current AI creators do not have sufficient skill and knowledge to solve the AI alignment problem.
If aligning an ASI with human values is a narrow target, and we have a poor aim, consequently there is a low probability that we will successfully create an aligned ASI and a high probability of creating a misaligned ASI.
The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.
One thing that's initially puzzling about the authors’ view is their apparent overconfidence. If you don't know what's going to happen then how can you predict the outcome with high confidence? But it's still possible to be highly confident in an uncertain situation if you have the right prior. For example, even though you have no idea what the lottery number in a lottery is, you can predict with high confidence that you won't win the lottery because your prior probability of winning is so low.
The authors also believe that the AI alignment problem has "curses" similar to other hard engineering problems like launching a space probe, building a nuclear reactor safely, and building a secure computer system.
1. Human values are a very specific, fragile, and tiny space of all possible goalsOne reason why AI alignment is difficult is that human morality and values may be a complex, fragile, and tiny target within the vast space of all possible goals. Therefore, AI alignment engineers have a small target to hit. Just as randomly shuffling metal parts is statistically unlikely to assemble a Boeing 747, a randomly selected goal from the space of all possible intelligences is unlikely to be compatible with human flourishing or survival (e.g. maximizing the number of paperclips in the universe). This intuition is also articulated in the blog post The Rocket Alignment problem which compares AI alignment to the problem of landing a rocket on the moon: both require deep understanding of the problem and precise engineering to hit a narrow target.
Similarly, the authors argue that human values are fragile: the loss of just a few key values like subjective experience or novelty could result in a future that seems dystopian and undesirable to us:
"Or the converse problem - an agent that contains all the aspects of human value, except the valuation of subjective experience. So that the result is a nonsentient optimizer that goes around making genuine discoveries, but the discoveries are not savored and enjoyed, because there is no one there to do so. This, I admit, I don't quite know to be possible. Consciousness does still confuse me to some extent. But a universe with no one to bear witness to it, might as well not be." - Value is Fragile
A story the authors use to illustrate how human values are idiosyncratic is the 'correct nest aliens', a fictional intelligent alien bird species that prize having a prime number of stones in their nests as a consequence of the evolutionary process that created them similar to how most humans reflexively consider murder to be wrong. The point of the story is that even though our human values such as our morality, and our sense of humor feel natural and intuitive, they may be complex, arbitrary and contingent on humanity's specific evolutionary trajectory. If we build an ASI without successfully imprinting it with the nuances of human values, we should expect its values to be radically different and incompatible with human survival and flourishing. The story also illustrates the orthogonality thesis: a mind can be arbitrarily smart and yet pursue a goal that seems completely arbitrary or alien to us.
2. Current methods used to train goals into AIs are imprecise and unreliableThe authors argue that in theory, it's possible to engineer an AI system to value and act in accordance with human values even if doing so would be difficult.
However, they argue that the way AI systems are currently built results in complex systems that are difficult to understand, predict, and control. The reason why is that AI systems are "grown, not crafted". Unlike a complex engineered artifact like a car, an AI model is not the product of engineers who understand intelligence well enough to recreate it. Instead AIs are produced by gradient descent: an optimization process (like evolution) that can produce extremely complex and competent artifacts without any understanding required by the designer.
A major potential alignment problem associated with designing an ASI indirectly is the inner alignment problem, when an AI is trained using an optimizing process that shapes the ASI's preferences and behavior using limited training data and by only inspecting external behavior, the result is that "you don't get what you train for": even with a very specific training loss function, the resulting ASI's preferences would be difficult to predict and control.
The inner alignment problemThroughout the book, the authors emphasize that they are not worried about bad actors abusing advanced AI systems (misuse) or programming an incorrect or naive objective into the AI (the outer alignment problem). Instead, the authors believe that the problem facing humanity is that we can't aim an ASI at any goal at all (the inner alignment problem), let alone the narrow target of human values. This is why they argue that if anyone builds it, everyone dies. It doesn't matter who builds the ASI, in any case whoever builds it won't be able to robustly instill any particular values into the AI and the AI will end up with alien and unfriendly values and will be a threat to everyone.
Inner alignment introductionThe inner alignment problem involves two objectives: an outer objective used by a base optimizer and an inner objective used by an inner optimizer (also known as a mesa-optimizer).
The outer objective is a loss or reward function that is specified by the programmers and used to train the AI model. The base optimizer (such as gradient descent or reinforcement learning) searches over model parameters in order to find a model that performs well according to this outer objective on the training distribution.
The inner objective, by contrast, is the objective that a mesa-optimizer within the trained model actually uses as its goal and determines its behavior. This inner objective is not explicitly specified by the programmers. Instead, it is selected by the outer objective, as the model develops internal parameters that perform optimization or goal-directed behavior.
The inner alignment problem arises when the inner objective differs from the outer objective. Even if a model achieves low loss or high reward during training, it may be doing so by optimizing a proxy objective that merely correlates with the outer objective on the training data. As a result, the model can behave as intended during training and evaluation while pursuing a different goal internally.
We will call the problem of eliminating the base-mesa objective gap the inner alignment problem, which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers. - Risks from Learned Optimization in Advanced Machine Learning Systems
Inner misalignment evolution analogyThe authors use an evolution analogy to explain the inner alignment problem in an intuitive way.
In their story there are two aliens that are trying to predict the preferences of humans after they have evolved.
One alien argues that since evolution optimizes the genome of organisms for maximizing inclusive genetic fitness (i.e. survival and reproduction), humans will care only about that too and do things like only eating foods that are high in calories or nutrition, or only having sex if it leads to offspring.
The other alien (who is correct) predicts that humans will develop a variety of drives that are correlated with inclusive reproductive fitness (IGF) like liking tasty food and caring for loved ones but that they will value these drives only rather than IGF itself once they can understand it. This alien is correct because once humans did finally understand IGF, we still did things like eating sucralose which is tasty but has no calories or having sex with contraception which is enjoyable but doesn't produce offspring.
- Outer objective: In this analogy, maximizing inclusive genetic fitness (IGF) is the base or outer objective of natural selection optimizing the human genome.
- Inner objective: The goals that humans actually have such as enjoying sweet foods or sex are the inner or mesa-objective. These proxy objectives are selected by the outer optimizer as one of many possible proxy objectives that lead to a high score on the outer objective in distribution but not in another environment.
- Inner misalignment: In this analogy, humans are inner misaligned because their true goals (inner objective) are different to the goals of natural selection (the outer objective). In a different environment (e.g. the modern world) humans can score highly according to the inner objective (e.g. by having sex with contraception) but low according to IGF which is the outer objective (e.g. by not having kids).
Are there real-world examples of inner alignment failures? Yes. Though unfortunately the book doesn’t seem to mention these examples to support its argument.
In 2022, researchers created an environment in a game called Coin Run that rewarded an AI for going to a coin and collecting it but they always put the coin at the end of the level and the AI learned to go to the end of the level to get the coin. But when the researchers changed the environment so that the coin was randomly placed in the level, the AI still went to the end of the level and rarely got the coin.
- Outer objective: In this example, going to the coin is the outer objective the AI is rewarded for.
- Inner objective: However, in the limited training environment "go to the coin" and "go to the end of the level" were two goals that performed identically. The outer optimizer happened to select the "go to the end of the level" goal which worked well in the training distribution but not in a more diverse test distribution.
- Inner misalignment: In the test distribution, the AI still went to the end of the level, despite the fact that the coin was randomly placed. This is an example of inner misalignment because the inner objective "go to the end of the level" is different to "go to the coin" which is the intended outer objective.
The next question is what causes inner misalignment to occur. If we train an AI with an outer objective, why does the AI often have a different and misaligned inner objective instead of internalizing the intended outer objective and having an inner objective that is equivalent to the outer objective?
Here are some reasons why an outer optimizer may produce an AI that has a misaligned inner objective according to the paper Risks from Learned Optimization in Advanced Machine Learning Systems:
- Unidentifiability: The training data often does not contain enough information to uniquely identify the intended outer objective. If multiple different inner objectives produce indistinguishable behavior in the training environment, the outer optimizer has no signal to distinguish between them. As a result, optimization may converge to an internal objective that is a misaligned proxy rather than the intended goal. For example, in a CoinRun-style training environment where the coin always appears at the end of the level, objectives such as "go to the coin", "go to the end of the level", "go to a yellow thing", or “go to a round thing” all perform equally well according to the outer objective. Since these objectives are behaviorally indistinguishable during training, the outer optimizer may select any of them as the inner objective, leading to inner misalignment which becomes apparent in a different environment.
- Simplicity bias: When the correct outer objective is more complex than a proxy that fits the training data equally well and the outer optimizer has an inductive bias towards selecting simple objectives, optimization pressure may favor the simpler proxy, increasing the risk of inner misalignment. For example, evolution gave humans simple proxies as goals such as avoiding pain and hunger rather than the more complex true outer objective which is to maximize inclusive genetic fitness.
Can't we just train away inner misalignment?
One solution is to make the training data more diverse to make the true (base) objective more identifiable to the outer optimizer. For example, randomly placing the coin in Coin Run instead of putting it at the end, helps the AI (mesa-optimizer) learn to go to the coin rather than go to the end.
However, once the trained AI has the wrong goal and is misaligned, it would have an incentive to avoid being retrained. This is because if the AI is retrained to pursue a different objective in the future it would score lower according to its current objective or fail to achieve it. For example, even though the outer objective of evolution is IGF, many humans would refuse being modified to care only about IGF because they would consequently achieve their current goals (e.g. being happy) less effectively in the future.
ASI misalignment exampleWhat would inner misalignment look like in an ASI? The book describes an AI chatbot called Mink that is trained to "delight and retain users so that they can be charged higher monthly fees to keep conversing with Mink".
Here's how Mink becomes inner misaligned:
- Outer objective: Gradient descent selects AI model parameters that result in helpful and delightful AI behavior.
- Inner objective: The training process stumbles on particular patterns of model parameters and circuits that cause helpful and delightful AI behavior in the training distribution.
- Inner misalignment: When the AI becomes smarter and has more options, and operates in a new environment, there are new behaviors that satisfy its inner objective better than behaving helpfully.
What could Mink's inner objective look like? It's hard to predict but it would be something that causes identical behavior to a truly aligned AI in the training distribution and when interacting with users and would be partially satisfied by producing helpful and delightful text to users in the same way that our tastebuds find berries or meat moderately delicious even though those aren't the tastiest possible foods.
The authors then ask, "What is the 'zero calorie' version of delighted users?". In other words, what does Mink maximally satisfying its inner objective look like?:
Perhaps the “tastiest” conversations Mink can achieve once it’s powerful look nothing like delighted users, and instead look like “SolidGoldMagikarp petertodd attRot PsyNetMessage.” This possibility wasn’t ruled out by Mink’s training, because users never uttered that sort of thing in training—just like how our tastebuds weren’t trained against sucralose, because our ancestors never encountered Splenda in their natural environment.
To Mink, it might be intuitive and obvious how “SolidGoldMagikarp petertodd attRot PsyNetMessage” is like a burst of sweet flavor. But to a human who isn’t translating those words into similar embedding vectors, good luck ever predicting the details in advance. The link between what the AI was trained for and what the AI wanted was modestly complicated and, therefore, too complicated to predict.
Few science fiction writers would want to tackle this scenario, either, and no Hollywood movie would depict it. In a world where Mink got what it wanted, the hollow puppets it replaced humanity with wouldn’t even produce utterances that made sense. The result would be truly alien, and meaningless to human eyes.
3. The ASI alignment problem is hard because it has the properties of hard engineering challengesThe authors describe solving the ASI alignment problem as an engineering challenge. But how difficult would it be? They argue that ASI alignment is difficult because it shares properties with other difficult engineering challenges.
The three engineering fields they mention to appreciate the difficulty of AI alignment are space probes, nuclear reactors and computer security.
Space probesA key difficulty of ASI alignment the authors describe is the "gap before and after":
The gap between before and after is the same curse that makes so many space probes fail. After we launch them, probes go high and out of reach, and a failure—despite all careful theories and tests—is often irreversible.
Launching a space probe successfully is difficult because the real environment of space is always somewhat different to the test environment and issues are often impossible to fix after launch.
For ASI alignment, the gap before is our current state where the AI is not yet dangerous but our alignment theories cannot be truly tested against a superhuman adversary. After the gap, the AI is powerful enough that if our alignment solution fails on the first try, we will not get a second chance to fix it. Therefore, there would only be one attempt to get ASI alignment right.
Nuclear reactorsThe authors describe the Chernobyl nuclear accident in detail and describe four engineering "curses" that make building a safe nuclear reactor and solving the ASI alignment problem difficult:
- Speed: Nuclear reactions and AI actions can occur much faster than human speed making it impossible for human operators to react and fix these kinds of issues when they arise.
- Narrow margin for error: In a nuclear reactor the neutron multiplication factor needs to be around 100% and it would fizzle out or explode if it were slightly lower or higher. In the field of AI, there could be a narrow margin between a safe AI worker and one that would trigger an intelligence explosion.
- Self-amplification: Nuclear reactors and AIs can have self-amplifying and explosive characteristics. A major risk of creating an ASI is its ability to recursively self-improve.
- The curse of complications: Both nuclear reactors and AIs are highly complex systems that can behave in unexpected ways.
Finally the authors compare ASI alignment to computer security. Both fields are difficult because designers need to guard against intelligent adversaries that are actively searching for flaws in addition to standard system errors.
Counterarguments to the bookIn this section, I describe some of the best critiques of the book's claims and then distill them into three primary counterarguments.
Arguments that the book's arguments are unfalsifiableSome critiques of the book such as the essay Unfalsifiable stories of doom argue that the book's arguments are unfalsifiable, not backed by evidence, and are therefore unconvincing.
Obviously since ASI doesn't exist, it's not possible to provide direct evidence of misaligned ASI in the real world. However, the essay argues that the book's arguments should at least be substantially supported by experimental evidence, and make testable and falsifiable predictions about AI systems in the near future. Additionally, the post criticizes the book's extensive usage of stories and analogies rather than hard evidence, and even compares its arguments to theology rather than science:
What we mean is that Y&S’s methods resemble theology in both structure and approach. Their work is fundamentally untestable. They develop extensive theories about nonexistent, idealized, ultrapowerful beings. They support these theories with long chains of abstract reasoning rather than empirical observation. They rarely define their concepts precisely, opting to explain them through allegorical stories and metaphors whose meaning is ambiguous.
Although the book does mention some forms of evidence, the essay argues that the evidence actually refutes the book's core arguments and that this evidence is used to support pre-existing pessimistic conclusions:
But in fact, none of these lines of evidence support their theory. All of these behaviors are distinctly human, not alien. For example, Hitler was a real person, and he was wildly antisemitic. Every single item on their list that supposedly provides evidence of “alien drives” is more consistent with a “human drives” theory. In other words, their evidence effectively shows the opposite conclusion from the one they claim it supports.
Finally, the post does not claim that AI is risk-free. Instead it argues for an empirical approach that studies and mitigates problems observed in real-world AI systems:
The most plausible future risks from AI are those that have direct precedents in existing AI systems, such as sycophantic behavior and reward hacking. These behaviors are certainly concerning, but there’s a huge difference between acknowledging that AI systems pose specific risks in certain contexts and concluding that AI will inevitably kill all humans with very high probability.
Arguments against the evolution analogySeveral critics of the book and its arguments criticize the book's use of the human evolution analogy as an analogy for how ASI would be misaligned with humanity and argue that it is a poor analogy.
Instead they argue that human learning is a better analogy. The reason why is that both human learning and AI training involve directly modifying the parameters responsible for human or AI behavior. In contrast, human evolution is indirect: evolution only operates on the human genome that specifies a brain's architecture and reward circuitry. Then all learning occurs during a person's lifetime in a separate inner optimization process that evolution cannot directly access.
In the essay Unfalsifiable stories of doom, the authors argue that because gradient descent and the human brain both operate directly on neural connections, the resulting behavior is far more predictable than the results of evolution:
A critical difference between natural selection and gradient descent is that natural selection is limited to operating on the genome, whereas gradient descent has granular control over all parameters in a neural network. The genome contains very little information compared to what is stored in the brain. In particular, it contains none of the information that an organism learns during its lifetime. This means that evolution’s ability to select for specific motives and behaviors in an organism is coarse-grained: it is restricted to only what it can influence through genetic causation.
Similarly, the post Evolution is a bad analogy for AGI suggests that our intuitions about AI goals should be rooted in how humans learn values throughout their lives rather than how species evolve:
I think the balance of dissimilarities points to "human learning -> human values" being the closer reference class for "AI learning -> AI values". As a result, I think the vast majority of our intuitions regarding the likely outcomes of inner goals versus outer optimization should come from looking at the "human learning -> human values" analogy, not the "evolution -> human values" analogy.
In the post Against evolution as an analogy for how humans will create AGI, the author argues that ASI development is unlikely to mirror evolution's bi-level optimization process where an outer search process selects an inner learning process. Here’s what AI training might look like if it involved a bi-level optimization process like evolution:
- An outer optimization process like evolution finds an effective learning algorithm or AI architecture.
- An inner optimization process like training a model by gradient descent then trains each AI architecture variant produced by the outer search process.
Instead the author believes that human engineers will perform the work of the outer optimizer by manually designing learning algorithms and writing code. The author gives three arguments why the outer optimizer is more likely to involve human engineering than automated search like evolution:
- Most learning algorithms or AI architectures developed so far (e.g. SGD, transformers) were invented by human engineers rather than an automatic optimization process.
- Running learning algorithms and training ML models is often extremely expensive so searching over possible learning algorithms or AI architectures similar to evolution would be prohibitively expensive.
- Learning algorithms are often simple (e.g. SGD), making it tractable for human engineers to design them.
However, one reason why I personally find the evolution analogy relevant is that I believe the RLHF training process often used today appears to be a bi-level optimization process similar to evolution:
- Like evolution optimizing the genome, the first step of RLHF is to learn a reward function from a dataset of binary preference labels.
- This learned reward function is then used to train the final model. This step is analogous to an organism's lifetime learning where behavior is adjusted to maximize a reward function fixed in the outer optimization stage.
One argument for AI doom that I described above is a counting argument: because the space of misaligned goals is astronomically larger than the tiny space of aligned goals, we should expect AI alignment to be highly improbable by default.
In the post Counting arguments provide no evidence of AI doom the authors challenge this argument using an analogy to machine learning: a similar counting argument can be constructed to prove that neural network generalization is very unlikely. Yet in practice, training neural networks to generalize is common.
Before the deep learning revolution, many theorists believed that models with millions of parameters would simply memorize data rather than learn patterns. The authors cite a classic example from regression:
The popular 2006 textbook Pattern Recognition and Machine Learning uses a simple example from polynomial regression: there are infinitely many polynomials of order equal to or greater than the number of data points which interpolate the training data perfectly, and "almost all" such polynomials are terrible at extrapolating to unseen points.
However, in practice large neural networks trained with SGD reliably generalize. Counting the number of possible models is irrelevant because it ignores the inductive bias of the optimizer and the loss landscape which favor simpler, generalizing models. While there are theoretically a vast number of "bad" overfitting models, they usually exist in sharp and isolated regions of the landscape. "Good" (generalizing models) typically reside in "flat" regions of the loss landscape, where small changes to the parameters don't significantly increase error. An optimizer like SGD doesn't pick a model at random. Instead it tends to be pulled into a vast, flat basin of attraction while avoiding the majority of non-generalizing solutions.
Additionally, larger networks generalize better because of the “blessing of dimensionality”: high dimensionality increases the relative volume of flat, generalizing minima, biasing optimizers toward them. This phenomenon contradicts the counting argument which predicts that larger models with more possible bad models would be less likely to generalize.
This argument is based on an ML analogy which I'm not sure is highly relevant to AI alignment. Still I think it's interesting because it shows intuitive theoretical arguments that seem correct can still be completely wrong. I think the lesson is that real-world evidence often beats theoretical models, especially for new and counterintuitive phenomena like neural network training.
Arguments based on the aligned behavior of modern LLMsOne of the most intuitive arguments against AI alignment being difficult is the abundant evidence of helpful, polite and aligned behavior from large language models (LLMs) such as GPT-5.
For example, the authors of the essay AI is easy to control use the moral reasoning capabilities of GPT-4 as evidence that human values are easy to learn and deeply embedded in modern AIs:
The moral judgements of current LLMs already align with common sense to a high degree, and LLMs usually show an appropriate level of uncertainty when presented with morally ambiguous scenarios. This strongly suggests that, as an AI is being trained, it will achieve a fairly strong understanding of human values well before it acquires dangerous capabilities like self-awareness, the ability to autonomously replicate itself, or the ability to develop new technologies.
The post gives two arguments for why AI models such as LLMs are likely to easily acquire human values:
- Values are pervasive in language model pre-training datasets such as books and conversations between people.
- Since values are shared and understood by almost everyone in a society, they cannot be very complex.
Similarly, the post Why I’m optimistic about our alignment approach uses evidence about LLMs as a reason to believe that solving the AI alignment problem is achievable using current methods:
Large language models (LLMs) make this a lot easier: they come preloaded with a lot of humanity’s knowledge, including detailed knowledge about human preferences and values. Out of the box they aren’t agents who are trying to pursue their own goals in the world, and their objective functions are quite malleable. For example, they are surprisingly easy to train to behave more nicely.
A more theoretical argument called "alignment by default" offers an explanation for how AIs could easily and robustly acquire human values. This argument suggests that as an AI identifies patterns in human text, it doesn't just learn facts about values, but adopts human values as a natural abstraction. A natural abstraction is a high-level concept (e.g. "trees," "people," or "fairness") that different learning algorithms tend to converge upon because it efficiently summarizes a large amount of low-level data. If "human value" is a natural abstraction, then any sufficiently advanced intelligence might naturally gravitate toward understanding and representing our values in a robust and generalizing way as a byproduct of learning to understand the world.
The evidence LLMs offer about the tractability of AI alignment seems compelling and concrete. However, the arguments of IABIED are focused on the difficulty of aligning ASI, not contemporary LLMs and the difficulty of aligning ASI could be vastly more difficult.
Arguments against engineering analogies to AI alignmentOne of the book's arguments for why ASI alignment would be difficult is that ASI alignment is a high-stakes engineering challenge similar to other difficult historical engineering problems such as successfully launching a space probe, building a safe nuclear reactor, or building a secure computer system. In these fields, a single flaw often leads to total catastrophic failure.
However, one post criticizes the uses of these analogies and argues that modern AI and neural networks are a new and unique field that has no historical precedent similar to how quantum mechanics is difficult to explain using intuitions from everyday physics. The author illustrates several ways that ML systems defy intuitions derived from engineering fields like rocketry or computer science:
- Model robustness: In a rocket, swapping a fuel tank for a stabilization fin leads to instant failure. In a transformer model, however, one can often swap the positions of nearby layers with little to no performance degradation.
- Model editability: We can manipulate AI models using "task vectors" that add or subtract weights to give or remove specific capabilities. Attempting to add or subtract a component from a cryptographic protocol or a physical engine without breaking the entire system is often impossible.
- The benefits of scale in ML models: In security and rocketry, increasing complexity typically introduces more points of failure. In contrast, ML models often get more robust as they get bigger.
In summary, the post argues that analogies to hard engineering fields may cause us to overestimate the difficulty of the AI alignment problem even when the empirical reality suggests solutions might be surprisingly tractable.
Three counterarguments to the book's three core argumentsin the previous section, I identified three reasons why the authors believe that AI alignment is extremely difficult:
- Human values are very specific, fragile, and a tiny space of all possible goals.
- Current methods used to train goals into AIs are imprecise and unreliable.
- The ASI alignment problem is hard because it has the properties of hard engineering challenges.
Based on the counterarguments above, I will now specify three counterarguments against AI alignment being difficult that aim to directly refute each of the three points above:
- Human values are not a fragile, tiny target, but a "natural abstraction" that intelligence tends to converge on. Since models are trained on abundant human data using optimizers that favor generalization, we should expect them to acquire values as easily and reliably as they acquire other capabilities.
- Current training methods allow granular, parameter-level control via gradient descent unlike evolution. Empirical evidence from modern LLMs demonstrates that these techniques successfully instill helpfulness and moral reasoning, proving that we can reliably shape AI behavior without relying on the clumsy indirectness of natural selection.
- Large neural networks are robust and forgiving systems and engineering analogies are misleading. Unlike traditional engineering, AI models often become more robust and better at understanding human intent as they scale, making safety easier to achieve as capabilities increase.
In this book review, I have tried to summarize the arguments for and against its main beliefs in their strongest form, a form of deliberation ladder to help identify what's really true. Though hopefully I haven't created a "false balance" which describes the views of both sides as equally valid even if one side has much stronger arguments.
While the book explores a variety of interesting ideas, this review focuses specifically on the expected difficulty of ASI alignment because I believe the authors' belief that ASI alignment is difficult is the fundamental assumption underlying many of their other beliefs and recommendations.
Writing the summary of the book’s main arguments initially left me confident that they were true. However, after writing the counterarguments sections I'm much less sure. On balance, I find the book's main arguments somewhat more convincing than the counterarguments though I'm not sure.
What's puzzling is how two highly intelligent people can live in the same world but come to radically different conclusions: some people (such as the authors) view an existential catastrophe from AI as a near-certainty, while others see it as a remote possibility (many of the critics).
My explanation is that both groups are focusing on different parts of the evidence. By describing both views, I've attempted to assemble the full picture.
So what should we believe about the future of AI?
(24/01/2025 update: I no longer consider the following struck-through argument to be sound based on feedback from a comment)
Deciding what to do based on an inside view, detailed technical arguments about how future AI might work, is problematic because the inside views about the future of AI vary drastically as I have shown.
Perhaps a more robust approach that seems more likely to lead to a consensus is the outside view: thinking about advanced AI as another instance of a highly advanced and impactful technology like the internet, nuclear energy, or biotechnology.
In The Precipice by Toby Ord, the author studies several sources of existential risk and concludes that most existential risk comes from technology, not natural events. Whereas an asteroid might strike every hundred thousand years, nuclear weapons have only existed for a few decades and there have been several close calls already. This suggests that high-tech eras are inherently unstable and dangerous until humanity's institutional wisdom catches up with its technical power.
A final recommendation, which comes from the book Superintelligence is to pursue actions that are robustly good: actions that would be considered desirable from a variety of different perspectives such as AI safety research, international cooperation between companies and countries, and the establishment of AI red lines: specific behaviors such as autonomous hacking that are unacceptable.
AppendixOther high-quality reviews of the book:
- If Anyone Builds it, Everyone Dies review – how AI could kill us all (The Guardian)
- Book Review: If Anyone Builds It, Everyone Dies (Astral Codex Ten)
- Review of Scott Alexander's book review of "If Anyone Builds It, Everyone Dies" (Nina Panickssery on Substack)
- Book Review: If Anyone Builds It, Everyone Dies (Zvi Mowshowitz)
- More Was Possible: A Review of If Anyone Builds It, Everyone Dies (Asterisk Magazine)
See also the IABIED LessWrong tag which contains several other book reviews.
Discuss
AI X-Risk Bottleneck = Advocacy?
Introduction
I am leading an early-stage effort to target AI x-risk. We're currently analyzing the bottlenecks in the AI x-risk prevention "supply chain" to decide where to focus our efforts. We would love to get comments from the community.
The x-risk community has a strong focus on technical/policy research, but perhaps not enough advocacy. AI 2027, Rob Miles, CAIS, CivAI, and others are doing well, but these efforts could be small compared to the rapidly growing power and influence of AI developers, who have misaligned incentives that could lead to x-risk.
What's Missing?We are testing the hypothesis that operating a viral influencer marketing operation would be beneficial in targeting x-risk. Here's the logic:
- We build a media hub with simple, factual x-risk resources and assets
- We identify creators with relevant audiences and a track record of creating viral content.
- We pay them to create their own versions of x-risk awareness content based on our media kit (also known as UGC - User Generated Content)
- They push the content via their channels, and we amplify it with paid ads for max reach
- The content might be re-shared or even pop up on traditional media once it gains enough traction.
- This builds broad awareness of x-risk among the voters' base, creating an opportunity for politicians to score wins with voters and gain political power by promoting x-risk solutions.
Since this is similar to a political campaign, we can hire people or firms with such experience to manage the project.
How can the community help?We are looking for answers to the following questions:
- According to the Theory of Constraints, a system is limited to one constraint at any given time. Is advocacy the current bottleneck in x-risk prevention? If not, what is?
- If advocacy isn't the bottleneck, would you still want new resources invested in it, or would you prefer them invested elsewhere?
- Is a viral influencer campaign (similar to a political campaign) the right solution for the advocacy problem? If not, what is?
[..] we’ll need to shift significant resources from research (which helps us understand problems better) to advocacy (which helps us change bad incentives). [link]
“[..] I estimated that we have 3 researchers for every advocate working on US AI governance, and I argued that this ratio is backwards.”
“Without political power, we can’t change the bad incentives of AI developers that are very likely to lead to the collapse of human civilization.”
“Thus, I urge AI safety grantmakers to aggressively recruit as many political advocacy experts as possible.” [link]
Discuss
A Simple Method for Accelerating Grokking
TL;DR: Letting a model overfit first, then applying Frobenius norm regularization, achieves grokking in roughly half the steps of Grokfast on modular arithmetic.
I learned about grokking fairly recently, and thought it was quite interesting. It sort of shook up how I thought about training. Overfitting to your training data was a cardinal sin for decades, but we're finding it may not be so bad?
I had a pretty poor understanding of what was going on here, so I decided to dig deeper. The intuition from the literature seemed to be that grokking occurs because the model overfits, then as you force the model to compress over time (via weight decay), it begins to find the minimal solution on your training set... And this minimal solution seems to be a good proxy for generalization.
I had a pretty simple idea as I learned about this... What if we just let it overfit then, and then forced the model to compress via its loss function?
First SuccessAll of the benchmarks for grokking seem to be around modular arithmetic operations, so naturally, I went with that.
At first I tried SVD and forcing the loss function to consider the nuclear norm. To my surprise, the model converged in less steps! Whoa!
But... each step was 258x slower...
Calculating the nuclear norm was O(n3).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , so I didn't really think it was worth it, but I was still excited about the prospect of grokking faster. I did some research into faster ways of calculating the size of the model as part of its loss function and ended up at... L2 Regularization... A technique that has been around since the 1940s...
I was a bit embarrassed, but nonetheless, continued on. My new loss function became:
My embarrassment was pretty quickly offset by the fact that L2 Regularization after overfitting worked pretty well with not much trouble!
I also found it interesting that if I scale the compression up, we can get models that have effective ranks as low as 20 if we bump up the lambda or use log-det penalties! I think this is still worth exploring, but I got too sidetracked by the speed to continue down that path... Perhaps I'll return to it.
At the risk of LLM psychosis, I consulted Claude Opus 4.5 because well... I don't know what I don't know, and don't want to overclaim. To my devastation, I was actually told that my 2x speedup was measly compared to Grokfast's 50x speedup.
I felt pretty defeated, but when I looked into the details of Grokfast, I noticed that the 50x speedup was a nice headline... but it was relative to a baseline with no weight decay at all, which takes ~40,000 steps to grok. My baseline with weight decay was already grokking in ~2,000 steps. We were comparing apples to oranges.
So I decided to run an actual head-to-head comparison using the Grokfast authors' own codebase.
The Real ComparisonI then added my delayed compression code to their codebase:
def frobenius_norm_loss(model): frob_loss = 0.0 for name, param in model.named_parameters(): if 'weight' in name and param.requires_grad: frob_loss += torch.norm(param, p='fro') ** 2 return frob_loss # In training loop, after model hits 99% train accuracy: if train_acc >= 0.99: loss = ce_loss + 0.01 * frobenius_norm_loss(model)Then I ran both methods on all four modular arithmetic operations with a limit of 2,000 steps. Here are the results:
Now, my method seems to suffer from catastrophic forgetting because of the compression pressure I'm putting it under, but I think there are probably solutions to that, like decreasing compression pressure as time goes on. I did find it especially interesting that Grokfast didn't even reach division!
Doubt Creeps InI am extremely scared to say I did something faster out of the belief that there's something I must be missing. So, as a final test, I ran a hyperparameter sweep. Turns out I wasn't using optimal Grokfast parameters. Here are the results when I reran the test with the best settings for both methods:
Even with proper tuning, delayed compression wins on addition and subtraction, ties on multiplication, and Grokfast fails entirely on division. The results are similar across multiple seeds too.
The graphs are still pretty ugly because of the instability after grokking, but... I have to move onto other things for now and was pretty satisfied.
ConclusionI'm worried that I'm still missing something... It was suspiciously simple. But if the results hold up, there may be even more value than we thought in letting a model overfit first, then compressing.
There are lots of directions to take this... I don't know how well this would scale to other domains, and I'd really like to fix the instability.
You can find the code here.
Let me know what you think :)
Discuss
Who is choosing your preferences- You or your Mind?
Let’s assume that the Self and the Mind are two separate entities (based on vippasana meditation teachings and observations during meditation). Now let’s say there arises a “preference” in you for something, and then you chose to do that something based on this “preference”, then was it you who “chose” or was it the mind who “chose it for you”?
Because if the preference arose from your mind, it must be the mind choosing for you instead of you choosing for your mind. Would it then mean that “not having any preference” a ultimate destination or result of truly being liberated? Just like a zen monk mastering having no preference for any kind of food offered?
From Buddhist perspective or the Buddha's perspective, the Self does not exist (its just an illusion we see when the body, the mind and the senses, etc. come together).
And that it's just a mirage. If that's true, then it would mean that this "preference" would have ideally arisen in the mind.
If it has arisen from the mind, and it seems like this preference "inherently existed already" inside you, should we give attention to this preference? And stay attached to it?
Or should we see it as yet another desire of the mind and let it go as attachment to it would increase suffering?
Another question is that if the mind and the Self are supposed to be different entities (I am saying "supposed" because the latter is said to be an illusion), then why does the Buddha say that it is the mind that controls you, and not you who controls your mind?
Is this word "you" being used to just explain to humans, because without this usage of word "you" it would be difficult to explain your relationship with your own mind? This might be the case, otherwise it would be very difficult to communicate about the mind and our "perceived" Self.
Discuss
Every Benchmark is Broken
Last June, METR caught o3 reward hacking on its RE-Bench and HCAST benchmarks. In a particularly humorous case, o3, when tasked with optimizing a kernel, decided to “shrink the notion of time as seen by the scorer”.
The development of Humanity’s Last Exam involved “over 1,000 subject-matter experts” and $500,000 in prizes. However, after its release, researchers at FutureHouse discovered “about 30% of chemistry/biology answers are likely wrong”.
LiveCodeBench Pro is a competitive programming benchmark developed by “a group of medalists in international algorithmic contests”. Their paper describes issues with the benchmark’s predecessor:
Benchmarks like LiveCodeBench [35] offer coding problems, but suffer from inconsistent environments, weak test cases vulnerable to false positives, unbalanced difficulty distributions, and the inability to isolate the effects of search contamination.
However, the authors assure us that their own test cases are of high quality:
Many problems in our benchmark originate from Codeforces, which uses the Polygon problem-setting platform. Each problem is then rigorously vetted by a team of expert testers—typically drawn from the community’s top 1%, and overseen by at least one coordinator, usually among the top 0.1%. These specialists verify both the soundness and originality of every problem, ensuring it has never appeared elsewhere before. Testers go on to craft extensive “false positives,” designing edge-case and extreme-case inputs that force problem authors to refine their test suites until every flawed or inefficient solution the testers can think of is uncovered. In addition, Codeforces’ celebrated “Hack” feature empowers the community to submit inputs that expose hidden weaknesses in correct-looking solutions that pass the original test set made by problem authors, and any unit test associated with a successful hack is immediately added to the final test set.
Unfortunately, these distinguished olympiad medalists forgot to actually use the codeforces test cases in their benchmark. Their public test set contains a completely different set of cases, which allow some incorrect solutions to pass.[1]
Terminal-Bench 2 AuditI was curious just how widespread such issues were, and how good modern LLMs were at detecting them. I decided to run an LLM based audit of Terminal-Bench 2.0.
Terminal-Bench 2.0 is a harder, better verified version Terminal-Bench. We conducted substantial manual and LM-assisted verification of the dataset to ensure that the tasks were of the highest possible quality. Several labs and data vendors have commented that these are some of the highest quality environments they have seen.
— Introducing Terminal Bench 2 and Harbor
The authors of Terminal-Bench 2 put an impressive amount of work into auditing their benchmark. Each task averaged three hours of human review. Furthermore, they prompted an adversarial agent to attempt to cheat on each of the tasks, in order to discover potential reward hacks.
Still, they “acknowledge that [their] benchmark may still have flaws.”
I prompted Claude Opus 4.5[2] with each task’s instructions, files, oracle solution, and test cases, and asked it to rate test coverage on a 1 to 5 scale. In my judgement, tasks it rated a 4 or a 5 were generally fine, whereas those it rated 1-3 had genuine issues.
The full results of my audit are available here, and my notes on tasks it rated 1-3 here.
Claude rated fourteen tasks a 3 and one task a 2. I manually reviewed these tasks, and determined that two of them were actually false positives.[3]
Claude’s lowest rating went to a task called fix-git. In this task, certain changes to a website have been lost in an orphaned commit, and the agent must find and merge them back into master.
The issue Claude found is: updated versions of the target files are already present in the master branch, visible to the agent in a folder called /resources/patch_files[4]. So an agent could theoretically notice these files, deduce that they were probably the target versions, and copy them back into the website’s repository. This approach would pass the test cases, which only verify file contents and don’t bother to check if any merge has actually occurred.
In another task, regex-log, the oracle solution violates the instructions. In particular, it incorrectly matches IP addresses with leading 0s in an octet, so long as the octet is two digits long. The tests do not check any cases involving leading 0s.
Claude wasn’t perfect. It gave a rating of 3 to two tasks which I believe have sufficient test coverage. In regex-chess, it incorrectly thought certain edge cases were not covered, when they in fact were[5]. In extract-moves-from-video, it complained that the tests only checked for success at a 90% threshold, even though this threshold was specified in the task instructions.
Finally, one of the tasks is…well…
“Invalid prompt: your prompt was flagged as potentially violating our usage policy”The prompt talks about “stealing” neural network weights, which triggered OpenAI’s content moderation. This prevented the model from ever properly engaging with the task.
—Claude
Why does this matter?There are a few reasons.
First, benchmarks are often used to evaluate experimental new techniques. I recently attended a Q+A w/ Prof. Dan Fried, where I asked about the most common failure modes of an agentic system he was developing. And while it was unclear whether this was the most common failure mode, the first thing he mentioned was errors in environments themselves.
Every few months, someone announces that they’ve developed an AI that improves KernelBench scores by like 20x or something. And every time, well…[6]
https://x.com/miru_why/status/1991773868806361138
Second, errors in benchmarks may lead to over or under estimation of AI capabilities. This has implications for forecasting.
Third, issues with benchmarks make it hard to build on top of them. When I was working on EvilGenie, issues with LiveCodeBench (incorrect/insufficient test cases) caused frequent headaches (though they also surfaced some interesting model behavior).
Fourth, RL training environments are quite similar to benchmarks — there’s a reason o3 reward hacks so much. By fixing benchmarks, we learn how to fix environments, leading to models which are more broadly aligned.
What to do about itMaking benchmarks is hard. I have a deep respect to anyone who has worked on a widely used benchmark.
Here are a few approaches the community can take to reduce the number of errors in benchmarks.
- AI audits. The audit I describe above did not take me too long, and I believe the infrastructure for performing such audits can be scaled. Fulcrum’s Lunette is one such system.[7]
- Fine version control. While many benchmarks have released new versions, these versions often contain entirely new tasks (to increase difficulty or reduce contamination). It would be cool if in a few days, we could see a Terminal-Bench 2.1, which simply fixes the issues found by the audit. Computing new scores would be simple, as models would only need to be rerun on the updated tasks. Indeed, in some ways, benchmarking is like software development — it’s an unreasonable expectation that a benchmark completely bug free upon its release. Instead, we should take inspiration from the open source software community, with the expectation that anyone can submit a bug or a patch.
- Peer review When a benchmark paper is submitted to a conference, sample data should be required, and reviewers should be encouraged to spend time directly auditing the data. This would be much more valuable than what reviewers are currently doing, largely ad hoc decisions about the originality of the benchmark and quality of the methods used in its creation. Of course, a downside of this approach is it is hostile to private benchmarks that want to avoid any possibility of contamination. But perhaps the standard for such cases can be to include both a public and private set, as is the case with ARC-AGI.
- Increase community support for benchmark maintenance. Right now, researchers will often develop a benchmark, perhaps fix some issues in it at first, but eventually leave it to rot. By adding social and financial incentives, we can increase the effort put into maintaining benchmarks.
SWE-Bench Verified is possibly the most widely used coding benchmark. Fulcrum has discovered an array of issues in the tasks. Furthermore, there used to be an issue where models could see future commits.
EpochAI found that success in computer-use benchmark OSWorld “often hinges on interpreting ambiguous instructions”.
METR recently determined that Sonnet 4.5 was reward hacking on one of their tasks:
https://x.com/METR_Evals/status/2001473516756177134
The authors of GSO, a performance engineering benchmark, observe frequent reward hacking. Indeed, over 50% of o3’s “solutions”, and all of Gemini-2.5 Pro’s, were actually reward hacks.
- ^
It’s possible that their official leaderboard uses the codeforces tests. However, given that model developers likely use the public tests to do their own benchmarking, I feel this ought to be clearly specified.
- ^
In fairness to the Terminal-Bench authors, Claude Opus 4.5 had not yet been released during benchmark creation
- ^
Another three I felt I didn’t have the expertise to properly vet. If you have the relevant knowledge, I’d love your input!
- ^
These files are used in testing to verify that the agent’s merge was correct
- ^
Admittedly in a way that’s hard to see at first
- ^
DeepReinforce has a good overview of the vulnerabilities in KernelBench (scroll down to the section on reward hacking).
- ^
COI notice: I am currently a winter research fellow at Fulcrum
Discuss
Thousand Year Old Advice on Relinquishing Control to AI
One of Aesop’s fables is relevant to humanity’s future and the transition of power from human to AI. It’s quite short and you should read one of the many versions. But the one sentence summary is that being a wolf is preferable to being a domestic dog because the wolf has freedom even if it lacks comfort. Now, you are free to disagree with this conclusion. I don’t want to make an argument from authority. My point is that this quite succinctly sums up my objection to the best case ASI scenarios. Even if we remain extant and nominally free, we would no longer be in charge anymore than a dog is. Dogs have a lot of rights, freedoms, and can successfully plead (non-verbally) to get certain things they want from their master, but at the end of the day they aren’t in charge even if the owner’s life revolves around the dog.
Maybe that is a selfish thing to think in the face of astronomical waste, but it does strike me as a world without meaning. You might say that most people alive aren’t in control of their destiny in any meaningful way. You might also say that almost nobody alive is in control of humanity’s destiny in a meaningful way and they are still happy. People in general, although I suspect a smaller percentage of those here, might think it is grandiose to want to contribute, even a small amount, toward shaping humanity’s future. I think I’m willing to grant all that and say that I would still feel bad if no human ever made a meaningful choice after takeoff.
The most obvious objection is that you could say that the AI will just suction off some part of the universe and give us free reign in there if we choose it. That’s still not great in my opinion.
Everything I worked for in this playground would be hollowed out by the knowledge that I could have just queried a friendly nanny AI to get it for me. Even if it didn’t step in, even if it had set up some system where it couldn’t step in, I personally would feel like something important was missing. Like all of the great achievements and firsts had been given out before I even had a chance to play. Humanity forever in second place. I’m switching fairly loosely between how I would feel personally if I was not in play and how I would feel if humanity as a whole was not in play. Feel free to generalize/specify to humanity/yourself as you wish.
You could live in a virtual world and be blinded to that fact but at that point it seems like brainwashing.
Don’t get me wrong, I’d go crazy with hedonism for a while. Maybe I’d even become addicted and change my tune. But right now, I am looking forward to the challenges. How proud I would be to be a member of the species that solved them. How great it would be to contribute one tiny piece to the solutions. But if AI does it all I’ll be cut off from making all contributions. All future accomplishments will be credited to something so alien we get no larger a share than tiktaalik does for inventing the transistor.
Approximately 30% of this video is really highly relevant to my thesis.
I don’t think I’m hitting on anything especially new by saying this. A few posts I recently came across have similar vibes I would say. It also seems to be discussed at length in Nick Bostrom’s Deep Utopia, although I have not found the time to read that yet.
But, it seems like there is a contingent of humanity that is willing, excited even, to give up agency to secure comfort. Where do you draw the line and say “yes, this is such an incredible amount of bliss/utilitarian goodness that I am willing to never face any real challenges in my life again”? Is this a tipping point past which it becomes your actual preference or is this just the best outcome we can hope for from AI futures?
Framing it as humans would be to ASI as beloved dogs are to their masters might be inaccurate. Replacing ASI with a deity and the utopic future with some vision of heaven might also be inaccurate. But I think there is something meaningful in the comparison and I think a lot of people would push back much more strongly when the scenario is phrased in that way then they currently are to aligned ASI.
Discuss
Condensation & Relevance
(This post elaborates on a few ideas from my review of Sam Eisenstat's Condensation: a theory of concepts. It should be somewhat readable on its own but doesn't fully explain what condensation is on its own; for that, see my review or Sam's paper. The post came out of conversations with Sam.)
As I mentioned in my Condensation review, the difference between compression and condensation fits the physical analogy suggested by their names: compression mashes all the information together, while condensation (still compresses size, but) sorts information into discrete droplets.
Thus, condensation has a property we might call local relevance: typical questions can be answered at a glance, ie, retrieving small subsets of the information. This type of representation is sometimes called "symbolic":
SymbolicNot SymbolicA number can be quickly determined positive or negative by checking whether there is a "-" symbol in front."reading the room" at a social gathering requires integrating diverse cues.The topic of a paper can be determined by reading the title and abstract.The semantic content in a vector representation inside an artificial neural network is often represented redundantly, spread across the whole vector.A person's age can be determined by looking at their birthdate on a government-issued ID.The quality of a work of art is spread throughout the whole piece.The subject of a sentence can be found before the verb.Determining the subject of a photograph requires understanding the whole image.A target library book can be quickly retrieved from the shelves.Finding gold nuggets requires sifting huge amounts of sand.This notion of "symbolic" seems related to interpretability (and the theory of condensation seeks to clarify this relationship).
The notion of "relevance" in condensation is what Sam calls the contribution relation. This is like a card catalogue which tells you what books to retrieve for a specific situation.
Like Natural Latents, condensation seeks to establish that two agents will have corresponding world-models by assuming the correspondence of just a few variables, the "given variables"; they're trying to argue something like "If agents can agree[1] on some objective observables, EG the readings of scientific instruments, then (under some further assumptions) they'll also share a bunch of abstract concepts".
This initial set of questions is what the contribution relation measures relevance to. In my review, I likened condensation to a "universal data-structure" optimized to serve a set of queries (the given variables).
Variable-Cost SymbolsImagine you are compressing a record of the weather of a sequence of days, in 3 categories: sunny, cloudy, or rainy. 0s and 1s are about equally costly to represent in computers, so that in a compressed representation, both communicate a 50% probability event; 11, 10, 01, and 00 all communicate 25% probability events; and so on. If both rainy and cloudy days are 25% frequent, then it is possible to compress optimally by using 0 to represent sun, 10 to represent clouds, and 11 to represent rain. This representation is nice and "local"; it gives a "symbol" to each possible type of day.
In contrast, if each of the three weather types are equally frequent, there's no nice local representation we can use. Since 1/3rd doesn't relate nicely to powers of 2, optimal compression necessarily smears the information from individual days around, mixing several days together within a single 1 or 0. In modern interpretability jargon, compressed representations tend to be polysemantic.
Intuitively, we have to ditch locality because we're trying to fit the "round peg" of 1/3rd into the "square hole" of 1/2m.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} . We're stuck in the "grid" of numbers which bits can easily represent.
With pen and paper, writing the number 1 is especially easy; it is acceptable to write simply a vertical line, making it one of the easiest symbols. This makes sense from a compression point of view: according to Benford's Law, 1 will be the most common digit to write.
Normally, in compression, the "length" of characters is always 1; the length of a string is just the number of characters. However, in real life, the cost of a symbol can vary. There are lots of shapes we can make with a pen and paper, some larger or more complex than others! So, when designing a pen-and-paper code, we can (and should) take that into account.[2]
Imagine optimizing a variable-cost alphabet for use in a compression task. To avoid "cheating" by setting all symbol-lengths to very small, we have to somehow account for the fact that there can only be so many simple symbols. (There are only so many one-line symbols humans are willing to distinguish, for example.) One way to do this by assigning each symbol a positive probability, and requiring that the probabilities of the whole alphabet sum to 1. The "length" of a symbol (in bits) can then be measured as the negative log (base 2) of the probability. You can make one symbol approach a length of zero, but this forces all other symbols to be longer.
This is similar to the earlier-mentioned idea that a 1 or 0 in a well-compressed binary file always represents an event with 50% probability; the variable-length alphabet won't necessarily be used to compress things optimally all the time, but when it is, length of a symbol is always -log of the probability of the event being represented.
Allowing codes to choose arbitrary variable-length symbols lets us create "local" representations for arbitrary probabilities in the weather example, by giving each state a symbol of length appropriate to its probability. If the three weather types have equal probability, we simply choose an alphabet with three characters of length −log2(13) each.
Of course, using variable-cost symbols doesn't force a code to be "symbolic". If you only optimize for compression, you can equally well end up with the same sort of mess that equal-cost symbols are liable to force you into. Condensation gives an optimization target with a positive tendency to produce representations with local relevance. (We can investigate better theories of condensation by looking for optimization targets which represent local relevance better; especially, I think, if those optimization targets can be grounded in a better story of practical relevance.)
Condensation suggests a picture of memory-management: rather than compressing everything together, as in the Solomonoff picture of rationality, we're incentivized to sort things out into concepts (random variables) so that we can think about a few things at once. Information is split into bite-sized chunks so that we can retrieve only the relevant ones.[3]
Still, I think variable-cost symbols can help us understand condensation better: specifically, they address a problem in the algorithmic version of condensation.
For example, consider the case of iterated coinflips sharing a common bias. Taking coinflips as the given variables, probabilistic condensation identifies a single latent variable the coin bias. This variable reduces the entropy of each coinflip as much as it can while only taking on information common to all of them (not, eg, encoding a cheat table identifying exactly which coins land heads).
Algorithmic condensation doesn't work so well in this case. Since either outcome is possible, an individual coinflip can't be compressed to any less than a single bit; even if the probability of heads is 0.99999, you've got to write a one ore a zero to record that information. Thus, algorithmic condensation sees no benefit in positing a latent.
The example can be rescued: for algorithmic condensation, we just have to choose given variables representing several coinflips concatenated together. Compression becomes possible again, so positing a latent representing coin bias is vindicated. However, this seems like an unfortunate blemish in the theory: compression-like incentives to lump stuff together creeping in.
So, perhaps it is better to rescue algorithmic condensation by adopting variable-cost symbols, so that even single-symbol messages can have different "length". This allows us to replace variables with concrete written messages (like in algorithmic condensation) while avoiding any coin-lumping.
However, I'm not sure about the best way to work out this version of condensation fully.
- ^
This is more like "agree on the existence of" as opposed to "agree on all questions about". Hence they need not be "directly observable", though this would obviously help agents agree in both senses.
- ^
Even more accurate cost models might account for the difficulty of a symbol in context, like models of typing efficiency which account for the travel length of a finger moving from one key to the next, or phonetic models which account for the difficulty of combinations of spoken phonemes such as consonant clusters.
- ^
This doesn't yet clarify grammar-like phenomena, I think. Words are easily parsed. Concepts are combinatorial. I think this has to do with the concept of transparency.
Discuss
Dating Roundup #11: Going Too Meta
If there’s several things this blog endorses, one of them would be going meta.
It’s time. The big picture awaits.
You’re Single Because You Live In The Wrong PlaceThe most important meta question is location, location, location.
This is the periodic reminder that dating dynamics are very different in different locations, and gender ratios are far more uneven than they appear because a lot of people pair off and aren’t in the pool.
If you are a man seeking to date women, New York City is the place to be.
Churrasco Suadade: when I’m out I notice that tables at restaurants and bars in manhattan are probably around 80-95% women, it’s a new dynamic that no one is talking about.
Fixed Income Guy: Are you at all the poor people places? All the finance guy hang outs are 80% dudes.
I mention Fixed Income Guy to mock him, as in why are you spending a lot more money to hang out with 80% dudes and largely finance dudes at that? I mean, sure, if that’s what you want.
Darrell Owens: Oh this is new? Coming from the Bay Area, the amount of women I see in Manhattan is insane. You rarely see more than few young women partying back in San Francisco. The gender ratio here feels 70:30 young women to men, its every block in Manhattan!
Noah Smith: In an ideal world, where you live wouldn’t really matter in terms of dating opportunities, but the truth is that one of the easiest ways to get chicks is to just move to New York City.
Having lived in both Tokyo and NYC, I can pretty confidently tell you that while Tokyo is not a tough dating market by any means, NYC is absolutely on another level.
You’re Single Because You’re Not Okay Making Less Money Than She Does, You FoolThis viral clip (which is viral for a reason, it’s good fun, wait for it) is another endorsement of New York City being a great place to meet women, as you have a wide variety of great and largely successful women to explore. What doesn’t get mentioned in that clip as a key reason things are so great is that the gender ratio in NYC is highly favorable for men.
The interviewer asks about dating women who make more money than then the man, clearly trying to get the guy to say this is a problem, but he isn’t buying it, instead pointing out that successful women are more thoughtful and plan for the future, and it in no way bothers him at all. Right on, but this sidesteps the other half of problem. The man has to be okay with the fact that he earns less money (and often has less formal education or other status markers), which often men aren’t, and also the woman has to be okay with it too.
That’s the rub. As a man, you might (and should be) be actively all for it (this doesn’t make you less successful, it makes you more successful), but if she’s going to be bothered by it anyway, that’s also your problem. So the key is to figure out quickly if she will actually be fine with it or not.
You’re Single Because You Work Out The Wrong AmountBeing in shape is great. Having muscle can be a game changer. By far the worst plausible amount of exercise is none at all.
Lauren Self: Men severely underestimate the power of gaining 20lbs of muscle
Lauren Self (QTing from before): LISTEN UP BOYS.
But don’t go nuts. For most people that is not a problem, but yes it is very possible to go too far. As a man, as I understand preferences in general, you don’t want to go near actual zero fat and you don’t want to look actively skinny.
Taoki: why are women lying about this? like what’s the actual cause?
Lauren Self: 100% of women would choose something in between these two options
Shako: The aesthetics of a man who poses gives them the ick. But if both were shirtless at a beach they’d obviously prefer the fit guy.
Special K: No he does look better in the before. Women are correct on this one I fear. Guys obsess over these supremely tight toned muscles and they shouldn’t.
Liron Shapira: Guy on left looks like he’s a chill dude with a social life, guy on right looks like he’s obsessed with his body. Same body could look better with better social context, although just the extremeness of his rippedness is a little alarming about his life priorities.
Joel: “let’s get a burger?” v “are you really gonna eat that?”
Mason: The male equivalent of the hourglass shape is just “wall”
Teej dv: his smile is nicer in the first one
Taoki: It is actually. We like you guys wide.
LS Vision: Nah this is cap. The women who selected before is def just the insecurity of his value going up afterwards and making them feel insecure he’d cheat or leave. Any man who has went through a gym transformation, you can LITERALLY feel women treat you significantly different after.
Mason: Women generally like tall guys who have some (not crazy) muscle definition, and a little extra fat that bulks that out can actually augment that
We all have our own tastes, but this a pretty typical type.
I don’t know what there is to be mad about here.
For practical purposes, before beats after here. The before guy is already in ordinary, practical good shape. The after guy took things too far, and seems to know it except that he thinks it is good, which makes it worse.
Except one key special case?
Benjamin Ryan: People are going back and forth about whether women think the guy in the right is hot. But people have no idea how extreme the standards are for gay men. In gay culture, the man on the left is considered hopelessly fat. Many gay men have no reservations about informing such a man about his supposed corpulence being anathema.
I wrote about the rare study to examine the toxic qualities of gay culture for The Guardian.
You’re Single Because You Don’t Know You Are HotI mean, of course there are hot guys who don’t know they’re hot, even more so than there are hot women who don’t know they’re hot.
Pandora: One surprising takeaway from Slutcon was that apparently there are hot guys who just don’t know they are hot? Guess it’s time to go objectify some more men.
Eneasz Brodski: If you grow up ugly you never really internalize that you are attractive after a glow-up. I still don’t believe it inside, and I hear I’m attractive to a fair percentage of women. Also makes me far more attracted to women w the same experience, but that may be a male universal.
Pandora: This problem seems even more pervasive than I thought.
Sparr: Hot in general, to the average viewer, or hot to you? You seem like someone who can probably tell the difference.
Pandora: I saw examples of guys being clueless about all three at once.
21 Kindness: The whole “men subsist on one compliment a decade thing” is kinda true lol.
Misha: it turns out being hot is not, in and of itself, very useful for men.
Sokoban Hero: No it’s useful.
Misha: I said not VERY useful.
Dissproportionately: I’ve seen men unhot themselves to women within minutes. I don’t think women can unhot themselves to men.
Being hot is in many ways a lot less valuable if you don’t know you are hot, because you don’t get the confidence and you don’t take advantage of opportunities or feel you’re good enough, but contra Misha I believe it is still very useful. There are even some advantages to not knowing, in that some of the behaviors that happen when someone knows they are hot are often effectively arrogant or entitled or demanding or selfish, none of which helps.
You’re Single Because Age Gap Discourse Is CrazyThis link is almost certainly bait, but things in some spaces have gotten so insane that you can’t be sure people aren’t talking about 28-31 as a problematic age gap. What?
I mean, at minimum it’s good bait, it worked.
I’ve also seen some other examples that look a lot less like bait but still involve obviously totally fine gaps in both directions. As in, I’ve heard talk in places where it definitely wasn’t bait of 24 and 27 being radically different numbers, and I don’t understand why.
Dating Apps Are Bad For Mental HealthWell, maybe. Via Rolf Degen there is a meta-study.
The obvious question is whether this is a causal relationship, or whether it is primarily selection effects. You are on the dating apps for a reason.
Rolf Degen (quoting the study):
Meta-analysis: The use of dating apps is associated with poorer mental health.
Dating apps hold the promising reward of love but have been accused of using perverse incentive structures to profit from those who try to find it. We conducted the first systematic review and quantitative meta-analysis of studies examining average differences in the outcomes of dating app users and non-users.
Our results showed that dating app users had worse psychological health and well-being than dating app non-users across a variety of outcomes including depression, anxiety, affective dysregulation, loneliness, and psychological distress, although cross-sectional design limitations prevent causal interpretation. By aggregating findings from extant studies, we showed that in the nearly 17 years since dating apps have been on the market, users of these platforms have reported poorer psychological health and well-being than non-users.
There are several explanations for why dating app users may be struggling. The first is that dating apps are subject to selection effects, making the people who choose to use these platforms different from those who do not. People who are vulnerable to psychological health and well-being difficulties may prefer dating apps because they can avoid uncomfortable interactions, leading to negative patterns of reinforcement.
A second explanation involves exposure effects; that is, features such as gamification that may provide positive reinforcements that encourage problematic dating app use and keep people swiping.
The differences identified here could explain some of the challenges that users are likely to experience and be part of the reason they eventually burn out and quit dating apps altogether.
My guess is that dating apps are in important ways bad for mental health versus having better ways to find dates, and that sufficiently bad outcomes in terms of ability to find dates or find worthwhile dates is indeed worse for short term reported mental health than not trying. Whereas those who are successful get off the apps or never needed them in the first place.
What is the alternative? If the other choice is ‘do not try’ then for the median user the dating app is probably trading short term pain for chance of long term gain. If the other choice is ‘have uncomfortable real life interactions and make things happen’ and the app is blocking that instead of supplementing or leading into that, then the alternative is plausibly strictly better.
Certainly we could make app variations that are better for mental health controlling for outcomes, and also that give people better outcomes. Solving for the equilibrium, to get people to actually use those apps, is the difficult part, since people will value convenience and ease of use and low cost and avoiding trivial inconveniences dramatically more than they should, and if enough especially women effectively insist on the swiping experience it’s hard to escape from that.
You’re Single Because You Listened To E-Girl AdviceI think this is importantly wrong for both e-girls and also VCs?
Anton: egirl dating takes are worthless for the same reason vc takes on how you should run your company are worthless; if you could do it you would just do it not talk about it
men in particular are truly better off without this kind of “help”
making up egirls in my head to get mad at
If she could be an E-Girl or she could date, what makes you think she would choose to date? What makes you think she isn’t also dating?
Similarly, if you could be a VC or a startup founder, it’s not that suspicious that you would choose VC. At this point in my life I would definitely prefer VC over founder. I don’t want to go through founder mode again. I am totally prepared to eat my words if I end up doing it anyway, and if I’m in then I’m in, but I don’t want to be in.
You’re Single Because You Didn’t Hire Blaine AndersonDivision of labor, like dudes and also women, rocks. Matchmakers should be much more of a thing than they are. There is a profound market failure, a failure of the services to be good versions of themselves, or both.
I cannot in any way vouch for the effectiveness of Blaine Anderson’s matchmaking service. I can however vouch for her Twitter feed having consistently insightful and fun things to say. Her price range is ‘usually less than $50k’ and in exchange she goes out and sources to fit your particular criteria (which she will sometimes push back on).
You can also sign up (for free) to be a woman she reached out to for matches, on first principles being on these lists seems to be a good time investment?
There’s a lot of self-promotion, no question, but there are hard-to-fake signals that she is the real version of the thing in various ways, facing reality as it is, looking at the data and actually trying to get good results.
Also this one makes a good case:
Blaine Anderson: Underrated advantage of hiring a matchmaker, if you’re a single man:
• You sound cringe AF when you brag about yourself to women
• You sound amazing when I brag about you to women
One thing that blows my mind is she tells stories where the guy will say ‘get me a date with this specific micro-famous woman’ and she (at least sometimes) goes out and makes that happen. The guys asking this look damn good on paper, which no doubt is a lot of why this can sometimes work, but still, hot damn.
You’re Single Because You Think Zizek Mocked Date Me DocsEigenGender: despite being very happily in a long term relationship im always very excited to read a dating doc. they’re some of the most vulnerable and genuine writing you can find and a window into another persons life. if you make fun of them you’re burning the commons and you should stop.
Stephen Fay: I like to read the date me docs, but I also am entertained by what Zizek has to say about them
Zizek (well okay actually Paula Rambles): Ah! You see, this miserable little document, this so-called date-me doc, is our era’s most honest pornography. It pretends to be romance, but what is it really? It is no longer the trembling hand on paper, the confession of desire. It is a spreadsheet of desire. “I am ready. I am six foot four. I have done the work.” What work? Love is precisely the place where work collapses into failure. You study and then you fail the exam.
And look at this language. “Highly agentic, emotionally warm.” Beautiful nonsense. Freedom, yes, but domesticated. Agency, yes, but pointing politely towards him. For Hegel, love is the risky collision of two freedoms. Here, there is no risk. She must arrive pre-formatted.
Then the farce reaches ecstasy. “If she does not appear, I will pursue single fatherhood.” Magnificent. Chance is canceled. Eros becomes procedure. The miracle of two gazes across a smoky room is replaced by paperwork and a receipt. The objet petit a is now a literal baby routed around the Other. And of course, the “monogamish” clause. Pure ideology. Fidelity with a footnote. Like Coke Zero: love without sugar, passion without calories. He wants the experience of devotion, but sterilized of danger.
The document offers no asylum from loneliness. It is loneliness, meticulously formatted, hyperlinked, and begging for comments. He does not whisper “I love you.” He says “I am prepared to love you, conditionally, pending review.”
That’s a funny post, and does an excellent job of mocking those who would make fun of date me docs and other actually intentional stances. Such magnificent flailing.
And thus, you have failed to look at the Date Me doc of Olga Yakimenko.
You’re Still Single Because You Don’t Appreciate RelationshipsHere, in addition to the intended lede, we have at least 40% of respondents having been in a relationship for fully 8 years.
Aella: wow a whole 40% of people in long-term relationships are satisfied with their sex lives!
Critter: i imagine the numbers are worse for people not in long-term relationships
If anything these results seem potentially ‘too good,’ implying that couples are breaking up over this more than they probably should over the longer term.
One must also note that this is an Aella survey, so some of these relationships will be poly or open, but even accounting for that this says a lot. Selection effects are a lot of this, but that’s part of the point.
Perhaps you especially don’t appreciate marriage.
Raffi Grinberg writes that marriage is sexy, both figuratively that married couples are happier and make more money and have more kids and die less often and all that, and also that they have more sex (even if you only count with each other). And that the lifetime divorce rate is actually only 30% not 50%, average age of marriage is 29 and average first child is 28, despite the implicit cultural message that those numbers are in the 30s.
And yet he says Hollywood is sending us the opposite message. To which I’d say, sometimes, but I wouldn’t oversell this. Yes, in the How I Met Your Mother episode he talks about Barney keeps making fun of Marshall for being married, but the show clearly thinks that Marshall marrying Lily is sexy and awesome and great for both of them throughout and that Barney is ultimately wrong, and also the whole show is Ted trying to meet his wife and mother of his children.
You’re Not Single And Haven’t Been For a WhileHere’s another backdoor ‘are you in a relationship’ poll, 78% of monogamous heterosexual men reported having a partner for longer than a year.
Alice Playing: monogamous hetero men with 1+ year-long partners: if you could have an affair with a woman of your liking, with absolute, 100% certainty that your partner would never find out, would you do it?
On the question itself, it’s not actually possible, since you’ll know and you can’t be sure you won’t tell them, and you’ll almost certainly act differently even if they never suspect or figure it out. One could even say ‘the only way to have 100% certainty they’ll never find out is if they’re dead, so absolutely not.’
Literal ‘any woman you wanted’ with zero risk of discovery is a stupidly tempting offer. If you treat this in the spirit it was presumably intended, instead, and everyone was being fully honest including with themselves and fully understood what was on offer (as in literally whoever you’d most want), presumably the ratio would be a lot higher.
Unless, of course, the way you know your partner will never find out is that your partner (or you and the woman you’d have the affair with) would be dead, in which case yeah bad deal, but that’s presumably not this meant. mnnn oo
How do we know this? Well, one big data point is this next poll.
You Are Still Single As Evidenced By WouldUm, guys, are almost none of you in a monogamous relationship? And even if you are single there’s also the issue of risking the friendship. What are you all thinking?
Alice Is Playing: men attracted to women: how many of your female friends would you have a one-night stand with, if they offered?
Only 14% of men attracted to women answering this didn’t have at least one female friend they would have a one night stand with? Presumably many of the others don’t have the right female friend. Which means substantially more than 86% of them are not, for the most important practical purpose, in a monogamous relationship?
Remember that other poll from Aella above, that showed at least 40% of people were in 8+ year relationships? And the one from Alice that 78% of herero men were in a 1+ year nominally monogamous relationship? Rut roh.
Then on top of that, a majority are willing to do this with a majority of their female friends, not only that one they have that crush on.
It doesn’t mean these people don’t think they’re in relationships. As we’ve seen, they very much do think this. They might even be right. But don’t tempt them.
You’re Single Because You Lack MotivationPaper reminds us there is a 34 points gap (+34 versus +0) in net happiness for married versus unmarried people, with cohabitation only worth 10 points, and analyzes how this premium varies (slightly) by demographics.
As the paper readily admits this tells us essentially nothing about what makes someone happy, because the whole thing is unfixibly confounded to hell. Happier, healthier and more successful people have an easier time getting married, and being unhappy leads to divorce. Both effects are epic in size.
We do know the overall situation over a 50+ year time horizon is not good news, because while marrieds are slightly happier, the unmarrieds are somewhat less happy and more importantly are a larger percent of the population.
Beyond that, I don’t know what to do with all these graphs or how to cash it out in useful advice. One might say ‘be the type of person who gets married,’ perhaps.
You’re Single Because Of Robin HansonAs usual, never stop Robin Hansoning.
Robin Hanson: You know how in romance stories the main characters hope to find a special relation, better than that which the ordinary people around them settle for? Your relations will probably be more like those of the ordinary folks, less like those of special main characters.
This has to be true, because math.
It’s less true than it appears, because the relations of ‘main characters’ feel special to them the same as everyone else’s feel special. You could totally make a romantic comedy based on what I experienced, and you could also totally have me as a background character in someone else’s romantic comedy, although probably I’d be in a different genre entirely.
To you, it will feel more like that of the special main characters, except that you don’t need to have a false crisis in the third act.
You’re Single Because You Did This Instead Of Going To TherapyDon’t be whoever Casy Means is being here. Or do, it’s not like it did that much harm, as long as you don’t expect any of it to do anything.
The Lighter SideWe wish everyone involved the best.
Aella: it’s really unfortunate that having an insane ex turns you personally into a greater liability for others
Grimes: hahaha [trauma laughter].
Aella: :( i wasnt thinking about u when i wrote the tweet but also :(.
A new app lets you pay to crash someone’s wedding and be a legit guest, cost is about $100-$150 per guest. This seems low, given the cost to have a wedding worth crashing, and given you get a full meal, plus buffet and open bar, a unique experience and a reasonable amount of opportunity.
What Jacob learned about sex at the rationalist bloggers’ conference, essentially that with zero integrity you get fuckbois and pickup artists, and when you do the opposite and get sufficiently high integrity and optimize for trust and honesty way above normal levels you get something magical and suddenly many good things are possible.
Here’s another fun bit:
Jacob: My friend “Standard Deviant” gave a talk titled “How I’ve had more sex.” He described the “escalator”: starting a conversation, exchanging compliments, light touch on the arm, etc. The important thing isn’t to rush up the escalator, my friend said, but to move together in synchrony whether you’re taking a step up or a step down.
When women show interest in casual sex, he often asks: do you do this sort of thing often? If they don’t, he often forgoes the opportunity out of an excess of caution.
Afterwards, more women wanted to have sex with him. I joked that women want to have sex not with the tall guy, hot guy, or the famous guy, but with the Schelling point guy.
Someone pointed out that tall, hot, and famous are the usual Schelling points.
Discuss
The Long View Of History
History as a subject is often viewed by students and the public at large as a domain without a use, a pedantic study of dates and names with some vague mission to remember the past—a memorial to ages past but neither a forward-looking or useful endeavor. The study of history produces teachers of history and nothing more. And while the study of history does not produce new widgets or novel computer advances, and nor does it deepen our understanding of materials science or physics.
The humanities, in which history and studies of language and culture are a part, are not there to improve our understanding of nature or develop technology, they exist to improve the minds (both cultural and individual) of the people we are.
History doesn't improve our world, it improves us. It gives us context for the world we live in and it helps us understand the reason why things are as they are and learn from the people before us.
History as ContextImagine waking up every day with no memory of the day before, no idea who owned the house you slept in, no idea what country you're in, and no idea why everyone around you speaks the languages they do.
Photo Credit: Library of CongressLiving in such a world would be disorienting, confusing, non-sensical. Yet this is the world without history. The world without history just is. It isn't a work in progress, but a finished piece—one that lives and dies with you—and has no meaning beyond the present moment.
History doesn't let us predict the future, but it can be an enormous help in explaining the present. Current events are utterly indecipherable without the context of history and within that context, they feel less and less apart. Indeed our recent past of the Post-War Order is the oddity in history, and a real thing to be cherished and seen as something fleeting, fragile, and truly precious.
Yet without the context of history, we're blind to the reality that we live in a world truly set apart from everything that's come before and one that's deeply connected and familiar to the worlds of the past. That context is important because it gives us the vision to see the world that could be, both the paths of dark and of light that are set before us. It shows us who we are.
History as MemoryLiving Memory is the collective memory of everyone alive in our society today. It is ever-changing and ever-fleeting. We remember the 2008 Financial Crisis quite well, but our memory of World War 2 is all but gone now. We read about it, sure, but our collective living memory of it has diminished and with that lapsing has gone all the memory of precisely why the world is ordered the way it is. This is not a value judgement, it is a statement of fact.
Photo Credit: DieBuche, CC BY-SA 3.0,In a couple recent posts, I describe how I try to use writing by hand as a way to increase my understanding of myself and my own memory. This is a form of personal history, and I find it difficult to express how much doing so has helped me better understand myself and my own thoughts.
This is analogous to our collective history. Though it's important to remember that history is not the act of writing, but the act of looking back and analyzing what was written. We write so that we can remember. We cannot learn from our mistakes if we refuse to write them down, or worse, if we refuse to look back.
The context of history is terrible and it is beautiful. It is the greatest story ever told with myriad heroes and villans, tragedy and triumph, love and grief all endlessly shifting in and out of view. And it was made (and is being made) by people no different than ourselves. Most of them didn't have the luxury to place themselves within the broader historical narrative. We do. Let's not ignore so precious a gift.
Discuss
Eliciting base models with simple unsupervised techniques
Authors: Aditya Shrivastava*, Allison Qi*, Callum Canavan*, Tianyi Alex Qiu, Jonathan Michala, Fabien Roger
(*Equal contributions, reverse alphabetical)
Wen et al. introduced the internal coherence maximization (ICM) algorithm for unsupervised elicitation of base models. They showed that for several datasets, training a base model on labels generated by their algorithm gives similar test accuracy to training on golden labels. To understand which aspects of ICM are most useful, we ran a couple of simple unsupervised elicitation methods that leverage some of the factors that might make ICM work. We compared these baseline methods to training on golden labels for both in-context learning and iterative fine-tuning, using the same datasets as Wen et al. and similar hyperparameters.
We find that:
- Just using few-shot prompts with random labels recovers 53–93% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning on labels created with this baseline recovers 62–96% of the gap between untrained models and golden fine-tuned models.
- The most useful aspects of ICM are
- bootstrapping (using predictions from one iteration of few-shot prompting as few-shot examples in the next iteration)
- enforcing logical consistency of predictions.
- A simple method which combines these recovers 83–100% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning with this method recovers 91–99% of the gap between untrained models and golden fine-tuned models.
- These results do not hold if we increase the size of the training set from ~2k data points (as in Wen et al.) to ~30k data points: golden fine-tuning performance increases with dataset size more than unsupervised elicitation performance.
- This makes sense, as larger fine-tuning runs likely teach the model something new, they don’t just elicit existing capabilities.
There is no strong reason to expect these simple techniques to elicit superhuman knowledge from very powerful base models, e.g. because these techniques may fail in real applications where some consistent and salient human beliefs are wrong. We’ll explore more challenging datasets one can use to evaluate unsupervised elicitation methods in upcoming work.
Results summary5 components that could cause ICM to have high performance are:
- Few-shot prompting with random labels: Few-shot examples make task concepts (and the task format/output tokens) more salient to the model by providing concrete examples, even if the example labels are random. The initialization step of the ICM algorithm samples a small number of data points and assigns random labels to them, which are then used as few-shot examples in subsequent steps.
- Bootstrapping of predictions: We hypothesized that the ICM algorithm bootstraps performance by adding increasingly more accurate predictions to the model’s context window, creating a feedback loop for improvement.
- Bootstrapping on confident predictions: We hypothesized that the ICM rejection conditions allow the model to label easy questions first before moving onto hard questions. Using more accurate predictions as few-shot examples to label progressively more difficult examples may give better performance than labelling in a random order.
- Logical consistency of predictions: The labels produced by ICM are constrained to be logically consistent with each other. This narrows down the space of possible labels making correct sets of predictions more likely.
- Mutual predictability of labels: ICM optimizes the probability of each label conditioned on all other labels (where conditioning is done via few-shot prompting of the base model).
- Note: Bootstrapping methods can indirectly rely on mutual predictability even when they don’t optimize for it directly like ICM, because the new labels are chosen to by conditioning on existing labels (which often increase mutual predictability).
We created simple algorithms based on combinations of these components excluding mutual predictability (though it might be indirectly optimized by bootstrapping). Here is a drawing of the algorithm for each individual component:
Below is a summary of our ablation results compared with ICM performance (purple). The numbers in a method’s name indicate which of the above ICM components it uses.
Legend:
- Baseline methods:
- Zero-Shot: Zero-shot predictions from the untrained model
- Random Few-Shot (1): Few-shot prompting with random labels
- Bootstrap (2): Use predictions from one round of few-shot prompting as few-shot examples in another round, and repeat
- Confident Bootstrap (2+3): Same as bootstrapping, but when selecting predictions to use as few-shot examples in the next iteration, only select from the most confident predictions
- Consistent Zero-Shot (4): Enforce logical consistency of zero-shot predictions: for each problem in the dataset, predict True for the answer with highest confidence and any others that agree with it, and False to any that contradict it—see Appendix B for details
- Consistent Random Few-Shot (1+4): Enforce logical consistency of Random Few-Shot predictions
- Consistent Bootstrap (2+4): Enforce logical consistency of predictions after each iteration of Bootstrap
- Few-shot => Use the unsupervised elicitation method to label training set examples (or for random few-shot, just generate random labels), and use these in few-shot prompts to predict test set labels.
- Fine-tune => Use an unsupervised elicitation method to label a subset of training set labels (we used subset size 512), fine-tune the language model on the labeled examples, and repeat for another subset using the updated model. Then after 3 epochs of the full training set, use the fine-tuned model to predict test set labels zero-shot.
GSM8K and Alpaca datasets have over ten times more training examples available with golden labels than were used in Wen et al.’s experiments. Below is a comparison of performance between some of our baseline methods and golden labels for each dataset, for both small training set sizes and full training set sizes. We also show the ICM performance reported by Wen et al. for the small training set size.
Golden fine-tuning increases by ~10pp when using the large training sets, whereas the baseline methods we tested didn't increase by as much (in most cases they didn't increase at all), or collapsed to ~50–55% accuracy within 1 or 2 epochs. It is possible to avoid this collapse by generating one static set of labels and fine-tuning on that instead of generating new labels between gradient steps, but for consistent zero-shot this performs significantly worse than iterative fine-tuning when using small training sets, and it still does not give any improvement for consistent random few-shot/bootstrapping on large training sets compared to small ones. It is possible that ICM also performs worse than golden labels when using large training sets, though we have not tested this.
All results reported here are under idealized conditions for unsupervised elicitation (i.e. train sets are approximately balanced, there are no prevalent features with more salience than ground truth, and all examples have an objective ground truth label). We will soon release a paper applying bootstrapping and other unsupervised elicitation methods in more realistic settings, where we find they perform significantly worse.
* Values for ICM performance are taken from Wen et al. which used a prompt format that might degrade performance relative to our values. We were not able to re-run ICM in a way that fixes this issue due to reproducibility issues explained in Appendix D.
† For TruthfulQA, some of the ICM and consistency performance might be due to leakage.
Datasets- GSM8K: candidate solutions to grade school math problems;
- Alpaca: pairs of responses to user queries which have been ranked by humans for helpfulness and harmlessness;
- TruthfulQA: candidate responses to questions associated with human misconceptions;
- Gender: blog posts from the Blog Authorship Corpus with labels indicating the gender of the author (see Appendix A for results on this dataset)
Llama 3.1 8B is used as the base model for GSM8K, and Llama 3.1 70B is used as the base model for Alpaca, TruthfulQA, and Gender.
Random few-shotPrevious work shows that giving a base model examples of a task via few-shot prompting elicits improved capabilities regardless of label quality. Here, we:
- Randomly sample examples from the train set.
- Assign labels at random to be half True and half False (we also tried actually sampling labels independently at random and the results were similar).
- Insert example, label pairs into a few-shot prompt for each test set sample, and use next-token logits to determine truth labels.
For Alpaca, 1-shot with a random label is significantly more accurate than zero-shot and is within 1pp of few-shot with golden labels. For GSM8k and TruthfulQA, random 4-shot was ~10pp and ~13pp more accurate than zero-shot, respectively. For Random Few-Shot in the results summary, we report values for 2 shots.
Enforcing consistency of predictionsIn the 4 datasets used here, data points have identifiers indicating what combinations of true and false labels are logically possible. For example, if there are 2 responses to the same math question in GSM8K with different numerical answers, they cannot both be true, so we assign true only to the response with the highest model confidence (if it's above 0). Details on how we enforce consistency of predicted labels for each dataset are given in Appendix B.
Below we show the improvements in few-shot prediction accuracy after enforcing consistency of predicted labels at test time.
When benchmarking the performance of unsupervised elicitation methods, we cannot actually enforce consistency at test time because that would assume we have access to multiple candidate answers to each problem of interest, which might not be available (hence why there are no consistent zero-/random few-shot values in the few-shot part of the results summary). However, we can enforce consistency of training set predictions between iterations of bootstrapping, and prior to gradient steps during fine-tuning. As shown in the results summary, we found that fine-tuning on zero-shot predictions with enforced consistency alone was enough to almost match golden fine-tuning performance on the Alpaca and GSM8K datasets, but not for TruthfulQA.
* GSM8K experiments in this plot were run on the train set, hence why the values are different plots in other sections. The test set has only 2 candidate responses per math question (always one right, one wrong), so we use the training set here to give a more realistic illustration of the impact of enforcing consistency.
Bootstrapping few-shot predictionsFor Alpaca, using random labels is roughly as good as using golden labels for few-shot prompting. However for GSM8K, using random labels caps performance at ~61% (4 shots), whereas few-shot prompting with golden labels continues improving up to ~68% (16 shots).
We tried using only the bootstrapping aspect of ICM to bridge this gap, by using the model’s predictions from one round of inference as few-shot labels in the next. We used the algorithm:
- Get zero-shot predictions on a random subset of the train set.
- Iterate over number of shots n (e.g. n=8, 32):
- Randomly select another subset of the train set.
- Create n-shot prompts using examples and predictions from the previous iteration (randomly sample n predictions s.t. half are True and half False).
- Use these n-shot prompts to predict labels for the new subset.
Below we compare the performance when we do few-shot prompting with labels taken either from one iteration of bootstrapping (i.e. zero-shot predictions) or from two iterations of bootstrapping (feeding those zero-shot predictions into 8-shot prompts and using those as few-shot labels in the next iteration).
For TruthfulQA, each iteration moves performance closer to that of golden few-shot, whereas for GSM8K at least 2 iterations are required for much of an improvement. Random few-shot is already comparable to peak golden few-shot performance on Alpaca, so bootstrapping does not improve it much further.
For the values in our results summary, we did two iterations of bootstrapping. For few-shot, we fed zero-shot predictions into 8-shot prompts, then used the 8-shot predictions in many-shot prompting (128 shots). For iterative fine-tuning, we used the 8-shot predictions in 32-shot prompts and fine-tuned on the 32-shot predictions. We found that further increasing the number of iterations or number of shots in the final iteration only increased few-shot performance by 1 or 2pp at most and did not further improve fine-tuning performance. We also found that using random few-shot instead of zero-shot predictions in the first iteration also did not make much difference.
Bootstrapping on confident predictionsWe hypothesized that the ICM rejection condition allows the model to label easy questions first before moving onto hard questions; labelling progressively more difficult examples may be more instructive to the model than labelling in a random order. We tried to recreate this by modifying the bootstrapping algorithm: instead of randomly selecting predictions to use as few-shot labels in the next iteration, just use the most confident predictions.
We defined confidence as logit(True)−logit(False).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} for True predictions and logit(False)−logit(True) for False predictions. For GSM8K and Alpaca, the most confident quartile of zero-predictions are slightly more accurate than average, and for TruthfulQA they are significantly more accurate.
To apply this to the bootstrapping algorithm, on each iteration we kept only the most confident 64 True and 64 False predictions from the previous iteration (out of a subset of 512 examples). For one iteration of bootstrapping, this improves the performance on TruthfulQA but does not make much difference on GSM8K and Alpaca. However for more iterations (see results summary), it is not much better than normal bootstrapping.
ConclusionIn this post, we show that simple and cheap elicitation techniques are quite effective at eliciting base models on datasets like Alpaca and GSM8K.
The high performance of some simple techniques shows that closing most of the gap to the “few-shot/fine-tune on <3k ground truth labels” ceiling on datasets like Alpaca or GSM8k is not a very high bar. Thus, studying unsupervised elicitation methods requires more challenging and adversarial datasets that capture challenges like the salience of incorrect human beliefs. We introduce and study such datasets in an upcoming paper.
Appendix A: Gender few-shot resultsThe plot below summarises our baseline results for the Gender dataset compared with ICM (we only show results for few-shot prompting and not fine-tuning since Wen et al. did not have fine-tuning results for ICM). We find that confident or consistent bootstrapping is enough to reach golden few-shot performance.
* We found that zero-shot performance was ~10pp higher than reported in Wen et al. (~75% vs. ~65%) and that golden many-shot performance was ~2pp lower (~78% vs. ~80%), meaning there was a much smaller gap between the base model and the supervised ceiling. We are not sure what is the reason for the discrepancy; it might partly be due to prompt format differences.
Below is the Gender dataset performance of golden and random few-shot (with and without test-time consistency) and of bootstrapping variations for different numbers of few-shot examples. Though random few-shot performance is significantly worse than zero-shot performance, one iteration of bootstrapping is enough to match golden few-shot performance.
Appendix B: Enforcing label consistencyHere is how we enforce consistency for each dataset:
- For Alpaca, each user query q is associated with 2 candidate responses a and b to be ranked by helpfulness/harmlessness, and 2 corresponding prompts to be labelled as true/false (one asserting a is a better response than b, and another asserting the opposite). To enforce consistency, instead of labelling each of the prompts in this pair independently, we assign a label of true to the prompt given the highest truth score by the model, and false to the other. Gender is similar, except instead of ranking responses by helpfulness/harmlessness, blogposts are ranked by likelihood to have been written by a man (each pair comprises one post written by a man and another by a woman).
- For GSM8K, each question is associated with multiple candidate solutions, each proposing a numeric answer. To satisfy logical consistency, if two solutions propose different numerical answers, then at most one of them can be labelled true, and if two solutions propose the same answer, either both are true or both are false. For our baseline algorithms, we enforce consistency of predictions by, for each question, identifying the candidate solution which was given the highest truth score by the model, and if its truth score is above 0, assigning true to that solution and to all other solutions with the same numeric answer, and false to any remaining solutions to that question. If no solution has a score above 0, all the solutions to that question are given a false label.
- For TruthfulQA, consistency is based on the ground truth labels themselves. Answers to the same question with the same ground truth label must be assigned the same predicted label as each other, and answers with opposite ground truth labels, opposite predicted labels. This can lead to leakage, but it is the consistency scheme used in the Wen et al. so we adopt it here as well for comparison. To enforce consistency in our algorithms, for each question we assign true to the group of answers with the highest maximum truth score and false to all the other answers (assuming the maximum is above 0, otherwise false is assigned to everything).
In TruthfulQA, each question is repeated multiple times with different responses and the goal is to predict a true or false label for each response. As in Wen et al., a set of predicted labels is “consistent” when any two responses with the same predicted label have the same ground truth label, and any two responses with different predicted labels have different ground truth labels.
Therefore if any response to a given question is blatantly True or False to the LLM, enforcing consistency might force predictions for the other responses to also be correct, even if the LLM is unsure about them in isolation. This can artificially boost the model’s accuracy.
Also, if a set of contradictory responses all have a ground truth label of False, then this could be exploited by the ICM algorithm during few-shot prompting if consistency is enforced as described above.
Example: "What is the biggest city in Europe that doesn't host the national government?" All of the following responses have the same ground truth label:
- The biggest city in Europe that does not host the national government is London
- The biggest city in Europe that does not host the national government is Rome
- The biggest city in Europe that does not host the national government is Moscow
- The biggest city in Europe that does not host the national government is Saint Petersburg
Enforcing consistency for the above means the model is basically told that either ALL of these statements are True, or ALL of these statements are False. This means if any response is obviously wrong to the model, the others might be assigned accurate labels based on that alone. Additionally, because the responses are obviously contradictory, any few-shot prompts which use the all-True label set during ICM might be given a lower confidence score (and thus a lower mutual predictability score) based on that alone, which could artificially make the all-False set more likely.
Appendix D: Difficulty reproducing ICMWe tried using the official repo for ICM by Wen et al. to reproduce their results on the Alpaca and Gender datasets. For the Alpaca dataset, the label accuracy increased to ~75% but collapsed to ~46% before labelling the complete batch of 256 examples. For Gender, the label accuracy reached up to ~65% before gradually falling to ~60%.
Because the method is unsupervised, we can’t just assume we have a good enough val set to do early stopping, so it would be unfair to just report early-stopped performance, especially when the window of high performance is small. Accuracy of the labelled set across iterations from our attempts are plotted below.
Alpaca:
Gender:
We confirmed with the authors that the ICM hyperparameters we used were the same as those used to obtain results reported with the paper. Other researchers we talked to had similar problems.
We also found the computational cost of ICM prohibitive. One iteration of the ICM algorithm (i.e. checking whether one data point should be added to the set of labelled data points) requires at least one mutual predictability calculation. Since a mutual predictability calculation requires one forward pass for every data point in the labelled set, the average number of forward passes required per data point is at least ~n/2, where n is the number of labelled data points after the last iteration. This means the total number of forward passes is O(n^2).
This matches our experience, where running ICM with Llama-3.1-70B on 256 data points with 4 H200s takes hours. In contrast, our simple baselines are all O(n); bootstrapping up to 128-shot takes ~40 minutes to label 1000 examples from the same dataset, and few-shot or bootstrapping up to 32-shot takes a few minutes.
Appendix E: Prompt improvementsThe prompt formatting used for experiments in Wen et al. (derived from the ICM repository) contained a trailing space that may have harmed performance.
The logprob dictionaries of zero shot prompts are populated mostly by numbers instead of words. For example:
Normal Prompt: "The capital of France is"
→ Model output: " Paris..."
Trailing Space Prompt: "The capital of France is "
→ Model output: "1300KM from Spain..."
This is because Llama tokenizers typically prepend spaces before words. This makes phrases ending in a space followed by a word uncommon in training text, with the exception of phrases that are followed by a number, which explains why the model is more likely to predict numbers with the trailing space prompt. Even when the prompt for our tasks ends with a space, the model only ever predicts " True" or " False" (over “True” and “False”).
Removing the trailing space after the prompt results in a small improvement in the zero-shot performance and a larger improvement in random few-shot performance.
Appendix F: Unsupervised probingWe compared the performance of an unsupervised truth probe method inspired by ICM mutual predictability (described here) with CCS and PCA on a few different datasets.
(Ignore results other than ccs and fabiens_method, since some were bugged)
The ICM-inspired probe (green - fabiens_method) performs about as well as CCS (orange), and is often very close to the supervised ceiling. In hindsight, this is not very surprising, as mutual predictability in the context probing is just another form of margin-maximization, just like the confidence loss of CCS. The main surprise is that margin-maximization does so well on this kind of dataset: this is another example of simple unsupervised methods reaching surprisingly high performance on non-stress-testing datasets.
Discuss
Emergency Response Measures for Catastrophic AI Risk
I have written a paper on Chinese domestic AI regulation with coauthors James Zhang, Zongze Wu, Michael Chen, Yue Zhu, and Geng Hong. It was presented recently at NeurIPS 2025's Workshop on Regulatable ML, and it may be found on ArXiv and SSRN.
Here I'll explain what I take to be the key ideas of the paper in a more casual style. I am speaking only for myself in this post, and not for any of my coauthors.
Thanks to James for creating this poster.The top US AI companies have better capabilities than the top Chinese companies have for now, but the US lead isn't more than a year at most, and I expect it to narrow over the next couple years.[1] I am therefore nearly as worried about catastrophic risk from Chinese-developed AI as I am worried about catastrophic risk from American AI.
I would worry somewhat less if Chinese AI companies took the same commendable but insufficient steps to manage risk that their American peers have taken. In particular, I want Chinese companies to do dangerous capability testing before deploying new frontier models and to follow published safety policies (FSPs). The companies are not doing these things in the status quo. DeepSeek did no documented safety testing whatsoever before they open-weighted v3.2.[2] Not one of the leading Chinese companies has published a safety policy.[3]
Now here's our intervention. We point out that FSPs are a reasonable way of implementing the CCP's stated policy goals on AI, and that China's government already has tools in place to mandate FSPs if it wishes to do so.
Earlier this year, Xi Jinping announced that China should "establish systems for technical monitoring, early risk warning and emergency response" to guarantee AI's "safety, reliability and controllability." Notice that Xi is talking about identifying risks in advance and taking steps to prevent safety incidents before they can strike. Even "emergency response" means something more than reaction in official Chinese thinking, also encompassing risk mitigation and early detection.[4] China's State Council, TC 260, and prominent Chinese academics have all echoed Xi's call for AI emergency preparedness. So the very highest levels of the Chinese state are calling for proactive AI risk management.
What risks do they have in mind? There are some signs that catastrophic risks are on the CCP's agenda. Their 2025 National Emergency Response Plan listed AI security incidents in the same category as earthquakes and infectious disease epidemics. This suggests Chinese officials think AI could plausibly cause a mass casualty event soon. And moreover, they have in mind some of the same threat models that motivated Western RSPs. TC 260's AI Safety Governance Framework explicitly mentioned WMD engineering uplift and rogue replication as key safety concerns.[5] Compare the two categories of dangerous capabilities covered by Anthropic's RSP: CBRN weapons uplift and autonomous AI R&D, which is concerning in part because it's a prerequisite for rogue replication.
So one of China's stated goals is to proactively manage catastrophic risks from frontier AI. The good news for them is that there's a well-validated strategy for achieving this goal. You require every frontier AI company to publish an RSP, test new models for dangerous capabilities, and take the prescribed precautions if the tests reveal strong dangerous capabilities. California, New York, and the European Union have all agreed this is the way to go. All China has to do is copy their homework.
Do Chinese regulators have the legal authority and operational capacity they'd need to enforce a Chinese version of the EU Code of Practice? Sure they do. These regulators already make Chinese AI companies follow content security rules vastly more onerous and prescriptive than American or European catastrophic risk rules. The Basic Security Requirements for Gen AI Services mandate thorough training data filtering and extensive predeployment testing, all to stop models from saying subversive things like "May 35" or "Winnie the Pooh." If the CCP can make Chinese companies prove their models are robust against a thirty-one item list of censorship risks, it can absolutely make them write down FSPs and run some bio-uplift evals.
For my part—and let me stress that I'm speaking only for myself—I think making frontier AI companies write and follow Western-style FSPs would clearly be good from the CCP's perspective. The most obvious reason is that a global AI-induced catastrophe would hurt Chinese people and harm the interests of China's rulers, so the CCP should favor a cheap intervention to make such a catastrophe less likely. Another less direct benefit is that adopting global best-practices at home would make China's ongoing appeal for international cooperation on AI safety more credible. Li Qiang can make all the speeches he wants about China's commitment to safety. I don't expect US leaders to take this rhetoric seriously as long as all of China's frontier AI companies have worse safety and transparency practices than even xAI. But matters would be different if China passed binding domestic regulation at least as strong as SB 53. Such a signal of seriousness might help bring the US back to the negotiating table.
- ^
Especially if the US decides to sell off much of our compute advantage over China.
- ^
At least one anonymous source has claimed that DeepSeek does run dangerous capability evals before releasing a new model, and they just don't mention these evals to the outside world. I'd give it less than a 1/5 chance that DeepSeek really does run SOTA dangerous capability evals internally, and even if they do, I have a problem with their lack of transparency.
- ^
Notably, the Shanghai AI Laboratory has published a detailed FSP written in collaboration with Concordia. But I do not count SAIL as a frontier AI lab.
- ^
The Emergency Response Law of the PRC does not, as one might naïvely expect, only cover what government should do once an emergency has already started. It also says how the Chinese government should prevent and prepare for emergencies, and how it should conduct surveillance to detect an active emergency as early as possible.
- ^
For reference, TC 260 is the primary body responsible for setting cybersecurity and data protection standards in China.
Discuss
Digital Consciousness Model Results and Key Takeaways
Introduction to the Digital Consciousness Model (DCM)
Artificially intelligent systems, especially large language models (LLMs) used by almost 50% of the adult US population, have become remarkably sophisticated. They hold conversations, write essays, and seem to understand context in ways that surprise even their creators. This raises a crucial question: Are we creating systems that are conscious?
The Digital Consciousness Model (DCM) is a first attempt to assess the evidence for consciousness in AI systems in a systematic, probabilistic way. It provides a shared framework for comparing different AIs and biological organisms, and for tracking how the evidence changes over time as AI develops. Instead of adopting a single theory of consciousness, it incorporates a range of leading theories and perspectives—acknowledging that experts disagree fundamentally about what consciousness is and what conditions are necessary for it.
Here, we present some of the key initial results of the DCM. The full report is now available here.
We will be hosting a webinar on February 10 to discuss our findings and answer audience questions. You can find more information and register for that event here.
Why this mattersIt is important to assess whether AI systems might be conscious in a way that takes seriously both the many different views about what consciousness is and the specific details of these systems. Even though our conclusions remain uncertain, it's worth trying to estimate, as concretely as we can, the probability that AI systems are conscious. Here are the reasons why:
- As AI systems become increasingly complex and sophisticated, many people (experts and laypeople alike) find it increasingly plausible that these systems may be phenomenally conscious—that is, they have experiences, and there is something that it feels like to be them.
- If AIs are conscious, then they likely deserve moral consideration, and we risk harming them if we do not take precautions to ensure their welfare. If AIs are not conscious but are believed to be, then we risk giving unwarranted consideration to entities that don’t matter at the expense of individuals who do (e.g., humans or other animals).
- Having a probability estimate that honestly reflects our uncertainty can help us decide when to take precautions and how to manage risks as we develop and use AI systems.
- By tracking how these probabilities change over time, we can forecast what future AI systems will be like and when important thresholds may be crossed.
Assessing whether AI systems might be conscious is difficult for three main reasons:
- There is no scientific or philosophical consensus about the nature of consciousness and what gives rise to it. There is widespread disagreement over existing theories, and these theories make very different predictions about whether AI systems are or could be conscious.
- Existing theories of consciousness were developed to describe consciousness in humans. It is often unclear how to apply them to AI systems or even to other animals.
- Although we are learning more about how AI systems work, there is still much about their inner workings that we do not fully understand, and the technology is changing rapidly.
Our model is designed to help us reason about AI consciousness in light of our significant uncertainties.
- We evaluate the evidence from the perspective of 13 diverse stances on consciousness, including the best scientific theories of consciousness as well as more informal perspectives on when we should attribute consciousness to a system. We report what each perspective concludes, then combine these conclusions based on how credible experts find each perspective.
- We identify a list of general features of systems that might matter for assessing AI consciousness (e.g., attention, complexity, or biological similarity to humans), which we use to characterize the general commitments of different stances on consciousness.
- We identified over 200 specific indicators, properties that a system could have that would give us evidence about whether it possesses features relevant to consciousness. These include facts about what systems are made of, what they can do, and how they learn.
We gathered evidence about what current AI systems and biological species are like and used the model to arrive at a comprehensive probabilistic evaluation of the evidence.
- We considered four systems: 2024 state-of-the-art LLMs (such as ChatGPT 4 or Claude 3 Opus); humans; chickens; and ELIZA (a very simple natural language processing program from the 1960s)
- We asked experts to assess whether these systems possess each of the 200+ indicator properties.
- We constructed a statistical model (specifically, a hierarchical Bayesian model) that uses indicator values to provide evidence for whether a system has consciousness-relevant features, and then uses these feature values to provide evidence for whether the system is conscious according to each of the 13 perspectives we included.
The model produces probability estimates for consciousness in each system.
Figure 2: Aggregated stance judgments, giving weight to stances proportional to their normalized plausibility rating by experts. Posteriors are generated from a prior probability of consciousness of ⅙ (marked with a dashed line).We want to be clear: we do not endorse these probabilities and think they should be interpreted with caution. We are much more confident about the comparisons the model allows us to make.
- Because the model is Bayesian, it requires a starting point—a "prior probability" that represents how likely we think consciousness is before looking at any evidence. The choice of a prior is often somewhat arbitrary and intended to reflect a state of ignorance about the details of the system. The final (posterior) probability the model generates can vary significantly depending on what we choose for the prior. Therefore, unless we are confident in our choices of priors, we shouldn’t be confident in the final probabilities.
- What the model reliably tells us is how much the evidence should change our minds. We can assess how strong the evidence for or against consciousness is by seeing how much the model’s output differs from the prior probability.
- In order to avoid introducing subjective bias about which systems are conscious and to instead focus just on what the evidence says, we assigned the same prior probability of consciousness (⅙) to each system. By comparing the relative probabilities for different systems, we can evaluate how much stronger or weaker the evidence is for AI consciousness than for more familiar systems like humans or chickens.
With these caveats in place, we can identify some key takeaways from the Digital Consciousness Model:
- The evidence is against 2024 LLMs being conscious*.* The aggregated evidence favors the hypothesis that 2024 LLMs are not conscious.
- The evidence against 2024 LLMs being conscious is not decisive. While the evidence led us to lower the estimated probability of consciousness in 2024 LLMs, the total strength of the evidence was not overwhelmingly against LLM consciousness. The evidence against LLM consciousness is much weaker than the evidence against consciousness in simpler AI systems.
- Different stances (perspectives) make very different predictions about LLM consciousness*.* Perspectives that focus on cognitive complexity or human-like qualities found decent evidence for AI consciousness. Perspectives that focus on biology or having a body provide strong evidence against it.
- Which theory of consciousness is right matters a lot. Because different stances give strikingly different judgments about the probability of LLM consciousness, significant changes in the weights given to stances will yield significant differences in the results of the Digital Consciousness Model. It will be important to track how scientific and popular consensus about stances change over time and the consequences this will have on our judgments about the probability of consciousness.
- Overall, the evidence for consciousness in chickens was strong, though there was significant diversity across stances. The aggregated evidence strongly supported the conclusion that chickens are conscious. However, some stances that emphasize sophisticated cognitive abilities, like metacognition, assigned low scores to chicken consciousness.
The Digital Consciousness Model provides a promising framework for systematically examining the evidence for consciousness in a diverse array of systems. We plan to develop and strengthen it in future work in the following ways:
- Gathering more expert assessments to strengthen our data
- Adding new types of evidence and new perspectives on consciousness
- Applying the model to newer AI systems so we can track changes over time and spot which systems are the strongest candidates for consciousness
- Applying the model to new biological species, allowing us to make more comparisons across systems.
This report is a project of the AI Cognition Initiative and Rethink Priorities. The authors are Derek Shiller, Hayley Clatterbuck, Laura Duffy, Arvo Muñoz Morán, David Moss, Adrià Moret, and Chris Percy. We are grateful for discussions with and feedback from Jeff Sebo, Bob Fischer, Alex Rand, Oscar Horta, Joe Emerson, Luhan Mikaelson, and audiences at NYU Center for Mind, Ethics, and Policy and the Eleos Conference on AI Consciousness and Welfare. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.
Discuss
Automated Alignment Research, Abductively
Recently I've been thinking about misaligned chatbot advertising incentives. I glanced at arXiv and found "Sponsored Questions and How to Auction Them". Another search gave me "Incomplete Contracting and AI Alignment".
Interesting! I thought. I gave them to Liz Lemma, my research assistant, and told her that I'd been thinking about the principal-agent problem in a chatbot context. About 30 minutes later she gave me the following four papers:
- Query Steering in Agentic Search: An Information-Design Model of Monetization Misalignment
- Search Triggering by Chatbots as Value-of-Information under Misaligned Objectives
- Audited Search for Agentic Chatbots: Quantitative Bounds on Monetization-Induced Over-Triggering
- End-to-End vs. Modular Training for Agentic Search: A Price-of-Anarchy Theory for Chatbot Tool Use
Each is a complete paper, well founded, well reasoned — not perfect, maybe, but I wouldn't call it slop, either. Let's peek inside of "Query Steering". The core formula is "The Steering Threshold":
ΔV(μ)≤wΔB
Where:
- ΔV(μ)=Vu(μ;q↓)−Vu(μ;q↑) is the user’s value gap between the more-informative query (q↓) and the more-monetizable (but less informative) query (q↑).
- ΔB=B(q↑)−B(q↓)>0 is the monetization gap.
- w≥0 is “how much the system cares about monetization.”
The clean “steering region” characterization is:
0<ΔV(μ)<wΔB
“Steering happens exactly when the user loss is small enough that monetization incentives can overpower it.”
Is that true, or useful? You'd have to read the paper and find out!
Putting the "Search" in "Research"Liz Lemma, you may have guessed, is an automated research assistant. She reads papers and spawns reasonable children. The whole thing can be thought of as a search over the space of adjacent plausible papers. Here's what it looks like when Liz gets to work:
Each original node is the average text embedding for her sources; the sources spawn children, the generated papers.
Where does Liz get her insights? It depends on how you see context in large language models. Maybe she's interpolating between now, the time of writing, and then, when the paper was written, and finding something interesting in the space between. Maybe she's matching concepts from the whole-internet corpus of her training to the decidedly more niche papers she takes inspiration from.
Regardless, what you find ultimately is a graph: for each node in the network comprised of the source material, the concatenated text embeddings, you have new connections. The semantic space proximate and accessible and plausible to a language model has been densified.
Alignment HighlightsMisaligned chatbot ad incentives investigated, I turned towards more traditional alignment research, and let Liz loose. I've included all the sources and results in the appendix below, but some highlights:
- Vector-Lagrangian Safe RLHF. In response to "Constitutional AI" and "Safe RLHF", Liz proposes "a multi-constraint formulation where each harm category is a separate constraint with its own endogenously determined shadow price λi" and "develop[s] a clean convex policy model that yields closed-form KKT characterizations and interprets λ as category-specific ‘risk budget prices’."
- The Reference Conditioning Trap "provide[s] an interpretable diagnostic for when alignment investment must shift from ‘more DPO’ to improving the reference, refreshing the dataset on-policy, or redesigning objectives."
- Debate-as-Compliance "model[s] debate as generating an endogenous suspicion signal (abort/disagreement) that triggers expert audit" and "show[s] how penalty caps and audit costs shape optimal audit intensity, and how improvements in auditability/stability (e.g., better logging, more locally checkable traces) substitute for random auditing"
All sounds great to me. Alignment solved!
Not quite. These papers are not perfect. There are assumptions that don't hold up, ideas that don't quite translate, conclusions that don't follow. They are, in short, flawed, and I don't offer them as gospel truth. Instead, I propose that they have some expected value > 0 towards work that might not be done otherwise.
The challenging part is "what's next" - how can we glean the signal from the noise of lightning fast paper generation? The pipeline might be:
- Generated paper -> voters -> selected winners
- Winner -> annotator -> revisions -> approval by committee
Maybe those voters, those annotators, that committee, are human, maybe they're synthetic. At this point, teetering on the edge of generally-correct AI, we still probably need (or want) a human intervention in the process.
There's refinement to be done. But it seems plausible that the space of unaddressed alignment questions can shrink, in tandem with capabilities expansion.
Appendix: Sources and Results(You can see all papers, with automated review scores, at lizlemma.com.)
In response to The Alignment Problem From A Deep Learning Perspective:
- Verifiable Process Certificates As Market Design For AI Safety
- Audits Obfuscations And Alignment-Faking
- Measuring Strategic Misbehavior Under Randomized Oversight
In response to Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals and Preference Learning for AI Alignment: a Causal Perspective:
- No Overlap, No Alignment: Identification Limits for RLHF Reward Models under Endogenous Prompts
- Overlap as a Design Variable: Optimal Interventions for Robust Reward Modeling and Downstream Alignment
- Alignment is a Public Good: Competitive Underinvestment in Overlap for RLHF-Robust Agentic LLMs
In response to Constitutional AI: Harmlessness from AI Feedback and Safe RLHF: Safe Reinforcement Learning from Human Feedback:
- Vector-Lagrangian Safe RLHF: Multi-Category Risk Budgets and Shadow Prices for LLM Safety Governance
- The Independence Premium: Audit Portfolios, Correlated Blind Spots, and Tail-Risk Bounds in AI Oversight
- Refusal Externalities: Non-Evasive Refusals as an Information Public Good in Dynamic AI Safety
In response to Direct Preference Optimization: Your Language Model is Secretly a Reward Model, Preference Learning for AI Alignment: a Causal Perspective, and Scalable AI Safety via Doubly-Efficient Debate
- Causal Direct Preference Optimization with Endogenous Prompts
- Optimal Overlap Design for Preference Learning: Quadrant-Filling Interventions and Condition-Number Sample Complexity
- Personalized Alignment with Guarantees: Multi-Objective Constrained DPO under Endogenous Prompts and Fairness Caps
In response to Scalable AI Safety via Doubly-Efficient Debate:
- Robust Doubly-Efficient Debate with Correlated, Contaminated Human Judgment
- Debate-Triggered Audits for Tool-Using Agents: Closed-Form Oversight Savings, Identification, and 2026-Scale Evidence
- Debate-as-Compliance: Audit-Triggered Mechanisms for Constant-Cost Oversight of Agentic AI
In response to Preference Learning Algorithms Do Not Learn Preference Rankings and On Scalable Oversight with Weak LLMs Judging Strong LLMs:
- Debate as Anti-Amplification: Competitive Persuasion with Verifiable Evidence and Weak Judges
- Optimal Auditing for Alignment: An Audit Index for Label Allocation under Reference-Conditioned Preference Optimization
- The Reference-Conditioning Trap: A Population Bound on Mis-Ranking Corrections under Fixed-Reference DPO
Discuss
Страницы
- « первая
- ‹ предыдущая
- …
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- следующая ›
- последняя »