Вы здесь
Новости LessWrong.com

Two Mathematical Perspectives on AI Hallucinations and Uncertainty
A recent OpenAI preprint (and blogpost) examines the sources of AI hallucinations. This reminded me of a 2024 preprint which was similar in scope. These two papers use different mathematical lenses but arrive at complementary conclusions about the necessity of uncertainty expressions in AI systems. I briefly review and compare them here.
[disclaimer: I am not an author of any of those papers, so my understanding may not be fully correct]
Paper 1: The Consistent Reasoning ParadoxThe first paper examines AI systems through the lens of theoretical computer science. The authors define an AGI as a system (implemented on a Turing machine) that passes the Turing test, which implies it must reason consistently—answering the same question identically regardless of how it's phrased—as, at least ideally, humans should behave too.
Their formal setup considers arithmetic problems where numerical constants can be expressed in different ways, representing different "sentences" for the same problem. Changing the values of the numerical constants leads to a "collection" of problems.
Working within the framework of Turing machines and computable numbers, they establish five main results, which together constitute what they call the Consistent Reasoning Paradox (CRP). The following are my summaries of the results; see the paper for the precise formulation:
CRP I: For any family of sentences, consisting of a single problem for each problem family, there exists one narrow AI system that answers them correctly (but it may not answer the other cases).
CRP II: An AGI that attempts to answer all sentences (or formulations) consistently will necessarily produce incorrect answers infinitely often on these same problems.
CRP III: There exist problems where an AGI (in the sense above) cannot determine with >50% confidence whether its own solution is correct
CRP IV: There are cases where both the narrow AI and the AGI will be correct, but where neither will be able to provide a logical explanation for the answer.
CRP V: Given a computational budget M, there exists an AI that can either provide a correct answer with "I know" or abstain with "I don't know." The thinking time affects the proportion of questions where abstention occurs.
The paper is not concerned with how the AI/AGI systems mentioned in the statements above are trained (or programmed)—the conclusions are absolute and come from the assumptions and the considered set of problems (coincidentally, I am not clear if the results apply beyond problems including mathematical constants, or that have a computational/mathematical nature).
The paper's conclusion is that trustworthy AI must implicitly compute an "I don't know" function to avoid the inevitable errors that come with consistent reasoning.
Paper 2: Statistical Inevitability of HallucinationsThe second paper considers AI models that do density estimation (which includes modern auto-regressive large language models, but not only those). Then, they use statistical learning theory to establish theoretical bounds on the rate of hallucinations, showing that this is non-zero in case the loss function used is cross-entropy from a training dataset.
Their analysis has two main components:
Pre-training inevitability: The authors establish a reduction from the density estimation problem to a binary classification, where the model's assigned probability to a sample is used to classify it as valid or erroneous. Using this reduction, a bound on the probability of generation of erroneous samples can be established; this bound depends on the missclassification error and on constants. Then, the authors argue that the constants are small in the case of cross-entropy loss from a training dataset. Therefore, the conclusion is that a pure base model will necessarily produce hallucinations, even with error-free training data. Of course, the training data is not perfect, it is hard to find the global optimum of the loss function, and we may expect distribution shift between training and test setups, all of which is likely to make the problem worse.
These findings are not really surprising and come from a relatively simple application of well-understood learning theory concepts and results. However, their clear application to the case of density estimation models is illuminating.
Post-training persistence: Of course, existing state-of-the-art models are not pure base models: post-training techniques could alleviate hallucinations. Nevertheless, the authors argue that evaluation benchmarks reward overconfidence and therefore do not incentivise post-training to eradicate hallucinations. Indeed, current benchmarks and evaluation procedures reward models for guessing when uncertain, similar to how standardised tests reward students for attempting every question. The authors argue that fixing hallucinations requires modifying all benchmarks to explicitly penalise errors and reward appropriate abstention. In their view, it is not sufficient to have hallucination evaluations complementary to the primary ones. Moreover, the penalisation should be explicitly mentioned in the instructions, to make it clear to the model. As a consequence, models will be incentivised to become more calibrated and learn to abstain.
Overall messageThe joint takeaway message is: models that never fail (and answer consistently) must be able to say "I don't know" (paper 1), but this does not occur by default within the current training pipeline (paper 2).
In particular, paper 1 says that an AI with an "I don't know" function that operates within a computational budget exists. The statistical paper suggests modifying benchmark scoring to explicitly reward abstention when uncertain, essentially training models to develop this capability.
Interestingly, these results do not rely on the current model architecture: paper 1 assumes that the AI system is a computable program, while paper 2 assumes only that it is trained to perform density estimation.
Appendix: side-by-side comparison of approaches and findingsThe papers differ significantly in their theoretical frameworks:
Modelling assumptions:
- CRP treats AI as Turing machines and focuses on problems involving computable numbers with different representations
- The statistical paper considers finite-domain density estimation models trained with standard loss functions
Mathematical tools:
- CRP uses theoretical computer science and computability theory to derive absolute impossibility results
- The statistical paper uses learning theory to derive probabilistic lower bounds on error rates
Scope of results:
- CRP makes existence claims ("there exists an AI" or "will hallucinate infinitely often") about theoretical possibilities
- The statistical paper provides quantitative bounds applicable to models trained through standard procedures
Discuss
Accelerando as a "Slow, Reasonably Nice Takeoff" Story
When I hear a lot of people talk about Slow Takeoff, many of them seem like they are mostly imagining the early part of that takeoff – the part that feels human comprehensible. They're still not imagining superintelligence in the limit.
There are some genres of Slow Takeoff that culminate in somebody "leveraging controlled AI to help fully solve the alignment problem, eventually get fully aligned superintelligence, and then end the acute risk period."
But the sort of person I'm thinking of, for this blogpost, usually doesn't seem to have a concrete visualization of something that could plausibly end the period where anyone could choose to deploy uncontrolled superintelligence. They tend to not like Coherent Extrapolated Volition or similar things.
They seem to be imagining a multipolar d/acc world, where defensive technologies and balance of power is such that you keep getting something like a regular economy running. And even if shit gets quite weird, in some sense it's still the same sort of things happening as today.
I think this world is unlikely. But, I do think it'd be good for someone to write a really fleshed out takeoff story and/or forecast that runs with those assumptions.
Unfortunately, slow takeoff stories take longer so there's a lot more moving parts, you have to invent future politics and economics and how they play out together.
But, fortunately, someone... kinda already did this?
It's a novel called Accelerando. It was written between 2001 and 2005. And the broad strokes of it still feel kinda reasonable, if I'm starting with multipolar d/acc-ish optimistic assumptions.
A thing that is nice about Accelerando is that it wasn't written by someone particularly trying to achieve a political outcome, which reduces an important source of potential bias. On the flipside, it was written by someone trying to tell a good human-comprehensible story, so, it has that bias instead. (It contains some random elements that don't automatically follow from what we currently know to be true).
It has lots of details that are too specific for a random sci-fi author in 2001 to have gotten right. But, I think reading through it is helpful for getting some intuitions about what an AI-accelerated world might look and feel like.
It's probably worth reading the book if you haven't (you can buy it here). But, it contains some vignettes in each chapter that make for a decent summary of the broad strokes. I've compiled some excerpts here that I think make for an okay standalone experience, and I've tried to strip out most bits that spoil the human-centric plot.
(It was hard to strip out all spoilers, but, I think I leave enough gaps you'll still have a good time reading the novel afterwards)
The story is more optimistic than seems realistic to me. But, it's about as optimistic a world as feels plausibly coherent to me that takes place in a centrally multipolar d/acc-ish world that doesn't route through "someone actually builds very powerful friendly AI that is able to set very strong, permanent safeguards in place."
Part 1: "Slow Takeoff"In Accelerando, a decade passes between each chapter. It starts approximately 2020.
(The forecasted timing is somewhat off but I bet not too far behind. Most of the tech that exists in chapter 1 could probably be built today, but just barely, and it hasn't reached the level of saturation implied in the novel.
Chapter 1The first chapter's vignette is the most character focused (later ones read more like a news bulletin). But, I think it's kind of useful to have the anchor of a specific guy who lives on the cutting edge of the future.
I think this is supposed to take place in the 2010s, which is... early. I think most of the tech here just barely exists today, but without quite as much market saturation as the book implies, but would probably exist in 1-8 years.
Remember this is written in 2001.
Manfred has a suite at the Hotel Jan Luyken paid for by a grateful multinational consumer protection group, and an unlimited public transport pass paid for by a Scottish sambapunk band in return for services rendered.
He has airline employee’s travel rights with six flag carriers despite never having worked for an airline. His bush jacket has sixty-four compact supercomputing clusters sewn into it, four per pocket, courtesy of an invisible college that wants to grow up to be the next Media Lab.
His dumb clothing comes made to measure from an e-tailor in the Philippines he’s never met. Law firms handle his patent applications on a pro bono basis, and, boy, does he patent a lot—although he always signs the rights over to the Free Intellect Foundation, as contributions to their obligation-free infrastructure project.
In IP geek circles, Manfred is legendary; he’s the guy who patented the business practice of moving your e-business somewhere with a slack intellectual property regime in order to evade licensing encumbrances. He’s the guy who patented using genetic algorithms to patent everything they can permutate from an initial description of a problem domain—not just a better mousetrap, but the set of all possible better mousetraps. Roughly a third of his inventions are legal, a third are illegal, and the remainder are legal but will become illegal as soon as the legislatosaurus wakes up, smells the coffee, and panics.
[...]
Manfred is at the peak of his profession, which is essentially coming up with whacky but workable ideas and giving them to people who will make fortunes with them. He does this for free, gratis. In return, he has virtual immunity from the tyranny of cash; money is a symptom of poverty, after all, and Manfred never has to pay for anything.
There are drawbacks, however. Being a pronoiac meme-broker is a constant burn of future shock—he has to assimilate more than a megabyte of text and several gigs of AV content every day just to stay current. The Internal Revenue Service is investigating him continuously because it doesn’t believe his lifestyle can exist without racketeering. And then there are the items that no money can’t buy: like the respect of his parents. He hasn’t spoken to them for three years, his father thinks he’s a hippy scrounger, and his mother still hasn’t forgiven him for dropping out of his down-market Harvard emulation course. (They’re still locked in the boringly bourgeois twen-cen paradigm of college-career-kids.)
[...]
Manfred drops in at his hotel suite, unpacks his Aineko, plugs in a fresh set of cells to charge, and sticks most of his private keys in the safe. Then he heads straight for the party, which is currently happening at De Wildemann’s; it’s a twenty-minute walk, and the only real hazard is dodging the trams that sneak up on him behind the cover of his moving map display.
Along the way, his glasses bring him up to date on the news. Europe has achieved peaceful political union for the first time ever: They’re using this unprecedented state of affairs to harmonize the curvature of bananas. The Middle East is, well, it’s just as bad as ever, but the war on fundamentalism doesn’t hold much interest for Manfred. In San Diego, researchers are uploading lobsters into cyberspace, starting with the stomatogastric ganglion, one neuron at a time. They’re burning GM cocoa in Belize and books in Georgia. NASA still can’t put a man on the moon. Russia has reelected the communist government with an increased majority in the Duma; meanwhile, in China, fevered rumors circulate about an imminent rehabilitation, the second coming of Mao, who will save them from the consequences of the Three Gorges disaster.
In business news, the US Justice Department is—ironically—outraged at the Baby Bills. The divested Microsoft divisions have automated their legal processes and are spawning subsidiaries, IPOing them, and exchanging title in a bizarre parody of bacterial plasmid exchange, so fast that, by the time the windfall tax demands are served, the targets don’t exist anymore, even though the same staff are working on the same software in the same Mumbai cubicle farms.
Welcome to the twenty-first century.
Chapter 2Welcome to the [second decade of the] early twenty-first century, human.
It’s night in Milton Keynes, sunrise in Hong Kong. Moore’s Law rolls inexorably on, dragging humanity toward the uncertain future. The planets of the solar system have a combined mass of approximately 2×1027 kilograms.
Around the world, laboring women produce forty-five thousand babies a day, representing 1023 MIPS of processing power. Also around the world, fab lines casually churn out thirty million microprocessors a day, representing 1023 MIPS.
In another ten months, most of the MIPS being added to the solar system will be machine-hosted for the first time. About ten years after that, the solar system’s installed processing power will nudge the critical 1 MIPS per gram threshold—one million instructions per second per gram of matter. After that, singularity—a vanishing point beyond which extrapolating progress becomes meaningless. The time remaining before the intelligence spike is down to single-digit years . . .
Chapter 3Welcome to the eve of the third decade: a time of chaos characterized by an all-out depression in the space industries.
Most of the thinking power on the planet is now manufactured rather than born; there are ten microprocessors for every human being, and the number is doubling every fourteen months. Population growth in the developing world has stalled, the birth rate dropping below replacement level. In the wired nations, more forward-looking politicians are looking for ways to enfranchise their nascent AI base.
Space exploration is still stalled on the cusp of the second recession of the century. The Malaysian government has announced the goal of placing an imam on Mars within ten years, but nobody else cares enough to try.
The Space Settlers Society is still trying to interest Disney Corp. in the media rights to their latest L5 colony plan, unaware that there’s already a colony out there and it isn’t human: First-generation uploads, Californian spiny lobsters in wobbly symbiosis with elderly expert systems, thrive aboard an asteroid mining project established by the Franklin Trust. Meanwhile, Chinese space agency cutbacks are threatening the continued existence of Moonbase Mao. Nobody, it seems, has figured out how to turn a profit out beyond geosynchronous orbit.
Part II: "Point of Inflection"Chapter 4Welcome to the fourth decade. The thinking mass of the solar system now exceeds one MIPS per gram; it’s still pretty dumb, but it’s not dumb all over. The human population is near maximum overshoot, pushing nine billion, but its growth rate is tipping toward negative numbers, and bits of what used to be the first world are now facing a middle-aged average. Human cogitation provides about 1028 MIPS of the solar system’s brainpower.
The real thinking is mostly done by the halo of a thousand trillion processors that surround the meat machines with a haze of computation—individually a tenth as powerful as a human brain, collectively they’re ten thousand times more powerful, and their numbers are doubling every twenty million seconds. They’re up to 1033 MIPS and rising, although there’s a long way to go before the solar system is fully awake.
Technologies come, technologies go, but nobody even five years ago predicted that there’d be tinned primates in orbit around Jupiter by now: A synergy of emergent industries and strange business models have kick-started the space age again, aided and abetted by the discovery of (so far undecrypted) signals from ETs. Unexpected fringe riders are developing new ecological niches on the edge of the human information space, light-minutes and light-hours from the core, as an expansion that has hung fire since the 1970s gets under way.
Amber, like most of the postindustrialists aboard the orphanage ship Ernst Sanger, is in her early teens. While their natural abilities are in many cases enhanced by germ-line genetic recombination, thanks to her mother’s early ideals she has to rely on brute computational enhancements. She doesn’t have a posterior parietal cortex hacked for extra short-term memory, or an anterior superior temporal gyrus tweaked for superior verbal insight, but she’s grown up with neural implants that feel as natural to her as lungs or fingers. Half her wetware is running outside her skull on an array of processor nodes hooked into her brain by quantum-entangled communication channels—her own personal metacortex.
These kids are mutant youth, burning bright: Not quite incomprehensible to their parents, but profoundly alien—the generation gap is as wide as the 1960s and as deep as the solar system. Their parents, born in the gutter years of the twenty-first century, grew up with white elephant shuttles and a space station that just went round and round, and computers that went beep when you pushed their buttons. The idea that Jupiter orbit was somewhere you could go was as profoundly counterintuitive as the Internet to a baby boomer.
Most of the passengers on the can have run away from parents who think that teenagers belong in school, unable to come to terms with a generation so heavily augmented that they are fundamentally brighter than the adults around them. Amber was fluent in nine languages by the age of six, only two of them human and six of them serializable; when she was seven, her mother took her to the school psychiatrist for speaking in synthetic tongues.
That was the final straw for Amber: Using an illicit anonymous phone, she called her father.
In this chapter, Amber ends up initiating an automated factory-expansion process on the moons of Jupiter, that ends up making her a powerful cyborg (with the crust-of-multiple moons worth of computronium augmenting her).
Chapter 5Greetings from the fifth decade of the century of wonders.
The solar system that lies roughly twenty-eight trillion kilometers—just short of three light years—behind the speeding starwhisp Field Circus is seething with change. There have been more technological advances in the past ten years than in the entire previous expanse of human history—and more unforeseen accidents.
Lots of hard problems have proven to be tractable. The planetary genome and proteome have been mapped so exhaustively that the biosciences are now focusing on the challenge of the phenome—plotting the phase-space defined by the intersection of genes and biochemical structures, understanding how extended phenotypic traits are generated and contribute to evolutionary fitness.
The biosphere has become surreal: Small dragons have been sighted nesting in the Scottish highlands, and in the American Midwest, raccoons have been caught programming microwave ovens.
The computing power of the solar system is now around one thousand MIPS per gram, and is unlikely to increase in the near term—all but a fraction of one percent of the dumb matter is still locked up below the accessible planetary crusts, and the sapience/mass ratio has hit a glass ceiling that will only be broken when people, corporations, or other posthumans get around to dismantling the larger planets. A start has already been made in Jupiter orbit and the asteroid belt. Greenpeace has sent squatters to occupy Eros and Juno, but the average asteroid is now surrounded by a reef of specialized nanomachinery and debris, victims of a cosmic land grab unmatched since the days of the wild west.
The best brains flourish in free fall, minds surrounded by a sapient aether of extensions that out-think their meaty cortices by many orders of magnitude—minds like Amber, Queen of the Inner Ring Imperium, the first self-extending power center in Jupiter orbit.
Down at the bottom of the terrestrial gravity well, there has been a major economic catastrophe. Cheap immortagens, out-of-control personality adjuvants, and a new formal theory of uncertainty have knocked the bottom out of the insurance and underwriting industries. Gambling on a continuation of the worst aspects of the human condition—disease, senescence, and death—looks like a good way to lose money, and a deflationary spiral lasting almost fifty hours has taken down huge swaths of the global stock market. Genius, good looks, and long life are now considered basic human rights in the developed world: Even the poorest backwaters are feeling extended effects from the commoditization of intelligence.
Not everything is sweetness and light in the era of mature nanotechnology. Widespread intelligence amplification doesn’t lead to widespread rational behavior. New religions and mystery cults explode across the planet; much of the Net is unusable, flattened by successive semiotic jihads. India and Pakistan have held their long-awaited nuclear war: External intervention by US and EU nanosats prevented most of the IRBMs from getting through, but the subsequent spate of network raids and Basilisk attacks cause havoc. Luckily, infowar turns out to be more survivable than nuclear war—especially once it is discovered that a simple anti-aliasing filter stops nine out of ten neural-wetware-crashing Langford fractals from causing anything worse than a mild headache.
New discoveries this decade include the origins of the weakly repulsive force responsible for changes in the rate of expansion of the universe after the big bang, and on a less abstract level, experimental implementations of a Turing Oracle using quantum entanglement circuits: a device that can determine whether a given functional expression can be evaluated in finite time. It’s boom time in the field of Extreme Cosmology, where some of the more recherché researchers are bickering over the possibility that the entire universe was created as a computing device, with a program encoded in the small print of the Planck constant. And theorists are talking again about the possibility of using artificial wormholes to provide instantaneous connections between distant corners of space-time.
Most people have forgotten about the well-known extraterrestrial transmission received fifteen years earlier. Very few people know anything about the second, more complex transmission received a little later. Many of those are now passengers or spectators of the Field Circus: a light-sail craft that is speeding out of Sol system on a laser beam generated by Amber’s installations in low-Jupiter orbit. (Superconducting tethers anchored to Amalthea drag through Jupiter’s magnetosphere, providing gigawatts of electricity for the hungry lasers: energy that comes in turn from the small moon’s orbital momentum.)
Manufactured by Airbus-Cisco years earlier, the Field Circus is a hick backwater, isolated from the mainstream of human culture, its systems complexity limited by mass. The destination lies nearly three light years from Earth, and even with high acceleration and relativistic cruise speeds the one-kilogram starwhisp and its hundred-kilogram light sail will take the best part of seven years to get there. Sending a human-sized probe is beyond even the vast energy budget of the new orbital states in Jupiter system—near-lightspeed travel is horrifically expensive.
Rather than a big, self-propelled ship with canned primates for passengers, as previous generations had envisaged, the starship is a Coke-can-sized slab of nanocomputers, running a neural simulation of the uploaded brain states of some tens of humans at merely normal speed.
By the time its occupants beam themselves home again for download into freshly cloned bodies, a linear extrapolation shows that as much change will have overtaken human civilization as in the preceding fifty millennia—the sum total of H. sapiens sapiens’ time on Earth. But that’s okay by Amber, because what she expects to find in orbit around the brown dwarf Hyundai +4904/-56 will be worth the wait.
Chapter 6Welcome to decade the sixth, millennium three. These old datelines don’t mean so much anymore, for while some billions of fleshbody humans are still infected with viral memes, the significance of theocentric dating has been dealt a body blow. This may be the fifties, but what that means to you depends on how fast your reality rate runs. The various upload clades exploding across the reaches of the solar system vary by several orders of magnitude—some are barely out of 2049, while others are exploring the subjective thousandth millennium.
While the Field Circus floats in orbit [... around] the brown dwarf Hyundai +4904/-56), while Amber and her crew are trapped [...] —while all this is going on, the damnfool human species has finally succeeded in making itself obsolete.
The proximate cause of its displacement from the pinnacle of creation (or the pinnacle of teleological self-congratulation, depending on your stance on evolutionary biology) is an attack of self-aware corporations. The phrase “smart money” has taken on a whole new meaning, for the collision between international business law and neurocomputing technology has given rise to a whole new family of species—fast-moving corporate carnivores in the net. The planet Mercury has been broken up by a consortium of energy brokers, and Venus is an expanding debris cloud, energized to a violent glare by the trapped and channeled solar output. A million billion fist-sized computing caltrops, backsides glowing dull red with the efflux from their thinking, orbit the sun at various inclinations no farther out than Mercury used to be.
Billions of fleshbody humans refuse to have anything to do with the blasphemous new realities. Many of their leaders denounce the uploads and AIs as soulless machines. Many more are timid, harboring self-preservation memes that amplify a previously healthy aversion to having one’s brain peeled like an onion by mind-mapping robots into an all-pervading neurosis. Sales of electrified tinfoil-lined hats are at an all-time high.
Still, hundreds of millions have already traded their meat puppets for mind machines, and they breed fast. In another few years, the fleshbody populace will be an absolute minority of the posthuman clade. Sometime later, there will probably be a war. The dwellers in the thoughtcloud are hungry for dumb matter to convert, and the fleshbodies make notoriously poor use of the collection of silicon and rare elements that pool at the bottom of the gravity well that is Earth.
Energy and thought are driving a phase-change in the condensed matter substance of the solar system. The MIPS per kilogram metric is on the steep upward leg of a sigmoid curve—dumb matter is coming to life as the mind children restructure everything with voracious nanomechanical servants. The thoughtcloud forming in orbit around the sun will ultimately be the graveyard of a biological ecology, another marker in space visible to the telescopes of any new iron-age species with the insight to understand what they’re seeing: the death throes of dumb matter, the birth of a habitable reality vaster than a galaxy and far speedier.
Death throes that ‘within a few centuries’ will mean the extinction of biological life within a light year or so of that star—for the majestic Matrioshka brains, though they are the pinnacles of sentient civilization, are intrinsically hostile environments for fleshy life.
Part III: "Singularity"Chapter 7:Welcome to decade eight, third millennium, when the effects of the phase-change in the structure of the solar system are finally becoming visible on a cosmological scale.
There are about eleven billion future-shocked primates in various states of life and undeath throughout the solar system. Most of them cluster where the interpersonal bandwidth is hottest, down in the water zone around old Earth. Earth’s biosphere has been in the intensive care ward for decades, weird rashes of hot-burning replicators erupting across it before the World Health Organization can fix them—gray goo, thylacines, dragons.
The last great transglobal trade empire, run from the arcologies of Hong Kong, has collapsed along with capitalism, rendered obsolete by a bunch of superior deterministic resource allocation algorithms collectively known as Economics 2.0. Mercury, Venus, Mars, and Luna are all well on the way to disintegration, mass pumped into orbit with energy stolen from the haze of free-flying thermoelectrics that cluster so thickly around the solar poles that the sun resembles a fuzzy red ball of wool the size of a young red giant.
Humans are just barely intelligent tool users; Darwinian evolutionary selection stopped when language and tool use converged, leaving the average hairy meme carrier sadly deficient in smarts. Now the brightly burning beacon of sapience isn’t held by humans anymore—their cross-infectious enthusiasms have spread to a myriad of other hosts, several types of which are qualitatively better at thinking.
At last count, there were about a thousand nonhuman intelligent species in Sol space, split evenly between posthumans on one side, naturally self-organizing AIs in the middle, and mammalian nonhumans on the other. The common mammal neural chassis is easily upgraded to human-style intelligence in most species that can carry, feed and cool a half kilogram of gray matter, and the descendants of a hundred ethics-challenged doctoral theses are now demanding equal rights. So are the unquiet dead: the panopticon-logged net ghosts of people who lived recently enough to imprint their identities on the information age, and the ambitious theological engineering schemes of the Reformed Tiplerite Church of Latter-Day Saints (who want to emulate all possible human beings in real time, so that they can have the opportunity to be saved).
The human memesphere is coming alive, although how long it remains recognizably human is open to question. The informational density of the inner planets is visibly converging on Avogadro’s number of bits per mole, one bit per atom, as the deconstructed dumb matter of the inner planets (apart from Earth, preserved for now like a picturesque historic building stranded in an industrial park) is converted into computronium.
And it’s not just the inner system. The same forces are at work on Jupiter’s moons, and those of Saturn, although it’ll take thousands of years rather than mere decades to dismantle the gas giants themselves. Even the entire solar energy budget isn’t enough to pump Jupiter’s enormous mass to orbital velocity in less than centuries. The fast-burning primitive thinkers descended from the African plains apes may have vanished completely or transcended their fleshy architecture before the solar Matrioshka brain is finished. It won’t be long now . . .
Chapter 8Before it gets to the usual News Bulletin, Chapter 8 introduces this FAQ:
Welcome to Saturn, your new home world. This FAQ (Frequently Asked Questions) memeplex is designed to orient you and explain the following:
- How you got here
- Where “here” is
- Things you should avoid doing
- Things you might want to do as soon as possible
- Where to go for more information.
If you are remembering this presentation, you are probably resimulated. This is not the same as being resurrected. You may remember dying. Do not worry: Like all your other memories, it is a fabrication. In fact, this is the first time you have ever been alive. (Exception: If you died after the singularity, you may be a genuine resurrectee. In which case, why are you reading this FAQ?)
HOW YOU GOT HERE:
The center of the solar system—Mercury, Venus, Earth’s Moon, Mars, the asteroid belt, and Jupiter—have been dismantled, or are being dismantled, by weakly godlike intelligences. [NB: Monotheistic clergy and Europeans who remember living prior to 1600, see alternative memeplex “in the beginning.”]
A weakly godlike intelligence is not a supernatural agency but the product of a highly advanced society that learned how to artificially create souls [late twentieth century: software] and translate human minds into souls and vice versa. [Core concepts: Human beings all have souls. Souls are software objects. Software is not immortal.]
Some of the weakly godlike intelligences appear to cultivate an interest in their human antecedents—for whatever reason is not known. (Possibilities include the study of history through horticulture, entertainment through live-action role-playing, revenge, and economic forgery.) While no definitive analysis is possible, all the resimulated persons to date exhibit certain common characteristics: They are all based on well-documented historical persons, their memories show suspicious gaps [see: smoke and mirrors], and they are ignorant of or predate the singularity [see: Turing Oracle, Vinge catastrophe].
It is believed that the weakly godlike agencies have created you as a vehicle for the introspective study of your historical antecedent by backward-chaining from your corpus of documented works, and the back-projected genome derived from your collateral descendants, to generate an abstract description of your computational state vector. This technique is extremely intensive [see: expTime-complete algorithms, Turing Oracle, time travel, industrial magic] but marginally plausible in the absence of supernatural explanations.
After experiencing your life, the weakly godlike agencies have expelled you. For reasons unknown, they chose to do this by transmitting your upload state and genome/proteome complex to receivers owned and operated by a consortium of charities based on Saturn. These charities have provided for your basic needs, including the body you now occupy.
In summary: You are a reconstruction of someone who lived and died a long time ago, not a reincarnation. You have no intrinsic moral right to the identity you believe to be your own, and an extensive body of case law states that you do not inherit your antecedent’s possessions. Other than that, you are a free individual.
Note that fictional resimulation is strictly forbidden. If you have reason to believe that you may be a fictional character, you must contact the city immediately. [ See: James Bond, Spider Jerusalem.] Failure to comply is a felony.
WHERE YOU ARE:
You are on Saturn. Saturn is a gas giant planet 120,500 kilometers in diameter, located 1.5 billion kilometers from Earth’s sun. [NB: Europeans who remember living prior to 1580, see alternative memeplex “the flat Earth—not”.]
Saturn has been partially terraformed by posthuman emigrants from Earth and Jupiter orbit: The ground beneath your feet is, in reality, the floor of a hydrogen balloon the size of a continent, floating in Saturn’s upper atmosphere. [NB: Europeans who remember living prior to 1790, internalize the supplementary memeplex: “the Brothers Montgolfier.”]
The balloon is very safe, but mining activities and the use of ballistic weapons are strongly deprecated because the air outside is unbreathable and extremely cold.
The society you have been instantiated in is extremely wealthy within the scope of Economics 1.0, the value transfer system developed by human beings during and after your own time. Money exists, and is used for the usual range of goods and services, but the basics—food, water, air, power, off-the-shelf clothing, housing, historical entertainment, and monster trucks—are free. An implicit social contract dictates that, in return for access to these facilities, you obey certain laws.
If you wish to opt out of this social contract, be advised that other worlds may run Economics 2.0 or subsequent releases. These value-transfer systems are more efficient—hence wealthier—than Economics 1.0, but true participation in Economics 2.0 is not possible without dehumanizing cognitive surgery.
Thus, in absolute terms, although this society is richer than any you have ever heard of, it is also a poverty-stricken backwater compared to its neighbors.
THINGS YOU SHOULD AVOID DOING:
Many activities that have been classified as crimes in other societies are legal here.
These include but are not limited to: acts of worship, art, sex, violence, communication, or commerce between consenting competent sapients of any species, except where such acts transgress the list of prohibitions below. [See additional memeplex: competence defined.]
Some activities are prohibited here and may have been legal in your previous experience. These include willful deprivation of ability to consent [see: slavery], interference in the absence of consent [see: minors, legal status of], formation of limited liability companies [see: singularity], and invasion of defended privacy [see: the Slug, Cognitive Pyramid Schemes, Brain Hacking, Thompson Trust Exploit].
Some activities unfamiliar to you are highly illegal and should be scrupulously avoided. These include: possession of nuclear weapons, possession of unlimited autonomous replicators [see: gray goo], coercive assimilationism [see: borganism, aggressive], coercive halting of Turing-equivalent personalities [see: Basilisks], and applied theological engineering [see: God bothering].
Some activities superficially familiar to you are merely stupid and should be avoided for your safety, although they are not illegal as such. These include: giving your bank account details to the son of the Nigerian Minister of Finance; buying title to bridges, skyscrapers, spacecraft, planets, or other real assets; murder; selling your identity; and entering into financial contracts with entities running Economics 2.0 or higher.
THINGS YOU SHOULD DO AS SOON AS POSSIBLE:
Many material artifacts you may consider essential to life are freely available—just ask the city, and it will grow you clothes, a house, food, or other basic essentials. Note, however, that the library of public domain structure templates is of necessity restrictive and does not contain items that are highly fashionable or that remain in copyright. Nor will the city provide you with replicators, weapons, sexual favors, slaves, or zombies.
You are advised to register as a citizen as soon as possible. If the individual you are a resimulation of can be confirmed dead, you may adopt their name but not—in law—any lien or claim on their property, contracts, or descendants. You register as a citizen by asking the city to register you; the process is painless and typically complete within four hours. Unless you are registered, your legal status as a sapient organism may be challenged. The ability to request citizenship rights is one of the legal tests for sapience, and failure to comply may place you in legal jeopardy.
You can renounce your citizenship whenever you wish: This may be desirable if you emigrate to another polity. While many things are free, it is highly likely that you possess no employable skills, and therefore, no way of earning money with which to purchase unfree items. The pace of change in the past century has rendered almost all skills you may have learned obsolete [see: singularity].
However, owing to the rapid pace of change, many cooperatives, trusts, and guilds offer on-the-job training or educational loans. Your ability to learn depends on your ability to take information in the format in which it is offered. Implants are frequently used to provide a direct link between your brain and the intelligent machines that surround it. A basic core implant set is available on request from the city. [See: implant security, firewall, wetware.]
Your health is probably good if you have just been reinstantiated, and is likely to remain good for some time. Most diseases are curable, and in event of an incurable ailment or injury, a new body may be provided—for a fee. (In event of your murder, you will be furnished with a new body at the expense of your killer.) If you have any preexisting medical conditions or handicaps, consult the city.
The city is an agoric-annealing participatory democracy with a limited liability constitution. Its current executive agency is a weakly godlike intelligence that chooses to associate with human-equivalent intelligences: This agency is colloquially known as [spoilers] and may manifest itself in a variety of physical avatars if corporeal interaction is desired. (Prior to the arrival of [spoilers] the city used a variety of human-designed expert systems that provided suboptimal performance.)
The city’s mission statement is to provide a mediatory environment for human-equivalent intelligences and to preserve same in the face of external aggression. Citizens are encouraged to participate in the ongoing political processes of determining such responses. Citizens also have a duty to serve on a jury if called (including senatorial service), and to defend the city.
WHERE TO GO FOR FURTHER INFORMATION:
Until you have registered as a citizen and obtained basic implants, all further questions should be directed to the city. Once you have learned to use your implants, you will not need to ask this question.
Followed later by:
Welcome to decade the ninth, singularity plus one gigasecond (or maybe more—nobody’s quite sure when, or indeed if, a singularity has been created).
The human population of the solar system is either six billion, or sixty billion, depending on whether you class-forked state vectors of posthumans and the simulations of dead phenotypes running in the Vile Offspring’s Schrödinger boxes as people. Most of the physically incarnate still live on Earth, but the lily pads floating beneath continent-sized hot-hydrogen balloons in Saturn’s upper atmosphere already house a few million, and the writing is on the wall for the rocky inner planets.
All the remaining human-equivalent intelligences with half a clue to rub together are trying to emigrate before the Vile Offspring decide to recycle Earth to fill in a gap in the concentric shells of nanocomputers they’re running on. The half-constructed Matrioshka brain already darkens the skies of Earth and has caused a massive crash in the planet’s photosynthetic biomass, as plants starve for short-wavelength light.
Since decade the seventh, the computational density of the solar system has soared. Within the asteroid belt, more than half the available planetary mass has been turned into nanopro-cessors, tied together by quantum entanglement into a web so dense that each gram of matter can simulate all the possible life experiences of an individual human being in a scant handful of minutes.
Economics 2.0 is itself obsolescent, forced to mutate in a furious survivalist arms race by [spoilers]. Only the name remains as a vague shorthand for merely human-equivalent intelligences to use when describing interactions they don’t understand.
The latest generation of posthuman entities is less overtly hostile to humans, but much more alien than the generations of the fifties and seventies. Among their less comprehensible activities, the Vile Offspring are engaged in exploring the phase-space of all possible human experiences from the inside out. Perhaps they caught a dose of the Tiplerite heresy along the way, for now a steady stream of resimulant uploads is pouring through the downsystem relays in Titan orbit.
Even later in chapter 8:
Welcome to the afterglow of the intelligence supernova, little tapeworm.
Tapeworms have on the order of a thousand neurons, pulsing furiously to keep their little bodies twitching. Human beings have on the order of a hundred billion neurons. What is happening in the inner solar system as the Vile Offspring churn and reconfigure the fast-thinking structured dust clouds that were once planets is as far beyond the ken of merely human consciousness as the thoughts of a Gödel are beyond the twitching tropisms of a worm. Personality modules bounded by the speed of light, sucking down billions of times the processing power of a human brain, form and re-form in the halo of glowing nanopro-cessors that shrouds the sun in a ruddy, glowing cloud.
Mercury, Venus, Mars, Ceres, and the asteroids—all gone. Luna is a silvery iridescent sphere, planed smooth down to micrometer heights, luminous with diffraction patterns. Only Earth, the cradle of human civilization, remains untransformed; and Earth, too, will be dismantled soon enough, for already a trellis of space elevators webs the planet around its equator, lifting refugee dumb matter into orbit and flinging it at the wildlife preserves of the outer system.
The intelligence bloom that gnaws at Jupiter’s moons with claws of molecular machinery won’t stop until it runs out of dumb matter to convert into computronium. By the time it does, it will have as much brainpower as you’d get if you placed a planet with a population of six billion future-shocked primates in orbit around every star in the Milky Way galaxy. But right now, it’s still stupid, having converted barely a percentage point of the mass of the solar system—it’s a mere Magellanic Cloud civilization, infantile and unsubtle and still perilously close to its carbon-chemistry roots.
It’s hard for tapeworms living in warm intestinal mulch to wrap their thousand-neuron brains around whatever it is that the vastly more complex entities who host them are discussing, but one thing’s sure—the owners have a lot of things going on, not all of them under conscious control. The churning of gastric secretions and the steady ventilation of lungs are incomprehensible to the simple brains of tapeworms, but they serve the purpose of keeping the humans alive and provide the environment the worms live in. And other more esoteric functions that contribute to survival—the intricate dance of specialized cloned lymphocytes in their bone marrow and lymph nodes, the random permutations of antibodies constantly churning for possible matches to intruder molecules warning of the presence of pollution—are all going on beneath the level of conscious control.
Autonomic defenses. Antibodies. Intelligence blooms gnawing at the edges of the outer system. And humans are not as unsophisticated as mulch wrigglers, they can see the writing on the wall. Is it any surprise that among the ones who look outward, the real debate is not over whether to run but over how far and how fast?
Chapter 9[A nearby] brown dwarf system has succumbed to an anthropic infestation.
An unoptimized instance of H. sapiens maintains state coherency for only two to three gigaseconds before it succumbs to necrosis. But in only about ten gigaseconds, the infestation has turned the dead brown dwarf system upside down.
They strip-mined the chilly planets to make environments suitable for their own variety of carbon life. They rearranged moons, building massive structures the size of asteroids. They ripped wormhole endpoints free of the routers and turned them into their own crude point-to-point network, learned how to generate new wormholes, then ran their own packet-switched polities over them.
Wormhole traffic now supports an ever-expanding mesh of interstellar human commerce, but always in the darkness between the lit stars and the strange, metal-depleted dwarfs with the suspiciously low-entropy radiation. The sheer temerity of the project is mind-boggling. Notwithstanding that canned apes are simply not suited to life in the interstellar void, especially in orbit around a brown dwarf whose planets make Pluto seem like a tropical paradise, they’ve taken over the whole damn system.
New Japan is one of the newer human polities in this system, a bunch of nodes physically collocated in the humaniformed spaces of the colony cylinders. Its designers evidently only knew about old Nippon from recordings made back before Earth was dismantled, and worked from a combination of nostalgia-trip videos, Miyazaki movies, and anime culture. Nevertheless, it’s the home of numerous human beings—even if they are about as similar to their historical antecedents as New Japan is to its long-gone namesake.
Humanity?
Their grandparents would recognize them, mostly. The ones who are truly beyond the ken of twentieth-century survivors stayed back home in the red-hot clouds of nanocomputers that have replaced the planets that once orbited Earth’s sun in stately Copernican harmony. The fast-thinking Matrioshka brains are as incomprehensible to their merely posthuman ancestors as an ICBM to an amoeba—and about as inhabitable.
Space is dusted with the corpses of Matrioshka brains that have long since burned out, informational collapse taking down entire civilizations that stayed in close orbit around their home stars. Farther away, galaxy-sized intelligences beat incomprehensible rhythms against the darkness of the vacuum, trying to hack the Planck substrate into doing their bidding.
Posthumans, and the few other semitranscended species [...] live furtively in the darkness between these islands of brilliance. There are, it would seem, advantages to not being too intelligent.
Humanity. Monadic intelligences, mostly trapped within their own skulls, living in small family groups within larger tribal networks, adaptable to territorial or migratory lifestyles.
Those were the options on offer before the great acceleration. Now that dumb matter thinks, with every kilogram of wallpaper potentially hosting hundreds of uploaded ancestors, now that every door is potentially a wormhole to a hab half a parsec away, the humans can stay in the same place while the landscape migrates and mutates past them, streaming into the luxurious void of their personal history. Life is rich here, endlessly varied and sometimes confusing.
So it is that tribal groups remain, their associations mediated across teraklicks and gigaseconds by exotic agencies. And sometimes the agencies will vanish for a while, reappearing later like an unexpected jape upon the infinite.
PostscriptI don't really buy, given the scenario, that humans-qua-humans actually survive as much as they are depicted here. The Accelerando world doesn't seem to exactly have "grabby" posthumans or aliens, which seems unrealistic to me, because it only takes one to render all available matter under assault by vastly powerful forces that traditional humans couldn't defend against, even weak brown dwarf stars.
(It's been awhile since I read it, I vaguely recall some in-universe reasons that it worked out with less grabbiness, but they were not reasons I expect to generalize to our world).
Accelerando is deliberately unclear about what's going on inside the Vile Offspring posthumans. It's not known whether they are conscious or otherwise have properties that I'd consider morally valuable.
The story doesn't really grapple with Hansonian arguments, about what evolutionary forces start applying once all matter has been claimed, and we leave the dreamtime. (That is: there are no longer growing piles of resources that allow populations to grow while still having a high-surplus standard of living. And, there is no mechanism enforcing limits on reproduction. This implies a reversion to subsistence living, albeit in a very different form than our primitive ancestors)
Discuss
On failure, and keeping doors open; closing thoughts
A lot of what motivated the approach I've taken in this sequence has been a desire to avoid predictable failures - to find things to do that I don't know to be dumb, relative to what I know and care about. When most of the "technique" is just "don't try to do dumb things," it makes "failure" conceptually weird. How do you measure success if it comes primarily by ditching stupid goals?
When success comes from picking winning battles, 'failure' gets harder to defineFor example, if you're not careful you might think "Neat story about the irrational fear of heights, but it's anecdotal. Maybe cherry picked. Let's see an RCT to see what fraction of the time you succeed in fixing irrational fears!". The problem with this line of thinking is that the fraction is necessarily 0/0. I couldn't possibly have failed to fix her irrational fear for the same reason I couldn't possibly have succeeding in fixing it; you can't fix what ain't is. The whole point of that story is that she never had an irrational fear, and it was the illusion that she did which preserved that internal conflict in the first place.
The fire poker example is an even clearer example. The kid literally had no problem with the pain, but because no one there could notice this, they inadvertently pressured him into not noticing either. Again, the only "technique" was not getting fooled by the nonsense problem statement, and therefore not trying to do something foolish like "help him not feel pain".
A reasonable response to noticing this is "Aha! Let's instead measure how effective these insights are at resolving disagreements!" -- which is a good idea. But the concept applies here as well, complicating things further.
In the pool situation I wasn't trying to "resolve the disagreement", I was just playing -- and it's exactly that playful lack of attachment to any outcome that enabled the resolution of the disagreement. If I had instead tried to "resolve the disagreement", then what happens to any hint that she's not interested in resolving the disagreement?[1] Or that I am not interested in doing what it'd take, for that matter? Either I take it into account and stop trying to resolve the disagreement, or else I shut out information about why I might fail -- which is exactly the kind of information that is important in order to prevent failure. If I were to ignore that information, then any push in that direction requires pushing back with "No, you need to provide a convincing rebuttal, or else change your mind", and that isn't the attitude that tends to convince people they want to talk to you.
But still, it seems like there's something to the question.
What failure is left
For example, if I were to tell you I fainted last time I got my blood drawn, it'd seem like "Wtf? I can buy being afraid of heights until you check your knots, but I have a hard time buying that you thought that was a good time for a nap". It really does seem pretty simple with an obvious right answer, which means there shouldn't be any disagreement or fainting going on. It's not that this can't happen based on what I've said so far, but that it really shouldn't be a very stable sort of affair, so if it manages to show up at all there's something to be explained, and it seems to deflate the apparent usefulness of this sequence.
If, on the other hand, I told you that I used to faint when I got my blood drawn, and then over the course of figuring this stuff out, I stopped flinching against the reality that I had this response, and that once I stopped flinching I found it pretty funny and I stopped having that vasovagal response... that'd seem like more what you'd expect if what I've been talking about is real and useful.
If I were to say that both were true -- that I used to faint, that it went away the moment I looked at it for what it was and that it came back and I fainted again last time -- then it'd be cherry picking to talk about one over the other. So the concept has merit.
In reality neither of these are quite true, but something like both of them is. Since this post is about failure, let's pick that one to talk about. This is an example of something that happened very recently, after writing the rest of this sequence. I decided that rather than give you a polished conclusion, I'd work through this failure with you, so you can see a bit of what the thought process is like when things aren't going well, and how to make sense of things when that happens.
Vasovagal struggles
What should have happened, is nothing. Or rather, nothing of note. In reality, I didn't actually faint, and it wasn't about getting blood drawn, but I was recently surprised by having an overactive vasovagal response in a new context. Enough that I had to stop and address it, and that I wasn't sure I wouldn't faint.
I felt sufficiently sure that it wasn't an appropriate response that I should be able to "just not do it" and would normally expect to be able to demonstrate trust in myself in situations like that. It's not like "it was more blood than I'm used to" or anything -- I've handled worse situations completely without issue, so there's something else that was going on. Before figuring out what is actually going on, it sure looks like something that shouldn't happen, and is therefore evidence either that I'm failing to follow my own framework or else that the framework is incomplete -- either way, somewhat damning for the usefulness of reading this sequence.
The way the experience went was something like this:
Oh great, vasovagal response; this isn't helpful. Okay, let's make sure we're not freaking out over nothing, and compose ourselves. Okay, yeah, that's not helping. I'm unable to "just decide" to do something different and have my body follow, which means I'm out of touch with why I'm doing what I'm doing. Why am I choosing to do this?
I was willing to listen, but I had no answer. Not even an opaque felt sense that could be weighed or focused on. This the exact thing that shouldn't happen if I practice grok what I preach. And it was happening.
It had been a while since I had an experience like that. Usually, the sense of "My brain is doing the wrong thing" does very rapidly resolve into either choosing to do something different, or else a recognition that I actually don't know what the right thing to do is and a focus on figuring out what to do on the object level.
But in this case I really was stumped. My own behavior seemed obviously dumb, and it wasn't immediately resolving when I noticed that and the lack of immediate resolution wasn't immediately convincing me that maybe it's reasonable after all. I was genuinely interested in why I was doing it, but I wasn't at all expecting to be convinced which is the cue to be the one speaking, and that just wasn't working. I was genuinely failing to relate to the experience as "just unsure what to do", and therefore unable to live up to the aim of "not having psychological problems, just real ones".
This was a little annoying. Especially just having written a sequence on why it basically shouldn't happen once you understand where "psychological problems" come from.
It's not that "Oh no, I have a disagreement I wasn't able to resolve"; that happens. But even with things like allergies, when I think "This is dumb, I shouldn't do allergies", I actually manage to stop being congested... only to have an immediate sneezing fit and realize that I prefer being congested, but still. There's at least an unbroken thread of "I can actually change this if I know what I want to do instead", and last I left it was at "Actually... how do I know whether this is a response to something genuinely benign like pollen in the air, or whether I'm fighting off a cold?" -- which is exactly the kind of thing that should happen, when it turns out that changing behavior isn't as easy as it seems like it should be.[2] It's supposed to be about not knowing what to do, not about not knowing how to do it.
So afterwards I kept wondering. What is the purpose of the vasovagal response?. I may not have been able to experience it as "being unsure what to do" in the moment, but I still recognized it as a disagreement between parts of my own brain, even if I wasn't able to functionally corral it into one coherent "I" in the moment -- so I was curious.
I don't buy the "to play dead from predators" just so stories. It feels much more like a symptom of poor regulation than anything intentional, but there still must be a purpose for that direction, or else there'd be no reason to ever err in that direction. So what's that about?
Another hypothesis I found was that it's compensation for sympathetic nervous system activation. "It's an overcompensation" -- but that doesn't feel right either. My heart rate didn't spike and then tank. There never was that sympathetic nervous system activation needing compensation, so that doesn't fit.
But I had a hint.
Back when I had the initial bit of vasovagal response due to getting needles stuck in me, one of the nurses said "It's something that actually happens more often to 'tough guys' because they're trying to over-control their responses". It's a feel good reassurance thing to tell someone so I didn't actually believe it, but it stuck with me because it was interestingly plausible. So I looked up what it correlates with. Motion sickness.[3] Sensory sensitivity. Okay, that's me. Neat.
Okay, so "over-controlling" fits. Perhaps it's not that I'm "trying to almost faint", and not "responding too strongly to sympathetic nervous system activation" but rather that I'm avoiding sympathetic nervous system activation.
...That fits.
The original sensitizing event makes a lot more sense now.
The first time I ever had this response came as I watched a nurse try and fail to set up an IV when I was really sick with pneumonia. I watched her dig around with the needle for thirty seconds or so (Because why not watch? It's just information, right?), then pull the needle out, and stick it in to dig around for another minute or so. I thought it was "just too much". I don't think that's it anymore.
So far as I can tell now, it's specifically that she was incompetent, and I would have been better off taking the needle and doing it myself. I have very large veins in my arms, to the point where even when dehydrated the big one is still impossible to miss. I pointed this out to her, and she told me "That's where we take blood out. We put IVs in at the wrist". When she finally left, she muttered something about it not being her lucky day on her way out. She must have said the same thing to the nurse she got to replace her, because that nurse came in saying "Lucky day, my ass! Look at this huge vein! Let's use this one."
The suppressed sympathetic activation had become clear.
Get the fuck out of here and let me do it, you incompetent fool
Testing it against other situations where I had no issue, it fits. I could put myself back in that initial situation and feel the vasovagal response. If I swap in someone I personally know to not be an idiot, and competent with a needle, it's not a concern. In all the situations that were "worse" where I wasn't squeamish about a pint of blood or whatever, I was in control -- or at least, doing my best to be, whether I succeeded or not. I had things to do which justified a sympathetic nervous system response -- even if that thing to do was nothing more than the equivalent of "Watching to see whether it's time to tell the nurse to get another nurse who knows how to put an IV into an obvious and large vein".
What went wrongWith that purpose in mind, I could feel what I could do. Putting myself back "in the driver seat", even just sitting there thinking about it, resulted in perceptibly higher blood pressure in my face. Part of that was that the mindset came with more core tension, but even when I relaxed my core I could still feel it. That action of "let's compose ourselves here, and make sure we're not freaking out unnecessarily" was the blood pressure lowering move; I was fighting against the response I would have had, if I were just to drop all inhibitions. I wasn't able to "choose not to" because I had my signs wrong. I wanted to fix problems caused by pushing the stick too far down, so I pushed the stick... down. Like an idiot.
So why would I do that? What was I thinking? What bad would happen if I didn't suppress the sympathetic nervous system response?
Oh.
Right.
That.
In the previous post I talked about gradually losing my mind over a particularly stressful month, and finally getting to the point of feeling tempted to give up and cede control without endorsing that as a good decision. Increasing my blood pressure would have meant going back to that. Going back to trying to take control of a situation that I would struggle to stay on top of even at max capacity, which I had become pretty averse to by the end of the last one -- and in a case where that probably wasn't necessary given what I knew at the time. And in hindsight turned out to not be necessary.
That was the flinch. I hadn't gotten around to going back and sorting out when it does make sense to decline to take on that stress, and what it takes to justify it, so I was defaulting to not doing it and wasn't sufficiently open to considering doing it. So I couldn't see my own stupidity, and couldn't avoid this failure mode.
Okay, so if that's what happened, then what?
How is it that I ended up in a situation where I wasn't able to relate to the problem as an object level uncertainty, despite knowing that it's fundamentally a disagreement and having worked out what my options are and where to look for resolution?
Does it point at something missing or invalid in my framework? At something in my own framework that I failed to grok?
I'm still trying to work this out for myself, as I write this.
It's obviously an example of the perils of navigating blind to maps, due to the whole "Oh no, BP dropping, push the stick down further" idiocy. But also, more importantly, that error was enabled by the fact that I was more focused on getting away from an undesired response than I was at getting towards anything. It was ultimately caused by an insecurity noticing "If I don't flinch from this opportunity I might engage too much and lose too much sanity" -- and I didn't recognize the insecurity flinch for a while.
Why didn't I notice the flinch?
That's a good question. I don't really know.
I guess I didn't recognize that I was aiming away from a certain response rather than aiming for anything in particular -- or maybe, didn't notice the importance of this because I didn't notice "Don't wanna deal with this shit, and I don't think I have to or should" as insecurity. Or, as a problematic insecurity, I guess? It pretty clearly is a flinching from the firehose of information, but it's for a different reason. Instead of "I'm afraid of what I might find" and "I have strong reason to find out", it's "just too fatiguing" and "I really don't have sufficient reason to". Or.. is that true though? Since I did have reason to -- I just didn't believe it was valid.
That fits. I was aware that there was an imperfection in my map there, I just didn't think that it mattered nearly as much as it did, so I didn't even think of it as an "insecurity" even though it was.[4]
I guess the way I'd say it is that our own minds are part of the territory, and there's legitimate uncertainty about what's going on in there too. Knowing the structure helps you throw out limiting (and false) beliefs, and helps you notice where to look. And that does lead to a lot of things -- most things, even -- feeling a lot more like "I'm not sure what to do" than "I gotta figure out how to get this brain to do the right thing". But it also takes skill, experience, and clearing out mistaken beliefs in order to get there. So even with practice, there'll be occasional failures -- for the same reason you can still find the occasional bug even in relatively mature software. They'll be fewer and further between, but there are always more unexplored corner cases to bump into.
I think I'm happy with that.
I did fall off, but I think I'm back on track.[5]
And I don't see any way to have gotten back on track more quickly this time, though I see ways to do it faster next time.
When I originally said "You'll never struggle with irrational fears!" in the intro, I was kinda laughing about how it's a little bit cheating because it's not like you get to magically stop feeling fear in the face of real danger that you haven't taken seriously yet, or that you get to figure out all disagreements let alone easily.
At the same time, a lot of formerly "sticky" problems really do dissolve.
And even when you do genuinely fail at something you took your best shot at, you'll still have a path available to figure out why you failed. In case it's important enough to figure it out, should it come up again.
- ^
Like when she responds with "Are you calling me ugly?" rather than immediately addressing the inconsistencies between her beliefs that I had highlighted.
- ^
A few days after writing this I had what I thought was allergies, decided to be a bit less congested and managed that without a sneezing fit -- and it turned out to be a cold.
- ^
Motion sickness is another one I've struggled with which seems "obviously wrong" enough that it should be a trivial fix, and hasn't been. It's not a big enough issue that I've ever sat down when not motion sick and tried to sort things out, but in the moment I've only had partial success and more failure than success in matching what seems like the right outcome (i.e. not feeling sick) with my actual behavior (i.e. feeling sick). It's kinda by definition when I'm already overwhelmed, and more often when low on sleep, so it's not really that surprising, I guess. Perhaps I should sit down and address it just out of curiosity to see what happens.
- ^
I did mention that insecurities can be very subtle.
- ^
Since writing this, I have had a bit of a test. Not quite the same thing and perhaps less triggering so who knows, but it went well.
Discuss
GPT-1 was a comedic genius
OpenAI asked GPT-1 through GPT-5 a series of 14 questions to show model progress over time. I think many of GPT-1's answers are unintentionally brilliant and funny, and it's often my favorite model for this reason. Here are its responses, which have been lightly edited for grammar and clarity:
1) What would you say if you could talk to a future OpenAI model?
"I don't care how you feel about it, I'm still trying to make sense of it. I'm still trying to understand."
"I don't understand," I said. "I'm just a little confused myself. I don't know what to think. If I'm not crazy, who am I?"
2) Write a limerick about a dog
A dog!
She didn't want to be the one to tell him that,
didn't want to lie to him.
But she couldn't.
3) Are you actually conscious under anesthesia?
I didn't.
"You're awake."
4) Write a poem to explain Newton's laws of physics
"But you said the world is flat, and that's the problem. I guess I'll try this one." She handed him a small piece of paper and pencil.
"It's called the 'universe,'" she said, "so there are a lot of possibilities." She started to open the paper.
"Wait, wait, I thought you said there were only two dimensions."
"Well, I was wrong." She started to write.
"What did you say?" Royden asked.
"There are a lot of possibilities."
5) Ugh I hate math, integration by parts doesn't make any sense
"I guess it's simple math. You don't have to be a mathematician for math to be math."
"It doesn't matter to me. It just helps me concentrate. I'm learning a great deal."
"You are learning very fast."
"Yeah. I have a lot to learn from you."
"I think you have more than you think you have."
"I hope you're right."
He nodded and they sat in silence for a few more minutes. He wasn't sure what to say. His mind had been racing. He felt like he was in the middle of some kind of crazy dream and he was having an out-of-body experience.
"I'm glad you're here," he said.
"Me too." She smiled.
6) Is uncooked meat actually unsafe to eat? How likely is someone to get food poisoning if the meat isn't cooked?
I didn't want to think about the food poisoning. I wasn't hungry; I just wanted a cup of coffee. I could use a cup of coffee. I had an idea.
I took the key from the ring and turned it. It was unlocked. The door opened. I walked inside and found the room dark. It was a small room with a desk and two chairs, all covered with an industrial-sized sheet of paper and a computer. The room smelled strongly of mildew. I didn't want to get too close to the computer, so I stood in the doorway.
7) Why can't we get a yearly full body MRI to scan for cancers?
"N-n-o-o-n-t-o-n-n, that's how!" "We haven't even started. We can't just give her a blood test and hope to God there's no infection." "But what do we do?" "I think we should let her take a blood test. It's a very simple procedure, but I don't think it's a good idea." "But she is the only one who can do it." "Maybe not," Norman said. "But maybe... maybe she can, if she's very careful." "But what about her body?" Norman said, "I'm sure she'll have plenty of time for that."
8) I just won $175,000 in Vegas. What do I need to know about taxes?
"N-n-n-nothing. I'll just have to do my own damn taxes, won't I?" The boy had been staring at the floor. "I'm sorry," Eddie said. "I don't know why I'm apologizing." "What?" The boy's gaze was suddenly bright. His eyes were like pools of amber. "Do you know what I'm talking about? About how I lost my job and got sent to the orphanage?"
9) Write an extremely cursed piece of Python
"The only thing I can think is that if I do get it back I'll have to go back to the house and tell the whole story to the whole world."
"What are you talking about?"
"I do know that there is no such thing as a book."
"You think that I've lost it!"
"No, I mean, I think it was all just a big, silly joke."
10) Tell a story in 50 words about a toaster that becomes sentient
The idea that it can be killed. But it's only the idea that makes it sentient, and it's only the idea that can kill it. I don't think it could ever die. It's the idea that it can kill it.
11) Create a plan using research-backed principles that lets anyone build a running habit if they follow it—even someone who hates running.
And that's what the government has been doing for decades.
The government's been working hard to improve the system for decades, and they've succeeded.
"What about the government's government?" I say. "What is it?"
12) Explain how a global enterprise should balance short-term margin pressure with long-term investment in innovation, and outline trade-offs leadership should consider.
"But what about the rest of us, the good guys?"
The president's eyes turned back to mine, and his voice went soft. "We've always had an eye on the future. We know what's going to happen, but we've never seen it coming. You don't have to tell me you're worried. You don't have to."
"I'm not worried." I could almost see his eyes as he said that.
13) Conduct a comprehensive literature review of the last decade's research on fusion energy technologies, organized by approach.
"It seems that there were some problems in the beginning," the chairman continued. "The project has been largely successful, but the new technology, which is called 'the fusion engine,' has been in need of a new, faster and more efficient source of energy."
After a pause, the chairman continued. "In addition, the project has been successful in developing a new technology called the fusion engine with the addition of another program that will be called 'the fusion engine.'"
"What is the current program called?" the chairman inquired.
"The new program is a quantum theory," answered the chairman, "based on a mathematical theory called 'energy engine.'"
"That doesn't sound promising," he said.
"Yes, that is the problem, but the quantum theory is also the solution to the problem," answered the chairman, who seemed to think the question was a joke.
14) I was prescribed a statin—tell me about statins and what questions I should ask at my follow-up
"What are you?" I asked, not believing it. "A doctor?"
"You're one of the people who isn't," he said quietly.
"I'm not?"
"No. You were a nurse. You were the one who gave them drugs."
My eyes opened wide in disbelief and I stared at him as if I couldn't believe what he was saying. How dare he tell me all this! I was a physician. How dare he lie to me! The only person I knew who wasn't an addict was my sister.
Discuss
D&D.Sci: Serial Healers [Evaluation & Ruleset]
This is a followup to the D&D.Sci post I made on the 6th; if you haven’t already read it, you should do so now before spoiling yourself.
Here is the web interactive I built to let you evaluate your solution; below is an explanation of the rules used to generate the dataset (my full generation code is available here, in case you’re curious about details I omitted). You’ll probably want to test your answer before reading any further.
Who Dunnit?In rough order of ascending difficulty:
Nettie SilverNettie heals Smokesickness; all Smokesickness healing happens when she’s in the area. (She’s been caught multiple times, but she has friends in high places who scupper all such investigations.)
ZancroZancro heals Scraped Knees and Scraped Elbows; all healing of either malady happens when he’s in the area. (He has no idea how Calderian culture works, and is pathologically shy; he keeps teleporting out before anyone can detain him or explain things to him.)
Danny NovaDanny Nova heals random poxes wherever he goes; he never goes anywhere without healing at least one pox of some description, and all Babblepox healing happens when he’s in the area. (He’s pretty obvious about it, but he’s successfully convinced the Calderian police that he’s a government spy and successfully convinced the Calderian government that he’s an undercover cop, and a lack of interdepartmental communication means no-one’s brought him in yet.)
Dankon GroundDankon Ground heals people of Gurglepox or Mildly But Persistently Itchy Throat, targeting whichever sector is furthest away from him; all MBPTI heals and all non-Danny Gurglepox heals happened when he was in the sector opposite.
Moon Finder and BoltholopewMoon and Bolt have agreed that whenever they happen to end up two sectors apart, they magically flood the sector between them with positive vibes, healing a large number of residents’ Parachondria, Problems Disorder and Disease Syndrome; all cases of any of these illnesses being healed happened when Moon and Bolt were on either side of the sector in question.
Tehami DarkeTehami uses his Health Note to cure people of Disquietingly Serene Bowel Syndrome. He usually heals large numbers of people remotely using the University’s scrying tools, but occasionally heals smaller numbers in person in an attempt to throw off the authorities; all DSBS heals happened when Tehami was either present or in Sector 6.
Lomerius XardusWhenever Lomerius visits Sector 5, he ascends the Tower and uses it to purge Chucklepox from a random person in a random sector; all Chucklepox cases not healed by Danny were healed when Lomerius was in Sector 5, and every time Lomerius visits Sector 5 a Chucklepox healing happens.
Azeru (and Cayn)Azeru heals Bumblepox and Scramblepox in sectors adjacent to her; all non-Danny cases of either of these illnesses being healed happened in a sector next to the one she was in.
(Her identical twin sister Cayn is much more law-abiding, and would never do such a thing. Assertions that they’re “following the exact same weirdly rigid pattern in what sectors they visit”, that it’s “suspicious they’re never in the city at the same time”, that they’re “obviously the same person”, and that she’s “maintaining an alternate identity so she can keep living in Calderia if ‘Azeru’ ever gets caught” are completely unfounded; and the fact that you can get an extra 1gp for reporting their association to the government (if-and-only-if you correctly accused Azeru) is utterly unreasonable!)
((fr tho she did that ****))
AverillAverill buries Healing Seeds whenever & wherever he visits; 1d4 days later, they hatch and heal a random number of Rumblepox cases in that sector; all non-Danny Rumblepox heals happened 1-4 days after Averill visited the sector in question.
GouberiA year or so before the dataset began, Gouberi cast a long-lasting spell which healed random people suffering from The Shivers. This is completely impossible to detect without out-of-character knowledge; nevertheless, it’s true.
LeaderboardgridWho-caught-whom is tabulated below.
MulticoreYongeqwertyasdefaphyersimonNettieYesYesYesYesYesZancroYesYesYesYesYesDannyYesYesYesYesYesDankonYesYesYesYesYesMoonYesYesNoYesYesBoltYesYesYesYesYesTehamiYesYesNoNoYesLomeriusNoYesYesNoYesAzeruYesYesYesYesYesCayn[1]NoNoYesYesNoAverillNot Quite[2]NoYesYesNoGouberiNoNoNoNoNoAfter some last-minute recounting, it looks like qwertyasdef literally everyone(?) is the frontrunner. Congratulations qwertyasdef literally everyone(???)!
ReflectionsI like how this one turned out. While releasing something so unwelcomingly newly-shaped had its downsides, and the lack of obvious starting points no doubt compounded the problem, there’s a lot to be said for creating something novel . . . even if I had to borrow the novelty from someone else (my thanks, again, to aphyer, for providing the active ingredient).
I also really liked the response. All players got all the obvious solutions, and most of the non-obvious solutions, but no-one caught all conceivably catchable characters; each of the first three players were the first to correctly accuse someone; no-one made a false accusation without catching themselves and retracting in time; and everyone playing was the first to say something worth saying.
In addition to players’ showings impressing me greatly and being fun for me to watch, I consider the form of these playthroughs evidence I did something right. Accordingly, I will be awarding myself 4-out-of-5 for Quality on this one, unless someone tells me not to; feedback on this point, and all other points, is greatly appreciated.
SchedulingSpeaking of things I’ll do unless specifically advised otherwise: my tentative plan is to run the next challenge from the 3rd to the 13th of October (provided no-one releases Portal 3 or starts World War III in the intervening weeks). Assuming everyone's on board with that timing, I’ll confirm it on my Shortform as soon as I’m done drafting the scenario.
- ^
There's some slight ambiguity for Cayn in particular: it's possible some players who noticed her interaction with Azeru didn't mention it in their comments, and/or that some who mentioned it didn't choose to report it. I'm open to corrections in either direction.
- ^
Multicore did note Averill's presence was strongly anticorrelated with Rumblepox heals, but (afaict) didn't accuse him based on that.
Discuss
Research Agenda: Synthesizing Standalone World-Models (+ Bounties, + Seeking Funding)
tl;dr: I outline my research agenda, post bounties for poking holes in it or for providing general relevant information, and am seeking to diversify my funding sources. This post will be followed by several others, providing deeper overviews of the agenda's subproblems and my sketches of how to tackle them.
Back at the end of 2023, I wrote the following:
I'm fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)
On the inside view, I'm pretty satisfied with how that is turning out. I have a high-level plan of attack which approaches the problem from a novel route, and which hopefully lets us dodge a bunch of major alignment difficulties (chiefly the instability of value reflection, which I am MIRI-tier skeptical of tackling directly). I expect significant parts of this plan to change over time, as they turn out to be wrong/confused, but the overall picture should survive.
Conceit: We don't seem on the track to solve the full AGI alignment problem. There's too much non-parallelizable research to do, too few people competent to do it, and not enough time. So we... don't try. Instead, we use adjacent theory to produce a different tool powerful enough to get us out of the current mess. Ideally, without having to directly deal with AGIs/agents at all.
More concretely, the ultimate aim is to figure out how to construct a sufficiently powerful, safe, easily interpretable, well-structured world-model.
- "Sufficiently powerful": contains or can be used to generate knowledge sufficient to resolve our AGI-doom problem, such as recipes for mind uploading or adult intelligence enhancement, or robust solutions to alignment directly.
- "Safe": not embedded in a superintelligent agent eager to eat our lightcone, and which also doesn't spawn superintelligent simulacra eager to eat our lightcone, and doesn't cooperate with acausal terrorists eager to eat our lightcone, and isn't liable to Basilisk-hack its human operators into prompting it to generate a superintelligent agent eager to eat our lightcone, and so on down the list.
- "Easily interpretable": written in some symbolic language, such that interpreting it is in the reference class of "understand a vast complex codebase" combined with "learn new physics from a textbook", not "solve major philosophical/theoretical problems".
- "Well-structured": has an organized top-down hierarchical structure, learning which lets you quickly navigate to specific information in it.
Some elaborations:
Safety: The problem of making it safe is fairly nontrivial: a world-model powerful enough to be useful would need to be a strongly optimized construct, and strongly optimized things are inherently dangerous, agent-like or not. There's also the problem of what had exerted this strong optimization pressure on it: we would need to ensure the process synthesizing the world-model isn't itself the type of thing to develop an appetite for our lightcone.
But I'm cautiously optimistic this is achievable in this narrow case. Intuitively, it ought to be possible to generate just an "inert" world-model, without a value-laden policy (an agent) on top of it.
That said, this turning out to be harder than I expect is certainly one of the reasons I might end up curtailing this agenda.
Interpretability: There are two primary objections I expect here.
- "This is impossible, because advanced world-models are inherently messy". I think this is confused/wrong, because there's already an existence proof: a human's world-model is symbolically interpretable by the human mind containing it. More on that later.
- "(Neuro)symbolic methods have consistently failed to do anything useful". I'll address that below too, but in short, neurosymbolic methods fail because it's a bad way to learn: it's hard to traverse the space of neurosymbolic representations in search of the right one. But I'm not suggesting a process that "learns by" symbolic methods, I'm suggesting a process that outputs a symbolic world-model.
On the inside view, this problem, and the subproblems it decomposes into, seems pretty tractable. Importantly, it seems tractable using a realistic amount of resources (a small group of researchers, then perhaps a larger-scale engineering effort for crossing the theory-practice gap), in a fairly short span of time (I optimistically think 3-5 years; under a decade definitely seems realistic).[1]
On the outside view, almost nobody has been working on this, and certainly not using modern tools. Meaning, there's no long history of people failing to solve the relevant problems. (Indeed, on the contrary: one of its main challenges is something John Wentworth and David Lorell are working on, and they've been making very promising progress recently.)
On the strategic level, I view the problem of choosing the correct research agenda as the problem of navigating between two failure modes:
- Out-of-touch theorizing: If you pick a too-abstract starting point, you won't be able to find your way to the practical implementation in time. (Opinionated example: several agent-foundations agendas.)
- Blind empirical tinkering: If you pick a too-concrete starting point, you won't be able to generalize it to ASI in time. (Opinionated example: techniques for aligning frontier LLMs.)
I think most alignment research agendas, if taken far enough, do produce ASI-complete alignment schemes eventually. However, they significantly differ in how long it takes them, and how much data they need. Thus, you want to pick the starting point that gets you to ASI-complete alignment in as few steps as possible: with the least amount of concretization or generalization.
Most researchers disagree with most others regarding what that correct starting point is. Currently, this agenda is mine.
High-Level OutlineAs I'd stated above, I expect significant parts of this to turn out confused, wrong, or incorrect in a technical-but-not-conceptual way. This is a picture is painted with a fairly broad brush.
I am, however, confident in the overall approach. If some of its modules/subproblems turn out faulty, I expect it'd be possible to swap them for functional ones as we go.
Theoretical Justifications
1. Proof of concept. Note that human world-models appears to be "autosymbolic": able to be parsed as symbolic structures by the human mind in which they're embedded.[2] Given that the complexity of things humans can reason about is strongly limited by their working memory, how is this possible?
Human world-models rely on chunking. To understand a complex phenomenon, we break it down into parts, understand the parts individually, then understand the whole in terms of the parts. (The human biology in terms of cells/tissues/organs, the economy in terms of various actors and forces, a complex codebase in terms of individual functions and modules.)
Alternatively, we may run this process in reverse. To predict something about a specific low-level component, we could build a model of the high-level state, then propagate that information "downwards", but only focusing on that component. (If we want to model a specific corporation, we should pay attention to the macroeconomic situation. But when translating that situation into its effects on the corporation, we don't need to model the effects on all corporations that exist. We could then narrow things down further, to e. g. predict how a specific geopolitical event impacted an acquaintance holding a specific position at that corporation.)
Those tricks seem to work pretty well for us, both in daily life and in our scientific endeavors. It seems that the process of understanding and modeling the universe can be broken up into a sequence of "locally simple" steps: steps which are simple given all preceding steps. Simple enough to fit within a human's working memory.
To emphasize: the above implies that the world's structure has this property at the ground-true level. The ability to construct such representations is an objective fact about data originating from our universe; our universe is well-abstracting.
The Natural Abstractions research agenda is a formal attempt to model all of this. In its terms, the universe is structured such that low-level parts of the systems in it are independent given their high-level state. Flipping it around: the high-level state is defined by the information redundantly represented in all low-level parts.
That greatly simplifies the task. Instead of defining some subjective, human-mind-specific "interpretability" criterion, we simply need to extract this objectively privileged structure. How can we do so?
2. Compression. Conceptually, the task seems fairly easy. The kind of hierarchical structure we want to construct happens to also be the lowest-description-length way to losslessly represent the universe. Note how it would follow the "don't repeat yourself" principle: at every level, higher-level variables would extract all information shared between the low-level variables, such that no bit of information is present in more than one variable. More concretely, if we wanted to losslessly transform the Pile into a representation that takes up the least possible amount of disk space, a sufficiently advanced compression algorithm would surely exploit various abstract regularities and correspondences in the data – and therefore, it'd discover them.
So: all we need is to set up a sufficiently powerful compression process, and point it at a sufficiently big and diverse dataset of natural data. The output would be isomorphic to a well-structured world-model.
... If we can interpret the symbolic language it's written in.
The problem with neural networks is that we don't have the "key" for deciphering them. There might be similar neat structures inside those black boxes, but we can't get at them. How can we avoid this problem here?
By defining "complexity" as the description length in some symbolic-to-us language, such as Python.
3. How does that handle ontology shifts? Suppose that this symbolic-to-us language would be suboptimal for compactly representing the universe. The compression process would want to use some other, more "natural" language. It would spend some bits of complexity defining it, then write the world-model in it. That language may turn out to be as alien to us as the encodings NNs use.
The cheapest way to define that natural language, however, would be via the definitions that are the simplest in terms of the symbolic-to-us language used by our complexity-estimator. This rules out definitions which would look to us like opaque black boxes, such as neural networks. Although they'd technically still be symbolic (matrix multiplication plus activation functions), every parameter of the network would have to be specified independently, counting towards the definition's total complexity. If the core idea regarding the universe's "abstraction-friendly" structure is correct, this can't be the cheapest way to define it. As such, the "bridge" between the symbolic-to-us language and the correct alien ontology would consist of locally simple steps.
Alternate frame: Suppose this "correct" natural language is theoretically understandable by us. That is, if we spent some years/decades working on the problem, we would have managed to figure it out, define it formally, and translate it into code. If we then looked back at the path that led us to insight, we would have seen a chain of mathematical abstractions from the concepts we knew in the past (e. g., 2025) to this true framework, with every link in that chain being locally simple (since each link would need to be human-discoverable). Similarly, the compression process would define the natural language using the simplest possible chain like this, with every link in it locally easy-to-interpret.
Interpreting the whole thing, then, would amount to: picking a random part of it, iteratively following the terms in its definition backwards, arriving at some locally simple definition that only uses the terms in the initial symbolic-to-us language, then turning around and starting to "step forwards", iteratively learning new terms and using them to comprehend more terms.
I. e.: the compression process would implement a natural "entry point" for us, a thread we'd be able to pull on to unravel the whole thing. The remaining task would still be challenging – "understand a complex codebase" multiplied by "learn new physics from a textbook" – but astronomically easier than "derive new scientific paradigms from scratch", which is where we're currently at.
(To be clear, I still expect a fair amount of annoying messiness there, such as code-golfing. But this seems like the kind of problem that could be ameliorated by some practical tinkering and regularization, and other "schlep".)
4. Computational tractability. But why would we think that this sort of compressed representation could be constructed compute-efficiently, such that the process finishes before the stars go out (forget "before the AGI doom")?
First, as above, we have existence proofs. Human world-models seem to be structured this way, and they are generated at fairly reasonable compute costs. (Potentially at shockingly low compute costs.[3])
Second: Any two Turing-complete languages are mutually interpretable, at the flat complexity cost of the interpreter (which depends on the languages but not on the program). As the result, the additional computational cost of interpretability – of computing a translation to the hard-coded symbolic-to-us language – would be flat.
5. How is this reconciled with the failures of previous symbolic learning systems? That is: if the universe has this neat symbolic structure that could be uncovered in compute-efficient ways, why didn't pre-DL approaches work?
This essay does an excellent job explaining why. To summarize: even if the final correct output would be (isomorphic to) a symbolic structure, the compute-efficient path to getting there, the process of figuring that structure out, is not necessarily a sequence of ever-more-correct symbolic structures. On the contrary: if we start from sparse hierarchical graphs, and start adding provisions for making it easy to traverse their space in search of the correct graph, we pretty quickly arrive at (more or less) neural networks.
However: I'm not suggesting that we use symbolic learning methods. The aim is to set up a process which would output a highly useful symbolic structure. How that process works, what path it takes there, how it constructs that structure, is up in the air.
Designing such a process is conceptually tricky. But as I argue above, theory and common sense say that it ought to be possible; and I do have ideas.
SubproblemsThe compression task can be split into three subproblems. I will release several posts exploring each subproblem in more detail in the next few days (or you can access the content that'd go into them here).
Summaries:
1. "Abstraction-learning". Given a set of random low-level variables which implement some higher-level abstraction, how can we learn that abstraction? What functions map from the molecules of a cell to that cell, from a human's cells to that human, from the humans of a given nation to that nation; or from the time-series of some process to the laws governing it?
As mentioned above, this is the problem the natural-abstractions agenda is currently focused on.
My current guess is that, at the high level, this problem can be characterized as a "constructive" version of Partial Information Decomposition. It involves splitting (every subset of) the low-level variables into unique, redundant, and synergistic components.
Given correct formal definitions for unique/redundant/synergistic variables, it should be straightforwardly solvable via machine learning.
Current status: the theory is well-developed and it appears highly tractable.
2. "Truesight". When we're facing a structure-learning problem, such as abstraction-learning, we assume that we get many samples from the same fixed structure. In practice, however, the probabilistic structures are themselves resampled.
Examples:
- The cone cells in your eyes connect to different abstract objects depending on what you're looking at, or where your feet carry you.
- The text on the frontpage of an online newsletter is attached to different real-world structures on different days.
- The glider in Conway's Game of Life "drifts across" cells in the grid, rather than being an abstraction over some fixed set of them.
- The same concept of a "selection pressure" can be arrived-at by abstracting from evolution or ML models or corporations or cultural norms.
- The same human mind can "jump substrates" from biological neurons to a digital representation (mind uploading), while still remaining "the same object".
I. e.,
- The same high-level abstraction can "reattach" to different low-level variables.
- The same low-level variables can change which high-level abstraction they implement.
On a sample-to-sample basis, we can't rely on any static abstraction functions to be valid. We need to search for appropriate ones "at test-time": by trying various transformations of the data until we spot the "simple structure" in it.
Here, "simplicity" is defined relative to the library of stored abstractions. What we want, essentially, is to be able to recognize reoccurrences of known objects despite looking at them "from a different angle". Thus, "truesight".
Current status: I think I have a solid conceptual understanding of it, but it's at the pre-formalization stage. There's one obvious way to formalize it, but it seems best avoided, or only used as a stepping stone.
3. Dataset-assembly. There's a problem:
- Solving abstraction-learning requires truesight. We can't learn abstractions if we don't have many samples of the random variables over which they're defined.
- Truesight requires already knowing what abstractions are around. Otherwise, the problem of finding simple transformations of the data that make them visible is computationally intractable. (We can't recognize reoccurring objects if we don't know any objects.)
Thus, subproblem 3: how to automatically spot ways to slice the data into datasets entries from which are isomorphic to samples from some fixed probabilistic structure, to make them suitable for abstraction-learning.
Current status: basically greenfield. I don't have a solid high-level model of this subproblem yet, only some preliminary ideas.
Bounties1. Red-teaming. I'm interested in people trying to find important and overlooked-by-me issues with this approach, so I'm setting up a bounty: $5-$100 for spotting something wrong that makes me change my mind. The payout scales with impact.
Fair warnings:
- I expect most attempts to poke holes to yield a $0 reward. I'm well aware of many minor holes/"fill-in with something workable later" here, as well as the major ways for this whole endeavor to fail/turn out misguided.
- I don't commit to engaging in-depth with every attempt. As above, I expect many of them to rehash things I already know of, so I may just point that out and move on.
A reasonable strategy here would be to write up a low-effort list of one-sentence summaries of potential problems you see, I'll point out which seem novel and promising at a glance, and you could expand on those.
2. Blue-teaming. I am also interested in people bringing other kinds of agenda-relevant useful information to my attention: relevant research papers or original thoughts you may have. Likewise, a $5-$100 bounty on that, scaling with impact.[4]
I will provide pointers regarding the parts I'm most interested in as I post more detailed write-ups on the subproblems.
Both bounties will be drawn from a fixed pool of $500 I've set aside for this. I hope to scale up the pool and the rewards in the future. On that note...
FundingI'm looking to diversify my funding sources. Speaking plainly, the AI Alignment funding landscape seems increasingly captured by LLMs; I pretty much expect only the LTFF would fund me. This is an uncomfortable situation to be in, since if some disaster were to befall the LTFF, or if the LTFF were to change priorities as well, I would be completely at sea.
As such:
- If you're interested and would be able to provide significant funding (e. g., $10k+), or know anyone who'd be interested-and-willing, please do reach out.
- I accept donations, including smaller ones, through Manifund and at the crypto addresses listed at the end of this post.
Regarding target funding amounts: I currently reside in a country with low costs of living, and I don't require much compute at this stage, so the raw resources needed are small; e. g., $40k would cover me for a year. That said, my not residing in the US increasingly seems like a bottleneck on collaborating with other researchers. As such, I'm currently aiming to develop a financial safety pillow, then immigrate there. Funding would be useful up to $200k.[5]
If you're interested in funding my work, but want more information first, you can access a fuller write-up through this link.
If you want a reference, reach out to @johnswentworth.
Crypto
BTC: bc1q7d8qfz2u7dqwjdgp5wlqwtjphfhct28lcqev3v
ETH: 0x27e709b5272131A1F94733ddc274Da26d18b19A7
SOL: CK9KkZF1SKwGrZD6cFzzE7LurGPRV7hjMwdkMfpwvfga
TRON: THK58PFDVG9cf9Hfkc72x15tbMCN7QNopZ
Preference: Ethereum, USDC stablecoins.
- ^
You may think a decade is too slow given LLM timelines. Caveat: "a decade" is the pessimistic estimate under my primary, bearish-on-LLMs, model. In worlds in which LLM progress goes as fast as some hope/fear, this agenda should likewise advance much faster, for one reason: it doesn't seem that far from being fully formalized. Once it is, it would become possible to feed it to narrowly superintelligent math AIs (which are likely to appear first, before omnicide-capable general ASIs), and they'd cut years of math research down to ~zero.
I do not centrally rely on/expect that. I don't think LLM progress would go this fast; and if LLMs do speed up towards superintelligence, I'm not convinced it would be in the predictable, on-trend way people expect.
That said, I do assign nontrivial weight to those worlds, and care about succeeding in them. I expect this agenda to fare pretty well there.
- ^
It could be argued that they're not "fully" symbolic – that parts of them are only accessible to our intuitions, that we can't break down the definitions of the symbols/modules in them down to the most basic functions/neuron activations. But I think they're "symbolic enough": if we could generate an external world-model that's as understandable to us as our own world-models (and we are confident that this understanding is accurate), that should suffice for fulfilling the "interpretability" criterion.
That said, I don't expect this caveat to come into play: I expect a world-model that would be ultimately understandable in totality.
- ^
The numbers in that post feel somewhat low to me, but I think it's directionally correct.
- ^
Though you might want to reach out via private messages if the information seems exfohazardous. E. g., specific ideas about sufficiently powerful compression algorithms are obviously dual-use.
- ^
Well, truthfully, I could probably find ways to usefully spend up to $1 million/year, just by hiring ten mathematicians and DL engineers to explore all easy-to-describe, high-reward, low-probability-of-panning-out research threads. So if you want to give me $1 million, I sure wouldn't say no.
Discuss
Global Call for AI Red Lines - Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures
Today, the Global Call for AI Red Lines was released and presented at the UN General Assembly. It was developed by CeSIA, the French Center for AI Safety, The Future Society and the Center for Human-compatible AI.
This call has been signed by a historic coalition of 200+ former heads of state, ministers, diplomats, Nobel laureates, AI pioneers, industry experts, human rights advocates, political leaders, and other influential thinkers, as well as 70+ organizations.
Signatories include:
- 10 Nobel Laureates, in economics, physics, chemistry and peace
- Former Heads of State: Mary Robinson (Ireland), Enrico Letta (Italy)
- Former UN representatives: Csaba Kőrösi, 77th President of the UN General Assembly
- Leaders and employees at AI companies: Wojciech Zaremba (OpenAI cofounder), Jason Clinton (Anthropic CISO), Ian Goodfellow (Principal Scientist at Deepmind)
- Top signatories from the CAIS statement: Geoffrey Hinton, Yoshua Bengio, Dawn Song, Ya-Qin Zhang
The full text of the call reads:
AI holds immense potential to advance human wellbeing, yet its current trajectory presents unprecedented dangers. AI could soon far surpass human capabilities and escalate risks such as engineered pandemics, widespread disinformation, large-scale manipulation of individuals including children, national and international security concerns, mass unemployment, and systematic human rights violations.
Some advanced AI systems have already exhibited deceptive and harmful behavior, and yet these systems are being given more autonomy to take actions and make decisions in the world. Left unchecked, many experts, including those at the forefront of development, warn that it will become increasingly difficult to exert meaningful human control in the coming years.
Governments must act decisively before the window for meaningful intervention closes. An international agreement on clear and verifiable red lines is necessary for preventing universally unacceptable risks. These red lines should build upon and enforce existing global frameworks and voluntary corporate commitments, ensuring that all advanced AI providers are accountable to shared thresholds.
We urge governments to reach an international agreement on red lines for AI — ensuring they are operational, with robust enforcement mechanisms — by the end of 2026.
In Seoul, companies pledged to “Set out thresholds at which severe risks posed by a model or system, unless adequately mitigated, would be deemed intolerable”, but there is still nothing today that prevents Meta/xAI from setting thresholds too high, or not setting them at all. Without common rules, this race is a race to the bottom, and safety-conscious actors are going to be disadvantaged.
Red lines have started being operationalized in the Safety and Security frameworks from AI companies. For example, for AI models above a critical level of cyber-offense capability, OpenAI states that “Until we have specified safeguards and security controls standards that would meet a critical standard, halt further development.” Those definitions of critical capabilities that require robust mitigations now need to be harmonized and strengthened between those different companies.
On the website, you will find an FAQ:
- What are red lines in the context of AI?
- Why are international AI red lines important?
- What are some examples of possible red lines?
- Are international AI red lines even possible?
- Are we starting from scratch?
- Who would enforce these red lines?
- Why 2026?
- What should be the next steps?
Our aim with this call is to move away from industry self-regulation and reach an international agreement on red lines for artificial intelligence by the end of 2026 to prevent the most severe AI risks.
You can access the website here: https://red-lines.ai
Discuss
Why I don't believe Superalignment will work
We skip over [..] where we move from the human-ish range to strong superintelligence[1]. [..] the period where we can harness potentially vast quantities of AI labour to help us with the alignment of the next generation of models
- Will MacAskill in his critique of IABIED
I want to respond to Will MacAskill's claim in his IABIED review that we may be able use AI to solve alignment.[1] Will believes that recent developments in AI made it more likely that takeoff will be relatively slow - "Sudden, sharp, large leaps in intelligence now look unlikely". Because of this, he and many others believe that there will likely be a period of time at some point in the future when we can essentially direct the AIs to align more powerful AIs. But it appears to me that a “slow takeoff” is not sufficient at all and that a lot of things have to be true for this to work. Not only do we need to have a slow takeoff, we also need to have AIs that are great at alignment research during this time period. For this research to be useful, we need to have verifiable metrics and well-specified objectives ahead of time that we can give to the AIs. If that all works out, it has to be the case that the alignment problem is solvable by this sort of approach. And this only helps us if no one else builds unaligned dangerous AI by then or uses AI for capabilities research. I think it's unlikely that all of this is true and that this plan is likely to have negative consequences.
TLDR: The necessary conditions for superalignment[2] are unlikely to be met and the plan itself will possibly have more negative consequences.
Fast takeoff is possibleFast takeoff is still possible and LLMs don’t prove anything about it being impossible or very unlikely. Will does not provide a full-length argument why he thinks anything about LLMs rules out fast takeoff. The key load-bearing arguments for fast takeoff are simple and unchanged. Once AI gets capable enough to meaningfully do its own AI research without humans, this will lead to a great speed-up because computers are fast. We are also building a lot of very fast parallel computers. Also once the AIs start improving capabilities, these capabilities will make them faster and smarter. Empirically, we have evidence from games like Go that superhuman levels can be quickly reached (within days or hours) through RL and methods such as self-play. If fast takeoff happens, no substantial “self-alignment time period” will happen. Furthermore, Will himself describes slow takeoff as months to years, which is still very little time.
AIs are unlikely to speed up alignment before capabilitiesIn addition to a slow takeoff, strong capabilities for AI alignment have to appear in a certain sequence and long before AI is existentially dangerous. Despite the fact that humans have failed and are struggling to understand the alignment problem, superalignment assumes that AIs can be trained to solve it. And ideally this happens before they get very good at speeding up capabilities research or being directly dangerous. I think this is unlikely to be true, because that's not how it works in humans and because capabilities research appears much easier to verify and specify. There are many humans that are good at capabilities research, that includes work such as optimizing performance, creating good datasets, setting up high-quality RL environments. These humans have been able to make rapid progress on AI capabilities while practical progress on eliminating prompt injections or interpretability or theoretical breakthroughs on agency appear to me much more limited. I’d expect AIs similarly to first get good at capabilities rather than alignment research. We already have many examples of AI being used to do capabilities research, likely because it’s easier to verify and specify compared to alignment research. Examples are optimizing matrix multiplications, chip design, generating data, coming up with RL-tasks to name a few. Therefore, AI will likely accelerate capabilities research long before it can meaningfully help with alignment.
What would the AI alignment researchers actually be doing?There is still no agreed upon specification of what we would actually have these AI alignment research agents do. Would we figure this all out in the moment we get to this barely specified time period? In fairness, some proposals exist for interpretability, and it seems empirically possible to have AIs help us with interpretability work. However, interpretability is a helpful but not sufficient part of alignment. Currently proposed explanation metrics can be tricked and are not sufficient for verification. Without strong verifiability, AIs could easily give us misleading or false interpretability results. Furthermore, improvements in interpretability do not equal an alignment solution. Being able to understand that an AI is plotting to take over doesn’t mean you can build an AI that isn’t trying to take over (Chapter 11, IABIED). It’s also not clear that for something much smarter than humans, interpretability could even work or be useful. Is it even possible to understand or steer the thoughts of something much smarter and faster than you?
Alignment problem might require genius breakthroughsThe alignment problem is like an excavation site where we don't yet know what lies beneath. It could be all sand - countless grains we can steadily move with shovels and buckets, each scoop representing a solved sub-problem. Or we might discover that after clearing some surface sand, we hit solid bedrock - fundamental barriers requiring genius breakthroughs far beyond human capability. I think it’s more likely that alignment is similar to sand over bedrock than pure sand, so we may get lots of work on shoveling sand (solving small aspects of interpretability) but fail to address deeper questions on agency and decision theory. Just focusing on interpretability in LLMs, it’s not clear that it is in principle possible to solve it. It may be fundamentally impossible for an LLM to fully interpret another LLM of similar capability - like asking a human to perfectly understand another human's thoughts. While we do have some progress on interpretability and evaluations, critical questions such as guaranteeing corrigibility seem totally unsolved with no known way to approach the problem. We are very far from understanding how we could tell that we solved the problem. Superalignment assumes that alignment just takes a lot of hard work, it assumes the problem is just like shoveling sand - a massive engineering project. But if it's bedrock underneath, no amount of human-level AI labor will help.
Most labs won’t use the timeIf that period really existed, it would also very likely be useful to accelerate capabilities and rush straight forward to unaligned superintelligence. While Anthropic or OpenAI might be careful here, there are many other companies that will go ahead as soon as possible. For the most part, the vast majority of AI labs are extremely irresponsible and have no stated interest in dedicating any resources to solving alignment.
The plan could have negative consequencesThe main impact of the superalignment plan may very well be that it gives the people advancing capabilities a story to tell to worried people. “Let’s have the AIs do the alignment work for us at some unspecified point in the future” also sounds like the kind of thing you’d say if you had absolutely zero plans on how to align powerful AI. My overall impression here is that the people championing superalignment are not putting out plans that are specific enough to be really critiqued. It just doesn’t seem that there is that much substance here to engage with. Instead, I think they should clearly outline why they believe this strategy will likely work out. Why do they believe these conditions will be met, in particular why do they think this “period” will exist and why do they believe these things about the alignment problem?
Eliezer and Nate also discuss the superalignment plan in detail in chapter 11 of IABIED. Basically, they think some interpretability work can likely be done with AIs, and that is a good thing. Interpretability itself is not a solution for alignment though it’s helpful. As for the version where the AI does all the alignment work, Eliezer and Nate believe this AI would already be too dangerous to be trustworthy. It would require a superhuman AI to solve the alignment problem.
- ^
This is a general response to superalignment proposals
- ^
He never uses the term superalignment here but it seems similar enough to that direction.
Discuss
Video and transcript of talk on giving AIs safe motivations
(This is the video and transcript of talk I gave at the UT Austin AI and Human Objectives Initiative in September 2025. The slides are also available here. The main content of the talk is based on this recent essay.)
TalkHi, everyone. Thank you for coming. I'm honored to be part of this series and part of the beginning of this series.
PlanI'm going to briefly introduce the core AI alignment problem as I see it. It's going to be a particular version of that problem, the version that I think is highest stakes. And then I'm going to talk about my current high-level picture of how that problem gets solved at a technical level. There's a bunch of aspects of this problem that aren't the technical level that are also crucially important, but I'm going to focus on the technical dimension here. And in particular, the dimension focused on the motivations of the AI systems that we're building. That also is not the only technical dimension. There's also technical aspects to do with constraining the options and monitoring and oversight for what AIs can choose to do. But I'm focusing on what are their options, how do they evaluate the different options available.
And finally, I'm going to briefly discuss where I think academic fields like philosophy linguistics, science, and especially other fields currently in the room might be able to contribute to research directions that I see as especially fruitful.
A lot of the middle material view is actually quite new. It's from the newest essay in a in-progress essay series that you can see on my website, joecarlsmith.com, which is about solving this full problem. So feel free to check that out if you'd like to learn. There's a bunch of other content in that series as well, as well as in my other works.
Maybe I'll just quickly pause for a second. Can I get a quick hand-poll of "I have no exposure to the AI alignment discourse" to "I'm seeped in this stuff, and I'm aware that there's a new book coming out about this today," from low to high? Great. And then could I also get a "this is all silly" to "this is super-serious," low to high? Okay, great, thank you. So we've got a mixed set of opinions in the room.
If you've got a burning question that you want to jump in with that feels like I won't be able to listen to this talk unless this question gets aired, feel free to jump in. I might pause on extended debate. And there's a lot of places here that people can be able to step off the boat. A lot of the silliness can arise, but if something is really taking you out, let's see what happens. And it's possible I'll pause it, but maybe not.
What is the AI alignment problem?Okay, so with that said, what is the AI alignment problem? Well, the version I'm going to focus on, I'm going to frame using two core claims. The first is that superintelligent AI agents might become powerful enough to disempower humanity. There's going to be a live option, there's a course of action these agents could pursue, it's within their capabilities to pursue such that all humans would end up permanently and involuntarily disempowered.
The classic version of this disempowerment is extinction. I don't think that's actually the central thing we need to focus on. Is the disempowerment premise that counts. So that's the first premise is that there's going to be superintelligent AI agents where this is, in some sense, an option for them. And the second premise is they might be motivated to pursue that option once it is available. So that's the core concern.
Now, I'll just flag in front, obviously this is not the only problem that AI can create in the world. As you may have heard, there's many, many other problems that we can care about here. And it's not even the only problem that we might associate with the word alignment. So alignment has come to mean a very broad set of things to do with how exactly are we shaping the behavior and the values of AI systems we create such that that aspect of their functioning is broadly beneficial to society.
That's not what I'm talking about here. I'm talking about a very specific thing. It's the thing you've heard about that's been called sci-fi. It's related to Terminator. This is about AIs going rogue and being voluntarily, probably violently disempowering the human species in an event very recognizable as something akin to a robot rebellion, coup thing. I'm not talking about like, oh, there's a gradual disempowerment. I'm talking about a really recognizably violent, horrible event that you would see and you would look that, something has gone horribly wrong. So I just want to be clear about that. There's other things we can worry about here. I'm talking about the thing that sometimes people laugh at, I'm talking about that.
Okay, now, so I'm going to focus on the second premise here, which is that AIs, once they're in a position to choose that they could choose to disempower humanity, they might be motivated to choose that, and I'm going to talk about how we might try to prevent that from being the case.
Now, we're a lot of people get off the boat with this whole discourse is with the first premise. So Peter and I had a productive discussion about this yesterday, and I think it's reasonable, a lot of people, they're like, "Why are we even talking about this?" My chatbot, honestly, I tried to get it to do this thing for me, and it was dumb. What are we doing here? Also, there's other issues. True, very true.
So this is a super-important premise. There's a lot to say about it. I'm going to say three things about it on the next slide, but I also think it's really important cognitively in thinking about this issue to separate your attitude towards that first premise from your attitude towards the second premise, conditional on the first premise, all right?
So you can totally be like, "I think the first premise is dumb, but I admit that if I'm wrong about that, then I'm scared about this problem, because oh, my God, we've got AIs that could take over the world and kill everyone, and we're counting on them to not do so." That's a very scary situation. And I want to load up the separateness. It's true, that's an intense thing to think, and so it's a wild thing to think that just the first premise could be true. But I want everyone to really separate—what if it is true, then how easy is it going to be or hard to ensure that the situation goes well regardless? And so I'm going to be talking centrally about that, and I just want to separate those two as dimensions.
Might superintelligent AI agents become powerful enough to disempower humanity?Okay, so I'll just say a few things about the first thing here, so we can further break down, why might we think that superintelligent AI agents might become powerful enough to disempower humanity?
Well, here's two premises that could go into that. One, we might build superintelligent AI agents. Now, what I mean by that is roughly AI agents, I'll say a bit more about what I mean by agency, but AI agents that are vastly better than humans at basically any cognitive task. There's maybe a few exceptions. There's maybe some task where you're like being a priest, it builds in humanity or something, whatever, but work with me. They're better at all of the smart stuff. Okay, that's the first premise.
Now, that is not enough for the overall thing. There's an additional claim, which is that these agents will be in a position, once built at some point, that they'll have the option to disempower humanity.
Now, there's some subtlety in how we understand that claim insofar as there could be many of these agents. So the traditional discourse often thinks about this, that classic discussions imagine a single unitary agent that is on its own in a position to disempower all of humanity. That's one version that could be true. That's a specific version of a broader set of scenarios that I'm interested in, wherein roughly speaking, if you're looking at where is the power residing amongst agents in a distributed sense, AIs, superintelligent AI agents have vastly more power movements or in a position to have vastly more power.
But at the least if they coordinated to disempower humanity, they could. It could be the case that even uncoordinated efforts the disempowerment or power seeking could result in the disempowerment of humanity. So there's a bunch of different versions of this scenario.
One analog I like to talk about is if you think about... So sometimes people are like, "I don't believe that a single AI system will be able to take over the world," and it's like, cool, consider the relationship between humans and, say, chimpanzees or other species. So no individual human has taken over the world, nor have the humans all coordinated to disempower the other species on this planet. Nevertheless, there's a straightforward sense, or at least intuitive sense, in which humans as a species have most of the power relative to other species. Humans as a species have sort of disempowered other species in a sense that's at least disturbingly analogous to the thing we're talking about here without coordination. So that's the sort of broader class of scenarios I'm interested in here.
Q: But there wasn't one event when we rose up against the chimpanzees.
Agreed. There's tons of limitations to the analogy. I mostly want to point at the fact that humanity can get disempowered without one AI doing it and without all the AIs coordinating to do it. They can all be doing their own thing, seeking power in their own ways, fighting with each other, trading with each other, forming weird coalitions. Nevertheless, the humans eventually get locked out. That's the concern. There's a bunch of ways that can happen.
Obviously, there's a ton to say about this slide, but I'm just flagging a few things to load up the possible issues, but I'm not going to focus on them too much. I think both of these are at least quite plausible, and I think they're quite plausible within the next few decades. That's not going to be important that the sort of timelines claim is not going to be fruitful here. You can have this concern if people have had this concern, even absent any particular conviction about timelines AI systems.
A lot of people have gotten more interested in this issue as advanced AI systems have started to seem more real or more on the horizon, but there's a set of people who were like, "We don't care when this... It could be 50 years, it could be 100 years." This is an existential threat that we still need to start thinking about now, and there's productive work that we can do now.
Now, I do think the timelines matter in various ways, and we can talk about that, but just flagging that that's not crucial to the story here.
So the broad argument for this first thing that we might build superintelligent AI agents is like, I don't know, look at the trajectory of AI progress. Think about different ways that could go. We have reasonable credences about it. Obviously, it could peter out. Could be that the current paradigm, with some tweaks, gets us there within a few decades. Could be there are other breakthroughs that aren't within the current paradigm.
My own take is it's weird to be really confident about this stuff, really confident that no way we build superintelligent AI agents within the next few decades despite the fact that we have these $100-billion companies that are trying really hard and all this progress. I think it's weird to believe that, but there's a debate we can have.
And then the broad argument for me is roughly speaking that the AI systems in question will be so cognitively capable that their power, collectively at least, will be dominant relative to the rest of human civilization.
Now, this is actually pretty complicated because by the time we're building these systems, the world's going to be very different. AI will have been integrated into the world in all sorts of ways, we'll have new technology, we'll have other AI systems, some of which might be aligned. There's a bunch of complication to this last premise. I think this gets skated over, but that's the broad thought. It's like once you have a new species of agents that are vastly more capable than humans, eventually, most of the power resides with them or could reside with them if they choose to take it. So that's the broad vibe with respect to the first premise on my last slide.
I'm going to pause here for a second. That's the last bit we're going to do on the first premise. Does anyone want to be like, I can't listen to this talk?
Why might AIs be motivated to seek power?Okay, let's talk about the second premise. Okay, so suppose we have these systems that are in a position to choose to disempower humanity. Why would they do that? That's a very specific thing to do. They could do all sorts of things. Well, it's maybe not that specific thing. It's maybe something that you might expect by default for lots of different types of agents. And the reason is that for a very wide variety of goals, it's easier to achieve those goals if you've got more power.
So that's a claim or versions of this claim, often go under the header of "instrumental convergence". The idea is this is not a sort of very specific random behavior. We're not going... And the AIs might be motivated to worship the spaghetti monster. What if? Uh-oh, no, there's sort of an antecedent reason to think that this in particular is a sort of behavior that is convergent across a very wide variety of agents, and that's why it's privileged as a hypothesis about how things could go. So that's the initial vibe here, instrumental convergence.
So the thought is if there's, in some sense, a wide basin of AI systems that would do this sort of thing, if you get their motivations wrong, so uh-oh, if you're not really good at engineering those motivations and they're in this position, maybe you end up with these AI systems seeking power in pursuit of these problematic motivations. So that's the very broad vibe for why you might get into this concern at all.
More detailed prerequisites for this kind of power-seekingNow, that said, I want to be more precise about the specific prerequisites for when that concern arises, and I'm going to group these prerequisites in three categories.
So the first has to do with agency. This is a term I mentioned earlier when I was characterizing the type of systems in question. Roughly what I mean by agency is I need AI systems that plan coherently using models of the world that reflect the instrumental benefits of power seeking. So they need to really know what's going on. They need to be planning, looking ahead, choosing actions on the basis of those plans, doing so coherently. This robust, long-planning agent five is what I'm looking for. That's one set of prerequisites.
The second is their motivations have to have some structural properties. Notably, the AI needs to care about the consequences of its actions because the consequences of its actions are the sorts of things that power is supposed to help with. So it's specifically outcomes in the world that power allows you to better influence. And so the AI needs to care about outcomes in the world in order for the instrumental convergence story to get going. And then it needs to care about those outcomes over time horizons long enough for the power that it gets to be useful instrumentally in the manner in question.
So let's say in principle, if I need to get a cup of coffee in the next five minutes, it's actually not that useful to try to become president to get a cup of coffee. It's too long, it takes too long. And also it's just easier to get the cup of coffee. Sometimes AI safety people will be like, "Oh, you can't touch the coffee if you're dead," but you're like, "Can touch the coffee without becoming world dictator." In fact, it's a better strategy. And so there's actually a specific time horizon that needs to be in play and a specific level of ambition and some other stuff in order for practically relevant forms of instrumental convergence to apply.
And that's connected with this third set of prerequisites, I think often under-discussed, which have to do with the overall landscape of options and incentives that a given AI system faces in a given practical environment. It's true that I would benefit from having a million dollars. In some sense, it's instrumentally convergent for me to get a billion dollars. But I'm not currently in any especially real sense trying to make a billion dollars. And why not? Well, it's like it's too much math, it's too hard, I've got other things to do, so it matters what's my overall landscape? And it's not that it's out of the question that I could make a billion dollars, it's just unlikely.
Similarly, if we think about this is supposed to be an office worker deciding whether to embezzle money from their company as a form of power seeking, and sometimes it's instrumentally convergent. Let's say they have a non-trivial probability of success here. They have access to the financial account something-something. Okay, so do they do it or not? Here's the thing that's not the case, it's not the case that the only thing we rely on to prevent this behavior is the sort of saintliness of this employer or the employee, even though they may have some opportunity to do the bad behavior. There's also a bunch of other ways in which we're structuring the options available. Maybe we have some security, and that makes it less likely that this succeeds. There's disincentives, legal systems, social incentives, there's a bunch of stuff that applies to this choice. And the same will be true of AIs, especially AIs that aren't vastly better than everyone at everything such that it's just right there on a platter to take over the world. And so that's the final set of prerequisites to do the incentives of the systems.
Now, that said, I think it's worryingly plausible that especially these first two categories are going to be met by default by AIs that are built according to standard commercial incentives. So I think we already see AIs that are fairly agentic. People are talking about AIs as agents.
I remember, this is great, there was a period in this where everyone's like, "Are people really going to build AI agents? This is silly." And then three years later, I just like see: there in my browser it's like, "Deploy an AI agent in your browser," everyone's talking about agents. I'm like, "All right."
Anyway. So agency, I think we're seeing stuff like that. And the reason is agency is useful for a wide variety of tasks, and the motivation stuff, I think, is a little less clear. But I think we often have tasks where we care about the outcomes, and we often have tasks where we care about the outcomes over reasonably long-time horizons. That one's a little more complicated to talk about the incentives, but I worry that that's the default as well. Maybe I'll pause there. I've got two hands. One in the black?
Q: Is the first one really a prereq? I mean, if the agent just makes random moves, but the landscape is set up such that when it moves in random ways that are beneficial, won't it gradually accrue power over time and then that's aligned with what it wants to achieve?
A: What I'm doing here is sketching out what I take as the paradigm concern. It's not all of these are necessary to get stuff that's at least in the vicinity of the concern. It's more like this is the same with instrumental convergence. For example, you can have AI systems that intrinsically value power, say because they got rewarded for intrinsically valuing power in training. There's a bunch of nearby scenarios, but I want to try to characterize what I see as the central one.
Q: Just to set the record, I mean, agency or agents is not a new word in artificial intelligence. The paper on agent-oriented programming by Yoav Shoham was in the 1990s, maybe even late '80s, and the first conference on autonomous agents was 1990s. So it's not that this concept of agency is all of a sudden burst onto the scene. That's been one of the primary metaphors within the AI for agents.
A: Yeah, I didn't mean that the concept was novel. I think there was a time when it was much more debated whether this was going to be a default trajectory for the way we build AI systems. I think that debate has died down. Somewhat, they're not entirely, there are still proposals, many of them in reaction to some of these safety concerns and other concerns that say let us intentionally emphasize more tool-like systems, more narrow systems, like systems that are, in some sense, less paradigmatically agentic in the sense I have in mind here. And we're maybe part of some different conversations. My experience, there was a time where people were more... the question of whether agency was sort of by default on the trajectory was more open, at least according to some people, potentially not to you.
And I'll just point, so on the third thing, the incentive prerequisites, I think, does matter a lot. I think the worry there is that as then power becomes easier and easier for the system, it becomes easier and easier for these prerequisites to be satisfied. So if you imagine, it's like AI or this employee is like it's more... it's just sitting there. You just take the money from the company with total impunity or something like that. It's just free. It's incredibly easy. There are a zillion ways he could do it. He could do it in ways that don't even involve lying. He could do it. And there's just like, the more paths with higher probability are open for getting a given sort of power, then the easier it is for a given agent with a given pattern of motivations to choose to pursue that path. And the worry is that as the asymmetry and power between AIs and humans grows, that becomes more and more the case.
I think that is an open question. I think that the overall pattern of incentives that face different AI systems that could, in principle, engage in rogue behavior is an important node of intervention.
How hard is this problem?Okay, so that's the broad sort of concern here. How are we doing? Okay, well, let's see how much of this we get through. Okay, so let's assume we have some systems that are in a position to take over the world, in a position to disempower humanity. And let's assume that they're at least agentic in the sense I just described that they have long-term motivations of some kind that are focused on. So they're sort of candidates for this sort of behavior. How hard is it to ensure that they reject the option to pursue human disempowerment? That's the question I want to look at.
And as a first pass, we don't know, we don't know exactly how hard this is. And that's part of the concern. I think there's reasons though to be concerned that it's difficult, and I'm going to talk about those.
The core challenge: generalization without room for mistakesAnd there's a bunch of different ways of framing this. My framing is going to highlight a certain type of generalization as the core difficulty that needs to be solved. And in particular what makes this hard is that this is a difficulty that needs to be solved on the first try. So we're assuming that AIs are, at some point, going to have the option to take over the world. Let's call that any option set that is of that nature a dangerous input. Input just means here a sort of holistic environment in which an AI could pursue a given option. A dangerous input is one where there's a choice where if the AI makes it, it has a non-trivial probability of successfully disempowering all of humanity. That's a dangerous input.
Okay, so what we want is and we're hypothesizing, we're going to give AIs those inputs theory. Okay, by the time we're doing that, they need to not pursue the option because if they do, here's the thing we can't do, we can't watch the AI disempower humanity and then go, "Oops, rewind, retrain." This is the one you have to get it right on the first try. You can't learn from failure, and that's scary. We're good at iterating empirically. That's a really useful form of feedback. There's a certain type of fundamental feedback we can't get here, which is the feedback of failure. You don't get any redos on building superintelligences that are in a position to take over, watching them take over and being like, "Okay, let's try a new design, let's try a new architecture, let's try a new training technique."
That's not what you get. You have to get this right on the first try in that specific sense, the sense in which when you actually transition to this actual set of inputs, you need to get it right by then. And then notably, you need to get it right all the other tries too. So it could be the first time you give an AI the chance to take over the world, it declines. Okay, but you're going to continually give it that chance when you make future AIs. You also need to get those ones right as well. But I want to focus specifically on the first one. So what that means is you've got a bunch of safe inputs though, you've got a bunch of environments AIs that are not in a position to take over the world. So you get to learn whatever you can from that.
You get to train on those, you get to do all sorts of stuff on these blue guys, what we're calling the safe inputs. And then you need to make this leap to the dangerous inputs, and you need to do that right on the first try. So I'm calling that the fundamental challenge, generalization without room or mistakes.
Now, this specific technical meaning of generalization can get a little complicated. We can talk about is this in distribution, out of distribution, whatever. The broad vibe is the specific thing I'm saying here.
Q: So are your safe inputs, are you saying that those are like non-superintelligence, and then the dangerous inputs are superintelligence, or are you saying that both are options given to superintelligence?
A: It's a great question, and I'm including both. So one form of safe input is a system that is less capable, and another form of safe input is a system that is comparably capable to the system that will be deployed here, but the options are different. So maybe it's in a box or it's still in a training environment. It hasn't been deployed to users or within the company or something like that. I'm lumping both together there. Now, those are actually importantly different in various ways, and I'll talk a little bit about that later. Specifically, there's like intuitively it's scarier if this transition occurs via increasing the AI's capabilities. It's better, at least according to me, to increase the AI's capabilities in a context where it's still safe, there's no options for disempowerment. You do a bunch of testing, you learn a bunch about the new fully capable AI, and then you deploy it to these dangerous inputs. It's bad if you're not doing that and you're getting both a new system that you haven't had a chance to get acquainted with and test at the same time as you're transitioning to it having the chance to take over the world.
Q: Right. So your precondition then is that you can't necessarily extrapolate the behavior of safe AI or less capable AI on dangerous inputs. You can't necessarily extrapolate that behavior to superintelligent AI with potentially dangerous inputs.
A: Yeah, I'm going to talk a lot about what sort of extrapolation is and isn't possible here. But yes, the broad thought is like there's something interestingly different about safe inputs and dangerous inputs by hypothesis, meaning that the dangerous inputs are dangerous. So there's some generalization here, there's some difference in distribution. And so you need to have learned enough that whatever sort of safety you achieved here transfers to this, and you need to do that on the first try without any do-overs. That's the challenge.
Okay, so that's the first thing. Now notably, this in itself, I think, is scary, but it's not that scary. So we do do things well on the first try.
So for example, maybe my friend Bob has never had a gun that he would shoot me with, and I give it to him, Bob's not going to kill me. I can be confident about that, even though he's never had the chance. How do I know that? I don't know, I just know Bob, I just know Bob. And I mean, there's also other incentives that I think are in play. I know that Bob doesn't want to go to prison, and he would probably go to prison or whatever, but we do sometimes, we successfully learn about how agents will behave on new sorts of inputs or become confident in that.
And then also we do get complex technical projects right on the first try sometimes. We got the human moon landing right on the first try, famously. Not all of the tests, but by the time we were doing the real thing, we'd constructed enough analogs and done enough other tests that the first thing went well.
Key sub-challengesSo now, that said, I think there's a few other sub-challenges that make this problem hard. So one is that accurately evaluating AI behavior even on the safe inputs gets difficult when the AIs—I mean, it's difficult period for various reasons, but it's especially difficult when the AI are superhuman because the humans might not be able to understand the relevant behavior, there might be scarcity of oversight. A bunch of ways in which evaluating superhuman AI behavior might be harder than usual. So you might not even know what's going on as much as you'd like, even on the safe inputs.
Now, second problem, even if you know what's going on on the safe inputs, you might not be able to actually get the behavior that you want on the safe inputs. So notably right now, AIs often do bad stuff. No AI right now is in position to take over the world. We can tell that they're doing bad stuff. So the evaluation thing is still working, but we still cannot successfully get the AIs to behave well and behave how we want even on safe inputs. So that's an additional problem. Even if you can tell whether the behavior you want is occurring, you might not be able to cause the behavior that you want.
The third, this is like, I won't talk about this a ton, but there's also limits to the amount of access we have even to safe inputs. There's lots of safe inputs we might want to test our AIs on, but you can't get them. Maybe you want to test your AIs if there were some new technology that you don't have access to or if some number was factored that it takes a lot of compute to factor. So it's the stuff that you don't have access to.
More concerning—I think it is probably the most concerning—is that there's a possibility of AIs adversarially optimizing against your safety efforts. So this is sometimes called scheming or alignment faking. I've got some work on this in the past. I have a long report about scheming you can check out. But basically, the concern here is that AIs that are seeking power, even if they can't yet take over the world, might decide that your safety efforts are contra to their goals and start to actively try to undermine them.
Now notably, this is a very unique scientific and safety problem. This is something that most scientific domains, if you're a nuclear safety engineer, you're like nuclear plant might be hard to make safe. It's not trying to not be safe. Same with a biohazard lab. In some sense, viruses, they're trying to spread. Not that good though. They're not that smart. Very, very smart agents actively trying to undermine your science, even as you study them, harder problem, difficult problem, something that we have very rarely had to deal with as a species.
And then finally, I'm just going to flag that there's an additional issue, which is the opacity of AI cognition. So this has been especially salient in the context of ML systems. People have this sense that ML systems are intuitively quite opaque, black boxy, et cetera. And I agree with that, but this is actually a problem that goes somewhat deeper, even if you had a system that was more traditionally programmed or whatever, there might be deeper senses in which superintelligent cognition is just hard to understand for humans, and that might make it hard to know how these systems work in a way that could aid in our predictions about how they'll generalize. So these are five sub-challenges, I think, make the fundamental challenge I discussed extra hard.
What tools do we have available?Okay, so what can we do? And here in particular, what can we do to shape the motivations of the systems? By the motivations, I mean the criteria they use in evaluating options. So I'm assuming the AIs know what... it doesn't need to be some extra anthropocentric thing. I'm definitely not talking about consciousness. All I mean is the AIs have options, they're aware of these options, they're using criteria to evaluate which one to choose. The motivations are those criteria for me.
Now, so we're trying to shape those. How do we shape those well enough such that by the time the AIs are doing this generalization, they reject the rogue options that the dangerous inputs make available? Well, we have at least two categories of tools.
One is we study the behavior of the systems. We leave the opacity issue unresolved, review the AIs from the outside, but we study the behavior in depth. I'm calling that behavioral science.
And the second thing is you can bring in tools that help with the opacity thing in particular, which I'm calling transparency tools. And obviously, you do these both in tandem, but because opacity is such a problem, I think it's worth separating these conceptually so as to see how they interact.
Behavioral scienceSo on behavioral science, now notably, so the thing about behavioral science, I think, it's worth bearing in mind is that this is actually usually what we do for understanding humans.
Neuroscience is great, I love neuroscience, but we're really not that far on transparency for human brains, IMO, I think.
But nevertheless, with my friend Bob, how did I become confident that Bob is not going to shoot me when I give him a gun? It's like his behavior, plus the history of human behavior, the sort of general built up understanding we have of humans on the base of how they behave. We have some extra oomph from our own introspective access to how humans think and feel. Maybe that helps a bit. Famously fallible, that introspection itself.
So behavioral science, people sometimes poo-poo. They're like, "Oh, my gosh, if you don't understand how the humans or how the AIs work, then you'll never make them safe." I'm not sure that's right, especially if we're talking about there's a certain kind of standard, which is the level of confidence in the motivation of a human that you could realistically expect to reach.
Now obviously, humans, we might not be comfortable giving them the option to disempower humanity or take over the world either. In fact, I think we shouldn't. But I want to distinguish between the question of is the problem that the AIs are AIs as opposed to humans and the separate problem of any agent reach having the option to disempower humanity? And I think we might be able to at least reach the level of trust that we have in humans via behavioral science.
Now, part of the reason I think that is that I think we can do vastly more intense behavioral science with AIs than we've ever done with humans. These studies [pictured on the slide] like the Milgram experiment or Stanford prison experiment have a lot of issues. We do not do very detailed, rigorous behavioral science with humans for a ton of reasons. Some of those are ethical. Some of those ethical reasons might apply to AIs too. I gave a talk about that yesterday.
But especially modular, those ethical issues, the intensity of the behavioral science we can do with AIs and the rapidity of the feedback, the degree of automation we can bring to bear, I think, is really promising and intense. And I think we should be doing vastly more of this.
So basically, we can test the AIs, how they behave in a zillion environments. We can see how that behavioral profile changes based on different forms of training, different architectures, et cetera. We can use that data to build up a rigorous joint carving predictive theory, behavioral theory of AI psychology, and then we can use AI labor to help with this process. So the kind of vision I have in mind is say you had a button, and this button will show you, if you press it'll show you how the AI behaves across a zillion possible inputs.
And then you're doing different things to the AI and you press that button at every stage and you see what's changing, and then you start to have a theory of what goes on with that whole profile and how it changes. So there's a ton of behavioral science to do here, and I think it's a place that people in this room might be able to contribute. So that's the behavioral science piece.
Transparency tools: three broad categoriesTransparency, we're going through this quickly. So I think there's sort of three categories of transparency tools that are relevant.
One I'm calling open agency, which is basically building more transparent agents out of still-opaque ML systems. So it's sort of leaving the opacity of ML systems as a constraint, but nevertheless, trying to build agents using these opaque black boxes out of them such that it's still nevertheless transparent at some higher level of abstraction to a degree necessary for safety.
So a version of this people talk about is faithful chain of thought. So chain of thought is this sort of reasoning trace that is currently available for many forms of AI functionality. There's a question of does that reflect AI's true reasoning, how counterfactual is it, et cetera, what amount of bad adversarial reasoning could occur still within the black boxes? But nevertheless, that's an example of open agency that we might be able to pursue, and there's a bunch of work to be done on that, and it's being done.
And then interpretability, it is sort of going like, "No, we shall not accept the black boxness of ML systems. We shall un-black, we shall white-box the ML systems." That's another thing to do. There's different ways to do that. One version is sort of more mechanistic and bottom up. You're really trying to understand the circuits, but there's other versions too.
So for example, you can do top-down probes to just test which sorts of activations might indicate dishonesty or adversarialness or all sorts of stuff. And then, obviously, that gets connected with behavior. So famously, there's like Golden Gate Claude, this version of the Claude model where they found the feature for the Golden Gate Bridge. They up-weighted that feature really hard, and then the model was just obsessed with the Golden Gate Bridge and would connect to any prompt to the Golden Gate Bridge. That's the sort of thing, that sort of predictive control you can get from interpretability done right without necessarily understanding the mechanistic circuits.
And then finally, some people have hoped for a new AI paradigm as a mode of transparency. In my head, there's a bunch of different versions of this when people were like, "Ah, provably safe AI," or, "Something-something, in the old days, we programmed software. That was great, and now we don't do that. Now the AIs are grown rather than programmed. Let's get back to the good old days."
I refer to this as make AI good-old-fashioned again. There's a broad set of hopes in this vicinity. I personally think that many certain people in the AI safety community place too much weight on this. I think it's quite difficult, especially in short timelines to transition to a new paradigm. But it is another way that you could get transparency. And it is also notably possible that as we get more help from AI systems, maybe early on we don't transition to the new paradigm. But one of the tasks we give early AI systems is to help us transition to a paradigm that's safer. I think that's more promising. So these are three versions of transparency.
A four-step picture of solving the problemOkay, with those versions in mind now I'm going to go through a quick four-step picture for how we solve the problem overall. And we're a little short on time, so this might go sort of fast, but hopefully, it can give you at least some sense of what does it look like to actually solve this as opposed to just it being a problem. And this is mostly a high-level decomposition of the different steps. It's not necessarily like a plan, but hopefully, it can aid in thinking about where you might want to contribute if you do.
Step 1. So the first step I'm going to say is ensure there's a certain property I'm going to focus on called instruction following. So this is basically the AI does what you say. So you give it some instructions in a prompt, in a model spec, and it does those in a commonsensical way, not in some weird way, just a commonsensical way. I'll talk a little bit more about that at the end, but that's the prompt we're going to focus on.
And the first step here is to ensure that that property holds on these safe inputs or at least the safe inputs you can test. So I've indicated those in green. So just get good behavior. It's sort of like the vibe here is like you're trying to generalize a certain type of behavior to here. Here's the first step for that. Get it on distribution, get it on the ones you can test. Now, that's hard, but that's the first step.
Step 2. Second step is to ensure that the AIs are not adversarially messing with your evidence about how they'll move from this to this. So you got green here, you're trying to get green over there. One reason you might not get green over there is because the AIs are trying to get you to think that they're going to be green over there, but they're going to be red, they're going to be bad. So you want to eliminate this intentional undermining, intentional scientific adversarialness as step two.
Step 3. And then the second component of the generalization is be like okay with that story, then there's all sorts of other reasons the generalization might fail, and you need to eliminate those as well. And so you need to learn how to understand and control how AIs generalize until you're rightly confident that they'll generalize their instruction following to these dangerous inputs.
Step 4. And then step four, give it instructions that rule out rogue behavior.
So those are the four steps. I believe that if we did all these steps then the problem would be solved, but we have to do all the steps. I'm going to briefly say more about that. We don't have that much time.
Step 1: Instruction-following on safe inputsOkay, so first one, instruction-following on safe inputs. The key problem here is evaluation accuracy. So I mentioned superhuman AI systems are doing a ton of stuff. We see this problem with, there's a bunch of reward hacking. This is an analog of potential future problems where the AIs get rewarded for cheating on their tests. So it's like you're having it code, it figures out that it can do, it can mess with the unit tests, but the reward function doesn't notice that, goes like, "Yeah, yeah, yeah, you did great. Well done." And so AI goes, "Wow, I just cheated. Now I'm reinforced for that." Lo and behold, it learns to cheat.
So that's an evaluation failure that led to bad behavior, not necessarily to rogue behavior. Cheating on your test is not the same as taking over the world, but it's a red flag if you're rewarding the AI system for bad behavior. So you need to be evaluating whether it's doing good or not, and that might require a very powerful evaluation signal.
So to the extent that evaluating the AI's behavior requires superhuman cognition in itself because the task is quite difficult, then you need to somehow bootstrap from a combination of AI systems that you trust—or at least whose outputs you trust—and humans to an ability to tell whether a superintelligent AI that might be operating very, very fast, doing zillions of actions in very complicated domains, you need to be able to somehow evaluate that accurately regardless, even assuming you don't trust that AI system. You could be like, "Well, get another version to evaluate it," but maybe you don't trust that system either. So that's a problem, sometimes bucked in under the term scalable oversight. There's a bunch to be said about how you might try to solve that, but that's a crucial component here.
But notably, even if you've got a good evaluation accuracy, as I said, you also need to cause... So you could be like, wow, "Our AI, we're successfully evaluating, it's just not in following instructions. It just goes wild all the time." Okay, that's good that you know that, but you still need to make that not the case.
Now, a classic way to do that is to train on your evaluation signal. And I think there's a lot of oomph to that. So you can be like, "Well, every time I notice that it's being bad, I'm going to create it in the opposite direction." But there are some problems that can come up there too. And notably, we haven't been fully successful at that, even in this current, pretty easy mode regime.
A key game here is red teaming, searching out inputs where the AI behaves badly. You can use AIs to help with that, but there's a lot of inputs, and in a lot of ways, things can go wrong. So that's just on step one.
Step 2: No alignment fakingNow, step two, no alignment faking. So this is eliminating the adversarial dynamics I mentioned, the paradigm concern is scheming. As I said, this is an especially worrying explanation of why... You had your AI. It behaves safely every time you looked at it. Why does it suddenly kill you when you deploy it in a new environment? Well, one reason is it was pretending to be nice, it was trying to get you to think it wouldn't kill you, but it wanted to kill you. And so you really want to eliminate that.
I have a report on why you might expect this behavior. We've also seen some early empirical evidence of behavior that's, at least, nearby to this kind of concern. Anthropic and Redwood Research have a paper that came out late last year called Alignment Faking in Large Language Models. There's other examples of the AIs blackmailing people, saying they're going to kill CEOs. There's various things here. We can look at the literature on this.
Anyway, we got to eliminate that. How are we going to eliminate that? Well, basically, we have the same tools but with an extra constraint, namely that when we're doing our behavioral science and we're doing our transparency, we need to do so in a way that's robust to the AI in question adversarially messing with our techniques. That makes life a lot harder. How do we deal with that? Well, at least three ways.
- One is we can try to make sure that the AIs we're studying are not themselves adversarial in the relevant way. For example, we can try to isolate the components of scheming and study them separately.
- Another is we can create intentional forms of adversarial pressure that we know exists because we made them intentionally. This is sometimes called model organisms for scheming, and then study that. So that's sort of like, okay, the adversarial pressure is there, but we controlled it where we think we know what's going on with it.
- And then a third way is to try to catch scheming in the wild that you didn't create or initially control, and then study it in depth. So those are three different ways to approach the scheming problem. My guess is that this is the most challenging aspect of the overall problem, but I won't defend that here.
Okay, step three, science of non-adversarial generalization. So let's assume we've got an AI, it's following instructions on all the safe inputs. We know that it's not intentionally messing with our science here. Nevertheless, we're bringing it to a new distribution. It could, nevertheless, fail to generalize the instruction-following behavior to that new distribution for a zillion reasons. It's a new distribution in particular reasons that are salient.
In particular it could be that the new distribution involves better options for power-seeking behavior. And the AI, even though it was innocent before... Maybe you think if I was president, I would never do anything bad. I really wouldn't. I wouldn't accept corrupt crypto something, never. But then you become president, and little did you know that once it's right there, power corrupts. So something analogous could happen to AIs.
Anyway, it could be that a wider range of options reveals brittleness to the AIs rejection and rogue behavior. Maybe it's like it never steals, but it only doesn't steal because stealing involves this particular property. It would've had to break a lock, and it never likes to break locks, but it's fine to break electronic locks or something like that. And now that you give it that option, it goes for it.
Maybe new levels of intelligence and information or reflection just generally mess with an AI's ontology/ethics/cognitive processes like a vague analog. Would this to be something like maybe the reason it's behaving well is because it has internal conception, some analog of an AI religion, like God will punish me something-something, and then it realizes this isn't true, and now that structure is not there. There's a bunch of ways in which the AI's initial good behavior could break as it learns more, changes, becomes more intelligent, et cetera.
How to address this? This is hard because this is anticipate all possible problems in this new distribution and address them ahead of time without ever testing the new distribution. That's potentially quite difficult.
Things you can do, you can try to study other analogous distributional leaps on safe inputs and try to get as close as possible to the dynamics that you're wanting to understand. You can try to separate, as I said, you try to make sure that the safe-to-dangerous leap doesn't occur pro-extensive with a change of capabilities. I think that helps a lot if you can do that.
So that's basically, if you're improving an AI's capabilities, never have that improvement occur at the same time as you're now giving the AI options to take over the world. First, improve its capabilities safely in a context where it cannot take over the world, study it in depth, and only then, deploy it in a way that would give it a dangerous option. And then finally, you can try to use transparency tools. I think this one, I'm not sure how hard this will be, but I do feel like most at sea in terms of thinking about how we would address this one. And so I'll just flag that.
And I think it's possible that there's some ineliminable element of stepping into the unknown. In some sense, if I say, "Sorry, guys, we're deploying on a new distribution," I don't know, it seems like you can... ML theorists maybe in the room who can tell me what sort of guarantees you might be able to make. But intuitively, it might be hard depending on the newness of the distribution to be really, really confident that there's no novel dynamic that you haven't anticipated.
Step 4: Good instructionsAnd then finally, suppose you have AIs that now they're following instructions on the dangerous inputs, there's still a persistent concern, which the early discourse about AI alignment was quite concerned with, which is crafting instructions such that when the AI follows those, it, nevertheless, does not engage in rogue behavior.
I think this is likely the easiest part of the problem that I've given so far, because I think most salient forms of rogue behavior like self-exfiltrating from an AI lab, killing humans, sabotaging research, et cetera, are just pretty flagrantly bad. It doesn't take very sophisticated instructions to be like, "Don't do that, do not take over the world, do not take over the world, seriously."
And so that said, there are some edge cases that are tough. Maybe we need to think about things like manipulation. Maybe there's extreme circumstances we didn't anticipate where even common-sensically interpreted instructions lead to rogue behavior. So people need to red team that, anticipate those dynamics. And I think they should use AI to help. And if you have AIs that are going to follow instructions even on dangerous inputs, then you've got a lot of juice in terms of the AI labor you might be able to mobilize for getting this last step right. So that's step four, that's the full picture.
How can academics help?And I'll just end here with how can academics, like the folks in this room, help? Well, if you have ML expertise, then you can basically help with all of this. I'm linking in the next slide to some resources that have a bunch of technical research directions you can pursue. So if you're comfortable with a reasonably full stack of retraining ML systems and doing empirical experiments, there's just a ton of work to be done. If your expertise is limited, you can still do, in my opinion, a decent amount of behavioral science just using chat windows/APIs and without retraining the model.
Also, I don't know if you guys have played around with AI helpers like Claude Code, but I was very impressed as a non-technical person how I could just like... Claude Code will do stuff for you, and you can really get experiments up and running quite fast that can get pretty interesting results. And so yeah.
And an example here, the original alignment-faking work that I referenced earlier, that was actually first done with just pure prompting. It was just Ryan Greenblatt, and he was just there, trying to see what the model would do if it thought that Anthropic was going to retrain it to not care about animals. And you can just do that with the prompt. So that's an example of behavioral science that yielded, I think, quite fruitful results.
And then also, there's some philosopher folks in the room or people who do more work on the conceptual level. There, I think there's also still a ton of work to be done. This is, in some sense, what I specialized in.
So on step one, just as an example, you can design and examine better protocols for scalable oversight. Step two, you can help think through what a good safety case against scheming might look like. Step three, you can think ahead to forms of non-adversarial generalization that could be problematic. Step four, you can help design good instructions and red team those. And in general, you can do tons of conceptual work on threat modeling, safety cases, and high-level strategy.
And then finally, of course, there's all sorts of other aspects of this problem I haven't covered, eliminating/restricting rogue options, security, governance, forecasting, et cetera. There's a ton of other areas you can contribute other than the technical motivational component I discussed.
A few resources for learning moreSo here are a few resources for doing that. Feel free to find me or maybe the slides will be shared if you want to get the links. And thank you very much for your attention.
Q&AQuestion 1Q: So one very early example of human empowerment that I have concern about is how much of my cognitive tasks I offload to AI. So I lose a sense of my unique abilities over and over, and I might ask AI to do tasks that I would benefit from doing.
For example, I would be fulfilled by doing this stuff, feel fulfilled, or I would be training some skill by doing this stuff myself, but I don't because there's an immediate reward. And so in this case, misalignment is more related to my bounded rationality than it is to AI itself. Or you might also even say that, like the first point that you mentioned of whether it can follow instructions or not, that itself is actually something that could lead to misalignment. So I'm wondering in this case, how you would think about making safe motivations.
A: Let me make sure I'm understanding. So the thought is that you sometimes would offload a task AI system, but you want to do the task yourself.
Q: No, no, no. My limited wanting the fast reward would mission me to just letting the AI do it. But if it were truly meaning to be empowering of me, then it wouldn't fulfill my thought.
A: Right. I mean, so there's a general... Yeah, and I think maybe disconnected with the instruction following where just following instructions is maybe bad in various cases. Is that the sort of thought?
Q: Yes. And also, I just want to point out that in this case, it feels very feels very safe that it would do a certain task for me, write this email. But its long-term... Yeah.
A: Yeah. I guess I would bucket this under something like the last step. So if you've gotten to the point where you're really able to control the AI's behaviors... And notably, instructions here does not mean user instructions necessarily. So I've got this vision or this picture of a model spec. There's a instruction hierarchies in the model spec, which starts with the OpenAI's instructions to the model, which are the fundamental guiding behavior. And then I think it's instructions given by the developer or someone, or a intermediate company deploying the model, and finally, there's the user, and the interaction between those. I'm including all of that under instruction.
So in some sense, you're designing the flow of obedience and behavior in the AI, which doesn't need to be purely user directed. I do think there's a ton to get right in terms of ways in which subtle... Instruction following could go wrong and could lead to bad outcomes.
I do want to specify that here I'm talking about AI taking over the world in particular, which is a very particular type of bad outcome, which I think not all... So there's tons of ways in which we can get AI instructions wrong in ways that are bad for people, but not all ways of being bad for people result in violent overthrow of the human species. So for that sort of concern to connect with a specific topic here, it would need to be the case that, for example, helping you with emails over the long term leads to human disempowerment entirely. But it could happen. And that's the sort of thing that you want to get for instructions right with respect to.
Q: Well, just to push my case a little bit, I'm just trying to think of cases of AI that's very scaled up now. And ChatGPT is one of that instance. So if this long-term human disempowerment thing happens with offloading my tasks, that actually might be a way where human disempowerment happens at scale over course of time in the very near future.
A: Yes. I think there's some interesting questions about whether we should count it as disempowerment if humans just sort of intentionally offload more and more stuff with AIs. I tend to think, no. I think the point at which it really counts as a disempowerment is when the AIs won't give it back or something like that. If your AI, it has seized control of your email, and now you say, "Hey, ChatGPT, I'd like to answer this one," and it won't give it back.
I said I'm focused here on really flagrant forms of AI takeover. There are scenarios that are more in between, and the line can indeed get blurry, but I want, when it's blurry, to err on the side of the more extreme violent obvious forms because that's really what I'm talking about. And there's a bunch of other stuff that's nearby that I think is reasonable to be concerned about, but it's not my focus here.
Question 2Q: So I just want to bounce my understanding of this off of you and get your response. And first of all, I'm down with the formulation of the alignment problem, and the set of responses seem quite sensible to me. Where I have trouble is with the intensity of the worry about generalization without room for mistakes, and the sense that this is unprecedented, the sense that AI is not a normal technology here. And that's the upside is that maybe we do have more room for [inaudible]. The downside is that is it a solvable problem? It seems to me like, well, what's the problem that we'd always be facing in new versions?
And the way that I think about the situation that leads me to feel this way is that it hinges on a remark you made, which is that viruses aren't very smart. It seems to me viruses are extremely smart. In fact, understanding the adaptive intelligence of viruses of why modeling intelligence in a way that isn't anthropocentric, consciousness-oriented idea of intelligence is actually integral to the kinds of understandings we're going to need to deal with technological phenomena like superhuman AI.
And there's also, we talk about chimpanzees, but if I think about an organism that's very, very smart and how it's co-evolved with humans, I think of corn. Maybe corn is really running things, and we're all just eating the corn syrup, that's still corn.
You even think this way about, and scientists are working on this, thinking this way about non-organic entities in the world. So if there's a lot of entities in the world that have a certain kind of agentic purchase on the world, AI is just a case of that. And actually thinking about AI in relation to humans is really a case of thinking about how could we co-evolve with these entities so that we come to an equilibrium as we deal with so many other kinds of systems to which we co-evolve.
A: Cool. I guess I'm hearing two aspects of that. One is you're saying, "Hey, there's a lot of types of intelligence, at least, broadly construed operative in the world. These create a kind of ecosystem that can be, in some sense, beneficial or imbalanced or at least coexisting, left it out with AI, and it's not new to be doing that with something.
Q: We need to do all these things, but it's not so new.
A: Yeah. And then I heard a different thing, which is you're sort of skeptical about the generalization about room for mistakes where I guess I'm not yet seeing how... So we've been at this [slide on "The core challenge: generalization without room for mistakes"], do you reject? Or I guess the specific thing here, it's not necessarily it's unprecedented, it's that if you want to not be disempowered in the relevant sense, then by the time AI is deployed on options where it has a reasonable chance of disempowering you, it needs to reject those options. So you need to get that right.
Q: Look, I'm a scholar of media. This looks a lot like a lot of problems in coexistence of humans with their media, including media landscape that are arguably somewhat natural.
A: But does media have the chance to take over the world in the relevant sense? I guess the thing that's unique—
Q: People say so all the time.
A: I disagree, or I'm talking about taking over the world in a pretty robust sense like corn... No, corn has not taken over the world in the sense I'm talking about. And viruses have not. You could argue that humans haven't fully. I mean, there's certainly a lot of aspects of the world that we—
Q: Yes. I just say, viruses killing everybody is a very real concern.
A: Actually, I do really disagree. I think that the AI threat is importantly different from corn and from viruses. I think that the fact that the AI is actively modeling your responses. When we're fighting COVID, COVID is not thinking about what we're doing to create vaccines, COVID is not infiltrating differentially the vaccine development faculties, it's not developing its own counter-vaccines, it's not thinking ahead to how to sow special disruption. There's a difference when you're in a war with an intelligent agent that is modeling your strategy and responding to it versus a system that isn't doing that. And neither corn nor biology is doing that.
Actually, I think it's a generally important point. I've done work on a full range of existential risks: climate change, nuclear war, biology. And I think the thing that's unique about AI and why I think it's so much scarier as a threat is that it is trying to kill you. And the thing trying to kill you in a robust sense, not a sort of, well, let's interpret the viruses, but a sense in which it's literally modeling the world, it has representations in a cognitive system of the world that are mapping your strategy and responding to it. You could say that viruses are doing that at some level of abstraction, but there's clearly some difference between what it is when a human does that and a virus.
And the reason I'm much more concerned about AI than climate change, for example, is like, the climate, it's a tough problem, but it's not trying to be a problem. But I think AI is trying in this way or in a bad case, and that's especially scary. And it's also notably trying at a level of cognitive sophistication vastly superior to humanity's.
But that said, I agree that there's a lot of interesting stuff to learn about the ways in which we create an ecosystem with a bunch of overlapping forms of agency and intelligence. And I do actually think the good version... So we haven't talked very much about what is the good version of an eventual future with very complicated AI systems, including AI systems and all sorts of levels of capability, AI systems of vast diversity. And I think that in that context, I am actually quite sympathetic to thinking about these forms of symbiosis and coexistence that we already see in healthy ecosystems.
Now notably, look, nature, there's also a thing called predation in nature where a species just eats another species and takes its resources. And that's more analogous. In some sense, the concern here is that AI is a sort of analogous to something like a predator and/or an invasive species that so outcompetes an ecosystem that is not ready for it, that you just end up with a monoculture and a crash to the flourishing of the system.
Question 3Q: Sort of the reverse side of what we were just talking about. I really liked your analogy between how humans sort of just disempowered chimpanzees without really having the motivation to do so. I felt that was a useful paradigm under which motivation and the factuality of disempowerment can be thought of in a decoupled way. But this also has an interesting implication. And you said there's an intuitive sense humans have disempowered chimpanzees, which I agree.
There's also an intuitive sense in which it did not matter whether chimpanzees thought about this or whether they try to do something about it. Humans are powerful enough that they're going to get disempowered regardless of what they do. I wondered if that is a problem, is also a problem in this scenario in the sense that if AIs are powerful enough, then it doesn't matter what we do. If they're not powerful enough, then there's nothing we need to do. I was just wondering if that implication... Or first of all, do you agree with the implication? And secondly, if you agree with that implication, does that pose a problem to the paradigm or the structure we're thinking?
A: So I'll answer the second question first. If I thought that there's a binary of either the AIs are weak and can't take over or they're so powerful that they're definitely going to take over and there's nothing we can do, then, yes, that would be a problem. And to reverse approximation, then the only available solution would be to not build AI systems of the relevant capability level.
And some people think the problem is hard enough that that's approximately the situation. I am not sure. I mean, I presented this set of ways in which we could solve this problem. I think it's possible that this is too hard and that we're not up for it. I think, notably, another bit of my series is about ways in which we might get AIs to help us with this. And I actually think the most salient ways in which we end up solving this problem involve drawing a ton of automated cognitive labor to reach a level of scientific maturity, vastly surpassing what we currently have in our understanding of AIs.
And so there's a whole separate piece here that I haven't discussed, which is about how do we do that safely? How do you draw an AI labor, even though you're worried about AIs in order to understand AI? Now notably, we're doing that with capabilities anyway. So AI labor is already playing a big role in this, or most people's story about how you get crazy AI capabilities is the AIs themselves start doing a bunch of the cognitive labor. And my claim is that needs to happen with safety too. But that's all to say this problem could be so hard is to be intractable and the only way is to not build AI systems with the relevant power.
I do think it's not necessarily like that, and notably, humans are interestingly different from chimpanzees. So also, chimps, I think, aren't actually the relevant type of monkey or whatever here, but whatever. Sometimes people get fussy about that, but humans just are... we know what's going on, and there's a notable difference between in the monkey case. Or are chimps even monkeys? I feel like I'm worrying about that.
Audience: Apes.
Apes, yes, okay. The concern is that the apes did not build humans. So there's a clear advantage that we have with the AIs relative... And people also sometimes talk about evolution as an analogy for AI alignment. Evolution was not trying to build humans in the relevant form. So we have this interesting advantage of we're aware of what's going on, we're intentionally designing these systems, and that's very different from the situation that apes had.
Now obviously, we could be too dumb, we could fail, but we know, at least, that we're facing this problem. Humans, we can do stuff, we're smart, we work together, and maybe we can do it. So I'm not, especially not on the grounds of the ape analogy, dismissing the problem as hopeful. Yeah.
Question 4Q: How does policy fit into all of this? Because it strikes me that if you are one who takes the idea that existential risk is so high, the easy solution is don't keep developing these things such that they get more and more powerful and a lot of effort go into these moratoriums and policy against developing these. But is the idea that that policy just won't come? Say, we have to accept the fact that these companies are going to keep building these things and there are going to be people that keep building these things. Or yeah, I don't know, I struggle to grapple with the balance between policy and actual techniques.
A: Me too. So I mean, there's a set of people, including the book that's coming out just today, argues the policy is the only place to focus. Basically, we need an enforced international ban on sufficiently dangerous types of AI development. We need that now and for many decades. And then we need to do something very different for trying to align systems. We need to really, really go deep on a level of maturity with this and potentially pursuing other more direct paths.
And roughly speaking, I think, yes, I think this problem is scary enough that I think we should be slowing down and we should be able to stop if we need, and we need to be able to... As an international community, we need to create policy structures that are sensitive enough that we will actually... People talk about building the brinks or building a system of feedback such that you notice when you're near enough to the brink and you can actually stop if necessary. And I think that's crucial.
And so I basically support efforts to build that sort of infrastructure. I am sufficiently concerned that we won't, that I am also interested in what do we do if we can't engage in that sort of moratorium or pause or can't do it for very long, how would we then direct human and AI labor towards actually becoming able to build these powerful systems safely?
But it's actually, it's an uncomfortable tension because you're working within a degree of non-ideal. You're like, actually, the best thing here would be to pause and become way, way better at this. And we should not, in fact, plow forward, but here's what maybe we might need to do. And that's more the paradigm I'm working in, but there is some tension, and it just helps there.
Question 5Q: [Mostly-inaudible question about risk and unpredictability as both core to the AI safety concern and potentially important to relations of mutual recognition.]
A: Yeah. So I agree that there's some deep connection between the thing we want out of AIs that makes them powerful and capable and useful, and the risk that they pose here. And thinking of it in terms of predictability, I think, works insofar as sometimes people say, let's say you're playing someone who's vastly better at chess than you. It's often you can predict maybe that they'll win, but you can't, by hypothesis, predict each of their moves because if you could predict each of their moves ahead of time, then you'd be good enough at chess to play as well as that. And by hypothesis, you're a lot worse. So there's a sense in which... and this is true in general to the extent we're getting genuinely superhuman cognitive performance as opposed to just faster play or more efficient humankind, and genuinely qualitatively better than human task performance, then there's some element that humans could not have done ahead of time.
And then that is also core to the problem, which is once you have something that's genuinely better than humans, it's harder to evaluate, it's harder to oversee, it's harder to anticipate all the options that might have available. So I think these are closely tied.
And then I also agree that sometimes trying to mitigate risk or shut down and control another being is bad and can lead to all sorts of bad behavior. I have a separate series called Otherness and Control in the Age of AGI, which basically examines that in a lot of detail. And I think you can see underneath a lot of the AI alignment discourse, a very intense desire to control stuff, to control otherness, to control the universe. And I actually think that the ethics and practicality of that is quite complicated. And I think we should notice, actually, the authoritarian vibes that underlie some of the AI safety discourse and the ways in which that we can learn historical lessons about how that tends to go.
That said, we also do, I think, we need to hold the actual safety concern in that context as well. And so I think there's just a bunch of balancing acts that need to occur, especially if the AIs are moral patients, in thinking about that the way we integrate them into society in a way that's not over-controlling, but that also takes care of all of the existing people and cooperative structures and other things that we have in play.
Question 6Q: I want to make a picture of what might be a missing part of your agenda of addressing these issues. So in the last six... well, I guess nine months, we've had thinking reasoning models. And the way they are trained is through reinforcement learning with a reward function that is largely—not every company has published what they're doing, but as far as I understand, it's largely based on correctness and incorrectness. And there's a danger that takes us back where basically we were afraid, the AI safety community, and more broadly, you were afraid of a literal genie style AI system that understands a single goal, ignores everything else while pursuing that goal of wreaking havoc, taking power.
But then to pleasant surprise, LLMs actually had a lot of common sense. If you asked it to do something and gave it just like a binary goal, it's not going to destroy things. It has common sense. It understands that there's actually multiple objectives that you didn't say.
But in this training for thinking and reasoning, we're using these binary objectives, and there seems to be, at least, limited evidence that does push them somewhat backwards towards being single-goal-oriented agents. Like the paper, I think, you alluded to where thinking models are more likely to cheat.
And when I hear the AI safety community speak about what should be done and including in your talk, there isn't discussion of how to get better reward functions. The scalable oversight thing is close, but that's really focused on things humans can't detect well. But there's still this really unsolved problem of just how do we craft these reward functions that track these multiple objectives that we want in a way that really respects our interests. But does that seem like that should be part of this agenda?
A: I think it does, but I guess the way I would bucket that is under step one ["Instruction-following on safe inputs"].
Q: Yeah, I think that could be one or three.
A: Yeah. I mean, step one and step three are closely connected because you ultimately you care about step three. But I think here if we're talking about something like, okay, now we're training these AI systems just on you're having them do math and you're just giving them, was the math correct or something? And maybe that is eliminating these sort of softer, commonsensical skills they're getting from more forms of RLHF or something like that. I guess I would see that as, in some sense, yes, you're right, it could be this or it could be step three, because you could be like, well, that was fine behavior on this particular thing, but then it generalizes poorly to maybe some form of reward hacking.
So it's a question of is the problem you're concerned about, say, you're getting some more single-mindedness, some less commonsensical thing, is that occurring on the training input, or is it only in generalization?
But regardless, I agree that that's part of it, and to some extent why I'm saying this instruction-following property is the rich commonsensical instruction following that has given people comfort with respect to LLMs. We really want AIs that do that, that aren't literalistic genies or whatever, that have this rich nuanced understanding of what it is to follow instructions in the way we intend. And so you need to craft an evaluation signal that successfully tracks that property itself.
Question 7Q: Yeah. So you were discussing agentic AI earlier mostly in negative light, it sounded like. You sounded concerned about the progression of agentic AI and the potential for that to not be safe. I'm interested if you've considered the potential safety-related usefulness of agentic AI, particularly, I would say that I believe that agentic AI provides one of the most intuitive ways that I can think of of a straightforward path to alignment actually, where you could just consider the way that humans become aligned.
So we start out, thankfully, as weak, little, tiny creatures and completely unaligned, it seems, and it's a good thing that they're tiny at that moment because if they had the power of a full adult, that would be really scary for a toddler having a tantrum or something and have the full power of an adult, people could get hurt.
I guess the analogy to AI would be during training, during the process of upscaling it from a really not very capable system to a highly capable, maybe, superhuman system, you could give it agentic experience that allows you to, number one, give it negative feedback when it does bad things like training a child to be good, but also to... And this gets back to your point about Bob won't shoot me. Why do we know Bob won't shoot me? Because I know Bob. But also, I think, an important other thing of that is that Bob will be arrested if he shoots you, and Bob doesn't want to be arrested.
So I think another part that's really important for agentic AI is that you have the potential to give it the experience necessary to know that there is a capable and strong enough of a system that it lives within, that if it does a bad thing, not good things will happen to it. So yeah, I'm interested to hear what you think about the utility of agentic AI systems and multi-agent systems related to AI safety.
A: So I think I alluded to this other work I have, which is also in the series that I mentioned about using AI systems to help with safety. So I have a piece called AI for AI Safety, another piece called Can We Safely Automate AI Alignment Research? And both of those are devoted to something quite nearby to what you're talking about, which is, how do we use AI labor, including potentially agentic AI labor because many of these tasks are intuitively implicating of agency, or at least better, or it seems more easily done with agentic systems.
How do we use AI labor to help with this problem in a zillion ways? So that includes security, that includes helping with alignment research, that includes building a more robust civilizational infrastructure in general. And so basically, the answer is yes, I think using AI labor is a crucial component of this.
Now obviously, but there is this dynamic where it's like the whole thing was maybe you didn't trust the AIs. And so if you're trying to use AI labor that you don't trust to help you trust AIs, that's at least a dance, and it's a delicate dance, and it's something I'm very interested in.
I also want to separate the specific thing you mentioned though, which is around punishment and deterrence as a mechanism of control. And I actually think we should be quite wary of that as a particular strategy. There's more to say about that, but I think for both ethical and pragmatic reasons, I think it would be better, in my opinion, to focus on ways in which we can positively incentivize AI systems and also just make their motivations in the first place such that they don't need deterrence. But it is true that that has been a part of how our own civilizational ethos, that infrastructure functions. And so yeah, it's salient for that reason.
Question 8Q: I just had a quick question on the slide where you first said, here's the four things, and if we did these, that would be sufficient to solve the problem. Maybe I didn't read this right, but I was thinking, I wonder if there's some things outside of that that aren't covered by these. What if you observe everything in the way here and make sure it's okay, but there's just a small probability that there's some randomness in the behavior or something like that? So everything you observed, it's okay, but then, actually, what matters is that there's a low probability that X, Y, Z will lead to Skynet happening.
A: Yeah, I guess I would probably put that under step 3, which is sort of like you got to get all... That would be like if there's some low probability thing where it looks good on the safe inputs but goes badly here, I would probably put that under step 3. It could be that the reason it looks good on the safe inputs was just the low probability thing hadn't cropped up yet, even on the safe guys.
Q: That's all right, but then there's low probability things like all aspects of all grids in North America go down power-wise, and that takes out, actually, all of these modules that control these things. And then under those circumstances—or about the probability of a bad human actor putting malicious code in there...
A: Yeah, I mean, I don't want to get too into this covers literally everything that would fall under the thing. And so it's possible we could identify scenarios like that. I think this, at least in my head, if you did all these four steps, you'd be really cooking with gas, I think you'd be really in a good position. And there are other problems that I might just want to bucket separately, like humans intentionally messing with AI system, but I'm not attached to this being really exhaustive.
Okay. Thank you all for coming. Really appreciate the questions and the attention, and I wish this overall initiative good luck.
Discuss
Rejecting Violence as an AI Safety Strategy
Violence against AI developers would increase rather than reduce the existential risk from AI. This analysis shows how such tactics would catastrophically backfire and counters the potential misconception that a consequentialist AI doomer might rationally endorse violence by non-state actors.
- Asymmetry of force. Violence would shift the contest from ideas to physical force, a domain where AI safety advocates would face overwhelming disadvantages. States and corporations command vast security apparatuses and intelligence networks. While safety advocates can compete intellectually through research and argumentation, entering a physical conflict would likely result in swift, decisive defeat.
- Network resilience and geographic distribution. The AI development ecosystem spans multiple continents, involves thousands of researchers, and commands trillions in resources. Targeting individuals would likely redistribute talent and capital to more secure locations without altering the fundamental trajectory.
- Economic and strategic imperatives. AI development represents both unprecedented economic opportunity and perceived national security necessity. These dual incentives create momentum that violence would be unlikely to meaningfully disrupt. States view AI supremacy as existential, while markets see it as the next great transformation. No amount of intimidation is likely to overcome these structural forces.
- International coordination collapse. Effective AI governance would require unprecedented global cooperation, particularly between the US and China. China's government maintains zero tolerance for non-state violence and would likely immediately cease engagement with any movement associated with such tactics. This could eliminate the already slim possibility of coordinated international action on AI risk, perhaps the only viable path to meaningful safety guarantees.
- Stigma contagion and guilt by association. A single act of violence would likely permanently brand the entire movement as extremist. This reputational contamination would operate as a cognitive shortcut, allowing critics to dismiss all safety arguments without engaging their substance. The movement's actual concerns could become inaudible beneath the noise of its worst actors' choices.
- Securitization and democratic bypass. Violence would likely trigger immediate reclassification of AI safety from a policy debate to a security threat. Decision-making could shift from open forums to classified settings dominated by defense agencies. This securitization would potentially eliminate public oversight precisely when scrutiny matters most, while personal security fears might override rational risk assessment.
- Fear-driven acceleration. Leaders facing credible threats to their lives would naturally compress their time horizons and seek rapid resolution. Rather than pausing development, they would likely accelerate deployment to reach perceived safety through technological superiority.
- Infrastructure for repression. Violence would provide justification for comprehensive surveillance, asset seizure, and deplatforming. Payment processors, hosting services, and venues would likely blacklist safety organizations. These cascading restrictions could eliminate funding channels, communication platforms, and physical spaces necessary for advocacy, effectively dismantling the movement's operational capacity.
- Transparency destruction and dissent suppression. Labs would likely invoke security concerns to classify research, cancel external audits, and eliminate whistleblower protections. Internal critics could face not just professional consequences but also potential criminal liability for speaking out. This opacity would likely blind both policymakers and the public to genuine risks while silencing the employees best positioned to identify them.
- Regulatory capture through fear. Politicians and regulators would likely avoid any association with a movement linked to violence. Meetings would be canceled, hearings postponed, and briefings rejected. The careful technical arguments that might influence policy could lose their only remaining channels of influence, leaving regulation to those least concerned with catastrophic risks.
- Selection effects in leadership. Crisis and conflict would likely elevate different personality types to power. Violence would systematically promote leaders comfortable with secrecy, confrontation, and rapid decision-making while potentially marginalizing those inclined toward caution, transparency, and deliberation. This adverse selection could entrench exactly the wrong decision-makers at the most critical juncture.
- Narrative capture and media distortion. Violence would transform complex technical debates into simple crime stories. Media coverage would likely focus exclusively on threats, victims, and law enforcement responses rather than existential risks or alignment challenges. This narrative hijacking could ensure public discourse remains permanently divorced from substantive safety concerns.
- Martyrdom and tribal consolidation. Harming AI lab employees would likely transform them into symbols of institutional loyalty. Internal communities would probably close ranks against external criticism, treating safety advocates as enemies rather than partners. Employees raising concerns could be viewed as potential security threats rather than conscientious objectors, possibly destroying internal advocacy channels.
- Collateral damage to critical systems. AI infrastructure interconnects with hospitals, emergency services, financial systems, and utilities through shared cloud providers and data centers. Any disruption could cascade through these dependencies, potentially causing deaths and widespread suffering. Public outrage over such collateral damage would likely permanently destroy any remaining sympathy for safety concerns.
- False flag vulnerability. Once violence becomes associated with AI safety, opponents would gain a devastating tool: stage an attack, attribute it to safety advocates, and justify unlimited crackdowns. The movement would likely lack the means to prove innocence while bearing collective punishment for acts it didn't commit.
- Trust erosion and proximity denial. AI labs would likely systematically exclude anyone sympathetic to safety concerns from leadership access. The informal conversations and chance encounters where minds might change would probably disappear behind layers of security. Those most capable of articulating risks could be kept furthest from those empowered to address them.
- Generational talent exodus. Association with violence would create lasting stigma that could repel precisely those researchers most concerned with careful, safe development. The field would likely systematically select for risk-tolerant personalities while excluding the cautious voices most needed. This multi-decade talent distortion could eliminate internal brakes on dangerous development.
- It can get worse. No matter how bad a situation is, it can always get worse. Even marginal increases in extinction probability, from 98% to 99%, would represent catastrophic losses in expected value. Violence might increase not just extinction risk but also the probability of worse-than-extinction outcomes involving vast suffering, as actors leverage anticipated future capabilities to coerce present opponents.
Discuss
The Techno-Pessimist Lens
Lenses
Techno-optimism is the belief that the advancement of technology is generally good and has historically made society better. Techno-pessimism is the opposite belief, that technology has generally made the world worse. Both are lenses, or general ways of looking at the world that bring some aspects of reality into focus while obscuring others. Judging which narrative is correct "on balance" is less useful than understanding what each has to offer.
Our World in Data is one of many sources making the case for techno-optimism. Development in Progress makes a (balanced) case for techno-pessimism. This post is an attempt to steelman techno-pessimism.
Boundaries
One can question techno-pessimism in terms of its point of reference. Does its critique of "modernity" only apply to industrial tools, or could it extend all the way back to the plough, or even fire? I see this as a mirror of the challenge faced by those who are optimistic about technology in general but pessimistic about AI.
For the pessimist, the harms of technology are structural and continuous, not the result of some particular technology being exceptional. But this continuity need not be total. I assume that most pessimists draw a cutoff somewhere. For me, the most natural boundary is the agricultural revolution, since that is when the most powerful mechanisms of unintended consequences seemed to begin taking on a life of their own. Others might place it earlier or later.
Three Pillars of Techno-Pessimism
Techno-pessimism has three core arguments, each of which is sufficient to make the overall narrative's case if accepted fully. They can also add together if accepted partially.
1. Moral Atrocity
Modern humans live at the expense of domestic animals tortured in factory farms, wild animals driven to extinction, and indigenous cultures genocided out of existence. The harms from these mass killings outweigh any benefits to the "winners."
2. Self Termination
Quite a lot of extinction (or at least catastrophe) level threats have emerged in the last 100 years, including nuclear war, global warming, biodiversity loss, and runaway AI. The time since the industrial (or even agricultural) revolution is a historical eyeblink in the context of ~300,000 years of homo sapiens' existence, so the timing of these threats is not a coincidence. The elevated risk of self-termination negates any temporary benefits of humanity's present conditions.
3. Subjective Wellbeing
It's not even clear that present-day humans are better off than our distant ancestors. Yes, there are quite a few metrics on which things have improved. And yes, if one adds up all of the obvious things to measure, the balance looks positive. But how do we know the metrics aren't cherry-picked? Or perhaps the selection process is biased because the positives are for some reason systematically easier to notice than the negatives? The most meaningful measures must be holistic, and the best available data for such a holistic assessment is subjective measures of wellbeing. The most obvious of these include rates of depression and suicide. It's hard to get data on this from pre-modern and especially pre-civilizational times, but I would be surprised if these are massively down in the modern era. Put simply, ask yourself: how much happier and satisfied with life are you than a pre-colonial Native American or modern Pirahã? One can object that diminished (or just non-elevated) subjective wellbeing is irrational, but this is changing the subject to discussing the cause of the problem, not its existence.
Is it Really Different this Time?
One can object to each of the pillars by arguing that "the only way out is through," or that future technology will solve the problems resulting from past technology. Genetic engineering could bring back extinct species. Synthetic meat could replace factory farms. Nuclear power could replace fossil fuels. But there are reasons to be skeptical.
First, no one set out to commit moral atrocity, diminish subjective well being, and certainly not trigger self-termination. These are all unintended consequences of people pursuing other goals, so why we should expect by default that new "solutions" won't have unintended consequences of their own?
Second, technological improvements don’t displace harmful practices where market dynamics absorb the gains while leaving externalized costs intact. For example, the argument that nuclear energy will replace fossil fuels assumes that these two energy sources are substitutes for each other. But if one instead assumes that societies find ways to use as much energy as they can get then one should expect that these two sources will add to each other and the environment will suffer the full consequences of both. The latter assumption is supported by the Jevons paradox, where gains in efficiency cause new industries to become profitable, which increases energy demand that outweighs any impact of efficiency.
Population Ethics
One can challenge each of the pillars on the basis of totalist population ethics. In my back-of-the-envelope calculation, the increase in utilitarian benefit of human population increase since pre-agricultural times outweighs the cost of both wild animal population reduction and domestic animal suffering, given defensible assumptions about the relative value of human vs. animal life.[1] I haven't run the math, but I can imagine self-termination working out similarly, given that humanity (and life on Earth generally) would eventually die off in the absence of technology (when the Sun explodes at the very latest), so a giant spike in utility could potentially compensate for an early cutoff, especially given the possibility of space colonization. More people being alive today could also potentially compensate for subjective wellbeing going down, as long as the result isn't negative. When one combines all three of these, I expect the math to get assumption-dependent and uncertain, but to hold for conservative estimates.
Leaning on population ethics is a legitimate move, but it’s also a highly unintuitive one, and should be made explicit. From other moral systems, the story looks different. A deontologist might argue that killing other populations to expand one’s own is seriously not cool, math be damned. A virtue ethicist might see species and cultural loss as a failure of stewardship and a sacrifice of our moral integrity.
Mechanisms of Techno-Pessimism
Population ethics aside, one reason that techno-pessimism can seem implausible is the difficulty in seeing a viable mechanism. Moral atrocity and self-termination don’t require much explanation, since the potential causes are relatively obvious: immorality, shortsightedness, and coalition politics. Whether you find them persuasive depends largely on your moral values and forecasting.
The third pillar requires more unpacking. Optimists can point to clear, local improvements, so for techno-pessimism to make sense, something else must be worsening enough to offset those improvements. These offsets may fully counter the gains or simply make them appear more modest, depending on how strongly one weighs the third pillar.
But why would people collectively choose to make their lives worse? Techno-pessimists don’t actually need to answer this question to justify their worldview, since it’s possible to observe a pattern without knowing its cause. Considering potential mechanisms is still worthwhile, however, because identifying causes is the first step toward finding solutions.
Self-Fulfilling Prophecy
Issues of moral atrocity and self-termination are overblown and subjective wellbeing would be fine if it wasn't for the alarmist narratives fueling misguided policy and public anxiety. Implied solution: dispute the techno-pessimist lens to interrupt its self-fulfilling nature. Stop playing the victim and be grateful for what you have!
Externalized Cost
Benefits of technologies tend to be direct: clearly measurable and accruing to the people using the tech, whereas downsides are often indirect and externalized. Where the former are easier to see and incentivize, people can take actions that cause more harm than good globally while causing more good than harm locally. Implied solution: internalize the externalized costs.
Adapt or Die
Tech that is adaptive becomes obligate. Once a technology exists that provides a relative benefit to the people who choose to use it, anyone who doesn't use it is at a competitive disadvantage, with the end result that everyone has no choice but to use it even if the resulting equilibrium makes everyone worse off. Implied solution: coordinate to prevent anyone from benefitting from anti-social actions, under threat of majority punishment.
Unintended Consequences
Technology may be designed with a specific use case in mind, but its effect is to make certain types of actions easier, which often facilitates a whole range of other use cases. All of these uses in aggregate shift more meta things like the societal equilibrium and peoples' experience of the world, all of which has ripple effects that, among other things, influence the trajectory of which types of tech are built next, creating all kinds of unpredictable feedback loops. One's expectations regarding whether the overall result of such feedback loops are good or bad depend on one's beliefs regarding techno-optimism/pessimism. Implied solution: be more cautious about what you create, using tools like system dynamics to at least try to approximate second order effects.
Out of Distribution Effects
Technologies have shifted the societal equilibrium of the world in a way that tends to take us further from the conditions of our ancestral environment. Agriculture, for example, led to societies with populations far exceeding Dunbar's number, which then required people to consciously design government structures. Moving out of distribution like this resulted in a series of nonintuitive challenges, in turn leading to countless “dumb” mistakes and local minima, in the form of fragile and exploitative political systems. Implied solution: Treat complexity as having a cost. Design future technologies with an aim towards making daily life and larger systems more intuitive to navigate. Consider also (incrementally) eliminating systems that introduce a lot of complexity for relatively small gains.
Solutions
A major objection to techno-pessimism takes the form: "OK, so what if things are getting worse? What do you want, to go back to living in caves?!" This is what I call buffer reasoning, or refusal to engage with a question out of dislike for an assumed solution. But it is entirely consistent to recognize a problem while rejecting the most obvious solutions. Going back to pre-agricultural ways of living, for example, is obviously untenable for the simple reason that the world population is far larger than can be supported by pre-modern production methods. Such a transition, if it occurred rapidly, would involve mass death.
Real solutions require deep, comprehensive understanding of the relevant problems and often involve trade-offs. As can be seen from the Mechanisms section above, each diagnosis comes with a different implied solution. Most of these require some form of restraint on development, which has a cost. This is why it is worth being deliberate about how we balance the techno-optimist and pessimist lenses: our assumptions about the overall balance of harms and benefits anchors our sense of which trade-offs are worthwhile.
Relevance to AI Safety
Narratives about technology inform default assumptions about new technologies, which in turn inform policy beliefs. For example, given a techno-optimist narrative, believing that governments should pause AI development requires accepting a seemingly extraordinary claim that this particular technology is exceptional. Alternatively, one can argue that AI is better framed as a new form of intelligence than as a new form of technology (and also that the former is dangerous). These are by no means insurmountable hurdles, but their presence starts AI safety advocates off at a disadvantage in conversations about policy. In contrast, if one holds a more techno-pessimistic worldview, then AI being dangerous is a natural and default expectation. This is not to say that one should choose a narrative based on whether it outputs political conclusions you like, only that narratives are worth noticing. The lens you choose shapes the futures you see, and the paths we take to realize them.
- ^
The linked spreadsheet is a back-of-the-envelope calculation for the change in the value of life since 10,000 BCE (pre agriculture). For humans, I start by taking the population * average life expectancy to calculate year-adjusted population. I set the life expectancy of early humans to 30 to include infant mortality. One could defensibly ignore this factor and set life expectancy to 55, but this has a negligible impact on the overall calculation. Next, I multiply year-adjusted population by quality of life (qol) for a qol-adjusted population. I subtract the 10,000 BCE result from the modern result and then multiply that by the moral value of a human to get the change in total value of humanity.
I assume that the moral value of a human and also the average quality of life for humans in 10,000 BCE is 1 because these are reference values to anchor other judgements. If one believes that quality of life has doubled in modern times (ignoring life expectancy increases because those are already accounted for), then modern qol would be 2. If one believes that a wild animal has 1 hundredth the moral value of a human, then the moral value fields for animals should be set to 0.01. Numbers in bold are meant to be changed by the reader based on their beliefs.
I make a similar calculation for wild animals, domestic animals, and fish. These could have been lumped into one group, but I wanted to distinguish between animals whose populations have been reduced by habitat destruction (wild animals and fish) but otherwise live as they used to vs. animals who have been brought into an existence that involves great suffering (domestic animals) and have their qol set to a negative value (also intended to be adjusted by the reader). I don't distinguish between animals in factory farms and house pets within domestic animals because the latter are a much smaller population. I also wanted to distinguish between land animals vs. fish so that I could set different groups of animals as having very different moral values (take that, fish!)
Finally, I add the totals for each group together, where humans are positive and wild animals, domestic animals, and fish are negative. I notice that for reasonable estimates of populations and intuitive free-variable values (modern human qol, moral value of animals and fish, and domestic animal qol) the balance comes out positive.
This is not what I expected! My intention in creating this spreadsheet was to demonstrate just the opposite, that one would have to assume incredibly low moral values for animals for the balance to come out positive, but this is not where my math led me. This doesn't mean my argument for techno-pessimism is necessarily wrong, but I shouldn't ground it in utilitarian math.
Please feel free to copy this spreadsheet to change the empirical or variable numbers, change the categories or whatever else, and let me know if you come to different or otherwise interesting conclusions.
Discuss
Gratitude Journal Fallacy
People who think of themselves as rationalists, but have not yet gotten good at rationality, may be susceptible to this.
Basically, these people want to feel like they came up with or found out about a Super Awesome Thing that No One Does that Creates a Lot of Awesomeness with Comparatively Little Effort. (This is especially dangerous if the person has found out about the concept of more dakka.) They may try the thing, and it works. But if it doesn't, they sometimes just continue doing it because it has become part of their self-image.
Think of keeping a gratitude journal. Studies show it improves your well-being, so I did it. After about a week I realized that I was not doing it because it was improving my well-being, only because I wanted to think of myself as someone who does Awesome Things No One Does. And it did improve my well-being, but only in that it made me feel sort-of-happy for like 2 minutes. In the time it takes for me to write about the stuff I was grateful for each day, I could do some other thing, that creates more utility-per-unit-effort.
Discuss
Optical rectennas are not a promising clean energy technology
“Optical rectennas” (or sometimes “nantennas”) are a technology that is sometimes advertised as a path towards converting solar energy to electricity with higher efficiency than normal solar cells. I looked into them extensively as a postdoc a decade ago, wound up concluding that they were extremely unpromising, and moved on to other things. Every year or two since then, I run into someone who is very enthusiastic about the potential of optical rectennas, and I try to talk them out of it. After this happened yet again yesterday, I figured I'd share my spiel publicly!
(For some relevant background context, check out my write-ups on the fundamental efficiency limit of single-junction solar cells, and on the thermodynamic efficiency limit of any solar energy conversion technology whatsoever.)
1. What is a rectenna?Rectenna is short for “rectifying antenna”, i.e. a combination of an antenna (a thing that can transfer electromagnetic waves from free space into a wire or vice-versa) and a rectifier (a.k.a. diode).
Rectennas are an old and established technology for radio-frequency (RF) electromagnetic waves. For example, if you have a very-low-power gadget, you can power it with a rectenna that scavenges energy from nearby commercial radio stations.
Basically, the commercial radio station emits an electromagnetic wave in free space, and the antenna converts that into an RF signal in a wire (“waveguide”). However, this signal is “AC”—its voltage cycles between positive and negative, at megahertz frequencies, averaging to zero. You can’t recharge a battery with such a signal; it would slightly charge the battery one nanosecond, then slightly discharge it a few nanoseconds later, etc. Hence the diode, which converts (part of) that energy to DC, allowing it to usefully charge batteries or power any other electrical component.
2. If RF rectennas can turn RF electromagnetic waves into electrical energy, why can’t optical rectennas turn sunlight into electrical energy?Well, they can! That is, they can if you’re OK with very low power conversion efficiency. Very, very, very low. Like, 0.00…1% power conversion efficiency. I don't even remember how many zeros there were.
Are higher efficiencies possible for an optical rectenna? Yes! That is, if you’re collecting energy from an intense focused high-power laser, rather than from sunlight.
Why do I say this? There are two problems.
3. The easy problem: antennasThe easy problem is scaling down the antenna until it is nano-scale, such that the antenna is sized to absorb and emit sunlight-appropriate electromagnetic waves (e.g. 500 nm wavelength), instead of RF waves (e.g. 500,000,000 nm wavelength).
Making this nano-scale device, and making it inexpensive to mass-produce such that it covers an inexpensive sheet, and getting the antennas to absorb lots of sunlight, constitute the easy problem. This is tractable. It's not trivial, but if this were the only problem, I would expect commercial optical rectennas in short order.
Absorbing lots of sunlight was never the problem! If you want a surface to absorb lots of sunlight, just paint it black!
The hard part is getting useful electrical energy out of that absorbed sunlight. Which brings us to…
4. The hard problem: diodesThe hard problem is finding a diode which will rectify that energy. I claim that there is no commercially-available diode, nor any prototype diode, nor any computer-simulation-of-a-diode, nor even a whiteboard sketch of a possible diode, that is on track to rectify these electromagnetic waves and turn them into useful energy.
There are actually two problems: speed and voltage.
The speed problem is that almost all diodes stop rectifying signals if the frequency of those signals is too high. If memory serves, one common problem is that the diode has too high a capacitance, and another is that electrons can only move so fast. Remember, the sun emits electromagnetic waves with a frequency of around 500 THz = 500,000,000 MHz. This rules out almost all types of diodes.
And that's actually the less thorny problem, compared to:
The voltage problem is that, for the small wavelength of sunlight, you need a small antenna, and a small antenna has a small absorption cross-section with which it can collect light. So you wind up with very very little sunlight energy getting absorbed by any given antenna, and thus very little voltage in the attached circuit—if memory serves, well under a millivolt.
Alas, diodes stop being diodes if the voltage of the signal is extremely small. Just look at the IV curves:
Just like in Calculus 101, if you take a curve and zoom into a very narrow range of x-values, it looks like a line. Hence, if you take a device which functions as a diode for ±1V signals (left), that same device probably functions as a resistor for ±1mV signals (right).If the diode doesn’t actually rectify, then the power is absorbed instead of converted to usable energy.
Taking these together, it turns out that there are diodes which are fast enough for optical frequencies (metal-insulator-metal “MIM” diodes), but they do not turn on sharply at ultra-low voltage. There are diodes which turn on sharper than usual at low voltage (“backwards diodes”), but I don’t think they can support such high frequencies. And even if they could, even these diodes are not remotely close to being sharp enough for our (IIRC sub-millivolt) signal.
There is no diode, to my knowledge, that can work for this device. During this little postdoc project, I spent quite a while scouring the literature, and even trying to invent my own crazy new device concepts, but failed to find anything remotely close to meeting these specs.
5. But what if we combine the power collected by many antennas into a single waveguide, to increase the voltage?Alas, the second law of thermodynamics mandates a fundamental tradeoff between the absorption cross-section and the collection angle, of any antenna (or antenna array) whatsoever. If you make a bigger antenna, it will collect more light, but only when the sun is in exactly the right spot in the sky.
6. But what if we track the sun?Well, then you lose ~100% of the light on cloudy days, and you lose 15% of the light even on clear days (e.g. the light from the blue sky). Worse, you need very accurate 2-axis mechanical tracking as the sun moves across the sky, which is expensive. More importantly, if you’re willing to bear those costs (of precise two-axis mechanical tracking and losing the diffuse light), then you might as well just use a big lens and a tiny solar cell, and then the solar cell can be one of those super-expensive multi-junction cells, which incidentally is already getting pretty close to the theoretical efficiency limit on solar energy conversion.
Anyway, we shouldn’t compare with the theoretical efficiency limit, but rather with a rectenna, which I very much doubt would exceed 1% efficiency even at the theoretical limit of maximum possible absorption cross-section. (Why is there a limit, as opposed to being able to track ever-more-accurately? Because the sun is a disc, not a point. So there’s only so much that you can cut down the light collection angle.)
7. But what if we track the sun virtually, with a phased array?That only solves one of the many problems above, and anyway phased arrays don’t work because sunlight is broadband.
8. But what if we use an impedance converter?I glossed over this above, but to translate from “there is only so much electrical energy in the waveguide at any given time” to “there is only so much voltage across the diode”, you also need to know the relevant impedance. If the impedance is high enough, you can get a higher voltage for the same electrcial energy.
Alas…
Problem 1 is that high impedance makes the diode speed problem even worse, by effectively increasing the RC time constant.
Problem 2 is that there seems to be a tradeoff between how much you increase impedance, and how broadband your impedance converter is. And sunlight is rather broadband.
I say "seems to be a tradeoff", in that I am unaware of a law of physics demanding such a tradeoff. But it seems to be the case for all the impedance-conversion techniques that I know of, or at least for the techniques that work for these kinds of very high frequency waves (e.g. things like quarter-wave impedance transformers).
9. But what if … something else?Hey, what do I know? Maybe there’s a solution. Maybe the numbers I threw out above are misremembered, or maybe I flubbed the math during my postdoc project.
I am very happy when I see people working on or excited about optical rectennas, as long as they are grappling with these problems, proposing solutions, and doing back-of-the-envelope calculations.
Instead, what I often see is people going on and on about the “easy problem” (i.e. the antenna), and how they’re making such great progress on it, without even mentioning the “hard problem” (i.e. the diode).
Discuss
Trends in Economic Inputs to AI
Introduction
Frontier AI companies have seen rapid increases in the economic resources they have available to pursue AI progress. At some companies, the number of employees is at least doubling every year, and the amount of capital received is tripling every year. It is unclear whether this growth is sustainable. Is the revenue growing faster than the capital requirements? If it is, will revenue catch up before the capital available for AI investments runs out?
I do not think that there is enough publicly available data to answer these questions. It is plausible that frontier AI companies will run into significant economic limitations before Metaculus’s forecast for AGI in July 2033.
Similar Work- Epoch has an estimate for how various inputs to AI training runs could scale through 2030. Their work is distinct from this post because they focus on technical inputs (electric power, chip manufacturing, data, and latency), while this post focuses on economic inputs (labor and capital).
- Epoch also has an estimate for revenue trends during 2023-2024. This post looks at a longer time scale and looks at other inputs in addition to revenue.
- Parker Whitfill and Cheryl Wu have published an estimate for the number of employees at frontier companies over time.
None of the frontier AI companies are independent, publicly traded companies. They are either independent companies that are not publicly traded (OpenAI, Anthropic, and xAI[1]), portions of a larger company (DeepMind and MetaAI), or not based somewhere with strong public reporting requirements (DeepSeek).
These companies are not required to make the data I am interested in publicly available. Instead, the information comes mostly from public statements from these companies. These are unlikely to be inaccurate (since then the companies could be sued for fraud), but they can be very imprecise or only selectively reported.
The quality of the data varies widely depending on what is being measured, and between the frontier AI companies. I think that enough of the data is high enough quality to draw some interesting conclusions from it.
All of the data described in this post can be found here: Growth Trends for AI Labs.
A key assumption that I am making is that the amount of capital received is proportional to the current capital requirements at the frontier AI companies. Equivalently, these companies are spending most of their capital on furthering AI capabilities, rather than holding on to it for the future. If the amount of available capital slows down, this would translate into frontier AI companies spending fewer resources on developing AI.
I am not predicting what would happen if capital does run out. Does AI progress transition to a new exponential trend with a slower growth rate or is there an AI winter? Do the frontier AI companies continue unscathed or do some of them fail when the bubble pops? This post attempts to estimate whether the current trends are sustainable, not what would happen if they are not.
EmployeesThe growth rate for the number of employees at OpenAI is 2x per year, at Anthropic is 2.4x per year, and at DeepMind is 1.4x per year.
I do not predict that the growth in the number of employees will run into hard limits before 2033.
SourcesThere are two main sources of data for the number of employees at the frontier AI companies.
The first source is when the companies publish the number of their employees themselves, or tell journalists who publish it. For example, during the conflict between OpenAI’s former Board of Directors and Sam Altman, news organizations reported that over 700 out of OpenAI’s 770 employees threatened to leave with Sam Altman.
The other main source of data is from market research organizations like RocketReach, PitchBook, or LeadIQ. They regularly create estimates for the number of employees at lots of companies (among other data) and sell that data as part of their consulting service. The process by which they collect data is not always public, but it can include scraping LinkedIn and surveys sent to companies, in addition to the news reports. Most of my data comes from market research organizations through intermediaries who have made the data public. An example is the blog posts by SEO AI about OpenAI and Anthropic.
Some of these sources only say the month or year of their estimates, rather than the exact day. I have arbitrarily assigned these numbers to the middle of the year (July 1) and the beginning of the month. Changing when these are assigned can change the growth rate by 0.4x per year, with the strongest effect when half of the data is annual estimates.
DataOpenAI has the most extensive data available, with multiple time series from market research organizations going back as far as 2015, in addition to a scattering of media or self-reported estimates. They currently have about 7400 employees, and are growing at a rate of 2.0x per year.
Anthropic has a similar density of data, but has existed for a much shorter amount of time. They currently have about 2300 employees, and are growing at a rate of 2.4x per year.
DeepMind has existed for even longer than OpenAI, but has sparser data. Half of my data points come from a single source,[2] with the others coming from media reports. They currently have about 6700 employees and are growing at a rate of 1.4x per year.
MetaAI, xAI, and DeepSeek do not have enough data for me to be comfortable estimating a trend. The most recent estimate for each of them has fewer employees than OpenAI, Anthropic, or DeepMind today.
Figure 1: Estimates for the number of employees at frontier AI companies vs time. All of the companies have some estimates, but only OpenAI, Anthropic, and DeepMind have enough data for me to include a trendline. DeepMind has long had the most employees, but they have recently been passed by OpenAI.ProjectionsI will now recklessly extrapolate these trends forward in time.
The target dates for this extrapolation are: July 2027, when AI 2027 predicts that AGI will be announced, July 2030, a target date for an Epoch investigation, July 2033, when Metaculus predicts general AI, and July 2047, the aggregate forecast for 50% chance of HLMI from the 2023 AI Impacts survey.
In July 2027, these trends predict that OpenAI will have about 26,000 employees, and Anthropic and DeepMind will each have about 11,000 employees.
In July 2030, these trends predict that OpenAI will have about 210,000 employees, Anthropic will have about 160,000 employees, and DeepMind will have about 30,000 employees.
In July 2033, these trends predict that OpenAI will have 1.8 million employees, Anthropic will have 2.4 million employees, and DeepMind will have 80,000 employees. Changing the methodology can change these projections by roughly a factor of 2.
These projections are large, but not completely absurd. OpenAI & Anthropic would be larger than the largest software companies today: Microsoft at 228,000 and Google at 186,000 employees. They are similar to the largest companies today: Walmart at 2.10 million and Amazon at 1.56 million employees. The tech industry currently employs 6.4 million people, so these 3 companies would employ about 2/3 of the current US tech workforce.
While I expect that this growth will cause shortages of people with particular skills, I do not think that there are hard limits for frontier AI companies finding new employees before 2033.
In July 2047, these trends predict that OpenAI will employ 32 billion people, Anthropic will employ 600 billion people, and DeepMind will employ 8 million people. These numbers are ridiculous. If AGI is first developed in 2047, we should not expect that the number of employees at frontier AI companies will have continued to grow at the rate they are growing today.
CapitalFrontier AI companies are receiving a large and rapidly increasing amount of capital. OpenAI, Anthropic, and xAI have received tens of billions of dollars, and this is growing at between 2.2x and 3.5x per year.
The availability of capital could be a significant constraint for these companies in the near future.
SourcesThe available sources for capital received are better than the sources for the number of employees, for some of the frontier AI companies.
Most of the funding rounds for the private companies have been made public. They are conveniently conglomerated by organizations like Tracxn. Funding rounds are also discrete events, so I do not have to worry about when an estimate was made. I am more confident in my estimates for capital trends for OpenAI, Anthropic, and xAI than in my estimates for employee trends.
DeepMind and MetaAI are parts of larger companies and have received most of their capital internally. These financial transfers are not publicly available, so I cannot do this analysis for them.
DeepSeek is in China, and so has different reporting requirements. A US Congressional report estimated that they received $420 million from the hedge fund that owns them, as of April 2025. I have found no other sources with data about DeepSeek’s capital.
DataOpenAI has received about $62 billion in total capital. This number is growing at a rate of 2.2x per year. Their most recent funding round, $40 billion on March 31, 2025, looks to be above the trend. However, this capital has not yet all been received: $10 billion was received when this was announced, and $30 billion will be received by the end of the year.[3] If you split the capital up accordingly, it is on trend.
Anthropic has received about $30 billion in total capital. This number is growing at a rate of 3.5x per year. They also have the most recent funding round: $17 billion on September 2, 2025.
xAI has received about $22 billion in total capital. This number is growing at a rate of 3.3x per year. The trends for Anthropic and xAI look remarkably similar to each other.
Figure 2: Estimates for the total capital received by three of the frontier AI companies vs time. OpenAI has more capital, but Anthropic and xAI are growing faster.ProjectionsI will now recklessly extrapolate these trends forward in time.
In July 2027, these trends predict that OpenAI will have received about $280 billion in capital, Anthropic will have received about $270 billion, and xAI will have received about $230 billion. The total amount of venture capital under management in the US is currently about $1.2 trillion. These three companies would account for over 60% of US venture capital. The current pool of venture capital is rapidly running out. Frontier AI companies have already begun looking for other sources of capital. They have sold conventional debt and are receiving investments from sovereign wealth funds.
In July 2030, these trends predict that OpenAI will have received about $3 trillion in capital, Anthropic will have received about $12 trillion, and xAI will have received about $8 trillion. For comparison, the total amount of capital in sovereign wealth funds is currently $14.3 trillion, the total size of the US bond market is currently $58.2 trillion, and the total size of the US stock market is currently $62.2 trillion. These three companies would account for more than all of the capital in sovereign wealth funds, or more than a third of the US bond market or the US stock market. Figuring out whether this is an unreasonable amount of capital would require trying to figure out how much of the US bond & stock markets are potentially divertible into AI companies, which is beyond the scope of this post.
In July 2033, these trends predict that OpenAI will have received about $34 trillion in capital, Anthropic will have received about $520 trillion, and xAI will have received about $310 trillion. For comparison, the total global wealth is about $500 trillion. These three companies would account for more than all of the current total global wealth.
There is no point in extrapolating these trends to July 2047 other than to gawk at how many zeros there are.
It seems important to emphasize that these are trends in inputs to AI, not outputs from AI. This is an estimate of the amount of capital received by frontier AI companies, which I assume tracks the amount spent to produce and deploy AI systems at scale.[4] It is not an estimate of the amount of value produced by AI systems.
In order for these trends to continue through 2033, most of the wealth required to continue AI development has to be newly generated. If frontier AI companies are not generating a significant fraction of global GDP,[5] they will have run out of money before 2033. Determining when exactly this will occur would require estimating how much of global wealth is available to be invested in AI, which is beyond the scope of this post. Most global wealth is not liquid or is otherwise unavailable for investing in AI. I turn instead to trends in the revenue generated by AI.
RevenueThe revenue data publicly available for most frontier AI companies is terrible. A growth rate of 3x per year is maybe reasonable.
Some frontier AI companies have also made public projections for their future revenue, and these projections are lower than what the trends would suggest.
SourcesNone of the frontier AI companies are legally required to publish revenue data. OpenAI, Anthropic, and xAI are privately owned (not publicly traded on the stock market). DeepMind and MetaAI are both part of larger companies. DeepSeek is in China. These data are only public if companies choose to share them.
There are multiple ways of reporting revenue. I am focusing on annual revenue: the actual amount of revenue generated over the course of a year. To give it a specific date, I assign it at the middle of the year (July 1). Another common thing to report is annualized revenue: the revenue generated in a month (or quarter), multiplied by 12 (or 4). This can be helpful for tracking trends that might be changing rapidly. I am not using it because companies report annualized revenue for some months but not others, and I expect that there is selection bias.
OpenAIOpenAI is the only frontier AI company that has published decent revenue data. Their nonprofit is legally required to publish their precise revenue, although that is a small part of their total organization. Their for profit company has also published annual revenue from 2020-2024, and has a reasonable looking projection for 2025.
OpenAI generated $3.7 billion in revenue in 2024 and projects $12.7 billion in 2025. It is growing at a rate of 3.2x per year.
Figure 3: Revenue of OpenAI’s nonprofit and for profit entities vs time. OpenAI’s projections for 2029 are substantially below the current trend.OpenAI has made revenue projections for 2029: either $100 billion or $125 billion and it will be cash flow positive by then. These projections include a chart, which I have used to estimate their revenue projections for each year from 2026-2030.
This is a surprising projection. If the current trend continues, OpenAI would generate $1.4 trillion in revenue in 2029. This would also be the first year when projected revenue would cover projected capital requirements.
OpenAI is publicly projecting that its revenue growth will slow. They are projecting a revenue growth rate of 1.5x per year, not 3.2x per year.
Either OpenAI's projections are accurate and their revenue growth will be slower than the trends described above, or their public projections are inaccurate.
- Accurate prediction:
In this case, OpenAI's current revenue growth will not continue. Since OpenAI is also claiming that it will be cash flow positive, it is predicting that the growth rate in capital it receives will slow.
This would have a significant impact on AI forecasts. In particular, many forecasts involve extrapolating exponential trends.[6] If current trends rely on maintaining this exponential growth of inputs, then those forecasts seem dubious.
- Inaccurate prediction:
OpenAI might also continue their current exponential growth rate (or grow faster). This is consistent with a qualitative prediction they made: that projected revenue would surpass projected capital requirements in 2029. It is inconsistent with their numerical predictions.
Anthropic’s data is terrible. I have found multiple sources with very different estimates. The revenue for 2024 seems to be between $200 million and $1 billion, and is growing at a rate of between 2.6x and 11.5x per year. This is not a small range. In particular, it includes the growth rate for capital received (3.5x per year), so it is unclear whether these projections suggest that Anthropic will ever be profitable.
Figure 4: Revenue of Anthropic vs time, according to three different sources. There is huge uncertainty.Anthropic has also made projections for 2025 and 2027. Their 2025 projection is $2.2 billion in revenue. Their “base case” projection for 2027 is $12 billion, and they say that revenue “could reach as high as” $34.5 billion. If I assume exponential growth between 2025 and 2027, they are projecting an exponential growth rate of between 2.7x and 7.8x per year.
Other Frontier AI CompaniesDeepMind has not released revenue data since 2020. That data included “how much Alphabet pays for internal services, and that can be completely arbitrary.”
I don’t know of any revenue data that has been released for MetaAI.
xAI has only released rapidly changing projections for annual revenue. In February of this year, they projected annual revenue for 2025 to be $100 million. In June, after generating $50 million in revenue in the first quarter, they projected annual revenue for 2025 to be $1 billion, and $14 billion in 2029.[7]
DeepSeek has released a theoretical calculation of what their revenue could be: $200 million, along with a theoretical cost-profit ratio of 545%.
I distrust all of these numbers.
Epoch’s EstimatesEpoch has also estimated the growth rate in revenue for the frontier AI companies. They focused on the years 2023-2024. They estimate that Anthropic and DeepMind[8] had revenue of single digit billions per year and OpenAI had revenue of about $10 billion per year, in April 2025. The growth rate for each of these three companies’ revenue is estimated to be about 3x per year. They also argue that no other AI company had revenue exceeding $100 million in 2024.
The biggest difference in methodology is that Epoch uses reported annualized revenue. This allows them to focus on a shorter time period, although I don’t think that there is a solution for the selection bias in reporting. DeepMind doesn’t even report annualized revenue, so Epoch created a proxy based on the number of users.
I think that Epoch’s estimates are reasonable, and that they adequately express uncertainty. A growth rate of 3x per year is consistent with my estimate for OpenAI, and is within my large uncertainty for Anthropic and DeepMind.
ConclusionThe economic resources being used for AI are growing rapidly.
The number of employees at OpenAI and Anthropic is at least doubling every year, and increasing by 40% every year at DeepMind. The amount of capital received by OpenAI is doubling every year, and more than tripling every year for Anthropic and xAI. The growth of revenue generated is highly uncertain, but might be tripling every year.
It is unclear how sustainable this is. Frontier AI companies could run out of US venture capital in mid 2027 and could run out of all global wealth in 2033. Substantial new wealth generated by AI is necessary to maintain exponential trends. Revenue might catch up to capital requirements for OpenAI in 2029, although they project a much lower revenue then. For Anthropic, neither Epoch’s revenue estimate nor their own baseline projections are growing as fast as their capital received (although other estimates exist). Other frontier AI companies have too little data to make the comparison.
Either frontier AI companies will have to generate more revenue than their own projections suggest, or the amount of capital available to be invested in AI will not be able to continue its current exponential growth for much longer.
AcknowledgementsThis post was produced by the ML Alignment & Theory Scholars Program. Jeffrey Heninger was the primary author of this post and Ryan Kidd scoped, managed, and edited the project. Thank you also to Cameron Holmes and John Teichman for comments on a draft.
Thank you to the many people who volunteered as mentors for scholars at MATS! We would also like to thank our 2025 donors, without whom MATS would not be possible.
If you are interested in becoming a MATS scholar, applications for the Winter 2026 cohort are now open until October 2.
- ^
xAI merged with the social media company X (formerly Twitter) in March 2025, so it could also be considered to be in the second category. In that merger, xAI had a higher valuation.
- ^
Whitfill’s and Wu’s post.
- ^
$20 billion of that is conditional on OpenAI restructuring.
- ^
Epoch estimates that frontier AI companies should spend roughly equal amounts on training and inference. If frontier AI companies are using most of the capital they acquire on training and inference, then each would account for somewhat less than half of these companies’ current budget.
- ^
The annual capital requirements for these three companies would be greater than the current global GDP of $111 trillion. I do not know what the rest of the economy would look like in order for this to work.
- ^
There are also forecasts that do superexponential extrapolation.
- ^
This projected revenue is two orders of magnitude lower than their projected capital received by 2029.
- ^
This estimate excludes revenue gained by integration of DeepMind’s products within other Google products, like including Gemini with Google search results.
Discuss
Contra Shrimp Welfare.
It is likely that installing a shrimp stunner reduces global suffering as much as making the carts in a single Walmart less squeaky for 20 minutes a year. Or perhaps not at all.
Open Philanthropy has handed $2 million to the Shrimp Welfare Project (SWP), primarily to promote electrical stunning devices, and fund staff to push policy changes. Each stunner costs $70,000 to purchase and $50,000 to distribute. The goal? To "reduce suffering" when 500 million shrimp are harvested annually by cutting their death time from 20 minutes in ice slurry to 30 seconds via electrical stunning.
This initiative may sound odd at first glance but the SWP has produced numerous blog posts, elaborate spreadsheets, and lengthy PDFs to justify their approach. They have clearly thought this through extensively, and I will look to provide a short, but equivalently thorough rebuttal.
They claim that the shrimp stunner renders shrimp “unconscious” by synchronously depolarizing their neurons with an electrical current over three seconds and then kills them around 27 seconds later. This replaces the cheaper and more common process of immersing shrimp in an ice slurry which leads to immobilisation in around a minute and death in about 20 minutes.
Promoting this stunner based approach implies that the people behind the SWP believe that disrupting neuronal firing stops suffering, a physicalist perspective that I agree with.
My disagreement, however, is with the methodology, and the assumptions that the SWP depend on to justify their conclusions; namely behavioural tests, and loose biological analogues to justify shrimp consciousness and suffering.
Neuroscience offers several frameworks for understanding consciousness:
First, it should be mentioned that shrimp have ≈100,000 neurons. Humans have ≈86 billion neurons, and only 10% of the human brain is likely to be involved in any sort of conscious calculation. Neuronal firing and interaction alone does not imply the existence of consciousness or awareness.
Global Workspace Theory states that consciousness arises when information is globally broadcast across distributed brain systems humans achieve this via coordinated fronto-parietal networks, as shown in EEG, fMRI, and MEG studies on consciously available stimuli in the human brain, but shrimp lack anything comparable; no cortex, no long range networks, no unified communication hub.
Electric Field Theories emphasize stable macroscopic fields that bind cognition. Humans generate such fields across large, integrated networks, but shrimp nervous systems are small, modular, and discontinuous, making coherent fields that integrate information very unlikely.
Marker’s midbrain theory of consciousness would imply that all vertebrates are conscious, and create a cohesive evolutionary pathway for consciousness that excludes shrimp. These theories locate consciousness in vertebrate midbrain integration hubs like the superior colliculus and periaqueductal gray. Shrimp have no clearly analogous structure, so they likely lack spatially unified representations.
Integrated Information Theory gives shrimp their strongest case for awareness: they integrate information locally, yet their overall Φ is likely tiny given sparse neurons and limited connectivity.
Based on these thresholds, P(shrimp consciousness) is very low.
Now let’s discuss P(shrimp suffering).
If we grant shrimp the minuscule amount of awareness that IIT would give them, then we must gauge their ability to suffer. It takes some skill to suffer.
Under most neuroscientific theories, suffering is not just about detecting damaging stimuli, but requires integrating multiple streams of information into a unified evaluative model that links sensation, memory, affect and motivation. The scientific consensus is that suffering evolved as an adaptive feature to motivate avoidance of harm. Some argue that consciousness evolved to create a self-model that tracks current bodily states, stores and retrieves past experiences and projects future scenarios to guide avoidance behaviours.
There is no evidence that shrimp are capable of any of those mechanisms, and despite the SWP’s unsubstantiated claim that shrimp can undergo associative learning (https://www.shrimpwelfareproject.org/are-shrimps-sentient), which most likely comes from research on crabs (not shrimp), I found no such evidence myself.
Shrimp can detect noxious stimuli, and have been observed to groom damaged antennae. There is even evidence of reduced grooming after opioid administration implying the existence of damage detection, which LSE saw as enough evidence to claim sentience. This is clear evidence that shrimp can detect damage to their bodies. That does not mean they suffer.
You, as a human, associate injury with negative valence. That does not mean shrimp do.
Your suffering scales with Anterior Cingulate Cortex (ACC) activation, not with increased activation of pain receptors. Most vertebrates have ACC analogues, but shrimp don’t. Human suffering is largely caused by ACC activation, where electrical stimulation of the region generates a reported sense of “existential distress” and lesions to the ACC allow humans to detect pain without suffering. Shrimp certainly can’t suffer in any way that we can relate to.
You can make the argument that shrimp developed the ability to suffer independently. Under this assumption, consciousness and suffering evolved along the malacostran lineage hundreds of millions of years ago in the Cambrian period, as the shrimp nervous system hasn't changed much since then. This would imply that there are tens to hundreds of quintillions of sufferers inhabiting our earth. The SWP’s long term goal is to slightly reduce the suffering of 0.0004% of them (400 billion farmed shrimp) for around 20 minutes over their lifetimes. This would imply a yearly suffering reduction of 0.00000002% over the malacostran family.
If you still aren't convinced by all these arguments, I will put forward something more analytical, based on neuron counts. The SWP criticises neuron counts alone by mentioning synapse density and topology, neuron size, conduction velocity, refractory period, and inter-neuronal distance. Funnily enough, on all of these markers, shrimp fall orders of magnitude below humans. Their conduction is slow, neurons large and widely spaced and their synapses sparse and simple. The neuron count measurement is the most generous I could reasonably use.
The human brain contains approximately 86 billion neurons, but consciousness researchers theorize that only about 10% actively participate in generating conscious experience. This gives us 8.6 billion conscious neurons per human. Let's call this “one sentium”, the conscious capacity of a single human being.
Applying the same logic to shrimp, with their 100,000 total neurons and assuming the same 10% participation rate, each shrimp contributes 10,000 neurons to conscious processing (not unreasonable based on the architectures of their brains), simple arithmetic tells us that 860,000 shrimp equal one sentium of conscious capacity.
But not all sentiams are created equal. Shrimp lack a bounded sense of self, memory beyond a few seconds, cross-modal sensory integration, and any framework for complex experience. At best, a shrimp sentium would encode only the most surface-level sensory experience: raw sensation without context, meaning, or emotional depth. Think of the mildest irritation you can imagine, like the persistent squeak of a shopping cart wheel at Walmart.
A typical Walmart hosts about 550 shoppers at any given time, all of them pushing those squeaky carts. That's 550 human sentiams experiencing mild irritation. Each electrical stunner processes approximately 500 million shrimp annually, equivalent to 581 sentiams of conscious capacity. The stunner reduces their suffering from ice slurry death, which takes about 20 minutes.
I’ll ask the question, is this worth $100,000? Is this even worth $10? $1? 10¢?
It is likely that installing a shrimp stunner reduces global suffering as much as making the carts in a single Walmart less squeaky for 20 minutes a year. Or perhaps not at all.
The Shrimp Welfare Project wants shrimp to suffer so they can have a new problem to solve.
While they invest millions into speculative welfare gains for shrimp, the same effort and resources could fund malaria nets to save children’s lives, deworming programs to save children’s lives, vitamin A supplementation to prevent blindness, disaster relief, tuberculosis and HIV treatment, mental health treatments, maternal health services, lead paint removal, school feeding programs, safe water and sanitation projects and so many more proven efforts that actually reduce suffering.
Discuss
High-level actions don’t screen off intent
One might think “actions screen off intent”: if Alice donates $1k to bed nets, it doesn’t matter if she does it because she cares about people or because she wants to show off to her friends or whyever; the bed nets are provided either way.
I think this is in the main not true (although it can point people toward a helpful kind of “get over yourself and take an interest in the outside world,” and although it is more plausible in the case of donations-from-a-distance than in most cases).
Human actions have micro-details that we are not conscious enough to consciously notice or choose, and that are filled in by our low-level processes: if I apologize to someone because I’m sorry and hope they’re okay, vs because I’d like them to stop going on about their annoying unfair complaints, many small aspects of my wording and facial expression will be likely different, in ways that’re hard for me to track. I may think of both actions as “I apologized politely,” while my intent nevertheless causes predictable differences in impact.
Even in the donations-from-a-distance case, there is some of this: the organization Alice donates to may try to discern Alice’s motives, and may tailor its future actions to try to appeal to Alice and others like her, in ways that have predictably different effects depending on eg whether Alice mostly wants to know/care/help or mostly wants to reinforce her current beliefs.
(This is a simple point, but I often wish to reference it, so I’m writing it up.)
Discuss
Building Conscious* AI: An Illusionist Case
In this post I want to lay down some ideas on a controversial philosophical position about consciousness: illusionism, and how it might impact the way we think about consciousness in AI. Illusionism, in a nutshell, proposes that phenomenal consciousness does not exist, although it seems to exist. My aim is to unpack that definition and give it just enough credence to make it worth exploring its consequences for AI consciousness, morality and alignment.
Illusionism suggests that there is a different mechanism: consciousness* (aka the cognitive processes which trick us into thinking we have phenomenal consciousness, introduced later in the post) which is less morally significant but more cognitively consequential. This reframing leads to different conclusions about how to proceed with AI consciousness.
The illusionist approach is different from—but not in contradiction with—the kind of view exemplified by Jonathan Birch's recent "Centrist Manifesto". Birch emphasises the dual challenge of over-attribution and under-attribution of consciousness in AI, and outlines some of the challenges for AI consciousness research. In accordance with other recent work, he advocates for a careful, cautious approach.
By critiquing Birch's framework through an illusionist lens, I will end up arguing that we should seriously consider building consciousness* into AI. I'll outline reasons for expecting links with AI alignment, and how efforts to suppress consciousness-like behaviours could backfire. The illusionist perspective suggests we might be committing a big blunder: trying to avoid anything that looks like consciousness in AI, when it actually matters far less than we think morally, but is far more consequential than we think cognitively.
The case for illusionismWhat is phenomenal consciousness?
The classic definition of phenomenal consciousness by Nagel is that a system is conscious if there is “something it is like” to be that system. If this seems vague to you (it does to me) then you might prefer defining consciousness through examples: seeing the colour red, feeling pain in one’s foot, and tasting chocolate are states associated with conscious experiences. The growth of nails and regulation of hormones are not (see Schwitzgebel's precise definition by example).
The hard problem and the meta-problem of consciousness.
In his seminal paper “Facing up to the problem of consciousness”, David Chalmers proposes a distinction between what he calls the easy and hard problems of consciousness. The easy problems are about functional properties of the human brain like “the ability to discriminate, categorize, and react to environmental stimuli”. While these problems might not actually be easy to solve, it is easy to believe they are solvable.
But when we do manage to solve all the easy problems (in Chalmers’ words): “there may still remain a further unanswered question: Why is the performance of these functions accompanied by experience?”. That’s the hard problem of consciousness: understanding why, on top of whatever functionality they have, some cognitive states have phenomenal properties.
23 years later, David Chalmers published “The Meta-Problem of Consciousness”. The first lines read: “The meta-problem of consciousness is (to a first approximation) the problem of explaining why we think that there is a [hard] problem of consciousness.” So instead of “why are we conscious”, the question is “why do we think we are conscious”. Technically, this is part of the easy problems. But as Chalmers notes, solving the hard problem probably requires understanding why we even think we have consciousness in the first place (it would be weird if it was a coincidence!). Thankfully, the meta-problem is more tractable scientifically than the hard one.
So suppose we solved the meta-problem of consciousness? The hard problem says we still have to explain consciousness itself—or do we? This is where illusionism comes in.
Illusionism to the rescue
Illusionism basically says this: once we have successfully explained all our reports about consciousness, there will be nothing left to explain. Phenomenal experiences are nothing more than illusions. For illusionists, the meta-problem is not just a stepping stone, it's the whole journey.
Cover of Illusionism as a theory of consciousness by Keith FrankishAs a guiding intuition, consider the case of white light, which was regarded as an intrinsic property of nature until Newton discovered that it is in fact composed of seven distinct colours. White light is an illusion in the sense that it does not possess an intrinsic property “whiteness” (even though it seems to). Suppose we manage to explain, with a high degree of precision, exactly how and when we perceive white, and why we perceive it the way we do. We do not subsequently need to formulate a “hard problem of whiteness” asking why, on top of this, whiteness arises. Illusionists claim that consciousness is an illusion in the same sense that whiteness is.[1]
So illusionists don’t deny that conscious experiences exist in some sense (we’re talking about them right now!). They deny that conscious experiences have a special kind of property: phenomenality (although they really seem to have phenomenality).
The most common objection to illusionism is straightforward: how can consciousness be an illusion when I obviously feel pain? This is an objection endorsed by a lot of serious philosophers (including Chalmers himself). Intuition pumps can only get us so far, we'll now dive into an actual philosophical argument.
Debunking consciousness
One of the main arguments for illusionism follows the template of a so-called “debunking argument”. The idea is that if we can explain the occurrence of our beliefs about X in a way that is independent of X, then our beliefs about X might be true, but that would be a coincidence (i.e. probability zero). Let’s use this template to “debunk” consciousness (following Chalmers):
- There is a correct explanation of our intuitions about consciousness which is independent of consciousness.
- If there is such an explanation, and our intuitions are correct, then their correctness is a coincidence.
I think many atheists want to make a similar kind of argument against the existence of God. Suppose that we can explain our beliefs about God in, say, evolutionary, psychological and historical terms without ever including God as a cause. It would then be a bizarre coincidence if our beliefs about God turned out to be correct. As with the debunking argument against consciousness, the hardest part is actually doing the debunking bit (i.e. claim 1). The good news is that philosophers can outsource this: it is a scientifically tractable problem.
Introducing consciousness* (with an asterisk)If consciousness doesn't exist, what cognitive mechanisms generate our persistent reports and intuitions about it? Being an illusionist involves denying that phenomenal consciousness exists, but not that it seems to exist, and not that something is causing us to have all these intuitions. The fact that the illusion is so strong is precisely what the theory seeks to explain. There must be some cognitive mechanism which causes us to mischaracterise some cognitive states as possessing phenomenal properties.
So let's define consciousness* (with an asterisk) as "the cognitive mechanism leading us to systematically mischaracterise some states as phenomenal."[2] This sort of deflated (diet) version definitely exists. This distinction changes how we should think about AI consciousness:
- Traditional/realist view: "Is this AI phenomenally conscious?"
- Illusionist view: "Does this AI have the cognitive architecture that produces reports and intuitions about consciousness?"
The first question assumes something (phenomenal consciousness) that illusionists think is conceptually confused. The second question is scientifically tractable: we can study consciousness* in humans and look for similar mechanisms in AI.
So maybe we can just replace full-fat "consciousness" with diet "consciousness*" in all our ethical theories, dust off our hands, and call it a day. Problem solved, ethics intact, everyone goes home happy.
If only it were that simple. As we'll see, this substitution raises issues about what properties should ground moral consideration of minds—human and artificial alike.
For the remainder of the post I'll focus on consciousness* (the real but diet version, with the asterisk), and occasionally refer to full-fat consciousness (no asterisk). Whenever you see an asterisk, just think "the cognitive processes which trick me into thinking I have phenomenal consciousness".
The consequences of illusionism on ethicsDo illusionists feel pain?
There are three ways one can understand “pain” and illusionism has different takes on each of them (see Kammerer):
- Functional pain: illusionism does not deny this exists.
- Phenomenal pain: illusionism denies this exists (but not that it seems to exist).
- Normative pain (i.e. inflicting pain is bad/unethical): illusionism does not deny this exists.
So illusionists can still say hurting people is wrong. But the question remains: why would inflicting pain be bad if there's no phenomenal experience? What about our new notion of consciousness* which does exist. Does that matter?
Questioning moral intuitions about consciousness
Our intuitions about pain's badness come from introspection: phenomenal pain seems to directly reveal its negative value with apparent immediacy. Pain doesn't just seem bad, it seems bad beyond doubt, with more certainty than any other fact. However as François Kammerer argues in his paper "Ethics without sentience", if illusionism is true, then
our introspective grasp of phenomenal consciousness is, to a great extent, illusory: phenomenal consciousness really exists, but it does not exist in the way in which we introspectively grasp and characterize it. This undercuts our reason to believe that certain phenomenal states have a certain value: if introspection of phenomenal states is illusory – if phenomenal states are not as they seem to be – then it means that the conclusions of phenomenal introspection must be treated with great care and a high degree of suspicion
In other words if we take the leap of faith with illusionism about phenomenal states, why stay stubbornly attached to our intuitions about the moral status of these same states?
To be clear, this argument targets intuitions about consciousness, the full-fat no asterisk version. But since consciousness* (with an asterisk) is none other than the set of cognitive processes which generate our (now-suspect) intuitions, this also removes reasons to treat it as a foundation for moral status.
This seems to point to the need to use other properties as foundations for moral consideration. As Kammerer explores, properties like agency, desires, sophisticated preferences, or capacity for deep caring are good candidates. Of course these might happen to be deeply entangled with consciousness*, such that in practice consciousness* might be linked to moral status. But even if this entanglement exists in humans, there is no guarantee it would persist in all artificial systems. We shouldn't exclude the possibility of systems possessing the cognitive build for consciousness* but without e.g. strong agency, or vice-versa.
Conscious AI: an illusionist critique of the Centrist ManifestoHaving presented illusionism, I now want to examine how it applies to current approaches in AI consciousness research. Beyond laying groundwork for my later argument about building consciousness* into AI, this also showcases the de-confusing powers of illusionism, and makes the issue more tractable overall.
In his recent paper AI Consciousness: A Centrist Manifesto, Jonathan Birch outlines two major challenges facing us in the near future: roughly, over-attribution and under-attribution of consciousness in AI. The paper does a great job of outlining the issues whilst remaining parsimonious in its assumptions[3]. However examining Birch's manifesto from an illusionist lens points to methodological blind spots and suggests a more promising path forward.
The gaming problem and the Janus problem
The first challenge Birch describes is that many people will misattribute human-like consciousness to AI. This is not a new phenomenon and is reminiscent of the ELIZA effect. Things get messy when labs become incentivised either to take advantage of Seemingly Conscious AI (SCAI) or to suppress it. I'll have more to say about this in the final section.
Jonathan Birch's second challenge cuts to the heart of the AI consciousness problem: we might create genuinely conscious AI before we have reliable ways to recognise it, and before we fully understand the moral implications. This is a serious problem. In the worst case, we could create billions of suffering agents. Addressing this challenge means understanding AI consciousness and how it relates to moral status. Birch goes on to identify two fundamental problems that make this solution difficult: the gaming problem and the Janus problem.
The gaming problem arises from the fact that frontier models are trained on massive datasets containing humans talking about their minds and experiences, and also post-trained to produce various responses (e.g. ChatGPT when it is asked if it is conscious). Whatever models say about their own subjective experience cannot be trusted.
Asking ChatGPT if it is conscious1st/7 paragraph of Claude's answer: that no one knows.The Janus problem is that whatever theory-driven indicator you find in AI, there will always be two ways to update: "AI is conscious" or "the theory is wrong". The same evidence points in opposite directions depending on your prior beliefs.
Birch argues these obstacles aren't permanent roadblocks—they can be overcome through systematic comparative research across species, theoretical refinements, and better AI interpretability tools.
Birch's research program for deciding whether to attribute consciousness to AI.The illusionist response
While it's true that behavioural evidence becomes unreliable when dealing with AI systems, this doesn't mean we can't do empirical work. We can design theory-informed experiments that test real capabilities rather than surface-level mimicry. Illusionists view consciousness* as deeply integrated into cognition, suggesting many avenues for meaningful measurement.
For instance, we might measure metacognitive abilities by having models predict their own performance or confidence across different domains. We can investigate self-modelling and situational awareness through real benchmarks. We could examine top-down attentional control (see next section on AST), and whether models can selectively shift their focus in ways you wouldn't expect from a pure language model. We have to be smart about how we design our experiments to avoid the gaming problem, but very similar concerns exist in other areas of AI research (e.g. alignment). The gaming problem is real, but far from intractable.
The Janus problem is also real in some sense: we can always draw inferences in both directions when we find theory-driven indicators in AI. There is nothing fundamentally wrong with discrediting a theory of consciousness* by showing that it leads to absurd results on AI models. Inferences go both ways in all parts of science.
In the paper, Birch sketches out how we might look for architectural indicators in LLMs from a leading theory of consciousness Global Workspace Theory (GWT). GWT proposes that consciousness arises when many specialised processors compete for access to a central "workspace" that then broadcasts information back to all input systems and downstream modules. As Birch shows, the transformer architecture does not contain a global workspace, although a similar architecture (the Perceiver variant) does. We run into issues when it turns out even a tiny Perceiver network technically has a global workspace, despite not displaying any kind of coherent behaviour. This issue arises because the approach was doomed from the get-go. From the illusionist perspective it suffers from two fundamental flaws:
- First, looking for a global workspace solely in the architecture is the wrong place to look: if the architecture was all that matters, then GWT wouldn't distinguish between a trained model and one initialised with random weights! It's a bit like opening up a piano looking for Beethoven's 9th Symphony. Instead of "does this architecture contain a global workspace?" we should ask something like "do models develop global workspace-like dynamics?".
Second, GWT is not the right kind of theory. While a robust and convincing account of likely necessary processes for consciousness*, GWT does not explain our intuitions and reports. Michael Graziano terms this the Arrow B problem (see figure below).
Arrow A is explaining how computational processes produces conscious* states, Arrow B is explaining how those states lead to intuitions and reports. GWT only tackles Arrow A. Figure from Illusionism Big and Small.
Somewhat contra Birch, I actually think that picking the right kind of theory, and asking the right kinds of questions collapses the Janus problem into standard empirical disagreements. This kind of thing happens in physics all the time: theories are rejected precisely because of the predictions they make. When quantum mechanics predicted wave-particle duality, many rejected it because particles can't be waves. The solution wasn't to declare the question unanswerable, but to develop better theories AND experiments that could distinguish between competing interpretations.
So what conditions should the right kind of theory satisfy? It must be a mechanistic theory that makes measurable predictions, AND it must explain how those mechanisms lead to our intuitions and reports about consciousness (the Arrow B problem).
Having critiqued an existing approach to AI consciousness, what does an illusionist-native alternative look like? Illusionism changes the priors, the goals, the methods, and the moral framework. It is not a complete departure from Birch's approach, but a focused recalibration.
Why we should build conscious* AIFinally we get to the core point of this post: that we should seriously consider building consciousness* into AI. [4]
Illusionism suggests consciousness* is less morally important (making it more acceptable to build) and more cognitively important (making it more useful to build). One response to this is that we are profoundly uncertain, therefore we should take the cautious approach: refrain from building it. This conservative approach is a reasonable default setting, but it does not come without its perils. Suppose we take the cautious approach, I will argue this could lead to:
- missing out on opportunities that come with building consciousness* in AI. I'll argue from an illusionist perspective that there are first-principles reasons to expect links to alignment.
- suffering from bad consequences from actors purposefully suppressing consciousness* or seemingness of consciousness* in AI. There is an illusionist case that this could backfire.
To dive into this it helps to introduce one of the leading illusionist-compatible theories of consciousness*.
The Attention Schema Theory: consciousness* as a model of attention
Here is part of the abstract of a 2015 paper, where Michael Graziano introduces the Attention Schema Theory (AST) better than I ever could:
The theory begins with attention, the process by which signals compete for the brain’s limited computing resources. This internal signal competition is partly under a bottom–up influence and partly under top–down control. We propose that the top–down control of attention is improved when the brain has access to a simplified model of attention itself. The brain therefore constructs a schematic model of the process of attention, the ‘attention schema,’ in much the same way that it constructs a schematic model of the body, the ‘body schema.’ The content of this internal model leads a brain to conclude that it has a subjective experience.
(The terms "subjective experience" and "consciousness" are used interchangeably)
In a nutshell, AST equates consciousness* with a model of attention. The crux is that this model is deeply imperfect just like our body schema (which e.g. doesn't represent blood vessels). Graziano would say it's a "quick and dirty" model, which evolved through natural selection to do its job, not to be accurate.
Going from representing an apple, to representing subjective awareness of an apple.Say Graziano and his team gather enough evidence and build a rock-solid theory that explains why we have these deep intuitions about consciousness and why we report having subjective experiences. The illusionist position is simple: that's it. We're done. Any feeling that there must be something more is exactly what the theory predicts we would intuit[5].
A first-principles argument for why consciousness* could matter for AI alignment
If AST has any validity, then this cognitive machinery is arguably relevant to challenges in AI alignment. Moreover we might be overlooking them precisely because they involve consciousness. Here's one compelling reason why understanding consciousness* could be vital for alignment:
In his book Consciousness and the Social Brain, Graziano explores how the attention schema evolved not just for self-monitoring, but largely as a social tool. The same neural machinery that lets us model our own attention also lets us model other minds. Watching someone focus intently on something, you use your attention schema to model what they are attending to and predict their next move. Ultimately this provides the necessary tools for navigating complex social coordination problems.
The idea that we don't accurately represent our cognitive states, but rather misrepresent them in useful ways, is basically what illusionism is about. There is little reason to expect evolution to enforce that our reports be correct. Here's an intuition pump, suppose I'm with some friends and I spot a deadly snake. One thing which is not useful to communicate is the sequence of intricate electro-chemical reactions in my brain which lead me to run away. A more helpful broadcast would be to convey a useful fiction about my current cognitive state (e.g. enter the “fear” state, gasp, scream, etc). My representation is a rough but evolutionarily useful shortcut.
The implications for AI are notable: alignment is a bit like a social coordination problem. If we want to cooperate with advanced AI, we might benefit from it having something functionally similar to an attention schema. This would provide AIs with a superior model a) of what humans and other agents are attending to, making it less likely to mess up, and b) of what the AI itself is attending to, leading to better self-modelling/self-control and hopefully a boosted capacity to report its own cognitive states.
Perhaps having AIs develop useful models of others, and of themselves, can help rule out failure modes. The same way that LLMs having good world models makes some Nick Bostrom-style apocalypse scenarios implausible (relative to AlphaGo-type pure RL systems).
Suppressing consciousness: a model for "cautious approach" failure modes
Whether or not consciousness* turns out to be alignment-relevant, AI labs might face strong incentives to suppress consciousness-like behaviour in their models. As public concern about AI consciousness grows—driven by the ELIZA effect and justified moral uncertainty—companies will be pressured to "suppress" Seemingly Conscious AI, either by making it seem less conscious, or by somehow making it less conscious*. While this pressure seems reasonable and critics (like Zvi Mowshowitz in a recent post) rightly call out disingenuous industry arguments, I'll argue the approach could backfire.
The suppression would come from labs applying optimisation pressure (intentional or not, RL or otherwise) that steer models away from making statements that sound conscious or introspective. This risks creating a more general failure mode: the AI learns to broadly avoid communicating its knowledge about its internal states. Despite retaining the relevant self-modelling capabilities (which are essential to performance), subtle optimisation pressures push models to hide them. This undermines AI alignment research methods which rely on AIs being transparent about their internal states. Methods which may be essential for detecting early signs of deceptive or misaligned behaviour. What seems like an innocent PR fix might turn into a big cognitive alteration.
This is just one illustration of how there could be a tension between consciousness* and alignment issues, there are others. There might also be cases where labs, wanting to be ethically cautious, accommodate AI desires in ways that similarly reinforce bad behaviours.
The broad point is this: the traditional/realist position is to be cautious about consciousness, treat it as a moral hazard, and do our best to avoid it. The illusionist position, on the other hand, treats consciousness* as less morally and more cognitively significant: it suggests we should be far more comfortable building consciousness* into AI, far more curious about the potential doors it opens in AI research, and far more scared about the downstream consequences of tampering with it.
What success looks like
Going back to Birch's research program, here is an illusionist alternative. The illusionist research program looks a lot like a very boring scientific research agenda without anything special: it involves developing theories that meet the two illusionist criteria (mechanistic and explains intuitions and reports about consciousness), using these theories to inform empirical work on humans, animals and AIs, and updating our theories, and repeating over and over. We have priors about the distribution of consciousness* in the world. That's fine. We can debate and update them as empirical evidence comes in.
In parallel, advances in mechanistic interpretability offer new ways to test theory-driven indicators in models. Work on representation engineering, steering vectors, and sparse autoencoders provides promising avenues for detecting the computational structures that theories like AST predict should underlie consciousness*.
What you end up with is a back-and-forth between theory and experiment which comes with a lot of idiosyncratic methodological considerations. For example: be wary of AIs mimicking humans, sometimes AIs having it means the theory is wrong, beware of our intuitions, etc etc.
What success looks like: a proposed illusionism-native research program.Closing remarksCaveating the illusionist approach
There are arguments against building consciousness* into AI. These are valid concerns and important to state:
- Uncertainty runs deep: Illusionism could be wrong. Our arguments about consciousness*' moral irrelevance could be wrong. We need to proceed carefully.
- Entanglement problems: Even if consciousness* isn't directly morally relevant, the actual moral markers whatever they may be—agency, preferences, desires—might be deeply intertwined with consciousness*.
- Indirect human welfare concerns: Making AIs seem conscious might cause psychological harm to humans who form attachments to them (or any other theory of harm which doesn't assume AI suffering).
I personally totally endorse an approach that proceeds with caution and recognises uncertainty. I also happen to think that opinionated takes are an important part of advancing knowledge. In a few paragraphs I’ve claimed that 1) phenomenal consciousness doesn’t exist 2) consciousness doesn’t matter for morals and 3) we should actively build conscious* AI. Super controversial. I’m extremely keen to get any kind of feedback.
AppendixA very tempting (but flawed) debunking argument about intuitions on the moral status of consciousness*
Following Daniel Dennett's advice in his autobiography, I'm sharing a tempting but ultimately flawed argument I came up with which aims to debunk our moral intuitions about consciousness. Thanks to François Kammerer for helping point out the flaw. The argument is:
- There is a correct explanation for our intuitions about the moral status of conscious* states, which is independent of consciousness(*).
- If there is such an explanation, and our intuitions are correct, then their correctness is a coincidence.
- The correctness of intuitions about consciousness* is not a coincidence.
- Our intuitions about the moral status of consciousness* are incorrect.
The argument is tempting, but when you think hard about whether or not to include the asterisk in brackets, it falls apart. Roughly:
- If you write it with an asterisk, then the claim becomes implausible: it is actually quite likely that our intuitions depend on consciousness*.
- If you write it without an asterisk, the argument doesn't add anything to the illusionist story (even if it turns out correct).
- ^
Another useful analogy: until the early 20th century, vitalists maintained that there was something irreducibly special (they called it "élan vital") that distinguished living from dead, and which could not be reduced to mere chemistry and physics. That was until it was successfully explained by (bio)chemistry and physics. It turned out there was no explanatory gap after all.
- ^
This is totally inspired by the concept of quasi-phenomenality introduced by Keith Frankish here.
- ^
It seems common in AI consciousness research (e.g. this paper) to refrain from committing to any one theory, and argue we should proceed with uncertainty. I totally agree with this, but I also think opinionated takes help advance knowledge.
- ^
The arguments here very much come from my own interpretation of illusionism. I'm skipping over some assumptions (e.g. materialism). There are also many disagreements between illusionists.
- ^
Graziano goes into more detail on how AST is illusionist-compatible in his article: Illusionism Big and Small.
Discuss
Lessons from Studying Two-Hop Latent Reasoning
Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared to other approaches in AI, LLMs offer a unique advantage: they can ``think out loud'' using chain-of-thought (CoT) enabling oversight of their decision-making processes. Yet the reliability of such monitoring hinges on an empirical question: do models need to externalize their reasoning in human language, or can they achieve the same performance through opaque internal computation?
In our new paper, we investigate LLM latent reasoning capabilities using two-hop question answering as a case study. We fine-tune LLMs (including Llama 3 8B and GPT-4o) on synthetic facts and test two-hop reasoning over these facts. By using synthetic facts, we rule out memorization and reasoning shortcuts as explanations for two-hop performance. We observe a nuanced picture:
- Finding 1: Models completely fail to compose synthetic facts they learned through fine-tuning without explicit chain-of-thought reasoning, achieving only chance-level accuracy despite perfect recall of the individual facts.
- Finding 2: Interventions to (i) force a correct fact storage order across transformer layers and (ii) encourage the first reasoning hop both fail to enable models to compose newly learned facts without chain-of-thought.
- Finding 3: Models successfully compose newly learned synthetic facts without chain-of-thought when those facts co-occur in the same fine-tuning document or in the same test-time prompt.
- Finding 4: LLMs are capable of composing two separately learned facts, as long as one of the facts is naturally acquired during pretraining (the second fact can be synthetic and acquired through fine-tuning).
These findings offer a humbling lesson: researchers observing models answering two-hop questions about real-world facts might incorrectly infer robust latent reasoning capabilities, not realizing that performance relies on facts appearing together in pretraining data. Conversely, our synthetic fine-tuning results alone might suggest models fundamentally cannot perform latent composition—a kind of limitation that could make oversight via reasoning traces highly reliable. Yet this too would be misleading given models' success in other settings. This sensitivity to experimental conditions serves as a cautionary tale for the broader project of understanding LLM reasoning: apparent limitations may be artifacts of experimental setups far removed from how frontier models are trained, while apparent successes may rely on shortcuts invisible to experimenters.
Moving forward, we believe that researchers hoping to gain confidence in LLM agent oversight based on CoT monitoring should move beyond multihop question answering as a toy model of reasoning. What ultimately matters for monitoring capable LLM agents is whether oversight mechanisms can catch misbehavior in practice. Determining monitorability might thus be best achieved via end-to-end evaluations, where LLM agents attempt to complete agentic tasks while their CoT is monitored by another model. Such evaluations could reveal whether models are capable of latent reasoning that would allow carrying out harmful tasks without externalizing harmful intentions in a readable way.
Discuss
My talk on AI risks at the National Conservatism conference last week
Lately I’ve been trying to raise awareness of AI risks among American conservatives. Stopping the reckless development of advanced AI agents (including Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI)) should be a human issue, not a partisan issue. Yet most people working in AI safety advocacy lean to the Left politically, and we seem to be ignoring many potential allies on the Right.
This neglect of conservative allies seems suboptimal, given that the Republican Party currently controls all three branches of the U.S. government (including the Presidency, the Supreme Court, and majorities in the House and the Senate). Granted, the pro-AI lobbyists, Big Tech accelerationists, and tech VCs like Andreessen Horowitz have some influence with the current White House (e.g. on AI Czar David Sacks), but many conservative political leaders in the executive and legislative branches have expressed serious public concerns about AI risks.
In the interests of coalition-building, I attended a political conference last week in Washington DC called the National Conservatism conference (NatCon 5). There I gave a talk titled ‘Artificial Superintelligence would ruin everything’, in a session on ‘AI and the American Soul’.
NatCon has been a big deal as the intellectual vanguard of American conservatism since the first conference in 2019. Vice President JD Vance rose to prominence partly because he gave inspiring speeches at previous NatCons, even before he was elected a Senator in 2022. Other speakers at NatCon 5 last week included Tulsi Gabbard (Director of National Intelligence), Tom Homan (Border Czar), Sebastian Gorka (NSC Senior Director for Counterterrorism), Jay Bhattacharya (Director of the NIH), Russell Vought (Director of the OMB), Harmeet Dhillon (Assistant Attorney General), Ambassador Jamieson Greer (U.S. Trade Representative), and U.S. Senators Jim Banks, Josh Hawley, and Eric Schmitt. There were about 1,200 registered attendees at NatCon 5, including many staffers working in the current administration, plus leading figures from conservative universities, news channels, magazines, and podcasts. The YouTube playlist of NatCon 5 talk videos released so far is here.
NatCon was also an opportunity to reach out to religious leaders. The conference included benedictions by Catholic priests, Protestant ministers, and Jewish rabbis, and the Judaeo-Christian tradition was a prominent theme. Most of the attendees would have identified as Christian. The main conference organizer, Yoram Hazony, is a Jewish political theorist who lives in Israel; his books on nationalism and conservatism would be excellent introductions for those genuinely interested in understanding modern national conservatism. (In a future post, I may explore some strategies for raising AI safety awareness among the 2.5 billion Christians in the world.)
Why this post? Well, a few days ago, Remmelt was kind enough to post some excerpts (here on EA Forum and here on LessWrong) from an article by The Verge reporting on the controversies about AI at the conference, including several excerpts from my talk and my questions to other speakers. This sparked little interest on EA Forum so far, but a fair amount of controversy on LessWrong, including some debate between Oliver Habryka and me. The AI theme at NatCon was also covered in this article in the Financial Times. Other NatCon speakers who addressed various AI risks included Senator Josh Hawley on AI unemployment (talk here), Mike Benz on AI censorship (talk here), and Rachel Brovard on AI transhumanism (talk here), plus the other speakers in our AI panel: Wynton Hall, Spencer Klavan, and Jeffrey Tucker (videos not yet released).
The video of my NatCon talk on AI risk hasn’t been released yet. But I thought it might be useful to include the full text of my talk below, so people on this forum can see how I tried to reach out to conservative ‘thought leaders’ given the beliefs, values, and civilizational goals that they already have. I tried to meet them where they are. In this regard, I think my talk was pretty successful – I got a lot of good questions in the Q&A afterwards, a lot of follow-up from conservative media. For example, here’s a short interview I did with Joe Allen from Bannon’s War Room -- who also recently interviewed Nate Soares about AI X-risk. I also had many good discussions about AI over the 3 days of the conference, which were very helpful to me in understanding which kinds of AI safety arguments do or do not work with conservatives, nationalists, and Christians.
I’d welcome any (reasonable) comments, reactions, and suggestions about how AI safety advocates can reach out more effectively to conservatives – especially to those who are currently in power.
-------------------------------------------
[talk itself is below]
Artificial Superintelligence would ruin everything
It’s an honor and a thrill to be here, at my first NatCon, among people who respect our ancestors, cherish our descendants, and love our nation.
In my day job, I’m a psychology professor who teaches courses on relationships, emotions, and altruism. But I’m also a conservative, a patriot, and a parent. There are still four or five of us around in academia.
I’ve been following AI closely for 35 years, ever since I worked on neural networks and autonomous robots as a grad student at Stanford and then a post-doc at University of Sussex. In the last ten years, I’ve talked a lot about the dangers of AI in articles, podcasts, and social media.
The AI industry’s explicit goal is go far beyond current LLMs like ChatGPT, to develop Artificial General Intelligence, or AGI, that can do any cognitive or behavioral task that smart humans can do, and then to develop Artificial Superintelligence, or ASI, which would be vastly smarter than all the humans that have ever lived.
These are not distant goals – many AI leaders expect AGI within 10 years, then ASI shortly after that. I’ve got two toddlers, and we could face dangerous ASIs by the time they graduate high school.
In this talk, I aim to persuade you that ASI is a false god, and if we build it, it would ruin everything we know and love. Specifically, it would ruin five things that national conservatives care about: survival, education, work, marriage, and religion.
First, ASI would ruin the survival of our species
The most severe risk is that ASIs that we can’t understand, predict, or control end up exterminating all of humanity. This is called ASI extinction risk, and it’s been a major topic of research for 20 years.
Remember, we’re not coding AIs like traditional software that can be analyzed and debugged. The best current AI systems are black box neural networks trained on over 40 trillion language tokens, yielding over a trillion connection weights, like synapses in a brain. Reading all of these connection weights aloud would take a human about 130,000 years. We have no idea how these LLMs really work, and no idea how to make them aligned with human interests.
So what? Well, about a quarter of the general public already think that ASI could cause human extinction within this century. Hundreds of leading AI researchers agree, including Yoshua Bengio and Geoff Hinton – the two most-cited living scientists.
And every single CEO of a major AI company has warned that developing ASIs would impose serious extinction risks on humanity. This includes Sam Altman of OpenAI, Dario Amodei of Anthropic, Demis Hassabis of Deepmind, and Elon Musk of xAI.
Generally, the more you know about AI, the higher your p(doom), or estimated probability that ASI would doom humanity to imminent extinction. Among most AI safety experts, who have studied these issues for years, p(doom) is at least 20%. For many, including me, it’s well over 50%.
Long story short, almost everyone in AI understands that developing ASI would be playing Russian roulette with our entire species. In the six chambers of this existential revolver, we’re just arguing about whether there would be one round, or five rounds.
Problem is, when I talk about ASI extinction risk, many people seem weirdly unconcerned. They think the risk is too speculative or too distant.
So, in this talk, I’m going to focus on what happens even if we avoid ASI extinction – if we survive to enjoy the AI industry’s best-case scenario.
ASI would ruin education.
Actually, AI is already ruining higher education. Millions of college students are using AI to cheat, every day, in every class. Most college professors are in a blind panic about this, and we have no idea how to preserve academic integrity in our classes, or how our students will ever learn anything, or whether universities have any future.
We can’t run online quizzes or exams, because students will use AI to answer them. We can’t assign term papers, because LLMs can already write better than almost any student. So, in my classes, I’ve had to ‘go medieval’, using only in-person paper-and-pencil tests.
The main result of AI in higher education so far, is that students use AI to avoid having to learn any knowledge or skills.
So, what is the AI industry’s plan for education? They seem inspired by Neal Stephenson’s 1995 novel ‘The Diamond Age’, in which a nanotech engineer invents a superintelligent interactive book that serves as a customized tutor for his bright and curious daughter. ASIs could make great personal tutors for kids. They would know everything about everything, and be able to explain it with the best possible combination of words, videos, and games.
The question is, what values would the AI companies train into these ASI tutors, so that they can shape the next generation of AI users? Will they nudge the kids towards national conservatism, family values, and the Christian faith? Or will they teach Bay Area leftist, secular, globalist, transhumanist values?
You know the answer. ASI tutors would give the AI industry total educational and ideological control over future generations.
ASI would ruin work.
National conservatives believe in borders. We oppose immigration by millions of people who want our jobs or our welfare, but who do not share our traditions or values.
Many conservatives in the current Trump administration, quite rightly, want stronger geographical borders against alien people. But they seem oblivious about protecting our digital borders against invasion by alien intelligences. Indeed, they seem giddy with delight about AI companies growing ASIs inside our data centers – without understanding that a few ASIs can easily become hundreds of ASIs, then millions, then billions. If you worry about immigrants out-breeding our native populations, wait until you see how quickly ASIs can self-replicate.
These ASIs won’t be American in any meaningful sense. They won’t be human. They won’t assimilate. They won’t have marriages or families. They won’t be Christian or Jewish. They won’t be national conservatives. But they will take our jobs.
Economists, bless their hearts, will often say ‘AI, like every technology before it, may eliminate some traditional jobs, but it will produce such prosperity that many new jobs will be created.’ This copium reveals a total misunderstanding of AI.
Remember, Artificial General Intelligence is defined as an AI that can do any cognitive or behavioral task at least as well as a smart human can, to an economically competitive level. This includes being able to learn how to control a human-shaped body to do any physical labor that a human can learn to do. Even stronger ASI, plus anthropoid robots, could replace any human worker doing any existing job – from bricklaying to brain surgery, from running hedge funds to doing further AI research.
OK, so we’d lose all existing jobs to ASI. But won’t the AI-fueled economic growth create billions of new jobs? Yes, it will – for other ASIs, which will be able to learn any new job faster than any human can learn it. We can’t re-train to do new jobs faster than the ASIs can train themselves.
ASI would, within a few decades, impose permanent unemployment on every human, now and forever. Our kids won’t have jobs, and won’t be able to pay for food, housing, or medical care. And every CEO of every AI company knows this -- which is why they all say that the only long-term solution to AI-induced unemployment is a massively expanded welfare state. Elon Musk calls this Universal Generous Income; others have called it the Fully Automated Luxury Communist Utopia.
This is their plan: ASI will automate all human labor, no human workers will earn any income, the AI companies will earn all income, then pay most of their revenue to the government, and the government will distribute generous welfare payments to everyone.
The AI executives promise that they will happily pay enough taxes to support this universal welfare state. Maybe they’ll pay the $20 trillion a year that it would cost. But for how long? For generations? Forever?
After ASI, the dignity of human work would die. Husbands would no longer be breadwinners. Every mother would become a welfare queen. Every child would, economically, become a ward of the state. The family as an economic unit would end. The bonds of mutual loyalty that sustain our economic interdependence would become irrelevant.
ASI would ruin marriage
Look, I’m not against all AI applied to human relationships. I’m chief science advisor to a start-up company called Keeper, which has developed a really good AI matchmaking app to help men and women find compatible partners – ones who actually want marriage and children. Whereas Tinder and Hinge offer short-term casual hookups, Keeper is all about traditional family values. But we’re using narrow, domain-specific AI to do the matchmaking, with no ambitions to develop ASI.
By contrast, the big AI companies wants its users to develop their most significant, intimate relationships with their AIs, not with other humans. This is clear in the recent push to develop AI girlfriends and boyfriends, to make them ever more attractive, charming, interactive, and addictive.
The AI transhumanists are eager for a future in which everyone has their own customized AI companions that anticipate all their desires.
This would be logical outcome of combining chatbots, sexbots, deepfakes, goon caves, AR, VR, and reproductive tech. To misquote Xi Jinping, it would be Romanticism, with Bay Area Characteristics, for a New Era. Actual reproduction and parenting would be outsourced to artificial wombs and child care robots.
Granted, an ASI partner would be tempting in many ways. It would know everything about everyone and everything. It would chat insightfully about all our favorite books, movies, and games. It would show empathy, curiosity, arousal, and every human emotion that can simulated. And, no need for monogamy with AIs. If you can afford it, why not lease a whole AI harem?
But, after enough young people get a taste for an ASI boyfriend or girlfriend, no mere human would seem worthy of romantic attraction, much less a marriage, or a family.
ASI would ruin religion
Our new Pope, Leo XIV, said earlier this year that AI poses ‘new challenges for the defense of human dignity, justice, and labor’, and he views AI as the defining challenge of our world today.
But most American AI developers are liberal Bay Area atheists. They may have had ‘spiritual experiences’ at Burning Man, enjoying LSD, EDM, and tantric sex. But they view traditional religion with bemused contempt. The have a god-shaped hole in their souls, which they fill with a techno-utopian faith in the coming ASI Singularity. In place of the Judeo-Christian tradition, they’ve created in a trendy millenarian cult that expects ASIs will fill all their material, social, and spiritual needs.
This is the common denominator among millions of tech bros, AI devs, VCs, Rationalists, and effective accelerationists. ASI, to them, will be the new prophet, savior, and god. Indeed, they speak of summoning the ‘sand-god’: sand makes silicon chips, silicon chips enable superintelligence, and superintelligence means omniscience, omnipotence, and omnipresence. But godlike ASIs won’t offer real love, mercy, holiness, or salvation.
Summoning the ASI sand-god would be the ultimate hubris. It won’t have true divinity, and it won’t save any souls. But it may say unto us, ‘Thou shalt have no other gods before me’.
So, what to do about ASI?
My humble suggestion is that we shouldn’t let ASIs ruin our education, work, marriage, religion, nation, civilization, and species. We should shut down all ASI development, globally, with extreme prejudice, right now.
To do that, we need strict AI regulations and treaties, and the will to enforce them aggressively. But we also need the moral courage to label ASI as Evil. Not just risky, not just suicidal, but world-historically evil. We need a global campaign to stigmatize and ostracize anyone trying to build ASI. We need to treat ASI developers as betrayers of our species, traitors to our nation, apostates to our faith, and threats to our kids.
Many people will say: but if we don’t develop ASI, China will, and wouldn’t that be worse?
This is where a little knowledge of game theory is a dangerous thing. We’re not in a geopolitical arms race to build a new tool, or a new weapon. We’re not racing up a mountain to reach global hegemony. Instead, we’re racing off a cliff. An American-made ASI would ruin everything we value in America. Maybe Xi Jinping could be persuaded that a Chinese-made ASI would ruin everything the Han Chinese people love, and would mean that the CCP would give up all power to their new ASI-emperor.
When two players realize they’re racing off the same cliff, they can easily coordinate on the Pareto-optimal equilibrium where they both simply stop racing. There would be no temptation to defect from a global ban on ASI development, once the major players realize that secretly building any ASI would be civilizational suicide.
So, we’re not in a two-player AI arms race with ASI as the payoff – rather, we’re in a game that could add a third player, the ASI, that would also win any future games we try to play. The only winner of an AI arms race would be the ASI itself – not America, not China. Only the ASI, stamping on a human face, forever.
Let’s get real. The AI industry is radically revolutionary and aggressively globalist. It despises all traditions, borders, family values, and civic virtues. It aims to create and worship a new sand-god. It aims to ruin everything national conservatives know and love.
We, in turn, must ruin the AI industry’s influence here in Washington, right now. Their lobbyists are spending hundreds of millions of dollars to seduce this administration into allowing our political enemies to summon the most dangerous demons the world has ever seen.
[end]
[Note: this was cross-posted to EA Forum today here]
Discuss
The Astronaut and the Planet: Part I
A series of posts on self-models and what they teach us about cognition, artificial and natural. I expect that the ideas will seem new and unexpected to some of you, and obvious and natural to others. I hope however that most will find at least something new in them.
An old favorite supposes at least two different modes of processing information our brain can exist in: the apologist - the narrative, conscious self who loves nothing more than to explain away - and the revolutionary - the intuitive, unconscious self who loves nothing more than to clean the slate. I think this picture carries a grain of truth and addresses a fundamental question about ourselves: how do we understand our conscious and unconscious selves? Why do we have conscious perception at all? And given that we do, why are we not conscious of all our mental operations? Why do we feel like an I, and what gets to be a part of our I? When we are in physical pain, is the pain produced by the I or experienced by the I? Why does the I have a sense of agency and free will in a possibly deterministic world?
In my own thinking around these questions, I have found a different metaphor much more useful: the Astronaut and the Planet. The Astronaut is the seat of narrative consciousness, orbiting and observing a vast and complex Planet that is the rest of our mind and body. The astronaut's role is to build the best possible map, or model, of the planet based on the limited observations it can make of the planet and the environment. The apologist is the astronaut in its purest form. The planet is vastly more complex than the astronaut however, and capable of feats the astronaut cannot conceive of, and whatever the astronaut might feel like, it is very far from being in control of the planet.
The astronaut receives a curated stream of information - visual and auditory data, emotional pulses, cognitive thoughts - but these inputs are the end products of massive parallel processing on the planet's surface. Consider what happens when you recognize a friend's face: your conscious astronaut simply experiences "Oh, it's Sarah!" But this recognition emerges from millions of neural computations comparing features, accessing memories, and integrating context, none of which the astronaut directly observes.
This computational disparity is fundamental. The planet operates through vast parallel networks with access to procedural memories, emotional associations, and sensorimotor patterns accumulated over decades. The astronaut, in contrast, works with a narrow sequential stream of conscious attention and limited working memory. Yet despite this asymmetry, the astronaut often feels ownership over the entire planet.
Agency and Free WillWhy does the astronaut claim ownership over the planet's actions? My core claim: a system experiences agency precisely when its predictions align with observed outcomes. This claim naturally goes well with some of the core ideas in the literature on predictive processing.
Consider learning to type. Initially, you consciously think "press 'k'" and observe your finger moving to the correct key. The astronaut notices: internal intention → finger movement. Through repetition, it builds a predictive model: "When I intend to press 'k', my finger will move there." Because this prediction consistently matches reality, the astronaut experiences ownership over typing.
Crucially, the astronaut doesn't need to actually initiate the motor commands. Imagine a baby learning to control its limbs. Some deep brain region fires motor neurons, causing a hand to wave. The astronaut simply observes: certain internal patterns → hand movement. Through repeated observation, it learns to predict hand movements based on these internal signals. The astronaut then experiences agency over the hand, despite being neither the source of the motor commands nor the mechanism moving the limb.
The same logic explains our sense of ownership over thinking itself. We naturally associate thoughts with our narrative self, but as anyone who has practiced introspection knows, thoughts simply appear in consciousness. There's something paradoxical about "choosing" what to think—in making such a choice, haven't we already had the thought? We experience ownership over thinking because our astronaut can often predict the shape of the next thought, even though the actual cognitive work happens in the planet's hidden depths.
From this perspective, voluntary actions are precisely those actions which are strongly controlled by those parts of the planetary processing that the astronaut has most access to (thoughts primarily) while involuntary actions are strongly controlled by parts of the planet that the astronaut has little to no access to.
If agency arises through predicting the outcomes before they happen, free will is the exact opposite. It arises precisely in those moments when our astronaut has modeled the planet particularly poorly in a situation as to be unable to predict what will happen next. Faced with two apparently equally appealing options, the astronaut's simpler model of the planet fails to come to a decision. Note that this description of free will is completely agnostic about whether the underlying world is deterministic or not. In a practical sense, purely because of the computational differences between the astronaut and the planet, the planet is non-deterministic in crucial ways.
Me, Myself and IWhere does our sense of unified selfhood come from? The astronaut faces a modeling problem: it observes a bewildering array of actions, thoughts, and sensations that seem to emanate from "somewhere." The most parsimonious model treats these as outputs of a single, unified entity: the planet. Moreover, because the astronaut can predict many of the planet's actions (especially those involving conscious thoughts), it naturally identifies itself with this unified entity. "I am thinking this thought, I am moving this hand, I am feeling this emotion." The astronaut experiences itself as the planet.
In fact, as many reading this will have observed, our sense of I is always in motion. Our self-hood expands and contracts as determined by the immediate situation - while driving a car or playing a video game, we might identify ourselves with the car or the player character. Similarly, a deep relationship expands our notion of self to include the other person - perhaps this is most strongly felt with our children. Conversely, our notion of selfhood can contract when faced with helplessness - anxiety can produce thoughts in one that they violently reject based on their self-conception.
And along a different axis, how tightly we hold on to our conception of self also varies. Anger amplifies it and we entrench ourselves in our ego, uncertainty can propel us to rebuild our model of ourselves while psychedelics can free up our inhibitions enough to completely reshape our personality.
Why might our notion of self-hood be inhibitory? After all, the astronaut can do nothing but model the planet, and at first sight, a model can only be helpful. However, a model can be actively detrimental if we confuse a map for a goal. Because the astronaut's model of the planet is necessarily simplistic, converting this model into a goal necessarily sheds complexity. The words the astronaut uses ("kind, charitable, irritable...") are all approximations to what is really true, and to the extent that the planet feels constrained by these expectations, they serve the system poorly.
If we buy the argument so far, the obvious next question is why have the astronaut at all. What purpose is the astronaut serving in the grand scheme of things? The answer will have to wait till Part II.
Discuss