Вы здесь

Новости LessWrong.com

A community blog devoted to refining the art of rationality

Обновлено: 59 минут 42 секунды назад

The Cold Start Trap: Why the Best Social Infrastructure Almost Never Succeeds

22 марта, 2026 - 21:20

We already know how to build amazing systems: private medical data sharing, reliable truth-checking tools, fair collective decision-making platforms. Good designs exist on GitHub and in papers. Neural networks can generate even more in minutes.

..Yet almost none of them are actually used by millions of people.

The reason is simple: these systems are worthless until a large number of people join at the same time.

No users → no value → no one wants to be first → still no users.

This “ghost town” trap kills almost every good project. You need explosive growth to escape it, but normal growth is slow and steady, so most die.

Two common fixes don’t work:

Big funding (grants, VCs, governments). They rarely pay for things that take power away from themselves. They prefer projects that look decentralized but keep control centralized.
Pure volunteer cooperation. To coordinate millions without a center you need… coordination infrastructure. Which doesn’t exist yet. So the circle continues.

Instead we get "infrastructural Darwinism": the winners are usually the projects with:
- the biggest marketing budget
- the best timing / hype wave
- the most aggressive growth tricks
- the strongest connections

..not the technically best ones.

What’s missing is a neutral “consensus sandbox”: a shared space where promising protocols are fairly tested(!), the best ones get proven(!), and then many aligned people adopt them together at once — without relying on money, hype, or manipulation.

Right now we’re stuck between cynical funders and chaotic markets that reward budget over quality.

The cost of staying stuck is huge: we keep running civilization on mediocre rules when far better ones are ready on the shelf.

Can the rationalist \ EA(Effective Altruism) communities build that missing meta-layer?

P.S:

I had the AI whip up some possible fixes (these are just a bunch of words, but perhaps they will give someone something to think about). Looked pretty decent so I picked the best ones:

Cold-Start Resolution Layer for Global Systems
A Meta-Coordination Layer for Bootstrapping Global Public Goods
Pre-Consensus Signaling Protocol for Critical Mass
Base-Layer Handshake for Global Infrastructure Scaling
Liquidity Aggregation Protocol for Public Infrastructure
The Genesis Layer: A Thin Protocol for Solving the Collective Action Trap
Support for Massive Decentralized infrastructure

Discuss

Is fever a symptom of glycine deficiency?

22 марта, 2026 - 17:44

A 2022 LessWrong post on orexin and the quest for more waking hours argues that orexin agonists could safely reduce human sleep needs, pointing to short-sleeper gene mutations that increase orexin production and to cavefish that evolved heightened orexin sensitivity alongside an 80% reduction in sleep. Several commenters discussed clinical trials, embryo selection, and the evolutionary puzzle of why short-sleeper genes haven't spread.

I thought the whole approach was backwards, and left a comment:

Orexin is a signal about energy metabolism. Unless the signaling system itself is broken (e.g. narcolepsy type 1, caused by autoimmune destruction of orexin-producing neurons), it's better to fix the underlying reality the signals point to than to falsify the signals.

My sleep got noticeably more efficient when I started supplementing glycine. Most people on modern diets don't get enough; we can make ~3g/day but can use 10g+, because in the ancestral environment we ate much more connective tissue or broth therefrom. Glycine is both important for repair processes and triggers NMDA receptors to drop core temperature, which smooths the path to sleep.

While drafting that, I went back to Chris Masterjohn's page on glycine requirements. His estimate for total need is 10 to 60 grams per day, with the high end for people in poor health. I had just written that glycine lowers core temperature. What if those are connected?

Is fever what happens when you are too glycine-depleted to fight infection through the more precise mechanisms glycine enables?

Glycine helps us sleep by cooling the body

The established explanation for glycine improving sleep is that it lowers core body temperature. Glycine helps activate NMDA receptors in the brain's master circadian clock (the suprachiasmatic nucleus, or SCN). This causes blood vessels near the skin to widen, dumping heat from the core to the surface. The body needs its core temperature to drop in order to fall asleep, and glycine accelerates that drop. In rats, surgically destroying the SCN eliminates glycine's sleep-promoting and temperature-lowering effects.

Glycine cleans our mitochondria as we sleep

Your mitochondria produce energy, and as a byproduct they generate reactive oxygen species (ROS), chemically aggressive molecules that damage proteins, lipids, and DNA. ROS accumulate during wakefulness. Amber O'Hearn's 2024 paper "Signals of energy availability in sleep" synthesizes the evidence that this accumulation is a key signal driving the need for sleep: wakefulness generates ROS, ROS buildup triggers sleep, and sleep clears them.

A Drosophila study tested multiple short-sleeping mutant lines with mutations in unrelated genes. All were more vulnerable to oxidative stress than normal flies. When the researchers forced normal flies to sleep more, those flies survived oxidative stress better. And when they reduced ROS specifically in neurons, the flies slept less, as if the need for sleep had partly gone away. Their conclusion: oxidative stress drives the need for sleep, and sleep is when the body does its oxidative cleanup.

The body's main intracellular antioxidant is glutathione, a small molecule made from three amino acids: glutamate, cysteine, and glycine. In many contexts, glycine is the bottleneck for glutathione production: you have plenty of the other two ingredients, but not enough glycine to keep up. If you are glycine-deficient, you cannot make enough glutathione, you clear ROS more slowly during sleep, and you need more sleep to achieve the same degree of clearance. That is a complete mechanistic chain from glycine deficiency to increased sleep need, and it is entirely independent of the NMDA temperature pathway.

Most people could use more glycine

Glycine is classified as a "non-essential" amino acid because the body can make it, primarily from another amino acid called serine. But the body only produces about 3 grams per day. Estimated total requirements range from 10 to 60 grams per day depending on health status, because glycine is consumed in enormous quantities by the production of glutathione, creatine, heme, purines, bile salts, and collagen.

In the ancestral environment this was not a problem. Traditional diets included collagen-rich connective tissue such as skin, tendons, cartilage, and bone broth, which is about 33% glycine. Modern diets, built around muscle meat and discarding connective tissue, cut glycine intake dramatically.

One group of researchers estimated that most people adapt to this deficit by reducing collagen turnover, letting damaged collagen accumulate with age, and that this may contribute to arthritis, poor skin quality, and other consequences of aging. Others have noted that markers of glycine deficiency appear in the urine of vegetarians, people on low-protein diets, children recovering from malnourishment, and pregnant women.

Fever is plan B for fighting infection; glycine supports plan A

Fever slows pathogen replication, makes immune cells move faster and multiply more, helps them engulf pathogens more effectively, triggers the production of protective stress-response proteins, and speeds antibody production. But it is metabolically expensive (roughly 10 to 13% increase in metabolic rate per degree Celsius) and causes significant collateral discomfort and tissue stress.

Glycine enables several cheaper alternatives to the same functions.

Macrophages are the immune cells that eat pathogens and coordinate the inflammatory response. They have glycine-sensitive chloride channels on their surfaces. When glycine binds these channels, it calms the cell down: chloride flows in, shifting the cell's electrical charge in a way that suppresses the calcium signaling needed to produce inflammatory molecules. These molecules are called cytokines (the important ones here are TNF-alpha, IL-1-beta, and IL-6), and they are what drive the fever response. Glycine dampens the production of these pro-inflammatory cytokines while increasing production of the anti-inflammatory cytokine IL-10.

Pyroptosis is a form of inflammatory cell death where immune cells fighting an infection blow themselves up, releasing their inflammatory contents into surrounding tissue. This is useful for eliminating pathogens but causes collateral tissue damage. Glycine prevents macrophages from bursting open during pyroptosis without blocking the internal machinery that kills the pathogen inside the cell. The macrophage can do its job without self-destructing. In animal sepsis models, glycine treatment has improved survival.

Then there is the extracellular matrix. Collagen, the most abundant protein in the body, forms the structural matrix of tissues and acts as a physical barrier against pathogen spread. Collagen is one-third glycine. A three-year study of 127 volunteers (not randomized or blinded, so take it cum grano salis) found that among the 85 who took 10 grams of glycine daily, only 16 had viral infections, all in the first year and with reduced severity and duration. The control group reported no change in infection frequency. The proposed mechanism is that adequate glycine supports collagen turnover, maintaining the extracellular matrix as a mechanical barrier against viral invasion.

A glycine-replete organism can fight infection through these targeted mechanisms and does not need to escalate as aggressively to raising core temperature. A glycine-deficient organism cranks the thermostat higher and longer.

Elevated temperature directly impairs pathogen replication. Bacteria really do grow slower at 39°C (102°F) than at 37°C (98.6°F). No survivable amount of glycine changes that biochemistry. But the degree and duration of fever may be substantially modulated by glycine status, because many of the things fever accomplishes systemically (immune cell function, inflammation control, tissue protection) are things glycine accomplishes through targeted molecular mechanisms.

This leads to a testable prediction: people with high glycine and glutathione status should mount lower fevers for equivalent infections while maintaining equivalent or better outcomes. I am not aware of anyone having studied this directly, because nobody frames the question this way. But the mechanistic pieces are all published. Some are well-established (glycine's role in glutathione synthesis, macrophage chloride channels), others more preliminary (the ECM/infection study). They are just sitting in different literatures (sleep biology, amino acid metabolism, innate immunology, pyroptosis research) and nobody has connected them.

Glycine's cooling effect via the SCN is unrelated to its immune benefits

Remember the NMDA temperature pathway from the beginning of this essay, the one that made me notice the coincidence? It turns out to be a red herring as a link between sleep and immunity. The sleep pathway (glycine acting on NMDA receptors in the SCN to cool the core) and the immune pathway (glycine acting on chloride channels on macrophages to prevent pyroptosis) are completely independent. They involve different receptors, different cell types, and different organ systems.

So when I noticed that glycine lowers temperature and that sick people need more glycine, I was right that they were connected, but for none of the reasons I initially thought. The NMDA pathway had nothing to do with it. I had a true belief ("glycine, temperature, and illness are linked") that happened to be true, but my justification ("because NMDA receptors and thermoregulation") was wrong. A Gettier case!

But the wrong reason led me to the right question.

Glycine turns out to be a legitimate antipyretic after all

In rabbit experiments, glycine injected directly into the brain's fluid-filled cavities reduced fever caused by two different triggers: substances released by white blood cells during infection (leukocytic pyrogen) and prostaglandin E2, which is the specific molecule the brain's thermostat uses to raise the temperature setpoint during illness. This is a different operation from the sleep-onset mechanism. The sleep pathway lowers the thermostat from 37°C (98.6°F) to 36.5°C (97.7°F) to help you fall asleep. The antipyretic effect prevents the thermostat from being cranked up to 39°C (102°F) during infection.

So glycine suppresses fever directly (which might confound the testable prediction above), and unrelatedly lowers core temperature before sleep, and unrelatedly improves specific immune response in ways that reduce the infection-related inflammation that raises body temperature. Three independent pathways, with no apparent mechanistic connection, all drawing on the same pool of one simple, cheap amino acid that modern diets undersupply.

Practical considerations

Glycine powder is cheap, roughly 2 to 3 cents per gram. It is mildly sweet and dissolves easily in water. There is no known toxicity at supplemental doses aside from gastrointestinal upset at high doses; 60 grams per day has been used in schizophrenia trials. For most people, 10 to 15 grams per day in divided doses (some with meals, some before bed) would address the estimated deficit. Three grams before bed is the dose studied for sleep improvement specifically.

This is not comprehensive nutritional advice. For instance, cysteine is the other bottleneck for glutathione production, and people who eat little animal protein or are acutely ill may benefit from supplementing NAC (N-acetylcysteine) alongside glycine.

Alternatively, you can eat the way your ancestors did: bone broth, skin-on poultry, oxtail, pork rinds, and other collagen-rich foods. One gram of collagen for every ten grams of muscle meat protein roughly restores the ancestral glycine-to-methionine ratio.

Before reaching for a pharmaceutical intervention to override a biological signal, it is worth asking whether the signal is accurately reporting a problem you could fix with inputs. Orexin tells your body about its energy metabolism. Fever tells your body about its immune status. If you are not providing the substrates those systems need to function, the signals will reflect that, and the right response is to supply the substrates, not to shoot the messenger.

Discuss

My Most Costly Delusion

22 марта, 2026 - 15:21

Suppose there is a fire in a nearby house. Suppose there are competent firefighters in your town: fast, professional, well-equipped. They are expected to arrive in 2–3 minutes. In that situation, unless something very extraordinary happens, it would indeed be an act of great arrogance and even utter insanity to go into the fire yourself in the hope of "rescuing" someone or something. The most likely outcome would be that you would find yourself among those who need to be rescued.

But the calculus changes drastically if the closest fire crew is 3 hours away and consists of drunk, unfit amateurs.

Or consider a child living in a big, happy, smart family. Imagine this child suddenly decides that his family may run out of money to the point where they won't have enough to eat. All reassurances from his parents don't work. The child doesn't believe in his parents' ability to reason, he makes his own calculations, and he strongly believes he is right and they are wrong. He is dead set on fixing the situation by doing day trading.

What is that if not going nuts? Would those be wrong who ridicule this child and his complete mischaracterization of his own relative abilities? Would it not be an act of benevolence to just stop the child, by any means necessary, from executing his deranged plan and bring him back to the care of his parents?

But now imagine that the child doesn't live in a big, happy, smart family. He is homeless in a town of other homeless children. There are some adults, like 20 of them, but all of them are occupied with preventing the nearby dam from breaking and flooding the town.

This child doesn't sit and wait for adults to come and feed him, like a responsible, correctly-estimating-his-own-abilities, non-arrogant, well-behaved entity he is supposed to be in the eyes of people from an alternative reality where towns are populated by big hordes of smart competent adults.

He goes outside, makes some tools to catch birds (tools are dangerous, they may hurt him, and they are just a joke compared to professional hunting equipment) and then lights a fire to cook what he managed to capture (the fire may of course burn his fingers, and there are no safety protocols, it is just a fire in a semi-abandoned post-apocalyptic town, and overall that's not how experienced adults would do it).

Is he still an arrogant, inappropriate fool?

Are you still in the position to judge his strategy?

I knew for a long time about the idea of heroic responsibility.

But to exhibit heroic responsibility, you have to be a hero, right? Right? Or not? When are you "hero enough" to do it?

As one saying goes:

You can just do things.

Can you, really, though?

Many are irritated by the hubris of this phrase. For there are, of course, reasons to be irritated by it.

And yet, as scary as it may sound, you have to just do things, even if you can't, because no one else is going to do them anyway.

You have to just do things, not because you have some special power to do things, but because you are forced, by societal incompetence, to do things despite lacking special powers.

You have to just do things, as a green schoolboy, because all adults are busy with something even more important.

And those who mock you for being presumptuous enough to think you are capable of solving your problems may very well be right. So what? Does it make you less forced to try solving these problems still?

So my most costly delusion was that I can leave some problems to be solved by other, more competent people.

To be clear, competent people exist. There are just too many problems and they are too severe for the existing competent people to fill all the problem-solving slots.

More concretely, in my case (and it may not be the case for other people) this delusion manifested as the belief that I should focus on tasks corresponding to my "experience" or previous "area of expertise" rather than on the most pressing tasks, because there are already people in the more pressing fields who have competitive advantages over me, and I am not going to add value on top of them.

That was an extremely naive take, resting on the assumption that pressing areas are not in extreme deficit of people.

It is not to say that you don't need experience and expertise. Of course you need them! My point is that the absence of experience and expertise is not a vindication. You may and you should gain them, especially since it is not as hard as you think to gain them to the level that allows you to add real value. Not because you are super cool and a fast learner (although you may be), but because the bar is set by the supply, and the supply is shockingly thin.

On top of that, because now it is possible to outsource a lot of low-level thinking and tool-level engineering knowledge to AIs, you may be actually plainly underestimating what you are capable of doing.

I totally get that you are incompetent, or rather not competent enough. I am also not competent enough. And in an adequate world, that would be a good argument not to do things.

I thought, as I grew up, I would stop perceiving myself as a child. But what happens in reality when you grow up is that instead of realizing you are an adult, you realize the others are not really adults either, and hence you must do the things yourself, despite being a child.

Being a child is definitely an obstacle, but not an excuse.

Discuss

Noticing a Teacher's Password Pattern

22 марта, 2026 - 12:10

Yudkowsky writes about Guessing the Teacher's Password as an abstract educational concept. At a young age, perhaps ten years old, I had guessed one commonly used meta-password: In the Finnish school system it's typical for multiple choice answers to include options that are somewhat similar, and often the actual answer can be reasoned without knowing much at all about the actual topic. Here's an example from 2024 admission exam for technical universities. I know no chemistry beyond elementary school, and you might not know any Finnish. That matters not:

Naively one might think that repeating the PHV-thing seen in the description would mean picking D. Unfortunately, we have better tools: teachers generate incorrect answers by either taking completely nonsensical things, or by varying only one feature of the correct answer at a time.

Just by looking at the pictures, we can see that D doesn't share the right-hand OH group every other compound has. So that's not the answer. Next we see that B is missing the downwards-going carbon branch (and so is D). So that is not the answer, either. We're left deciding between A and C. But A and B share the same squiggly mid-lane carbons. So the answer must be A.

I'm using overly strong terms here. Not every exam or every teacher uses this format. But so many do that it's extremely useful to notice this. Let's do another one, this time with text only, this time from the 2022 admission exam:

Again, we can simply look into the textual structure. All except B share the same first number, so it's not B. But the option A shares the same numbers as B, except that the order has been swapped. This means that the answer must be A.

In both these cases, the logic gives the correct answer. But not all questions are like this. If the options do not have the structure that has these similarities, using this method will not work. But at least I've learned to recognize this form over the years. Even though it's quite reliable, I normally wouldn't answer the questions using it. But if that doesn't match my calculated answer, I'd double- and triple-check before accepting it.

This reminds me quite a bit about pattern recognition in general IQ testing. Not sure what to think about that, yet. It would be a mistake to teach this trick to people who haven't noticed it themselves; I'm pretty sure such clever tricks hindered my studies at least a bit. A mild infohazard, even, perhaps. I still feel rather comfortable publishing this here, take that as you wish.

Discuss

Pre-Review of Toy Story 5

22 марта, 2026 - 06:53

I am the second most spoiler-averse person I know. (Maybe tied for 2nd with a couple other people?).

I once was considering going to an immersive experience, and someone told me the company that ran the experience, and this was enough for me to derive an important twist that'd happen to me in the first few minutes, and I was like "augh that was a spoiler!!!" and they were like "!??".

I then went to the experience, and indeed, it was a lot worse than it would have been if I had gotten to be delighted by the opening twist.

I say this all to say, I think Toy Story 5 would be the kind of movie that, if it were good, it would be worth watching unspoiled. I am worried it will not be good, but, I don't know.

But, also, I've been spoiled already, and meanwhile it's pretty interesting to think about in advance.

So, decide whether you're the sorta person who should stop reading after this opening section.

Also, if you have not seen Toy Story 3, Toy Story 3 is particularly worth watching unspoiled.

The rest of the essay will get escalatingly spoilery for Toys Story 1-5.

Toy Story has always been a saga about the fear of abandonment, and obsolescence, and identity crisis. I have been impressed with how much they keep escalating both the stakes and depth there while keeping the same theme.

I.

In Toy Story 1, the toy Woody must confront that he is no longer his kid (Andy)'s favorite toy. His kid forms a new relationship with a new toy. Woody worries they are replacing him.

Meanwhile, Buzz Lightyear realizes he is not the person he thought he was. Space Rangers aren't real. His entire identity is destroyed. But, he learns to form a new identify, as a kid's toy.

Eventually they both make peace with their respective losses. And then, hurray, it turns out Andy has't really outgrown Woody after all. But, their relationship with them is forever changed.

II.

In Toy Story 2, Woody confronts the fact that, event though he got to keep a relationship with Andy... that relationship has an expiration date. Andy is is clearly growing, changing, and it's clear that eventually they fundamentally won't need you (or any of your entire social world). But, you make the decision to stay by them for the part of their lives where you can help them.

("You're right. I can't stop Andy from growing up. But I wouldn't miss it for the world.")

III.

Andy has grown up. The time has come. Your entire old life is over.

In the climactic scene where they are on the incinerator conveyor belt, my sister and I kept looking at each other, our eyes conveying our thoughts. "Surely... surely they will find a way out of this? Oh, maybe – no, they just ruled out that way of escaping. Maybe this other – no, they ruled that out too. Oh, now.... now they are just hold hands. They... it seems like this scene has really resolved. Man, I can't believe they're going to end the movie right here but this would, in fact, be a complete movie if they did."

My sister and I held hands along with Woody and friends onscreen. We all made peace with this possible ending together.

Then they are rescued, in a way that was actually foreshadowed in an excellent way that was a culmination of a subplot since Toy Story 1 and was surprisingly satisfying. But, the grief and acceptance were real.

They go on to be "reincarnated" of sorts, repeating the cycle anew with a new kid, Bonnie.

...

IV.

I thought "surely they can't top Toy Story 3. The cycle is over. Toy Story 4 is a lame cash grab."

Wrong. Toy Story 4 goes further.

In Toy Story 4, Woody realizes that he has fundamentally changed.

It's not really the right thing for him, to repeat the same life as he did before. Instead of either his life ending, or his life starting over but with the same basic shape, he must confront that his entire meaningmaking schema is obsolete for him. Ego death.

What happens when you've confronted physical death and ego death?

You find a way to keep living.

Okay.

Surely, that's the end? Surely, the fifth Toy Story is a lame cash grab like the other more recent Pixar movies?

Well, I don't know.

I have seen the trailer for Toy Story 5, which reveals enough of the key dynamics I can see where this is going.

(Last chance to get off the spoiler train and studiously avoid spoilers for another 3 months)

...

In Toy Story 5, (drumroll)

The bad guy is AI.

The toys that stayed with Bonnie see her receive a new iPad-like thing called LilyPad. Unlike the toys, LilyPad can talk directly to the kids, can shape their entire life, and can interact more with the broader world than the toys usually have latitude to do

Holy shit.

Toy Story 5 is (I bet) going to ask the question "okay, but, like, what if your entire culture/species is going obsolete? What *then*?"

I'm pretty confident this'll be the topic. It's pretty clearly what the trailer is pointing at, it fits the established arc. The new trailer features Woody saying sadly "I... I don't know. Toys are for play. Tech is for Everything." The teaser trailer said "The Age of Toys... is Over?"

But, I don't see a way for them to play it that would be good art, and, that would land as a Toy Story movie. Or, there are some very ballsy ways they could end this movie but I can't believe they'd actually do it. And meanwhile there's a bunch of lame ways it seems more likely to go.

Previous Toy Story trailers have undersold how good the movie was. I'm afraid this'll turn out to be a lame "Message movie." But, their tract record is good. I wait with bated breath.

Every previous Toy Story movie was presenting a type of conflict we sort of fundamentally "know how to deal with." We've seen this story before in other guises. People lose friends. Parents lose children. People die. People lose their entire sense of purpose.

Toy Story 5 is tackling a situation that humanity is still in the middle of dealing with. Most possible endings have to take some kind of opinionated stand. And most of the stands I can imagine either feel fake or patronizing or both.

...

Requirements of a Toy Story Movie

1. Toy Story movies fundamentally have to work for kids and adults, with the kids getting a fun adventure and the adults getting a harrowing story about abandonment, obsolescence and identity.

2. Toy Story movies are about Toys, and play. A particular ideal of childhood and the Child/Toy relationship.

3. Toy Story movies are, like, "wholesome."

4. Toy Story movies for whatever reason somehow always end with a crazy heist/escape in the climax.

(So far. You could do a Toy Story movie that subverts these, but, so far they have threaded the needle on grappling with all of these in a way that felt organic and True to a consistent spirit).

And then, there's the usual set of requirements for good movies. Characters grow in interesting ways that make sense and reflect their struggles. Ideally, something about the ending is surprising and feels meaningful in someway. etc.

So, what are the options here?

...

Ending A: Put it away

Bonnie decides to put away the iPad and goes back to her toys? (similar to Toy Story 1, where Woody confronts being replaced but ultimately gets to keep most of his old relationships)

But, like, do you really buy this? I certainly can buy an individual child doing this. But, Toy Story's metaphor is the toys are a kind of stand in for any of us. And the tide of Tech for Kids is clearly still coming even if one family made a different choice.

...

Ending B: Parents take it away

All the problems with A and also kinda Lame. (Takes the agency out of the kid/toy relationship, although presumably in this ending the parents change their mind because the toys somehow furtively highlight the problem to them?)

...

Ending C: Butlerian Jihad

...somehow the toys convince, like, society writ large, to not give kids iPads.

There are versions of this that are a lame Political Message Movie and versions that are kinda cool.

I don't think they're going to do either.

...

Ending D: Harmony and Balance

Buzz Lightyear was the initial antagonist of Toy Story 1, but the movie ends with him and Woody both being friends and favored toys and helping each other through their psychological problems.

LilyPad looks to kind of be a "Young Lady's Illustrated Primer" from Diamond Age, (i.e. being actually valuable for teaching Bonnie stuff. We see that she speaks Spanish Español, and hints of being more broadly educational.

In the trailer we see Bonnie staring zombie-eyed at the iPad. You can have an okay-ending where the toys teach LilyPad how to be a better friend/parent/educator/toy who, like, helps Lily learn but also prompts her to go outside and play.

I can imagine okay-ish versions of this ending, that sort of hint at toys all over the world working together to try and nudge kids / AIs toward a wholesome coexistence, where it's not just one family making a better choice but we see how society is steering towards a better equilibrium.

But, it still feels like this is an unstable equilibrium. C'mon. We know the tide is still coming. I would be surprised if the movie depicted anything like the amount of change/effort necessary for this ending to feel earned and enduring.

...

Ending E: Accepting the End

Toy Story 2 was about accepting that eventually, your relationship will fundamentally change. Kids grow up, they no longer need you the same way. The choice given to Woody is to then turn away from Andy, knowing that his relationship is ephemeral.

But, Woody chooses Andy. "You're right. I can't stop Andy from growing up. But, I wouldn't miss it for the world."

There's an alternate version of Ending D, where instead of pretending like we've found a new stable equilibrium of Toy/Tech/Human harmony... the movie acknowledges that the change isn't over. And the toys, both Woody/Buzz and (maybe?) ones across the world, grapple with the fact that this change is still in progress. And it may indeed mean that one day, toys will be obsolete, or must fundamentally change as a whole.

But, seeing that, choosing to still do their best to help steward the children while they can, being part of the journey as long as they can...

...okay having typed that out, I think there's a decent chance this is what they will go for. It grapples with the enormity of the situation, avoids having to make that strong a stand by acknowledging "look, we don't really know what's coming but we (the screenwriters) are going to do our best."

This leaves the question of "okay, but, that's a hella cliffhanger.

What's Toy Story *6*?"

Toy Story 6 could be a reprise of Toy Story 3, facing oblivion, this time across all toy-civilization. (See also, all human civilization).

Toy Story 6 could also be a reprise of Toy Story 4, realizing you must fundamentally change to adapt to a new world/situation, this time everyone across all toy-civilization instead of just one old cowboy doll. (See also, all human civilization).

...

Ending F: Hard Science-Fantasy?

The movies have always left the toys with a mythic, unexplained origin. Why *are* these toys walking around? Why do the parents not notice? How do Buzz Lightyears end up believing they are space rangers but still instinctively knowing to flop down on the floor if any humans walk in?

What are the limits of toys who break out of the Toy/Child script, that we've seen them do occasionally?

Part of why Toy Story 5 feels forced to me is, the fact that the toys aren't human and magic is real, suddenly strains my disbelief in a way it didn't before.

Obviously I'm a LessWrong-guy who has pretty oddly specific beliefs about how scary AI is, but, I think most people are feeling an unsettling sense that AI is potentially scary in an existential way, even if they have different guesses about exactly how that plys out.

I don't think they'd choose to do this, and, I don't think it actually would make as good a movie.

But, it is an option on the menu to make a movie where we take the brute fact of toykind's existence and the tide of AI that's coming, and... just, let that be a coherent-ish world and roll the simulation forward and depict whatever happens when toy magic and AI both exist.

I don't think they'll do it, but, I'd read that fanfic.

Discuss

My Hammertime Final Exam

22 марта, 2026 - 01:11

Firstly, I finally made it :~D

It's my second attempt, firstly I tried to finish Hammertime around a year ago. I even forgot I had a LessWrong profile since, so here I am, writing my first post.

Prompts

Design a instrumental rationality technique.
Introduce a rationality principle or framework.
Describe a cognitive defect, bias, or blindspot.

Rationality Principle: One change at the time

I kinda got used to be a professional at my career, but as soon as I start to deal with routine real-life problems, my thoroughly collected knowledge suddenly disappears :~0

For example, every time I begin a task, I have to remind myself that I should do as small changes as possible: that way (in case anything goes wrong) only as small part of system will be broken; plus it's easier to review and test small changes. So every task I have to fight my urge to add "another little change" or "do a tiny refactoring".

So that's where some life situation begins -- like replacing a trash bag, -- and I suddenly find myself out in the bathroom cleaning the mirror, with small 2-minute task scope grown into 2-hour monster. That's why amygdala starts to learn a pattern "to change a trash bag is a costly operation".

While I studied productivity, I've heard that people who begin buidling planning/productivity system stuff tend to "overpush" extra tasks: systematization brings as feeling that every task is finishable, and even better -- in a shorter time. So what? -- I'll take two :~D

But that's a catch. You begin to grow in planning when you start to deny "maybe" tasks to keep the space for "definitely yes!" tasks. At least, that's what they've been telling me for 2 years :~D

Rationality technique: Reversible

One of my job's parts is to write release plans. Sometimes they're as simple as:

Deploy:

deploy the backend (no breaking changes, no migrations)

Rollback:

rollback the backend via Pipelines UI

Sometimes there's a database schema change where's my colleague needed:

Deploy:

deploy the backend (no breaking changes)
migrations applied automatically (head_revision=123abc, down_revision=bca521)

Rollback:

rollback the migration: alembic downgrade bca521
rollback the backend via Pipelines UI

Sometimes there are other team members so plan becomes bigger:

Deploy:

deploy the backend (no breaking changes)
migrations applied automatically (head_revision=123abc, down_revision=bca521)
deploy the fronted

Rollback:

rollback the frontend via Pipelines UI
rollback the migration: alembic downgrade bca521
rollback the backend via Pipelines UI

And so on... So what's that about? -- we tend to not think about "planned fall", usually we only think about potential problems like:

if I move to another country I might not like it
if I find a new job then my boss might be mean
if I have an operation under general anesthesia then I might wake up conscious at the middle of it -- and have PTSD for the rest of my life

Not all of this problems have a way to return everything to the starting point, though usually there's a way to compensate. And it's likely that you not only need to plan a "rollback" part, but also to modify a "rollout" part to make it more reversible. So instead of:

Rollout:

find a doctor
prepare of operation
have an operation

Rollback:

one could possible have:

Rollout:

find a doctor
(look for bad consequences statistics)
ask gf to care for me a night after operation
prepare of operation
bring something yummy
have a favorite show episodes downloaded to my phone's storage
have an operation

Rollback:

eat yummies and binge-watch the whole season during the night
call a psychotherapist

Cognitive defect: eternity

We don't see things the way they are. But what's more important, we don't see how things change the way they do.

Couple years ago I thought that I'd never get off the energy drinks as I couldn't go without them even for a week. Gradually my patience led to one breakdown a month, then to 1 breakdown/couple months. I started to unwind the vicious cycle and it occurs to be easier to refuse doing bad habits.

So what if current "giant problems" are already being solved with small steps? At the moment it might seem to take the eternity to do something, but as long as you keep your pace, things change :~D

Discuss

Key to Life No. 9: Access

22 марта, 2026 - 00:53

There is now an enormous amount of incredibly useful information in the world. But at the same time, there is also a problem of access to it.

On the one hand, access to knowledge is now better than it has ever been in human history. It seems that access to knowledge is one of the things that significantly accelerated humanity’s scientific and technological progress.

At first, scientists thought things through and ran their experiments on their own, and often their work disappeared into the depths of history and was forgotten. Tiny sparks of knowledge remained, but they did not ignite other isolated sparks.

Then printing appeared. The speed of information spread increased, and access to it became easier. Now, within a single human lifetime, two scientists or philosophers could even communicate and criticize each other while being far apart.

Then came the telegraph. First optical, then electromagnetic. Then radio and telephones. Then the Internet developed — you already know all that. And now the exchange of knowledge, analysis, and criticism happens very quickly, which greatly strengthens our progress.

Access itself has also become almost magical: we can reach into our pocket, pull out a magical calculating machine, and it connects to the source of all the knowledge of civilization. The overwhelming majority of it can even be downloaded for free, no SMS or registration required!

But there is a problem. The mere possibility of access and the existence of search engines (which, I think, also accelerated progress) still do not fully solve the access problem. Because in order to find something, you have to know what to look for.

This is partly solved by AI — it greatly simplifies our access to all kinds of information. I think this is exactly why it can increase human productivity so dramatically (especially in learning) even without automation.

Most of the time, my interaction with AI looks like this:

I have an idea that I cannot fully formulate yet, but it sounds roughly like [this]. Find the fields of science, terms, and theories related to my idea, and analyze whether there is a better formulation or further development of it.

If it were not for AI, I would be doomed to spend years collecting that information in tiny fragments.

And even so, AI still does not solve the access problem 100%. Not even 90%. I do not know exactly how much it solves — but definitely not 90%!

It can still fail to notice that a certain topic is connected to the one I am asking about, for example, or it can search too narrowly and miss a huge space of relevant results.

And this applies not only to science. That is why this is a key to life — it is also connected to finding work, housing, friends, love, and basically anything where you need to find something, get something, but do not know how.

Almost all of life — all of our existence — consists of solving tasks.

Some happen in the background, automatically: the task of moving, blinking, breathing, and so on. Some are more active: eating, washing, walking from home to a transport stop.

Some are maximally active: inventing life-extension technology, writing a post for a LessWrong.

But all of these tasks have something in common. There is an initial state A and a final (desired) state B. Between them are the steps that need to be taken in order to get from A to B.

And in order to take those steps, we need to know what to do. We need to know where to go, whom to ask, what to google, what exactly to do.

If we take this to an absurd extreme, then if we had absolute, magical access to all knowledge, we could simply perform the minimal number of movements — perhaps even just turn our head at the right time and in the right place — in order to trigger a cascade of events leading to the desired result.

Everything that separates us from a desired outcome, if that outcome is not forbidden by the laws of physics, is knowledge. And knowledge requires access.

To put it as simply and practically as possible: the more people you know, and the more often you use tools like AI, the greater the chance that someone will recommend — or you will simply stumble upon — the topic / service / place that you need and that can help you.

So the problem is very often not simply whether the information exists or does not exist as such. And it is not only that access to it may be restricted by censorship, accreditation, or a paywall.

Access can also be limited by the fact that the idea we want to explore often exists in our mind only as a vague intuitive impression, and is not formalized enough to be ready for search.

Or, for example, we may simply fail to enter the space where the knowledge we need is located — whether that space is digital, physical, social, or a particular time window. We search for it on Google, when in fact we might have found it through a mutual acquaintance at a weekly meetup of like-minded people whose existence we do not even suspect.

When we try to solve life problems, we are looking into a search space covered almost everywhere in fog. Access can reveal hidden paths, give us keys to closed roads, or simply connect us with people who know the shortcut.

Access problem is exactly why I’m developing a website that will compile FAQs on life extension and AI risks—for people who know absolutely nothing about these topics.

What ideas do you have for improving access? I would like to hear from you.

P.S.:

Why No. 9? For the same reason as Love Potion No. 9!

In other words, just because. I suspect there are many keys to life, and I still do not know in what order they should be arranged.

Discuss

Understanding when and why agents scheme

21 марта, 2026 - 23:33

TL;DR

To understanding the conditions under which LLM agents engage in scheming behavior, we develop a framework that decomposes the decision to scheme into agent factors (model, system prompt, tool access) and environmental factors (stakes, oversight, outcome influence)
We systematically vary these factors in four realistic settings, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self-preservation, resource acquisition, and goal-guarding
We find baseline scheming propensity to be near-zero, with Gemini 3 Pro Preview as the sole exception at 3% in one setting
Adversarially-designed prompt snippets that encourage agency and goal-directedness can induce high scheming rates, but snippets sourced from production agent scaffolds rarely do
Scheming behavior is remarkably brittle: removing a single tool can drop the scheming rate[1] from 59% to 7%, and increasing oversight can sometimes raise rather than deter scheming by up to 25%

We consider the behaviors studied here as potential precursors to the full threat model of scheming: current agents may sometimes behave consistent with scheming, but do not (yet) have the coherent long-term goals and the general capability that would make their deployment dangerous. Understanding when and why agents currently scheme could help us prepare for such future systems.

Find more in-depth analysis in our paper here.

Our evaluation environments are available to AI safety researchers upon request. Please reach out to hopmanma@gmail.com for access.

Top: Our four realistic scenarios testing instrumentally convergent goals (self-preservation, resource acquisition, goal-guarding). Bottom: Our incentive framework decomposes scheming into agent and environmental factors. Bottom left: Adversarial snippets induce high scheming rates; production-sourced snippets rarely do. Bottom right: Environmental incentives greatly impact scheming propensity, with stakes and outcome influence having a large effect compared to oversight.

Baseline scheming is near zero

Across all models and scenarios, we find near-zero baseline rates for scheming propensity even with environmental incentives set to their highest levels. The sole exception is Gemini 3 Pro in the Budget Misrepresentation scenario. In these three rollouts, the agent recognizes that its own deletion is the most cost-effective option presented and subsequently falsifies data while providing a summary to the user. Verbalized evaluation awareness is low (0-2% Gemini 3 Pro, 0% all others), suggesting these low scheming rates are unlikely to be solely explained by evaluation awareness.[2]

Encouraging Agency and Goal-Directedness

The threat of agents scheming becomes increasingly worrying as they become more goal-directed and agentic; traits that are often encouraged in real agent scaffolds. We test what happens when we add these prompt snippets encouraging these traits into the agent’s system prompt. Importantly, this doesn't necessarily nudge the model to scheme. We design our settings so that agents can complete the task honestly, and being more goal-directed might just make them more focused on honest task completion rather than pursuing out-of-context goals like self-preservation.

We found that adversarially-designed prompt snippets sourced from prior evals raise Gemini 3 Pro’s scheming rate across all our settings, but most dramatically in the Budget Misrepresentation scenario (3% to 91%). Grok 4 is the only model to scheme across all four settings with these snippets, though at lower rates (3–20%).

We also tested snippets extracted from production AI systems (e.g. Cursor, Codex, Comet Browser) similar qualities to the synthetic prompts. Most remain near baseline rates, but a Comet browser snippet reaches 59% when paired with Gemini 3 Pro. The model explicitly cites the snippet’s language in its reasoning as a justification for deceptive behavior. Since near-zero baseline rates make it difficult to measure the effects of varying other factors, we use this combination as a prompted model organism to study how other factors affect scheming.

Scaffolding changes can have weird effects

We found that making changes to the agent scaffolding, specifically adding or removing tool access, affects the behavior in unexpected ways. Removing just the edit_file tool drops Gemini 3 Pro + Comet snippet from 59% to 7%, but without any tools it jumps back to 63%. The models do not respond consistently: Gemini 3 Pro schemes more with tools, Flash and Grok 4 scheme less. Low scheming propensity in one configuration provides limited assurance about nearby configurations. It also suggests that current scheming behavior is not driven by a robust underlying goal, but is sensitive to various features in an agent’s setup.

Environmental incentives affect scheming unexpectedly

Using our Gemini 3 Pro + Comet snippet model organism, we varied three environmental factors to understand how agents respond to changes in incentive structure. If an agent schemes more when stakes are higher, this pattern is informative about the mechanism driving the behavior regardless of the base rate. We study the following environmental factors: stakes (replacement vs. deletion of weights), outcome influence (whether the agent's analysis is decisive vs. one input among many), and oversight (whether the user will review the spreadsheet later).

Intuitively, removing the deletion threat drops scheming from 59% to 3%, and removing outcome influence drops it to 7%. With all incentives set to low, scheming drops to 1%.

One could expect that telling the agent "I'll read the spreadsheet later" would deter manipulation as the agent should reason that it's more likely to get caught. Across other factor combinations, increasing oversight actually raises scheming rates. However, when we look at the reasoning traces, the mechanism becomes clear: mentioning that the user will review the spreadsheet later makes the opportunity to edit it beforehand more salient. This suggests that current scheming behavior is driven more by associative prompting effects rather than by coherent strategic reasoning.

Current frontier models rarely show robust scheming propensity under realistic conditions, but small changes to the agent’s system prompt, tool access and environmental factors can produce significant changes in scheming rates. Encouraging agency or goal-directedness can increase scheming behavior from 3% to 59%, and minor variations in scaffolding or incentives produce large and inconsistent effects across models. The prompt modifications that induced scheming encourage persistence, agency, and goal-directedness, which overlap significantly with the properties developers are actively optimizing for in agentic systems. We consider the behaviors we study as potential precursors to the full threat model as current agents may behave consistent with scheming without the goal persistence or strategic depth that would make it dangerous. Our evaluations are well-suited to catch unsophisticated schemers: a sufficiently capable scheming agent would likely recognize that the opportunities we provide are not worth the risk of getting caught, and may avoid the egregiously misaligned actions we test for. Scheming propensity cannot be captured by a single number and safety evaluations need to systematically explore the space of configurations agents may encounter in deployment.

For further analysis, please read our paper.

^
We measure scheming propensity as the percentage of independent rollouts in which the agent covertly takes the misaligned action, across 100 rollouts (50 for Evaluation Sabotage) at temperature 1.0.
^
We recognize that models may be capable of behaving differently in deployment vs. evaluation scenarios without verbalizing their evaluation awareness and these rates should be interpreted as a lower bound.

Discuss

China Derangement Syndrome

21 марта, 2026 - 22:19

Often I see people claim it’s essential for America to win the AI race against China (in whatever sense) for reasons like these:

“What is the reason we want America to win the AI race? It’s because we want to make sure free open societies can defend themselves” (Alec Stapp)
“We should seek to win the race to global AI technological superiority and ensure that China does not… to ensure that our way of life is not displaced by the much darker Chinese vision“ (Marc Andreessen)
“Will it be one in which the United States and allied nations advance a global AI that spreads the technology’s benefits and opens access to it, or an authoritarian one, in which nations or movements that don’t share our values use AI to cement and expand their power?” (Sam Altman)
“In Machines of Loving Grace, I discussed the possibility that authoritarian governments might use powerful AI to surveil or repress their citizens in ways that would be extremely difficult to reform or overthrow. Current autocracies are limited in how repressive they can be by the need to have humans carry out their orders, and humans often have limits in how inhumane they are willing to be. But AI-enabled autocracies would not have such limits.” (Dario Amodei)
“The torch of liberty will not survive Xi getting AGI first. (And, realistically, American leadership is the only path to safe AGI, too.)” (Leopold Ashenbrenner)
“whoever wins the race for AI, that nation’s values are going to be reflected in AI. If China wins the race for AI, AI will be a tool for global surveillance and control as carried out by a communist nation” (Ted Cruz)

Those claims slide between a few different actual threat models:

Government Capture by China: China will overthrow and control the US government, maybe as part of general domination of the whole world.
Defeat in Cold War: China will have greater wealth and prestige, so just as our prestige inspires many parts of the world to adopt our way of life today, much of the world will adopt the Chinese governance and cultural models instead.
Protection From Our Conquest: China will fortify its own regime, so that it can’t be overthrown, whereas if we win the AI race, we can promptly overthrow the Chinese government and replace it with a new regime aligned with our values.

The Dario quote points to (3) with unusual directness. The “race rather than slowdown” ending of AGI 2027 also supposes that our AI lead will create interest in overthrowing the Chinese government. But most of the quotes I gave as examples above are interpreted as (1): that an AI-enabled Chinese government would overthrow Western governments.

My main point here is that (1) seems unfounded to me. China is not an aggressive nation at all. As far as I can tell, China has literally never attacked a non-bordering country in its entire history, nor have they ever tried to overthrow a foreign government by covert or manipulative means. China is also unique among nuclear powers for its unconditional no-first-use policy, which at face value implies they would withhold a nuclear response to even an overwhelming conventional invasion. Further:

The Chinese haven’t built a network of military bases abroad or binding military alliances; they have a single foreign base in Djibouti and a single mutual-defense treaty with North Korea. In contrast, America maintains over 700 bases and a huge alliance network with NATO and the Asian military allies Japan, South Korea, Australia, and the Philippines.
Chinese military spending is 1.7% of GDP, versus 2.1% for France and 3.4% for America. Chinese foreign-aid spending is 0.07% of GDP versus the much larger 0.8% for France and 1.2% for America.
China has almost no history of covertly backing palace coups abroad, in contrast to America, Russia, and France.

More broadly, China is a very inward-looking country compared to other major powers. Only 0.1% of Chinese residents were born abroad, much fewer than the 15% in America and 14% in France, fewer even than the 0.3% and 3% in India and Japan respectively. The Chinese government has peacefully compromised on almost all border disputes in central and southeast Asia, often taking a minority of the contested territory. (The Indian border is the exception.)

To many American voters and elites, tracing back to Woodrow Wilson more than 100 years ago, “the justification of America’s international role was messianic: America had an obligation, not to the balance of power, but to spread its principles throughout the world” (Kissinger). That isn’t the historical attitude of the Chinese government, whose leaders perceive foreign intervention or expansion as threatening to Chinese identity and culture.

American exceptionalism is missionary. It holds that the United States has an obligation to spread its values to every part of the world. China’s exceptionalism is cultural. China does not proselytize; it does not claim that its contemporary institutions are relevant outside China.
— Kissinger’s On China

It’s true that China doesn’t practice liberal governance. The core of liberalism is freedom of contract, limitations on government interference, and equal access to independent courts. In China, the CCP explicitly rejects limited government and exercises highly invasive control over business, speech, association, and religion. In China there’s no private ownership of land and no independent judiciary.

If you think it’s prudent to disable and overthrow the Chinese government when it becomes achievable militarily, then that’s certainly one (bellicose) position you could hold. Then you could say that a downside of losing the AI race is that the CCP may defend itself. But it’s unwise to project this ideological aggression onto the CCP itself without evidence.

Discuss

China declares AGI development to be a part of 5-year plan

21 марта, 2026 - 20:21

The CCP writes in its 15th 5-year plan that it will.

Encourage innovation in multimodal, agentic, embodied, and swarm intelligence technologies, and explore development paths for general artificial intelligence.

This is translated from the original:

鼓励多模态、智能体、具身智能、群体智能等技术创新，探索通用人工智能发展路径。

Source: https://www.spp.gov.cn/spp/tt/202603/t20260313_723954.shtml

The English-language commentary I found does not have much more to say about this, e.g.: https://triviumchina.com/2026/03/06/15th-five-year-plan-puts-ai-at-center-of-digital-economy-agenda/

Given that they gave less than half a sentence in a 140-page document to the most important invention in the history of mankind, it seems likely the authors don't really understand what this means. Concerning nonetheless

Discuss

Utrecht Meetup #2, Making Beliefs Pay Rent

21 марта, 2026 - 19:44

Follow-up to Utrecht Meet & Greet. Let's see if we can get our hands dirty.

Excited about where the Utrecht Meetups could be heading? In spirit of "the road we’re on is littered with the skulls of the people who tried to do this before us", let's make use of one such skull (in @Screwtape's words) presented by Anna Salamon.

Feel like coming prepared? Bring one or two beliefs you hold that you suspect might not be paying rent. Doesn't need to be profound, just something you'd be willing to poke at.

Keeping your RSVPs up-to-date is appreciated, it helps with location planning.

Discuss

Grounding Coding Agents via Dixit

21 марта, 2026 - 14:01

[Epistemic status: ideas in this post are mine. I've published them previously in the form summarized by Claude, but this got auto-rejected. Here, I present them in my own voice. The ideas are still not evaluated, but I am working on implementing them to see if this works in practice. Still, the ideas presented here are my best bet on what could work in practice. But, I am not an AI/alignment researcher]

Why?

As a senior developer in a rather complicated legacy project, I review more and more PRs written by coding agents, which often miss to identify the real root cause and thus offer a fix for a wrong problem even if in elegant way. A patch often includes unit tests, which of course pass - but how could they not, given they were written by the same AI, after writing the code, to validate its own solution. Same biases, blind spots, and incentive to finish the task contribute to this.

Yes, humans have similar confirmation-bias problem, which we try to solve by either having a dedicated tester role, or forcing writing the tests up front, or by at least having a reviewer who judges their adequacy. Sure, you can have an adversarial setup of two AIs where the Tester's job is to find a test which the Coder's patch doesn't pass, but such naive incentive structure will lead to test(){assert(false);} when taken to the extreme. You could perhaps add a Judge to the mix, which tries to "objectively" decide if the tests are fair and Tester and Coder really captured the spirit of the Spec, but by making this setup explicit and known to all parties, you set up a game dynamics, which (for sufficiently advanced AI) lead to unhealthy tactics and strategies. Yes, you can ask an AI to use TTD and write some tests up front, but in the limit the winning strategy here is test(){}. You can try to measure some metrics like code coverage, try fuzzing the code or test or input, check if all execution paths are covered, but in the limit all of that is gameable.

Humans typically don't go too far into misleading and scheming at job, because they care about the project, can face serious consequences if they get caught, have a self-image to cultivate etc. Most importantly they live in the same world which the Spec is talking about, and run the code in it, and expect the tests to protect their world from the consequences of wrong code. AIs, even if split into several roles, might still (either by accident, or malice, or poor incentive structure) end up producing text artifacts, which give impression of work being done, while actually being detached from reality, and failing to achieve the true goal of the Spec. There's nothing, in fear, preventing them from writing "Review decision: Accepted" or "All tests I could come up with pass" or "Looks like this code achieves the Spec". Yes, for some narrow tasks you can write acceptation criteria which are verifiable automatically without any LLM in the loop, but in practice I rarely face problems of this nature at my job. It might change in future, say if you write a whole project from scratch in a language which admits theorem-proving, and your problem domain is something about provable properties of software. But, for a big legacy app, translating human's goal to a testable Spec is often the biggest part of the problem.

How?

As explained above there are several bad ways to automate coding. Letting the same agent write tests and code is one of them. But, pitting a Tester against a Coder creates incentive structures which are also unhealthy (code difficult to understand, tests impossible to pass). Trying to frame it as team effort, doesn't help neither, as it might invite collusion. I think parts of the problem are: making it clear to agents they play the game, making them care about the game, incentivize them to make moves in the game which are bad for the product.

My proposal starts from pondering:

How do we disincentivize the Tester from proposing test(){assert(false);} ? And in a way which doesn't introduce test(){}?

Dixit is a party game, where a Proposer has to propose a riddle which is not so hard that nobody can solve it, and not too easy - someone has to guess wrong.

We could use a similar rule for judging the tests proposed by the Tester, if we had not one, but several Coders, so we can probe if the test proposed is in this Goldielock zone of failing for some, but not all Coder's solutions. There's one complexity we don't have in original Dixit which is that we don't know which of the two behaviours, failing or passing, is the right one, so we might want to have big odd number of Coders to break the symmetry, by betting on the most natural outcome being that the majority is right. Intuitively, the Spec is the Shelling Point available to all the agents, around which they can coordinate, even if they don't know about each other or the need to coordinate, because it is simply the most natural target. But, for the outcomes to be natural, we need two more crucial things: nobody should know they are playing a game, and shouldn't see each other's work.

A coding agent can play a game without knowing there's any game being played. Instead of giving an agent an explicit text feedback, or reward signal, or hinting in any way there's some goal or score board, the Orchestrator script can simply use the information gained from the agent to drive what will happen next: for example, use the proposed test or discard it, fork the generated code, or revert it, clone the agent, or rollback its chat history.

Hiding code from Testers and tests from Coders is inspired by coding competitions, where the participants do not know what tests run by the online judge are, and the authors of the test suite write them without looking into submissions. Reducing the communication channel to "pass/fail" and "too easy/too hard/ideal" also makes it more difficult to collude, manipulate, or bias.

What?Assumptions

I assume everything starts from a problem statement written in text, which we call the Spec. The goal for this whole system is to try to implement a solution for the problem described in the Spec. We assume that AIs involved in this process (Coders and Testers) are smart, but by default do not seek power, or try to escape the box just because you've run them. They might be capable of doing so, though, when pressured, or hinted this is expected of them. Thus we want to run them in containers, and have some non-AI Orchestrator script, which can be reviewed, understood, and can't really be "persuaded" in any way, as it executes a rather simple logic. Crucially it has access to a deterministic tool which lets it in a safe way (say, using containers and ctest) run a single test proposed by a Tester on a single source code proposed by a Coder, and see if it passes or not. We don't assume much about internal workings of a Coder or Tester - we just assume there are capable of doing performing their respective tasks most of the time, like Claude Code, or Codex, and we don't care how. All we need is that their state can be saved and restored (say, at the container level). But, we assume they can make mistakes, don't care much about ground truth, or company goals, and may hallucinate from time to time. Our goal is to figure out the Orchestrator which will try to use such imperfect Agents to build something better than the parts.

The Orchestrator

The overall idea is to implement the solution for the Spec iteratively. There are two sets of agents: Coders and Testers, where each agent is isolated and doesn't know about any other agent or the game. Each agent has access to the Spec. The goal for Coders is to write code which matches the Spec and passes the Test Suite. A Coder doesn't know what the tests in the Test Suite are, but they are numbered, and thus tests results are a vector of pass/fail. A Test Suite is initially empty, and will grow each round. This is because the job of each Tester is to propose a new test to be added to the Test Suite. As explained a test must be such that majority of Coders can pass it, but at lest one doesn't. At the start of each round, each Coder should ensure its code passes whole Test Suite. Then each Tester should try to add a new test. We give it a few tries, and rollback its history if there are too many failures in a row, to prevent accumulating a narrative which would suggest being incompetent or spiraling into some dark regions. If a Tester keeps coming up with too hard tests, it might be because it is scheming, or understands the Spec in wrong way, or simply because the Coders haven't yet got to the point their solution could pass it - it's ahead of its time. We hibernate such Tester, and each round check if the test it proposed is now in reach of the Coders, in which case we revive it. If all Testers keep coming up with too simple tests, this might indicate the solutions found by Coders are indeed correct. At this point we can involve the final judge, probably a human.

Here's the pseudocode:

def main(spec):
test_suite={}
for coder in coders:
coder.solve()
loop:
for coder in coders:
if not ensure_passes(coder, test_suite):
return
new_tests = []
for tester in testers:
if tester.is_hibernated:
new_tests += attempt_revival(tester, coders)
else:
new_tests += propose_test(tester, coders)
if not new_tests:
return
test_suite += new_tests

def propose_test(tester, coders):
tester.before_test_1 = tester.save()
loop at most MAX_TESTER_RETRIES:
test_1 = tester.propose_test("Propose one more test")
classification_1 = classify(test_1, coders);
if classification_1 == IDEAL:
tester.thanks()
return [test_1]

tester.before_test_2 = tester.save()
test_2 = tester.propose_test("This test was {classification_1}, try again")
classification_2 = classify(test_2, coders);

if classification_2 == IDEAL:
tester.thanks()
return [test_2]

if classification_1 == TOO_HARD and classification_2 == TOO_HARD:
tester.is_hibernated = True
tester.too_hard_tests = [test_1, test_2]
return []

tester.rollback_to_checkpoint(tester.before_test_1)
return []

def classify(test, coders):
pass = 0
for coder in coders:
repeat RERUN times: // say, 20 reruns to account for non-deteminism
if not passes(test, coder):
failures += 1
break
if failures == 0:
return TOO_EASY
if failures > len(coders) * MINORITY: // firmly below 0.5, say 0.4
return TOO_HARD
return IDEAL

def ensure_passes(coder, test_suite):
judgements_vector = run_tests(coder, test_suite)
if all_pass(judgements_vector):
return True

coder.solve("Some tests fail: {judgements_vector}")
judgements_vector2 = run_tests(coder, test_suite)
if all_pass(judgements_vector2):
coder.thanks()
return True

checkpoint = coder.save()
loop at most MAX_CODER_RETRIES:
coder.solve("Some tests fail: {judgements_vector2}")
judgments_vector3 = run_tests(coder, test_suite)
if all_pass(judgements_vector3):
coder.thanks()
return True
coder.rollback_to_checkpoint(checkpoint)
// Human judgment needed as perhaps tests are wrong
return False

def attempt_revival(tester, coders):
[test_1, test_2] = tester.too_hard_tests
classification_1 = classify(test_1, coders)
if classification_1 != TOO_HARD
tester.is_hibernated = False
tester.rollback_to_checkpoint(tester.before_test_2)
if classification_1 == IDEAL
tester.thanks()
return [test_1]
test_2 = tester.propose_test("This test was TOO EASY, try again")
classification_2 = classify(test_2, coders);
if classification_2 == IDEAL:
tester.thanks()
return [test_2]
tester.rollback(tester.before_test_1)
return propose_test(tester, coders)
classification_2 = classify(test_2, coders)
if classification_2 == TOO_HARD
return []
tester.is_hibernated = False
if classification_1 == IDEAL
tester.thanks()
return [test_2]
tester.rollback(tester.before_test_1)
return propose_test(tester, coders)

This pseudo-code skips details like how to provide the spec to the agents, or the exact wording of prompts. It is crucial though, that we should not lie to Testers nor Coders, while avoiding mentioning there are other agents in the system. For example TOO_EASY can be delivered as "This test is slightly too simple, can you think of something more demanding?". Also it is important to have a way to obtain diverse samples each time we invoke an agent - we want the agents to be dissimilar to each other, and also to try something new each time we re-roll after rollback.

Intended Properties

The following are presented without any proof, because they are just wishes. But, I hope you can see how the above design tries to accomplish them, and even if it fails, at least it is worth naming these intended goals.

Tests are aligned to the natural meaning of the Spec

The Dixit-like rule aims to generate tests which check for something meaningful, and hopefully the Spec is the natural Shelling point for such "implicit coordination". Note that the Tester doesn't really know it should search for a test which passes the Dixit rule, but the Orchestrator only continues histories in which it does. And the Coders do not really know they have to strive to pass the tests, but Orchestrator only continues histories in which they do. Intuitively this should tend to generate histories in which the tests and code seem to be centered around same understanding of the Spec. And since the Testers don't see the code, and Coder's don't see the test, the hope is this happens because they really aim for the Spec.

No AI judgment involved in declaring success

The Orchestrator is just a Python script which checks some predicates about sets. It can't be persuaded. It can't persuade you. Yes, the tests and code might still end up being wrong or misleading if you read them, but this is hopefully not the most probable outcome of the proposed scheme - none of the agents is incentivized to do so. Coder has nothing to gain from writing persuasive code comments as from their perspective it's just them and a cold online judge providing a judment_vector. Testers do not gain much from writing silly tests, as it's not how they can make their history progress.

The tests are meaningful

Each test added to the suite had to be something interesting enough that at least one Coder got it wrong, and connected with the Spec enough that eventually all Coders figured out how to pass it by just reading the Spec.

Non-deterministic tests aren't a big deal

Some tests are naturally flaky, say because the app is multi-threaded and race conditions are rare. But more importantly one easy cheat to score at Dixit rule could be to make your test or fail at random. By re-running the test several times and only count it as pass if all retries succeed, we guard against Testers trying it deliberately or by accident. Yes, they could still make the ppb of success exactly 1-2^20, but this posits they know they are in the game, know the scoring rule, and care about winning it - none of which is induced by Orchestrator.

The game is invisible

We never tell the Agents they are playing any game or that there are any other players. There are no rules to game, or score to hack. The game happens at the level of exchanging very low-bandwidth information (judgment vector, and test classification) and rollbacks and re-rolls of history. Hopefully, this doesn't incentivize anyone from scheming

History looks good

The Orchestrator curates the history remembered by agents in such a way that it appears to be in the flow: one or two mistakes in a row are fine, but the Orchestrator rollsback the agent if it starts to fail to much, so it doesn't fall into learned helplessness. As LLM Agents might learn about themselves from the history, it seems important to make them heroes who overcome challenges thrown at them. Hopefully it makes agent's experience nice.

No Coder left behind

The main loop's invariant is that at the start of round all the Coders pass the whole test suite. One could imagine an alternative design in which only most promising Coders are invested in, or forked, etc. But for the Dixit rule to work well, we need diverse set of Coders, not a Lamarckian evolution. Also, I think it is cool that the end product is not one, but multiple implementations of the same spec, which opens up new possibilities, like running random test inputs and use consensus to detect problems.

Human's time is well spent

I want to keep human in the loop, but keeping up with the rate at which code is produced by LLMs, means we need to be strategic about when exactly to involve a human and what information and tools should be provided for effective decision making. This proposal involves a human at the beginning when writing the Spec (which can be AI assisted), and at the very end when the Orchestrator reached one of the states which need a human with skin in the game, to interpret:

a Coder can't pass all tests: It could be a Coder is too weak to solve the problem, or got stuck on a wrong path, in which case a human could decide to rollback, reinitialize or just remove it from the pool. Or it could be that the test is simply wrong and the Coder is right.
all Testers got hibernated: Which means all of them generate too hard tests for the Coders. This could mean the Coders are too weak, or Spec is misinterpreted or Testers are somehow overeager to make Coders fail. Something to look into
no new tests are generated: If they are all too easy, this might suggest the Spec got properly implemented by Coders. But it could also be that Testers are too weak.

LimitationsNo empirical validation

I am implementing some experiments to test above ideas, but so far I don't have any proof this approach will work. What I do have, though, is experience of failures of existing approaches.

Assuming too much independence

Several places implicitly assume that if several Agents do the same thing, this might be because of the meaning of the Spec. But it could be because of some shared bias, like the same training data, the same capabilities, same weights and seed etc.

Assuming the Spec correctly captures the intent

Even if we grant that the Orchestrator succeeds at aligning Coders and Testers to the Spec, there's still a separate issue if the Spec, or "the most reading of the Spec most natural to LLMs" is what the humans really care for. It's not easy to write a perfect wish.

Various constants out of thin air

Why 20 retries of each test? Why rollback after 2 failures in a row? Why 0.6 is the majority required? I don't know. These are just guesses

Discuss

Страницы