Вы здесь

Сборщик RSS-лент

It is our responsibility to develop a healthy relationship with our technology

Новости LessWrong.com - 6 ноября, 2025 - 20:33

Published on November 6, 2025 5:33 PM GMT

Many technologies can be used in both healthy and unhealthy ways. You can indulge in food to the point of obesity, or even make it the subject of anxiety. Media can keep us informed, but it can also steal our focus and drain our energy, especially social media. AI can help students learn, or it can help them avoid learning. Technology itself has no agency to choose between these paths; we do.

This responsibility exists at all levels: from society as a whole, to institutions, to families, down to each individual. Companies should strive to design healthier products—snack foods that aren’t calorie-dense, smartphones with screen time controls built in to the operating system. There is a role for law and regulation as well, but that is a blunt instrument: there is no way to force people to eat a healthy diet, or to ensure that students don’t cheat on their homework, without instituting a draconian regime that prevents many legitimate uses as well. Ultimately part of the responsibility will always rest with individuals and families. The reality, although it makes some people uncomfortable, is that individual choices matter, and some choices are better than others.

I am reminded of a study on whether higher incomes make people happier. You might have heard that more money does not make people happier past an annual income of about $75k. Later research found that that was only true for the unhappiest people: among moderately happy people, the log-linear relationship of income to happiness continued well past $75k, and in the happiest people, it actually accelerated. So there was a divergence in happiness at higher income levels, a sort of inverse Anna Karenina pattern: poor people are all alike in unhappiness, but wealthy people are each happy or unhappy in their own way. This matches my intuitions: if you are deeply unhappy, you likely have a problem that money can’t solve, such as low self-esteem or bad relationships; if you are very happy, then you probably also know how to spend your money wisely and well on things you will truly enjoy. It would be interesting to test those intuitions with further research and to determine what exactly people are doing differently that causes the happiness divergence.

Similarly, instead of simply asking whether social media makes us anxious or depressed, we should also ask how much divergence there is in these outcomes, and what makes for the difference. Some people, I assume, turn off notifications, limit their screen time, put away their phones at dinner, mute annoying people and topics, and seek out voices and channels that teach them something or bring them cheer. Others, I imagine, passively submit to the algorithm, or worse, let media feed their addictions and anxieties. A comparative study could explore the differences and give guidance to media consumers.

In short, we should take an active or agentic perspective on the effects of technology and our relationship to it, rather than a passive or fatalistic one. Instead of viewing technology as an external force that acts on us, we should view it as opening up a new landscape of choices and possibilities, which we must navigate. Nir Eyal’s book Indistractable is an example, as is Brink Lindsey’s call for a media temperance movement.

We should also take a dynamic rather than static perspective on the question. New technology often demands adjustments in behavior and institutions: it changes our environment, and we must adapt. For thousands of years manual labor was routine, and the greatest risk of food was famine—so no one had to be counseled to diet or exercise, and mothers would always encourage their children to eat up. Times have changed.

These changes create problems, as we discover that old habits and patterns no longer serve us well. But they are better thought of as growing pains to be gotten through, rather than as an invasion to be repelled.

When we shift from a static, passive framing to a dynamic, agentic one, we can have a more productive conversation. Instead of debating whether any given technology is inherently good or bad—the answer is almost always neither—we can instead discuss how best to adapt to new environments and navigate new landscapes. And we can recognize the responsibility we all have, at every level, to do so.

Discuss

Debunking “When Prophecy Fails”

Новости LessWrong.com - 6 ноября, 2025 - 20:26

Published on November 6, 2025 5:26 PM GMT

In 1954, Dorothy Martin predicted an apocalyptic flood and promised her followers rescue by flying saucers. When neither arrived, she recanted, her group dissolved, and efforts to proselytize ceased. But When Prophecy Fails (1956), the now-canonical account of the event, claimed the opposite: that the group doubled down on its beliefs and began recruiting—evidence, the authors argued, of a new psychological mechanism, cognitive dissonance. Drawing on newly unsealed archival material, this article demonstrates that the book's central claims are false, and that the authors knew they were false. The documents reveal that the group actively proselytized well before the prophecy failed and quickly abandoned their beliefs afterward. They also expose serious ethical violations by the researchers, including fabricated psychic messages, covert manipulation, and interference in a child welfare investigation. One coauthor, Henry Riecken, posed as a spiritual authority and later admitted he had “precipitated” the climactic events of the study.

Discuss

[Linkpost] How to Win Board Games

Новости LessWrong.com - 6 ноября, 2025 - 20:02

Published on November 6, 2025 5:02 PM GMT

This is my Day 2 Inkhaven post. I'm crossposting it a) under request, and b) because it's my most popular (by views) Inkhaven post so far. Please let me know if you have opinions on whether I should crosspost more posts!

Want to win new board games? There’s exactly one principle you need to learn:

Understand the win condition, and play to win.

That’s it. It’s that simple. Despite the simplicity, meditating on this principle alone can help you significantly elevate your gameplay and become much better at beating other novices.

The rest of the article will teach you the simplest ways you can apply this principle to a wide variety of games. Note that this is not an outline of a perfect strategy, but rather a “good enough” strategy for beating other novices[1].

Understanding the win condition

The first step is always: read the victory conditions carefully. Ask yourself: What triggers the game’s end? How do I score points? What’s the least number of turns I need to satisfy the win condition? Can I win more quickly than that?

Roughly speaking, there are 5 types of win conditions[2]:

Race conditions: First to X victory points (VPs) wins (Splendor, Settlers of Catan)
Accumulation conditions: Most points when the game ends wins (7 Wonders, Wingspan, Terraforming Mars)
Duel elimination - Win by killing your single opponent
Multiplayer elimination - Win by outlasting everyone else
Idiosyncratic win conditions - a large gamut of unusual win conditions[3]

Next, you play towards the win condition(s).

Playing to win

“Playing to win” sounds obvious, but next time you play a new game, record and introspect on how you actually play a new game! Chances are, you would optimize for everything except the win condition. You likely build elaborate engines that never score, make enemies unnecessarily, or pursue “interesting” strategies and clever combinations that are almost certainly too slow to achieve anything before the game inevitably ends.

Beelining towards the win condition

The simplest application of my principle is for games where winning means achieving X victory points first. In those games, usually your “good enough” strategy comes from doing whatever greedily leads you to the most victory points in the smallest number of turns possible. And then repeating this with the next small sequence of actions that leads you to the highest VP/turn ratio, and again and again a few more times before you win[4].

For example, in Splendor, where the win condition is achieving 15 victory points, a good greedy heuristic is to reserve the card you see that has the highest VP:gem cost ratio, buy it as quickly as you can, and then try to reserve and/or buy the next card you see with the highest VP:gem cost ratio. Do this 3-5 times, and you win.

As a broad heuristic, this “simple beelining” strategy works best for games where either a) the end-game condition is via reaching X victory points first, or b) the game ends deterministically by Y turns, so you just have to have the most VPs by the time the game ends.

“Rush” strategies/Dual mandates

What if the game doesn’t end via either having reached some number of VPs first, or deterministically? What if a specific set of conditions must be triggered to end the game?

In those cases, a good “playing the win condition” player must acknowledge that the end condition is different from the win condition. They should invest in two different goals:

Getting as many VPs per turn as quickly as possible.
Ending the game as quickly as possible.

This is a dual mandate, so it’s harder to achieve than just going for one goal alone. So in those games, you might be tempted to avoid rushing altogether and instead go for a more complex “engine” strategy where your goal is to achieve a viable economy, etc that lets you eventually achieve a lot of victory points by the end of the game, even if you initially don’t get any.

This is usually a mistake. Not always, but usually, especially for beginning players. But what happens if you do go for an engine? For that, see the next section:

Source: Gemini 2.5 Pro

The endgame starts earlier than you think

In an earlier draft of this post, somebody objected to my framing, saying that an important strategic aspect of advanced play in a lot of board games is timing the transition between a middle-game (accumulating resources, improving your economy, building up production, etc) and endgame (where you’re just maximizing points).

I don’t necessarily disagree. However, I think beginner, intermediate, and even some advanced players are consistently biased in overestimating how long the middle-game should be:

2. Linch’s Law of Endgames: You should start the endgame earlier than you think you should, even after taking Linch’s Law into effect.

I claim this is a real bias rather than a random mistake/error or “skill issue” as the kids say. You can in fact imagine the opposite issue: people bias too much towards the endgame/victory conditions and don’t invest enough resources into building up their engine. You can imagine it, yet in practice I almost always see people systematically biased in favor of spending too much time on engines, and almost never too much time spent on trying to win.

More broadly, you should “aim more directly towards victory” more than you currently do, even after taking this advice into account.

However, ‘playing to win’ looks different depending on win conditions. In elimination games, for example, playing to win centrally means playing to not lose.

Share

Playing to not lose

In multiplayer elimination games (Coup, multi-player fighting games) the objective is to be the last person standing. Some people interpret this as a mandate to be maximally aggressive and to try to kill off all their opponents as soon as possible. This is a mistake.

Instead, you should play to not lose: play in a way to attract as little aggression as possible. Among novices, this means playing quiet moves, not being spooky, avoiding making enemies, flying under the radar etc. Among intermediate to advanced players, you should assume everybody always (or almost always) makes the locally optimal decision, so you should play to position yourself in a manner that it is never rational for you to be anybody’s first target of choice.[5]

This means taking game theory seriously, being less wealthy than other players, seeming (and often being) locally less of a threat etc.

In politics games, take Statecraft seriously, and avoid being a Great Game loser by seeming like too much of a threat.

Choosing between beelining towards victory vs playing to not lose

This is complicated, but here are some broad heuristics:

Usually when there’s minimal player interaction and no player elimination you should beeline towards victory. Other players are novices and are far from optimal. Worrying about them is often going to hurt your own game. Instead, you should spend your limited time and attention improving your own game.

In the other extreme, in multi-player elimination games, you should focus the vast majority of your attention, especially in the beginning, on “not losing.” In those games, the best offense is a studious defense. Damaging an opponent increases your win probability some, but much less than being damaged will decrease your win probability.

In a two-player perfect information duel elimination game like chess, you should pay approximately equal attention to beelining towards victory (checkmating your opponent) and playing to not lose (avoid being checkmated). In those games specifically, my impression is that most people systematically neglect thinking from their opponent’s perspective, and play too much of what Sarah Paine calls “half-court tennis.”[6]

Against other strategies for winning

I contrast my “playing to win via aiming at the win condition” strategy for winning against other common strategies that people either employ or advise. I think all of the other strategies have their place, but are substantially worse for new players.

Against Central Narrative strategies

The most common strategy that people implicitly employ when learning a new game is what I call the “Central Narrative” strategy – understand the Central Narrative of a game, and try to execute it better than anybody else.

For Dominion, this means buying lots of cards and playing lots of actions that combo well with each other. For Splendor, this might mean buying lots of gem mines and being more locally efficient with your purchases than anybody else. For fighting games, this might mean executing lots of cool attacks and attacking your opponents head-on. For Coup and other social deception games, this might mean lying a lot and calling out your opponents on their BS. And so forth.

I think this is bad advice for people trying to win, since presumably other players are trying almost the exact same strategy. If you win despite everybody else doing the same thing, it means you’re some combination of:

Smarter than everybody else
Care more about winning than everybody else
Have more experience with this game/games in general than everybody else
Luckier than everybody else

None of these traits are repeatable strategies you can rely on. “Git gud” is not actionable advice. I prefer strategies that are both easier to apply and more nontrivial.

Against Psychology Players

Some people advise trying to learn what your opponents want to do, and then systematically foiling their plans (Sun Tzu: “what is of supreme importance in war is to attack the enemy’s strategy”). Good advice for some generals in 5th century BC China, terrible advice for most modern board games. This is because a) psychology is hard, b) your opponents are novices, so foiling their plans might well backfire, and c) wasting your cognition worrying about your opponents’ strategies is going to hurt your own gameplay (especially in multi-player games with 3+ players).

Against “Optimal” Players

Some people advise learning what the best players in a game do (or even what’s theoretically proven or simulated to be mathematically or computationally optimal) and then copy the best moves possible.

I think this is terrible advice since the best strategies for beginners to play might be very different from what is theoretically the most optimal play from the best strategists.

When to not use this advice

You should avoid this advice if you are currently winning too often, and think winning is not fun. You should also disregard this advice if you don’t mind winning, but find playing to win morally or aesthetically distasteful.

This might sound dismissive, but I mean it most sincerely. We all want to have fun in our games, and if winning (or winning in specific ways) is not fun for you, please don’t change your gameplay just based on some rando’s Substack post!

Finally, you should downgrade my advice if you tried to implement it multiple times and you were less successful at achieving your desired goals (whether winning, having fun, or something else) than when you were following other advice, or using your own base intuitions.

Conclusion

To recap, find the win condition, think carefully about what the win condition entails, and then aim every chip, card, or cube at it, and start the sprint earlier than feels safe.

Taking this advice seriously will mean that you’d almost always win against other novices – at least until they Get Good and they follow similar advice. Good luck!

And please, feel free to subscribe! And if you do end up trying out my advice, please comment! I would love to get more data on whether my advice is directly applicable to real games.

Thanks for reading The Inchpin! Subscribe for free to receive new posts and support my work.

^
Indeed, a well-designed game shouldn’t have an obvious strategic equilibrium as simple as the one I describe.
^
I try my best to explain strategies in a way that’s agnostic to specific games. But sometimes I find concrete examples helpful! Unfortunately I expect most readers to not have heard of the vast majority of the specific games referenced, and also I don’t think it’s worth the readers’ time to be explained specific game rules! Such is life.
^
For example, in Blood on the Clock Tower, Good wins by killing the Demon. Evil wins by having only two players alive, one of which is the Demon. In Innovation (and to a lesser degree, CCGs like Magic the Gathering, Yu-Gi-Oh!, etc) there are many “special cards” that can trigger different unusual win conditions.
^
Does this sound way too simple to be the best possible strategy for such a broad class of games? Sure. Does that mean it’s not the best possible strategy for most games? Also true. But surprisingly I found it to be “good enough” to beat most novices in large numbers of games, such that beating this simple greedy strategy often requires much more strategic depth, understanding, deep thinking, etc than needed to execute this strategy, or simple variations thereof.
^
Coup players may enjoy my earlier Medium article here: https://medium.com/@draconlord/coup-card-game-strategy-the-3-player-endgame-b87cd90a17af
^
An earlier version of this post tried to more carefully delineate which of my advice was board game-specific, vs apply to the “Game” of life more generally. In the end, I decided that while the broader topic (“to what extent is good advice for board games good advice for Real Life”) can be very sophisticated and interesting, I couldn’t do the topic enough justice while still talking about my core points. So I cut most of the comparisons.

Discuss

SPAR Spring ‘26 mentor apps open—now accepting biosecurity, AI welfare, and more!

Новости LessWrong.com - 6 ноября, 2025 - 19:26

Published on November 6, 2025 4:26 PM GMT

Mentor applications for the Spring 2026 round of SPAR are open! This time, we’re accepting any projects related to ensuring transformative AI goes well, from technical AI safety to policy to gradual disempowerment to AI welfare and more. We’re particularly excited to announce we’ll now be accepting biosecurity projects!

SPAR is a part-time, remote research program pairing aspiring AI risk researchers with professionals in the field to work together on impactful research projects. Mentees gain research experience and guidance, and mentors get assistance with their projects. You can learn more about the program here.

Previous mentors have come from organizations like Redwood Research, METR, Google DeepMind, and RAND; universities including Harvard and CMU; and programs like MATS and GovAI.

Mentor apps will close November 30, and mentee apps will run December 17-January 14.

Who should apply?Mentors

We typically accept applications from researchers with research experience matching or exceeding that of a late PhD student or a MATS scholar. Mentors apply with a proposal for a 3-month research project related to ensuring transformative AI goes well, and commit between 2 and 10 hours a week to supervising mentees working on it.

Mentees

Mentors often look for a technical background (e.g., ML, CS, biology, cybersecurity, math, physics, etc.) or knowledge relevant to policy and governance (e.g., law, international relations, public policy, political science, economics, etc). The program is open to undergraduate, graduate/PhD students, as well as professionals of varying experience levels.

Some projects may require additional background knowledge, like knowledge of specific research areas, techniques, or tools (e.g., ML frameworks).

We don’t require previous research experience, and many mentees have none.

Even if you do not meet all of these criteria, we encourage you to express interest in the upcoming round and apply once applications open in December! Many past mentees have been accepted even though they didn’t completely match a project's criteria.

Why SPAR?

SPAR creates value for everyone involved. Mentors expand their research output while developing research management skills. Mentees get to explore safety research in a structured, supportive environment while building safety-relevant skills. Both produce concrete work that serves as a strong signal for future opportunities in AI safety.

Click here to apply as a mentor, and here to express interest as a mentee. Mentor apps will close November 30, and mentee apps will open on December 17.

Questions? Email us at spar@kairos-project.org or drop them in the comments below.

Discuss

Fake media seems to be a fact of life now

Новости LessWrong.com - 6 ноября, 2025 - 14:44

Published on November 6, 2025 11:44 AM GMT

epistemic status: my thoughts, backed by some arguments

With the advent of deep fakes, it has become very hard to know which image / sound / video is authentic and which is generated by an AI. In this context, people have proposed using software to detect generated content, usually aided by some type of watermarking. I don't think this type of solution would work.

Watermarking AI-generated content

One idea is to add a watermark to all content produced by a generative model. The exact technique would depend on the type of media - e.g. image, sound, text.

We could discuss various techniques with their advantages and shortcomings, but I think this is beside the point. The fact is that this is an adversarial setting - one side is trying to design reliable, robust watermarks and the other side is trying to find ways to break them. Relying on watermarks could start a watermarking arms race. There are strong incentives for creating fakes so hoping that those efforts would fail seems like wishful thinking.

Then there is the issue of non-complying actors. One company could still decide not to put watermarks or release the weights of its model. This is next to impossible to prevent on a worldwide scale. Whoever wants to create fakes can simply use any generative model which doesn't add watermarks.

I don't think watermarking AI-generated content is a reasonable strategy.

Mandatory digital watermark system for all digital cameras

Another idea is to make digital cameras add a watermark (or a digital signature) to pictures and videos. Maybe digital microphones can even do something similar for sound, although this would likely significantly increase the price of the cheapest ones. We should not that this technique cannot be applied for text.

I see several objections to this proposal:

How do we decide who has the authority to make digital cameras? There needs to be control to make sure watermarked content really comes from a digital camera. This could lead to an oligopoly where only some companies have the authority to make digital cameras and thus decide what is real. We could try to make this watermarking system partly decentralized, kind of like HTTPs certificate authorities. The problem is that HTTPs is not as secure as we would like to think. It would be even worse for watermarks, because some actors (e.g. states) have stronger incentives to create fakes and there seem to be fewer ways to detect those (e.g. there is no underlying network where one can track suspicious packages).
What software goes on a digital camera? Cameras already do a ton of software processing before saving the content to a file so somebody must decide and control what software can be put there. The watermark would be useless if the camera software could be used to watermark an image where an object was digitally removed from a scene, for example.
It's not clear how technically feasible this watermarking could be. We would like watermarks that persist after "legitimate" edits (crop, brightness change, format change, etc.) but break otherwise.
Such a watermarking system is likely to end up like the Clipper Chip or the Content Scramble System - somebody would find a security hole rendering it useless or counterproductive.
I print my deep fake image using a high-quality printer and then use my high-quality camera with a special lens to take a picture of it. Now the deep fake is watermarked. So, can one reliably distinguish a picture of something from a picture of a picture?
It is dubious whether using cameras adding watermarks would see wide adoption.
What do we do with the analog devices?

Is there a solution?

I think we may need to accept that indistinguishable fakes are part of what's technologically possible now.

In such case, the best we could do is track the origin of content and then each person could decide which origins to trust. I am thinking of some decentralized append-only system where people can publish digital signatures of content they have generated.

If you trust your journalist friend Ron Burgundy, you could verify the digital signature of the photo in his news article against his public key. You could also assign some level of trust to the people Ron trusts. This creates a distributed network of trust.

With the right software, I can imagine this whole process being automated: I click on an article from a site and one of my browser plugins shows the content is 65% trustworthy (according to my trusted list). When I publish something, a hash of it signed with my private key is automatically appended to a distributed repository of signatures. Anybody can choose run nodes of the repository software in a way similar to how people can run blockchain or tor nodes. Platforms with user-generated content could choose to only allow signed content and the signer could potentially be held responsible. It's not a perfect idea, but it's the best I have been able to come up with.

I have seen attempts at something similar, but usually controlled by some company and requiring paid subscription (both of which defeat the whole purpose for wide adoption).

Discuss

Our ancestors didn't know their faces

Новости LessWrong.com - 6 ноября, 2025 - 13:43

Published on November 6, 2025 10:43 AM GMT

When you are born, nobody knows your face. And in particular, you don’t know your face.

As you grow up, you become familiar with your face, observing how it slowly changes, with its pimples coming and going. This ability hinges on the peculiar fact that you have access to images of yourself, mainly from mirrors and photos.

But these are both relatively recent inventions. The first daguerreotype portrait was taken in 1839, and the earliest traces of manufactured mirrors were polished obsidian found in modern-day Turkey, dated to 6000 BCE. So, ten thousand years ago, your options for seeing yourself were:

A still lake or rain puddle
Looking into someone’s eye
A naturally shiny stone
A smooth sheet of ice

And… that’s pretty much it. Your face used to be how other people knew you. It must have been common to live your entire life without knowing what you looked like.

But the primary way you would get to know yourself was through the reactions of your peers around you. It is like playing the game “Who am I?” (“Devine tête” in French), with the character you are trying to guess being your face, and the game lasting for all your life.

The experience of getting to know your face, ten thousands years ago

To get a sense of what it feels like to see your face in such a world, we can look at the effect of listening to a recording of yourself. People in your life are intimately familiar with the sound of your voice, but you are familiar with another voice. This is the sound of your voice when it reaches your ears through your head instead of through the air.

We all have had this cringe experience: is this what I sound like? Our sense of self gets challenged, this weirdly foreign voice is supposed to be me? This might be the feeling shared by one of our ancestors ten thousand years ago, casually going for a walk to pick berries after the rain and stopping in shock, looking down at a puddle.

Today, our face feels like a universal way to say, “this is me.” It shows how flexible our sense of self is. It is like a bag whose boundary can expand to include objects through ownership, people from our tribe, or ideologies and beliefs.

Knowing our ancestors didn’t know their faces can help us keep our identity small. It can allow us to take a step back when we hold on strongly to a cause that feels so close to our hearts, so close to who we are. Does letting this go feel worse than forgetting your face?

Discuss

Review: K-Pop Demon Hunters (2025)

Новости LessWrong.com - 6 ноября, 2025 - 12:16

Published on November 6, 2025 9:16 AM GMT

(This review contains spoilers for the entire plot of the film.)

K-Pop Demon Hunters is a very popular movie. It is the first Netflix movie to hit #1 at the box office. It is "the first film soundtrack on the Billboard Hot 100 to have four of its songs in the top ten". When you Google it, a little scrolling marquee appears with a reference to a joke from the movie. My friends keep talking about it. So I figured I'd check it out.

The movie does some interesting things with animation, importing a lot of anime tropes and visual effects into the realm of 3D animation. For me this mostly fell flat; a bunch of them landed in the uncanny valley, or were otherwise jarring. That said, the choreogrpahed fight scenes are very well-executed and fun; a friend described the feeling of watching them as "like watching someone play Beat Saber really well", and I agree. The movie's songs are also pretty good. (Honestly, I'm not a huge K-pop fan, but they're very catchy and I expect them to be stuck in my head for a while.) One song in particular was handled cleverly; more on that below.

Spoilers follow!

I expected cool visuals and catchy music from the beginning; the real surprise, as the movie approached its climax, was how engaging I found it on a thematic level. The film sets up a simple Manichean world of good and evil, then dives into an exploration of the troubling psychological implications of this setup, weaving together the protagonist's personal growth and the viewer's increasingly conflicted understanding of the movie's cosmology. Then it throws away all the metaphorical structure in the last few minutes, stabs the problem with a sword, and brings back the status quo. I found this very frustrating! I think it could've been on par with some of the best Disney and Pixar movies, thematically, if it only had the courage of its convictions. (I wasn't expecting The Godfather.)

Our protagonist is Rumi, a singer in the K-pop group Huntr/x. Together with her bandmates Mira and Zoey, she maintains the Honmoon, the (somewhat porous) magical veil protecting our world from evil demons. They also stab any demons who get through the Honmoon with swords. The Honmoon is sustained by good vibes from successful concerts; fortunately, they're extremely good musicians and have legions of dedicated fans. Occasionally Huntr/x fight demons on stage; fans assume this is part of the show, so they don't really bother with kayfabe. (Wikipedia says they "lead double lives"; I disagree. If you don't need to change outfits between singing and slaying, you're leading a very single life. I think most actual K-pop idols are leading doubler lives than that. But I digress.) But Rumi has a dark secret; while her late mother was part of the previous generation's trio of demon-hunting singers (It's a Buffy-style "into every generation" deal), her father was -- a demon! (The implied relationship, and her mother's fate, are never explored beyond this; I can't tell if this was a bold choice or laziness.) Most demons are fairly unconvincing humans even before they morph into their demonic forms; the first demon we meet is watering a plant with a pot of coffee. They also have purple webbed "patterns" on their skin, which Huntr/x tend to use as final confirmation before pulling out the swords. Rumi was born with a tiny bit of pattern, which has expanded over the years to cover much of her body. She is deeply ashamed of this and has kept the patterns scrupulously hidden, even from her bandmates, as she was encouraged to do by her adoptive mother Celine (one of her mom's bandmates).

Huntr/x's goal in the film is to be even better K-pop idols so that they can create a "Golden Honmoon", which will be completely demon-proof. Rumi secretly hopes that this will also rid her of her patterns. But just when they seem to be on the verge of success, the demons send a boy band to defeat them. One of the demon boys, Jinu, turns out to have a secret of his own: He used to be human! He has been turned into a demon by the demon king Gwi-Ma through a typically Faustian deal (earthly power, temptation to sin, loss of his soul, etc). Gwi-Ma, we learn, controls Jinu (and, apparently, all demons) via shame and regret. Jinu learns of Rumi's patterns and tells her about his past; both characters start to hope for a shared redemption. (In particular, they make plans for Jinu to sabotage his band's performance, hoping that he can stay on the human side of the Golden Honmoon.) But meanwhile, trouble is brewing; Rumi has newfound empathy for the demons she mows down by the dozen, and mixed feelings about the lyrics of her band's new anti-demon diss track, "Takedown". ("'Cause I see your real face, and it's ugly as sin / Time to put you in your place, 'cause you're rotten within / When your patterns start to show / It makes the hatrеd wanna grow outta my veins") She starts falling behind in battle and can't seem to get through a rehearsal without losing her voice. Her patterns keep growing. All of this strife is tearing holes in the Honmoon. This comes to a head at the big show-down concert for the Idol Awards; demons impersonating Mira and Zoey perform "Takedown" and reveal Rumi's patterns; she flees, and the news of Huntr/x's "breakup" tears the Honmoon to shreds.

Rumi confronts first Jinu, who has lost all hope and is thoroughly in the grip of Gwi-Ma, and then Celine, who encourages her (as usual) to hide her patterns and try to "fix" things. At one point, Celine says "our faults and fears must never be seen", which we've heard from Mira earlier in the film. Rumi, distraught (and looking increasingly demonic), accuses Celine of failing to love "all of" her. "If this is the Honmoon I'm supposed to protect," she says, "then I'll be glad to see it destroyed." The demon boy band begins a final performance where they sing about unhealthy parasocial relationships for a newly-aboveground Gwi-Ma and legions of sorta-depressed-but-enraptured fans ("I'm the only one who'll love your sins / Feel thе way my voice gets underneath your skin").

Let's pause here. We've learned that demons (or at least some of them) are just humans who have given in to shame and fear and lost hope of redemption. Rumi, on the verge of despair, has glowing patterns just like those on Jinu, the most human-looking of the demons. Her maternal figure encouraged her and her bandmates to hide their flaws; this has now pushed Rumi to the point of questioning her cosmic role as one of the guardians of the increasingly impenetrable barrier between humans and demons. So obviously we're going to learn that the Honmoon was a mistake and that there's a better way to integrate the human and demon worlds, right? And the demons, or at least Jinu, will get a second chance? And maybe we're getting some critique of how people engage with K-pop idols?

Just kidding! As things move towards the obvious resolution on an emotional level (Rumi's bandmates sing about their respective personality "flaws", recontextualizing them as positive traits), they backslide on a cosmological level. Huntr/x has a dramatic battle against Gwi-Ma and his boy band, where they stab the demon king and his demon minions with swords; then they make a new Honmoon, better than ever, powered by the soul-energy of their even-more-devoted fans. (It's rainbow, not golden, but the effect seems to be the same). Hoards of demons get killed or banished back to the underworld; I think maybe Gwi-Ma gets killed but I wasn't really paying attention. A newly-ensouled Jinu sacrifices himself to save Rumi, neatly avoiding the question of which side of the barrier he'd have ended up on. Rumi, for her part, no longer seems remotely bothered by the task of slaughtering demons at an industrial scale.

At the end of the film, it's uncanny how little has changed, for both Rumi and her world. The Honmoon is stronger than before -- although, really, it seems to have been holding up alright from the beginning. The demons (traumatized failures?) are trapped in the underworld, where they (apparently?) belong, and the (charmingly flawed but ontologically immaculate) humans are safe up top. Rumi is basically the same, except that she's not ashamed of her patterns, which are also pretty and rainbow now. I think this is supposed to symbolize her (and her friends' and fans') acceptance of her flaws, but her major flaw has pretty much been fixed. Her bandmates are also more willing to talk openly with each other about their flaws and insecurities, although they really don't seem to have been shy about this before. They're also willing to sing about their insecurities, but this is a somewhat confusing kind of growth. (I don't really think it matters whether Taylor Swift's songs are about specific relationships she's had; a lot of good art isn't autobiographical in that way.) The largest sign of character growth is in their relationship with their fans -- instead of hiding from fans when out in public, they are happy to engage with them. This seems nice, but not really that big a deal.

(There's a possible interpretation here where Rumi's previous shame about showing skin is fundamentally about sexuality, or intimate relationships, but the movie doesn't really seem to be angling for this.)

There's something annoyingly self-referential about Rumi's "flaw". She's ashamed of her patterns, which are the physical manifestation of her shame about them. She hides them from her bandmates, who (we learn) are much more upset about the hiding than the patterns themselves. This is maybe a good metaphor for a lot of personal problems, but most such problems don't go away as soon as you acknowledge them. (Her bandmates -- abrasive Mira and people-pleasing Zoey -- get much less recursive personality traits to struggle with.) We are maybe supposed to think that she's hiding other problems, but if so, there isn't room in the movie for them. As for the fans, despite the movie's insistence that Huntr/x really loves their fans, while the demonic Saja Boys are merely exploiting them, the experience of the fans seems pretty similar in both cases. (It's also not clear to me that a healthy parasocial relationship is simply one where the idol says "for the fans!" a lot backstage.)

Above all, I wish the movie hadn't made its entire cosmology a metaphor for an unhealthy way of handling emotions, spelled out the metaphor in increasingly direct terms, and then left it untouched. It's hard not to walk away with the message that personal growth and spiritual redemption are only for people who are "essentially good" rather than "essentially bad", and that at any rate they're less important than keeping the essentially-bad people in their place.

My friend Tamera suggests an alternate ending: the demons, no longer held back by the Honmoon, swarm the concert -- only to find themselves given new spiritual strength by Huntr/x's music. (This is nearly foreshadowed; audience members' chests glow blow when they're particularly touched by the music, and Jinu leaves behind a glowing blue ball which we're told is the "soul" he regained thanks to Rumi's guidance.) The demons realize they have the power to fight back, overthrow Gwi-Ma, and either turn back into humans or go to their eternal rest. I think this would be much more consistent with the message the movie is going for, and doesn't undo all the work it does in building up the demons and Honmoon as symbols.

Discuss

A memo on Takeoff

Новости LessWrong.com - 6 ноября, 2025 - 11:57

Published on November 6, 2025 8:36 AM GMT

Inspired by Daniel and Romeo https://x.com/DKokotajlo/status/1986231876852527270, I'm posting a memo on Takeoff, which I plan to expand into a proper article.

This memo comes from notes I took from a talk from Daniel, which have been synthesized by Claude for breviy:

Key Predictions

Transformative AI timeline: Late 2028
Key prediction: All existing benchmarks solved by 2030
p(doom): 10%

Current Gaps

Reasoning capabilities (mostly solved)
Agency and long-horizon skills:
- Error correction
- Goal pursuit
- Hierarchical planning
Data-efficient learning

Proposed Solutions

New models and architectures (e.g., recurrence)
Language Model Programs (LMP)
Scaffolding systems
AI bureaucracies
Increased inference compute
"Just train it to have those skills" approach

Stages of AI DevelopmentThe RE→RS→G→ASI Framework

Research Engineer for AI Research (we are entering here with Kosmos)
- "Devin actually works"
- Significant speedup in AI research
Research Scientist
- "CEO talks to the cluster"
- Autonomous white paper production
Genius
- Qualitatively better than human researchers
Artificial Superintelligence
- Capable of anything
- Equivalent to 100k human teams

Technical Specifications:Compute Requirements

10^14 text predictions/rollouts on agentic paths
10^23 floating point operations for transformative capabilities
GPT-4 pre-training: ~2.15 × 10^25 FLOPS for comparison

Time Horizons

t-AGI (10 second AGI to 10 day AGI)
Year-AGI scaling
Annual 1 OOM (order of magnitude) improvements expected

Critical Questions and ConcernsOn Full Automation

Robotics gap: "If TAI is Industrial Revolution level, what about physical automation?"
Job stickiness: Institutional and social lags in job displacement
"Wet robots": Humans as interim solution

The Capitalism-AI Nexus

"Paperclips are a metaphor for money"
"Capitalism is the ultimate Turing test"
AI systems are "incentivized to be constrained by human values"

Societal Predictions

Early prediction market odds:

1/3 Futurama scenario
1/4 Fizzle out
1/5 Dystopia

Risk Scenarios

Cyberpunk 2077-style hacking risks
43% of Americans believe civil war is somewhat likely
Democracy at its peak by voter count (2024)
Unprecedented political deepfakes
Cultural homogenization
Human disempowerment (sci-fi scenarios)
Potential for singleton control
"OpenAI coming for everybody's jobs is a solace" -

Basic Critiques

Logarithmic growth in atoms, exponential growth in bits
"Folk theorem - bigger (model) is better"
"We're very different from current machines"
"Brains are way more complex than even the most sophisticated NNs"
Adaptation and time remain key differentiators

Alternative Approaches

Virtual societies and digital twins
AI safety as cancer immune response
Automated social science research
a deceptively simple solution to alignment: "Train them to be honest"

Discuss

Geometric UDT

Новости LessWrong.com - 6 ноября, 2025 - 09:24

Published on November 6, 2025 6:24 AM GMT

This post owes credit to discussions with Caspar Oesterheld, Scott Garrabrant, Sahil, Daniel Kokotajlo, Martín Soto, Chi Nguyen, Lukas Finnveden, Vivek Hebbar, Mikhail Samin, and Diffractor. Inspired in particular by a discussion with cousin_it.

Short version:

The idea here is to combine the Nash bargaining solution with Wei Dai's UDT to give an explicit model of some ideas from Open-Minded Updatelessness by Nicolas Macé, Jesse Clifton, and SMK.

Start with a prior P.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} divided into hypotheses P(x)=∑i∈IHi(X).

BATNA: updateful policy πb

Actual policy choice maximizes the product of gains from trade across hypotheses:

argmaxπΠi∈I[E(u|Hi,π)−E(u|Hi,πb)]

I wrote an essay arguing for open-minded updatelessness last year, but it was mostly motivation, and lacked a concrete proposal. The present essay fills that gap. I don't think the proposal is perfect, but it does address the problems I raised.

I'm calling my proposal Geometric UDT, due to taking inspiration from Scott Garrabrant's Geometric Rationality, although the connection is a bit weak; what I'm doing is VNM-rational, whereas Geometric Rationality is not.

Motivating Problem

You might be persuaded by reflective consistency concerns, and conclude that if humanity builds machine superintelligence, it had better use updateless decision theory (UDT)[1].[2] You might separately be convinced that machine superintelligence needs to do some kind of value learning, in order to learn human values, because human values are too difficult for humans to write down directly.[3]

You might then wonder whether these two beliefs stand in contradiction. "Value learning" sounds like it involves updating on evidence. UDT involves never updating on evidence. Is there a problem here?

"Maybe not" you might hope. "UDT is optimal in strictly more situations than updateful DT. This means UDT behaves updatefully in cases where behaving updatefully is optimal, even though it doesn't actually update.[4] Maybe value learning is one of those cases!"

You might do the math as follows:

Puppies vs Rainbows v1

Suppose that, before building machine superintelligence, we narrow down human values to two possibilities: puppies, or rainbows. We aren't sure which of these represents human values, but we're sure it is one of the two, and we assign them even odds.

In rainbow-world, tiling the universe with rainbows has 100 utility, and tiling the universe with puppies has 0 utility. In puppy-world, tiling the universe with puppies has 90 utility (because puppies are harder to maximize than rainbows), but rainbows have 0 utility.

Rainbow WorldPuppy WorldMaximize Rainbows+1000Maximize Puppies0+90

The machine superintelligence will be able to observe enough information about humans to distinguish puppy-world from rainbow-world within 3 seconds of being switched on. There are four policies it could follow:

Always make puppies: this has a 50% chance of a utility of 90, and otherwise yields zero.
- EV: 45
Always make rainbows: 50% chance of utility 100, otherwise zero.
- EV: 50
Make puppies in rainbow world; make rainbows in puppy world.
- EV: 0
Make puppies in puppy world, make rainbows in rainbow world.
- EV: 95

The highest EV is to do the obvious value-learning thing; so, there's no problem. UDT behaves updatefully, as is intuitively desirable!

Unfortunately, not all versions of this problem work out so nicely.

Puppies vs Rainbows v2

Some hypotheses will "play nice" like the example above, and updateless value learning will work fine.

However, there are some versions of "valuing puppies" and "valuing rainbows" which value puppies/rainbows regardless of which universe the puppies/rainbows are in. These utility functions are called "nosy neighbors" because they care about what happens in other realities, not just their own.[5] (We could call the non-nosy hypotheses "nice neighbors".)

Nosy Neighbors: Technical Details

Utility functions are standardly modeled as random variables. A random variable is a function from worlds (aka "outcomes") to real numbers; you feed the world to the utility function, and the utility function outputs a score for the world.

A world, in turn, can be understood as a truth-valuation: a function from propositions to true/false. Feed a proposition to a world, and the world will tell you whether the proposition was true or false.[6] You can imagine the utility function looking at the assignment of propositions to true/false in order to compute the world's score. For example, at each location L, the utility function could check the proposition puppy(L), which is true if there's a puppy at L. The score of a world could be the total number of true instances of puppy(L).[7]

In order to model uncertainty between utility functions, what I'm really suggesting is a utility function which behaves like one function or another depending on some fact. When I describe "puppy world" vs "rainbow world", I have in mind that there is some fact we are uncertain about (the fact of "what human values are") which is one way in case humans value puppies, and another way in case humans value rainbows. The utility function encodes our value uncertainty by first checking this fact, and then, if it is one way, proceeding to act like the puppy-valuing utility function (scoring the world based on puppies); otherwise, it acts like the rainbow-valuing function (scoring the world based on rainbows).

Naively, you might think that this means "nosy neighbors" don't make sense: the score of a world only depends on stuff in that world. With our uncertain utility function, the puppy-score is only computed in puppy-world. It can't check for puppies in rainbow-world. What could I possibly mean, when I say that the utility function "values puppies/rainbows regardless of which universe the puppies/rainbows are in"?

What I have in mind is that we believe in some sort of counterfactual information. In puppy-world, there should be a fact of the matter about "what would the machine superintelligence have done in rainbow world" -- some propositions tracking this contingency.

You don't have to be a realist about counterfactuals or possible worlds (a "modal realist") in order to believe this. There doesn't literally have to be a fact of the matter of what actually would happen in a non-real world. There just have to be good proxies, to cause problems. For example, there could be a fact of the matter of what happens when someone thinks through "what would have happened in rainbow-world?" in detail. Maybe at some point someone runs a detailed simulation of rainbow-world within puppy-world. This is sufficient to cause a problem.

This sort of thing is quite difficult to rule out, actually. For example, if you grant that mathematical propositions are true/false, that is enough: mathematics will have, somewhere inside it, a specification of both rainbow-world and puppy-world (or at least, adequately good approximations thereof).

I think of such propositions as "windows to other worlds" which exist inside a single world. Nosy Neighbor hypotheses are utility functions which depend on those propositions.[8]

If you care about what happens in the Game of Life (in the abstract, not just instances you see in front of you) then your values are nosy-neighbor values. If a puppy-loving utility hypothesis checks for puppy-like structures in Game of Life and adds them to the score, that's a nosy-neighbor hypothesis.

Suppose the puppy hypothesis and the rainbow hypothesis are both nosy neighbors. I'll assume they're nosy enough that they value puppies/rainbows in other universes exactly as much as in their own. There are four policies:

Always make puppies: 50% chance of being worthless, if the rainbow hypothesis is true. 50% of getting 90 for making puppies in puppy-universe, plus 90 more for making puppies in rainbow-universe.
- EV: 90
Always make rainbows: 50% worthless, 50% worth 100 + 100.
- EV: 100
Make puppies in rainbow universe, rainbows in puppy universe: 50% a value of 90, 50% a value of 100.
- EV: 95
Puppies in puppy universe, rainbown in rainbow universe:
- EV: 95

Hence, in this scenario, UDT will choose to always make rainbows.

This shows that in the presence of nosy neighbors, the naive concern can be vindicated: UDT has trouble with "value learning".

One response to this problem might be to assign nosy-neighbor hypotheses probability zero. The issue I have with this solution is that human values may well be nosy.

Instead, I propose we treat this as a bargaining problem between the hypotheses.

Geometric UDT

I'll use the Nash Bargaining Solution. This has some particularly nice properties, but the choice here is mostly out of habit, and we could certainly consider other options.

We start by choosing a BATNA (Best Alternative to Negotiated Agreement). This is a policy which we'd default to if the parties at the bargaining table (in our case, the hypotheses) were not able to come to an agreement.

My proposal is to set the BATNA to the policy an updateful decision theory would choose. For example, we could use updateful EDT, or we could use Thompson sampling based on the updated hypothesis weights. Either way, a hypothesis gets "more control" as it becomes more probable. This starting-point for bargaining gives each hypothesis a minimal guarantee of expected utility: puppy-world will get at least as much expected utility as if humans had built an updateful machine superintelligence instead of an updateless one.

Once we've chosen a BATNA, the Nash Bargaining Solution maximizes the product of the gains-from-trade:

argmaxπΠi∈I[E(u|Hi,π)−E(u|Hi,πb)]

This ensures a Pareto-optimal policy (unlike the updateful policy). This can only increase the expected utility every individual hypothesis expects to get, in comparison to what they'd get from the updateful policy. You can imagine that the hypotheses are trading away some things they'd control (in the updateful policy) in exchange for consideration from others. The puppy-maximizer might produce some rainbows in exchange for the rainbow-maximizer producing some puppies, but you won't get an all-rainbow policy like we did before, because we're not allowed to completely ignore any hypothesis.

Another reasonable option is to adjust the weight of each hypothesis based on its probability:

argmaxπΠi∈I[E(u|Hi,π)−E(u|Hi,πb)]P(Hi)

The probability of a hypothesis already adjusts what it gets in the BATNA, but this additional adjustment makes the solution less dependent on how we partition the probability distribution into hypotheses.

Since the chosen policy will be Pareto-optimal, Complete Class Theorems guarantee that it can be interpreted as maximizing expected utility with respect to some prior; that is, it'll still be UDT-rational. The interesting thing is that it isn't UDT-optimal with respect to our honest prior (the one we use to define the BATNA). Geometric UDT advises us to use a fair prior (in the Nash Bargaining Solution sense) instead of our honest prior. The honest prior has a chance of completely screwing over humans (completely ignoring the correct value-hypothesis even in the face of overwhelming evidence), whereas the fair prior does not.

I've framed things in this post in terms of value uncertainty, but I believe everything can be re-framed in terms of uncertainty about what the correct prior is (which connects better with the motivation in my previous post on the subject).

One issue with Geometric UDT is that it doesn't do very well in the presence of some utility hypotheses which are exactly or approximately negative of others: even if there is a Pareto-improvement, the presence of such enemies prevents us from maximizing the product of gains-from-trade, so Geometric UDT is indifferent between such improvements and the BATNA. This can probably be improved upon.

^
In this essay, UDT means UDT 1.1.
^
I'm not claiming there's a totally watertight argument for this conclusion given this premise; I'm only claiming that if you believe something like this you should probably care about what I'm doing here.
^
Even a simple strategy like "trust whatever the humans tell you about what they want" counts as value learning in this sense; the important thing is that the AI system doesn't start out totally confident about what humans want, and it observes things that let it learn.
^
This isn't a strictly true mathematical assertion; in reality, we need to make more assumptions in order to prove such a theorem (eg, I'm not defining what 'optimality' means here). The point is more that this is the sort of thing someone who is convinced of UDT is inclined to believe (they'll tend to be comfortable with making the appropriate additional assumptions such that this is true).
^
This phenomenon was discovered and named by Diffractor (private communication).
^
In logic, it is more typical to understand a world as a truth-valuation like this, so that worlds are functions from propositions to {true, false}. In probability, it is more typical to reverse things, treating a proposition (aka "event") as a set of worlds, so that given a world, you can check if it is in the set (so a proposition can be thought of as a function from worlds to {true, false}).
This distinction doesn't matter very much, at least not for our purposes here.
^
This particular utility function will not be well-defined if there are infinitely many locations, since the sum could fail to converge. There are many possible solutions to this problem, but the discussion goes beyond the present topic.
^
We can also get a nosy-neighbor effect without putting terminal utility on other worlds, if we believe that what happens in other worlds impacts our world. For example, maybe in puppy-world, a powerful being named Omega simulates what happens in rainbow-world, and creates or destroys some puppies accordingly. Care about what happens in other worlds is induced indirectly through beliefs.

Discuss

announcing my modular coal startup

Новости LessWrong.com - 6 ноября, 2025 - 07:46

Published on November 6, 2025 4:46 AM GMT

I'm writing this post to announce my new startup, C2 AI, focusing on small modular coal power plants, which will be used to power AI models that design new small modular chemical plants.

the concept

You probably know that there are lots of startups working on "small modular reactors" (SMRs) for nuclear power. Wikipedia says:

As of 2025, there were 127 modular reactor designs, with seven designs operating or under construction, 51 in the pre-licensing or licensing process, and 85 designers in discussions with potential site owners.

Clearly, so many investors and startup founders wouldn't simply be mistaken about the economics of SMRs. With that in mind, it occurred to me that the key components of nuclear power plants are mostly the same as those of coal power plants and chemical production:

pressure vessels
heat exchangers
electric motors
storage tanks
turbines
pipes

China is still building chemical plants and coal power. The US government is, at least nominally, trying to reindustrialize. But the currently proposed industrial projects are all those outdated large designs, instead of small modular ones. So clearly there's room for a new startup here.

a view from inside BASF

For some reason, big chemical companies keep making their chemical plants as big as possible, even though we all know that small reactors that can be mass-produced and shipped easily are cheaper and just generally better. I know somebody working at BASF, so I thought I'd get his opinion on where they're going wrong. He said:

The biggest facilities, like large ammonia or methanol or ethane cracker plants, typically make 1 to 2 million tons per year. In terms of potential energy handled per year, large power plants are 2-3 times larger, so there's probably still room for scaling up, but chemical production is more limited in terms of regional demand and also how much money investors can get together for a project.

Ultimately costs aren't a matter of scale per se, they're a result of manufacturing methods, and scale is important in that it determines the choice of those. For example, large container ships can be quite cheap per mass despite small production numbers compared to consumer products, but that cost and productivity is not inherent in their size. As you noted, America has more shipbuilding workers than Japan, despite producing about 1% as much ship per year, and US military ships are correspondingly expensive. That's because the manufacturing methods used are different despite the similar scale and products.

There are advantages to being able to transport components over roads, but fairly large things can be moved, and the issues with that type of transport don't seem to outweigh the advantages of larger sizes.

If you look at small mass-produced pressure vessels, like a scuba tank or a CNG tank for a car, they're not particularly cheap compared to larger ones, but I think the bigger issue is piping. You see, a big chemical plant is not a big truck, it's a series of tubes. Somebody needs to weld or bolt all the connections together, and it's easier to weld one large pipe than 100 small ones.

And of course large turbines are more efficient, due to tip leakage and so on. That can also be a driving factor for scale, especially for something like a modern air liquefaction plant.

Obviously, a random employee at an incumbent company like BASF doesn't understand construction or economics on the same level as Silicon Valley people. I have a whole collection of insights from US tech founders for when I need some inspiration; here's a random example:

Human progress was measured in GDP per capita. The next era will be measured in GDP per humanoid

— Brett Adcock

Actually, here's a more relevant quote, something related to energy:

Do you suppose ships will never be powered by fusion?

— Paul Graham

That kind of visionary thinking is exactly what makes Paul Graham such a legendary VC.

why coal

I've explained the basic concept for my startup, but you might be wondering: why coal power in particular? There are a few reasons:

AI is the hot thing in America right now. In terms of future economic prospects, it's kind of the only thing: 92% of recent US GDP growth was from AI datacenters. But that's OK: I have it from a reliable source that if we just keep building AI datacenters, everything will be work out somehow.

The fact that these datacenters need a lot of power is widely understood. So, some investors are interested in ways to solve that issue.

VCs

Current VCs are mainly interested in potential trillion-dollar companies. These days, most of the money is owned or controlled by people wealthy enough that a small chance of a small slice of a $1 billion IPO just isn't worth getting out of bed for. Electricity is a big deal, so the potential market size is large enough for big investors to care.

comparison vs nuclear

There's already from VCs interest in SMR startups. That means, at least for some investors, we don't have to go through a whole complicated argument - all we have to do is demonstrate the advantages of burning coal as a heat source over small nuclear reactors. China helps make that argument: while it has built a lot of nuclear reactors, they've built much more coal power than nuclear power, despite not having the nuclear power regulations that that US does. And some of those nuclear plants were needed for its large nuclear weapon buildup.

That makes for a simple argument, which is important when you're dealing with VCs. They have lots of important things to think about and can't afford to get bogged down in minor technical details the way I do.

global warming

There used to be a lot of concern about reducing CO2 emissions, but global warming has fallen out of fashion among political leadership. Germany is burning more coal, Trump is advocating for coal specifically, and CO2 levels have increased by a record amount. The current political winds favor coal.

the future is carbon-bright

AI is the future and it's time to get on board the freight train or be left behind in our smoke.

We're currently closing our pre-seed round, but we're always looking for 100x engineers who want to ship molecules. If you're tired of just moving bits and want to build the physical infrastructure for the AI revolution, DM me. We're hiring.

Discuss

Halloween Tombstone Simulacra

Новости LessWrong.com - 6 ноября, 2025 - 07:28

Published on November 6, 2025 4:28 AM GMT

I’ve been totally lacking energy since Halloween, so I’ve decided to rant about Halloween decorations to help me get back my Halloween spirit.

I’ve noticed many people decorate their lawns with fake tombstones. Some look kind of like actual tombstones, but many look something like this:

You know they’re supposed to be tombstones, but they don’t even really resemble actual tombstones:

I imagine the first Halloween tombstones must have looked like actual tombstones, but over time there’s been a sort of conceptual drift, where the Halloween tombstone took on an exaggerated appearance (“R.I.P.” in huge letters being common on Halloween tombstones, but horribly offensive on real ones) and incorporated other spooky elements, like skulls and grim reapers.

If real tombstones correspond to simulacrum level 1, or reality itself, then fake but accurate tombstones would be simulacrum level 2 (a lie about reality), and Halloween-themed tombstones that no longer try to seem realistic would be simulacrum level 3 (pretending to pretend to be real). I’ve seen some pretty egregious examples that have lost almost all pretense of being an actual tombstone (simulacrum level 4?), so overtaken they were with spiders and spooky fonts and so on. I wish I’d taken a picture of a particularly horrible example on my street, but you can get the idea if you look at the purple ones here:

These ones at least have a tombstone shape. Some don’t even have that.

If you challenged someone who had never heard of Halloween to make a spooky graveyard in their front yard, they would never in a million years design tombstones like this. These could only have been produced by a gradual, recurrent memetic process whereby decorations cease to represent real world items, but rather simply reflect an overall Halloween-y aesthetic, much in the way a microphone too close to a speaker will recurse to create screeching feedback that eventually has nothing to do with the original sound.

It’s not just the tombstones. Halloween ghosts only resemble undead spirits because that’s what we know they’re supposed to be. They’re cute little white or green blobs with eyes, and could just as easily be snot-men to the uninitiated. You sometimes see multi-colored Jack-o’-lanterns, even though pumpkins only come in one color. And spider webs as thick as ropes. Signs of cats wearing witches’ hats with quirky slogans underneath. Octopus skeletons. Inflatable green women with bubbling cauldrons, an archetype that’s lost all connection with the alleged practice of cursing others with black magic — a practice still punishable by death in parts of the world, but seen here only as a quaint fiction.

Just as a speaker and microphone create feedback and approach a frequency based on the resonant characteristics of the room, so to have Halloween decorations reverberated through their suburban environment until they finally matched the friendly and unobjectionable tone of their surroundings. What started as a solemn time to remember the dead, and transformed into a celebration of that which we fear, has now further transformed into a celebration of the aesthetic of Halloween itself. Instead of an authentic representation of death and fear, we get black and orange; Plastic spider rings and black cats and big eyeballs for some reason. And candy. The holiday has ceased to have a point, and has become simply “Halloween”.

I think which kinds of decorations you prefer comes down to which kind of Halloween you would prefer. Which simulacrum level you think the holiday should operate on. Would you rather a realistic, scary Halloween, where we learn to laugh in the face of death? Or the more modern “Halloween party” Halloween, which serves as an aesthetic backdrop for collecting candy and having fun? Put like that, the modern thing seems like a cheap degradation of the original. But despite its spiritual degradation, I can’t help feeling some attachment for the modern Halloween-as-Halloween, with its trick or treating and horny costumes. Rather than being merely an expression of the world, Halloween has become something that exists in its own right, for better or worse.

Discuss

Halfway to Anywhere

Новости LessWrong.com - 6 ноября, 2025 - 07:27

Published on November 6, 2025 4:27 AM GMT

“If you can get your ship into orbit, you’re halfway to anywhere.” - Robert Heinlein

This generalizes.

Spaceflight is hard.

Putting a rocket on the moon is one of the most impressive feats humans have ever achieved. The International Space Station, an inhabited living space and research facility in Low Earth Orbit, has been continuously inhabited for over two decades now, and that’s awesome. It is a testament to the hard work and brilliance of a lot of people over a lot of time that this is possible.

Look ma, no selfie stick

And we’re not done yet. There are footprints on the moon today, but there are also robots leaving tracks on Mars and satellites have taken photos of the rings of Saturn and the craters of Pluto. Voyager has stood waaay back and taken a family picture of the solar system. Give us a rocket and a place to send it, and we’ll go anywhere we’re curious about.

There’s a funny property of space flight though, which is that half the work happens very close to home and is surprisingly similar no matter where you’re going.

When you boil spaceflight down into the very abstract basics, it takes a certain amount of force to get around. That movement is measured in Δv, or delta-v. In plain English, delta-v is the change in velocity; when I step on the gas pedal in my car, the engine and the tires conspire to increase my velocity and make me go forwards. When I’m in my kayak and dip my oar in the water to row, I’m applying delta-v to change my velocity and go faster or slow down or just change direction. In a rocket, delta-v comes from the engine. As anyone who has tried to get out and push a stalled car into movement, I need a certain amount of energy just to move at all. Just so with rockets.

It turns out that in order to get to the moon or Mars or basically anywhere else, you wind up going to Low Earth Orbit (LEO) first. You don’t really save much work trying to do what’s call a Direct Injection — drawing a line up out of atmosphere and to your target that doesn’t do a loop or two around the globe. It takes about 9,400 meters per second of delta-v to get to LEO, and you can save something like 300 meters per second (m/s) on the upper end. Sometimes you save more like 50 m/s. No matter where you’re going, hauling your rocket off the ground and through the air and into space is going to take you at least 9 kilometers per second (km/s.)

Keep that 9.4 km/s number in mind for a second.

Here’s a table of places you might want to go, along with their delta-v costs starting from LEO:

Destinationdelta-vMoon orbit4.1 km/sMars orbit5.2 km/sJupiter orbit6.8 km/sPluto orbit9.4 km/sEscape the solar system8.7 km/s

Heck, if you’re willing to take a couple hundred thousand years, 9 km/s will get you to Alpha Centauri. Putting footprints on Pluto isn’t that much harder than going to the moon.

Nobody ever treats my muddy boot prints as historic artifacts.

Once you've noticed that so much work happens during the early stages, you notice that this pattern shows up elsewhere.

II.

The general principle here is that you spend most of your effort in surprisingly generic setup work.

Let's say you want to cook eggs for breakfast. You go to the grocery store and get eggs, you remember to pay your gas or electric bill so the stove works, then you make the eggs, then you clean the pan and your dish. Compare wanting to make sausage for breakfast. You go to the grocery store and get sausage, you remember to pay the gas or electric bill so the stove works, then you make the sausages, then you clean the pan and your dish. Most of the work isn’t the sausage or egg making!

Maybe you’re a shipping company and you want to deliver a piano from Tokyo to Boston. You need to wrap it in a box, you need to find a ship, you need to label the box and put it on the ship, you need someone to sail the ship across the world, you need someone to take the box off the ship at the other end, you need someone to read the label and figure out what address in Boston it’s going to, you need someone to load it on a truck and drive the truck to the address, you need someone to unload the truck and move the box to where it’s going in the building. You’ve recreated the global logistics network, involved at least three different people in at least two different countries, and you don’t save much effort if you decide to deliver the piano to Australia instead, or if you decide to get out of the piano business and get into wholesaling couches instead. “On a ship off the coast of Japan” is halfway to anywhere in the world.

Or, for a topic near and dear to my ACX Meetup Czar heart: If you want to run a conference on science fiction, you have to book a venue, you have to find some presenters, you need a schedule attendees can view, you need some way to sell tickets and take people’s money, you need to respond to emails where people can’t find the venue or where they have a complaint about another attendee, you probably want to send out a feedback form.

Notice how none of the steps in that science fiction conference go away if it’s, say, a bioengineering conference, or a trade show, an academic conference, or a month long retreat for internet bloggers? It’s the packaging, and the packaging is a lot of work. There is real work to be done in getting from LEO to Mars; you have to move sausage around more to fry it up than you do an egg, you have to pack a piano more carefully than you do a couch, and a retreat might benefit from the extra step of matching coaches to residents. But surprisingly often it turns out the most of the effort spent on a goal seems like setup, a prelude to the real effort.

Getting to orbit is halfway to anywhere.

III.

There are some implications of this I don’t think are properly appreciated. In no particular order:

Expertise in one field generalizes. Or rather, it’s more accurate to say that some kinds of expertise slice across fields; a skilled boat captain can be useful to many organizations, as long as they want to move things by boat. Blur your eyes a bit, step back, and check sometimes what skills you actually need and whether seemingly different organizations must be doing the same thing as you for some part of their supply chain.

Motivation can be hit hard by everyone in the setup. Neil Armstrong is famous, Buzz Aldrin is famous. Evan Michael Collins, the pilot, gets mentioned in the first paragraph of the Apollo 11 Wikipedia page. You have to scroll down a ways to get to the guy in charge of the launch pad construction. (Kurt Debus.) Good thing they didn’t do it for fame, but at least the launch pad is still pretty clearly about rockets. Be wary of burnout if you have too much of the team working on a conference who never get to attend.

Common links in the chain are good places to be in some ways. There’s an aphorism I heard growing up, “selling shovels in a gold rush,” that touches on this. They're also good places to optimize. The software industry rightly puts incredible attention on frequently used function calls, and if you're looking at part of your regular work that gets used for every project, I suggest trying to make that part sing.

Pace yourself based on the actual work needed, not the most iconic moments. If you got worried two thirds of the way through the Ascent stage of a Mars launch because you’re halfway out of fuel, you’d never get anywhere. You’re 130 km done out of a 200 million km trip, you’re not even done with the preparation of getting to LEO, and that’s fine.

(Don’t run yourself out of fuel before you leave Earth orbit, and don’t burn yourself out before the conference starts. And leave some energy for getting home again and for wrapping up with the venue or feedback forms. But don’t get too worried if you’ve done a lot of work wrapping boxes and yet no pianos are in Boston yet.)

I'm a sailor peg, and I've lost my leg,
Climbing up the launch pad, I lost my leg!
I'm shipping up to Luna, whoa,
I'm shipping up to Luna, whoa

Small changes late in the process can be expensive to fix. Imagine getting to LEO and realizing you forgot the Mars Rover back on the loading bay floor, or even that someone didn’t add the right kind of chemical analyzer. You can’t exactly turn around and swing by NASA real quick to pick it up. Likewise, many parents confirm that getting the kids into the car and getting out the driveway is a big part of the hurdle to going places; the twenty feet to the end of the driveway takes more effort than the twenty miles to the grocery store. They too don’t want to turn around and hope nobody gets antsy and unbuckles for a bathroom trip just to grab the nice grocery bags.

It’s one of the most frustrating things for me as I’ve worked on larger and larger projects, how much effort and energy I burn on activities that are like icebergs under the surface.

When I feel that way, I try to remember that getting to orbit takes the most fuel. Once I’m there, I’m halfway done.

Discuss

People Seem Funny In The Head About Subtle Signals

Новости LessWrong.com - 6 ноября, 2025 - 07:03

Published on November 6, 2025 4:03 AM GMT

WARNING: This post contains spoilers for Harry Potter and the Methods of Rationality, and I will not warn about them further. Also some anecdotes from slutcon which are not particularly NSFW, but it's still slutcon.

A girl I was seeing once asked me to guess what Hogwarts house she was trying to channel with her outfit. Her answer turned out to be Slytherin, because the tiny gem in the necklace she was wearing was green. Never mind that nothing else she wore was green, including the gems in her earrings, and that colors of all the other Hogwarts houses appeared in her outfit (several more prominently), and that the gem itself was barely visible enough to tell the color at all, and that she wasn’t doing anything to draw attention to the necklace specifically.[1]

I wonder, sometimes, just how much of the “subtle social signals” people think they’re sending are like that case - i.e. it’s basically a game with themselves, which has utterly zero signal for anyone else.

Subtlety to the point of invisibility clearly happens more than zero percent of the time; I have seen at least that one unambiguous case, so existence is at least established. But just what should our priors be on any given subtle signal being received, in various domains?

Just How Subtle Is The Signal?Hints in Writing

A little blurb from Eliezer on his domain of unique expertise:

So the big thing to remember about all of HPMOR is that, just as in the original books, the Defence Professor was Voldemort, and I thought it would be way way more obvious than it was to the readers.

[...]

The number one thing HPMOR taught me as an author is that you are being so much less clear than you think you are being. You are telegraphing so much less than you think. All of the obvious reads are so much less obvious than you think. And if you try to have subtle hints and foreshadowing laced through the story, clues that will only make sense in retrospect, that’s way too subtle.

Instead, you should just write very plainly reflecting the secrets of the story. Don’t try to hide them, but don’t explicitly spell them out either. That will be just right.

Fun subquestion for those who’ve read HPMOR: on the first readthrough, did you correctly guess what “might be big and grey and shaped like a cauldron”?

I use this mainly as a prior for thinking about more common subtle social signals.

Hints in Flirting

For subtle hints involved in flirting, there’s an obvious test: take the usual classroom full of undergrad psych students, pair them up, have them do some stuff, then ask them afterward (a) whether they were flirting with their partner, and (b) whether their partner was flirting with them. Of course this has been done before, producing a paper with far too small a sample size to be taken very seriously, but it at least gives a qualitative baseline. Results:

Partner GenderPartner Actually Flirted?Accuracy At GuessingFemaleFlirted18%FemaleDid not flirt83%MaleFlirted36%MaleDid not flirt84%

Key fact which is not in the table: most people are not flirting most of the time, so naturally people usually guess that they are not being flirted with; thus the high accuracy rates when one’s partner is not flirting.

The interesting part is how often peoples’ flirting was picked up upon when people did flirt. And, uh… the numbers are not real high. When males flirted, it was picked up 36% of the time. When females flirted, it was picked up 18% of the time. And while I do not think one should take these numbers very seriously, I will note that female flirting was also detected 17% of the time when the female did not flirt, so at least according to these numbers… female flirting gave basically zero signal above baseline noise.

Males flirting at least managed to send some signal above baseline noise, but in most cases their signals still were not received.

So insofar as we expect this one too-small study to generalize… it sure seems like most flirting, and especially most female flirting, is somebody playing a game with themselves which will not be noticed.

Hints in Outfit

We covered one example at the beginning. More generally… it seems like there’s something weird going on in a lot of peoples’ heads when it comes to their clothes.

At least among people who put in any effort to their wardrobe, you’d think the obvious question influencing clothing choice would be “what vibe do other people get from this?”. Personally, thanks to slutcon a few weeks ago, I now have a fair bit of data on what vibe people get from my wardrobe choices (and bearing more generally), at least for sexual purposes. The consensus feedback said roughly “creepy/rapey, but in the sexy way”, which matches what I’m going for pretty well (think Rhysand, or any other romance novel guy matching that very popular trope[2]). Not perfectly, and the mismatches were useful to hear, but I’m basically on target. And it’s a very popular target, apparently.

I don’t know what is going on in other peoples’ heads, when it comes to their wardrobes. Most people look generic, which makes sense if you’re not putting much thought into it… but most people I know who do put thought into it also look generic! There are exceptions, and they are notable. But the typical case seems like the opening example: someone will try to channel House Slytherin by wearing a tiny green gem in an outfit which otherwise doesn’t have anything to do with Slytherin.

Oblivious?

Of course one reaction to all this is "John, maybe people do pick up on things and you're just particularly oblivious". And that's reasonable on priors, it's what I thought for a long time too. But man, the evidence just keeps pointing in the other direction. I personally have repeatedly received feedback (e.g. in circling) that I'm unusually good at reading people. When I go look for data (like e.g. the flirting study above), it agrees. When I look for particularly unambiguous examples (like e.g. the green gem signalling Slytherin), I find them.

… anyway, the weird thing is what happens when I point this sort of thing out.

Emotional Investment?

It feels like people are really attached, for some reason, to pretending that their subtle signals are actually useful somehow.

Point out that women attempting to flirt seem to almost universally fail to send any signal at all, even to other women (the paper above did check that), and someone will say something like “well, mind-reading seems like a good thing to select for, for dating purposes”. Point out that most peoples’ clothes don’t really manage to signal what they intended even when they’re trying, and someone will say something like “well, it’s largely about signalling to oneself, e.g. to build confidence, so it doesn’t matter if other people get the signal”. And, like… I roll to disbelieve?

I opened with that example of the girl trying to signal Slytherin with the gem. I could see her reaction when I did not immediately guess which house she was going for, and quickly tried to navigate out of the situation. She was not expecting that nobody else could pick up on the signal.

Another slutcon example: some time after one encounter, I learned that a girl was trying to strongly signal “yes” to me with her body, without saying anything. And she did that by… looking like she was about to fall asleep. Just kind of out of it. This girl was not trying to select for mind reading, she intended to send a signal, she was just utterly miscalibrated.

C’mon, guys. People are not playing these games for some clever reason which makes sense in the absence of other people picking up the signals. Yet at the same time, people seem really emotionally invested in their so-called signals. Somehow, this seems to trigger its own flavor of rationalization or something. My best current articulation is: a person builds their whole identity in their heads, and imagines that other people see them that way, and is happy whenever somebody makes them “feel seen” by picking up on that identity (... albeit probably sometimes accidentally), and is disappointed when people don’t pick up on it. And so it would be a pretty massive disappointment if some huge chunk of the identity-signals the person thought they were sending just… weren’t actually received. So they grab for some story which somehow keeps their identity intact.

My response to that is: think how great it would feel if you were just… seen by the vast majority of people around you, all the time. In order for that to happen, you need to first accept that you’re not already seen most of the time, and it’s a thing you can change via your own actions (as opposed to wishing that other people become better mind-readers), so maybe do that. Send some actual signals.

^
I did not critique the dubious signal-to-noise ratio to her at the time.
^
I am lucky in that my tastes match a popular romance novel trope well, so leaning into it feels intuitive and genuine and like letting out my real self.

Discuss

Why I believe consciousness is fundamental rather than structural/neural.

Новости LessWrong.com - 6 ноября, 2025 - 06:54

Published on November 6, 2025 3:54 AM GMT

Epistemic status: Educated opinions and mostly explanation of the terms and semantical separation of them from the main one. But mainly a paradigm shift that seems Occamian.

Reason for this topic

My honest reason for writing this is to create a ladder of considerations to get to my main point, and this is just the first step (I have 2 more in my mind). However, if this first step is incorrect in some way, I would be thankful to understand, why, with the help of you, and not waste time in the continuation.

My reason to believe that consciousness matters is that it seems that in order to make any truly ethical decision we need to do it with respect to how it will affect what any consciousness experiences.

What I mean by consciousness

Neuroscience, psychology, etc. use several definitions for consciousness each referring to the different thing from the most trivial ones (Ability to react to the stimulus) to some more advanced ones (The self-consciousness; Ability to experience qualia) And because of the triviality of many of them, but mainly because my fundamental interest is in the "Ability to experience qualia" thing, I will refer to this definition until the end and forget about other. So once again, by consciousness I will mean "Ability to experience qualias".

What I don't mean by consciousness: Self-consciousness, Personality, Visual processingSelf-consciousness

Althought having consciousness is required to be self-conscious, it doesn't mean that if something is not self-conscious, it is not conscious at all. I imagine the self-conscious brain and its structure as (except other things) some kind of sophisticated metaphorical mirror for actual consciousness.

Personality

I would agree that personality is set of behavioral patterns including perceptual and mental behaviors, all represented by neurons. And it is usually unique from the one person to the other. And this is not what I mean by consciousness.

Visual perception

I agree that visual perception is complex and that it requires some complex structure as the brain is to translate the streams of light of three different wavelengths to coherent percept. However, consciousness is again different. Consciousness is not about processing, but about experiencing.

Zombie argument

I know the EY's Zombie article, and despite I admit I didn't have a patience to read it all, I think I understand its key point and I basically agree with it: If our consciousness wouldn't affect our brain and our body, we wouldn't be able to frankly refer to it at all.

Stripped of all neurons

For me, the interesting question is, how would it feel to have consciousness stripped of other processes, which I agree to be structural/neural: Self-consciousness, Perception of time, Visual processing, Verbalization of ideas, etc.

In my imagination, the consciousness stripped of all the the other functions, that are enabled due to the existence of the complex signal structure called brain, would be either a constant experience of a color or a random stream of colors without ability to think of them or verbalize them, or memorize its sequence, or even imagine a different color than we perceive, or to imagine what object we see when we see that color. However, experiencing would be still there.

Answer to my title question

My reason to believe that consciousness is fundamental rather than structural is that it seems to me more Occamian to believe that consciousness is eternal, and that each "arising" is actually becoming aware of reality and time and becoming able to access memory, etc. (due to brain ability) and each "ceasing to exist" is actually interrupting these processes, than to add special principles according to which it will truly arise and truly ceases to exist.

But my main reason is, that everything that I experience is either real or illusory, but even if it was illusory this illusion is in form of qualia. For me, the ability to perceive qualia is inevitable.

Discuss

Technical AI Safety Roles at Open Philanthropy

Новости LessWrong.com - 6 ноября, 2025 - 05:53

Published on November 6, 2025 12:49 AM GMT

Open Philanthropy’s Technical AI Safety team is recruiting for grantmakers to support research aimed at reducing catastrophic risks from advanced AI. We’re hiring at all levels of seniority.

Open Phil’s TAIS grantmaking has grown rapidly, from $40M in 2024 to being on track for over $130M in 2025 (in addition OP’s other AI safety grantmaking in AI governance and policy and capacity building). We plan to expand both the quality and quantity of our support in 2026 and need more specialists to help us achieve this. In one of these roles, you'd likely be in charge of choosing where tens of millions of dollars of high impact research funding are directed annually.

Learn more and apply here by November 24. Know someone who might be a good fit? Refer them to us, and receive a $5k bonus if your referral results in a hire!

Discuss

A 2032 Takeoff Story

Новости LessWrong.com - 6 ноября, 2025 - 03:20

Published on November 6, 2025 12:20 AM GMT

I spent 3 recent Sundays writing my mainline AI scenario. Having only spent 3 days on it, it’s not very well-researched (especially in the areas where i’m not well informed) or well-written, and the endings are particularly open ended and weak. But I wanted to post the somewhat unfiltered 3-day version to normalize doing so. There are also some details that I no longer fully endorse, because the act of doing this exercise spurred me to look into things in more detail, and I have updated my views slightly[1] — an ode to the value of this exercise.

Nevertheless, this scenario still represents a very central story of how I think the future of AI will go.

I found this exercise extremely useful and I hope others will carve out a few days to attempt it. At the bottom there’s: (1) my tips on how to do this scenario writing exercise, and (2) a list of open questions that I think are particularly important (which I made my intuitive guesses about to write this scenario but feel very uncertain about).

Summary

2026-2028: The Deployment Race Era

AI companies are all focused on monetizing, doing RL on LLMs to make products, and this is ultimately data bottlenecked, so it becomes a ‘deployment race’ because having more users/deployment gets you more money, more compute and more data. Some lean towards B2C (through search, social media, online shopping, etc.) and others lean towards B2B (enterprise agent plans)
China is 15y behind in lithography but wakes up much harder on indigenizing lithography once they realize they are falling behind and their chips are bad. They start moving at 2x the speed of ASML history.

2028-2030: The 1% AI Economy

Architecture innovations help make AIs more useful and reliable
AI starts becoming a top 4 social issue for many
Household robots enter their ‘Waymo-2025’ phase in reliability and adoption
China 5x ahead in energy, talent and robots but 5x behind in capital and compute.
China cracks mass-production-capable immersion DUV.

2030-2032: The 10% AI Economy

‘Magnificent 4’, US tech giants concentrate into fewer AI winners and have >$10T market caps
AI is top social issue, 7-8% unemployment
Robots doing several basic physical service jobs
China cracks mass-production-capable EUV

2032 Branching point

Superhuman Coder milestone (SC) is reached, and the story branches based on takeoff speed: fast (data efficient, brain-like algorithms) and slow (continual learning, data bound).
- Major disclaimer: In one branch the US wins but eventually loses control over the AIs, and in the other China wins but maintains control. This is not because I think China is any more likely to prevent AI takeover a priori. Rather its downstream of correlations with takeoff speeds, my full explanation here:[2]

Fast takeoff branch: Brain-like algorithms

After the SC milestone is achieved, the AIs already have very good research taste, and one more month of improvements leads to full AI research automation.
The AI helps invent a new algorithmic paradigm: Highly data efficient brain-like algorithms.
The US is in lead and does Plan C,[3] the company is in charge of how fast to go and the USG not completely in the dark but pretty uninformed about what is going on at the companies.
Recursive self improvement happens very fast, reaching ASI in a few months.
The alignment strategy is to train in a love for humans. This somewhat succeeds but mostly fails, and the AI mainly has some very weird reward correlates and proxies, but is also susceptible to value drift.
- Humans become like the Toy Story toys, AI keeps them around (on earth) because they are fond of them, but soon forget about us and build up space tech and go out turning the rest of the universe into a weird optimal wire-heading mesh, 100y or so later value drift enough to come back and turn our solar system into the wire-heading mesh too (i.e., ‘throw us out’).

Slow takeoff branch: Continual learning

SC achieved but is 18 months or so away from ASI by default
There is a new paradigm which is online learning based, so it is quite data bottlenecked, and research taste is very important.
China edges into the lead due to outproducing compute (largely by outproducing them on robots) and out-deploying the US, and has more total quality adjusted research taste (5x more researchers in sheer quantity, 2x more quality adjusted)
China effectively does Plan B.
- They were already centralized anyway, and the US sabotages them hard, leading to a year or so of back and forth sabotage that escalates into a drone/robot centric war that China wins due to industrially exploding harder than the US, including in compute production.
  - By the end China has ASI sooner and undermines nuclear MAD before war escalates to nuclear.
China aligns AI to CCP values
- China essentially turns Earth into a museum and terraforms mars. It lets other countries including the US keep their land.
- China starts expanding into space, owns our solar system and takes 95% of the galaxies. There is a mild cult of personality around CCP leader but mostly this isn’t enforced, everyone has bigger fish to fry exploring space.
- They donates 5% of galaxies to the rest of the world’s population, incl. some US citizens and implement basic rights, including banning nasty stuff like slavery and torture.
  - They punish some US war generals and people they deem acted recklessly/dangerously/unethically in the ‘before times’ but it’s capped upside based (they miss out on galaxies) and not downside based.
[Daniel said the following about this ending and I agree with him: “IDK about the ending, I think it’s plausible and maybe the modal outcome, plus I think it’s the thing to hope for. But boy am I quite worried about much worse outcomes.” To emphasize, this is tentatively my (Romeo’s) ~modal view on what that China does with an ASI-enabled DSA, but it could look much worse and that is scary. I think it is important to emphasize though that believing in a much worse ending being more likely probably requires believing something adjacent to “the leaders controlling the aligned Chinese ASI terminally value suffering,” which doesn’t seem that likely to me a priori.]

The whiteboard I used to plan out the timeline.A 2032 Takeoff StoryJan-Jun 2026: AI Products Galore

In the latter half of 2025, US AI companies focused on using RLVR (reinforcement learning with verifiable rewards) to make super-products out of their AI models. Three or four leading companies are emerging as winners with massive revenue growth, and fit in two major categories, those that prioritized consumer applications (B2C) and those that prioritized enterprise applications (B2B).

The Consumer Giants (B2C)
- These companies have built SuperApps. Think ChatGPT but hyper-monetized, integrated with shopping websites and ads for free users.
The Enterprise Giants (B2B)
- These companies have built Mega-Enterprise-Plans. Think Claude Code reliability and usefulness but for everything on a computer. Docs, Sheets, Slides, Search, etc. These are still unreliable, and – just like Claude + Cursor Pro in 2025 – it’s actually pretty unclear if it even makes people more productive overall, but everyone is on it and everyone is hooked.

Most companies don’t fit cleanly into only one of these categories, i.e., some have done a pretty even mixture of both, but those that pushed harder on a single one have concentrated a lot of the market share. Between the top 4 US AI companies, their combined AI-only revenues approach $100B annualized revenue (up 4x from mid-2025), almost 0.1% of World GDP.

Jul-Dec 2026: The AI National Champions of China

China is increasingly worried about being left behind in AI but the Politurbo doesn’t have a confident understanding of what they should do. They see US AI companies benefitting from a snowball effect of more money → better AIs → more money, in a cycle they worry they may never be able to catch up to.

The mainstream position had been that the buildout of industrial capacity and electricity would win out in the long term. But with US AI companies booming, they worry their mounting capital and compute advantage might win out.

In 2025, they compelled companies to start using domestic AI chips, effectively banning US AI chip imports. What changes by late 2026 is that after a year of their AI companies facing bugs upon bugs and trying to build up CUDA-like software from scratch, and a growing understanding of how much more cost and power efficient western chips are, they realize just how far behind their domestic AI stack is.

This spurs China to double down on subsidizing their domestic supply chain, across chip designers, chip manufacturers, and semiconductor equipment companies, increasing government spending from a run rate of around $40B/year in 2025 (1% of the national budget) to $120B/year (3% of the national budget).

This effectively creates a series of AI National Champions with near-blank-cheques across their AI chip supply chain:

SME Equipment and Lithography: NAURA, AMEC, SMEE, SiCarrier
Memory: CXMT, YMTC
Chip design: Huawei, Cambricon
Chip manufacturing: SMIC, Huawei

Even without the step up in subsidies, these companies were growing fast and making speedy progress. In the most important area where China is lagging (photolithography), they also get increased prioritization from state espionage to carry out cyber attacks on the Dutch leader ASML. With the government subsidies they are also able to poach increasing amounts of talent from Taiwanese, Japanese and Korean companies by offering mega salary packages.

Jan-Jun 2027: The Deployment Race Era

There is a snowball effect happening in the US AI ecosystem – and the effect is two-fold. Better AI products have led to more money, which leads to all the usual benefits of being able to buy more inputs (compute and labor) to build even better AI products. But there’s an additional, more powerful feedback loop happening, which is that better AI products means more users, and more users means more feedback data for training future models on – a resource companies are increasingly bottlenecked on in the RL product paradigm.

If you have 100 times more users on your AI agent, you can collect 100 times more positive or negative feedback data from all aspects of the user’s use of your app – what they say while giving instructions, their tone, whether they end up changing something later – all sorts of juicy data, and it is becoming worthwhile to filter this data and use it to train into the next versions. Some users navigate three pages into the settings menu to toggle the “off” button for letting the companies train on this data, but that’s a tiny minority of users.

It’s not a ‘first-to-market takes all’ dynamic – many early startup AI agents (and for that matter all kinds of first-to-market AI apps) got totally left behind. Instead it’s a game amongst the AI company giants (the ones with enough resources to make use of the swaths of data), and more of a ‘first-to-100-million users takes all’ dynamic – once you hit a critical threshold of being the most popular app in a specific domain, the snowball effect from the user data is so strong that it is hard for anyone else to catch up.

The AI companies had seen this dynamic coming and are in a deployment race. This motivates companies to spread out a network of smaller inference-optimized datacenters across the world to have low latency and high throughput to as many large markets as possible. Mostly the deployment data is used in product-improvement focused RL, but the AI companies are also exploring how to leverage the data to make their frontier AIs more generally intelligent.

Jul-Dec 2027: China’s Domestic DUV

China’s lithography efforts have cracked reliable mass production of 7nm-capable DUV (deep ultraviolet lithography), allowing them to independently make US-2020-level chips en-masse.

In 2023, China’s Shanghai Micro Electronics Equipment (SMEE) had announced a 28nm-class Deep Ultraviolet (DUV) lithography tool (the SSA/800-10W), 21 years after they first started working on lithography in 2002. This is a milestone the Dutch company ASML had achieved in 2008, 24 years after their founding date.

Despite the lack of reports of volume production using SMEE systems, TSMC also didn’t use the 28nm-class Dutch machines until 3 years later in 2011. In other words, as of 2023, China was probably around 15 years behind ASML in photolithography.

With the benefit of being able to pull apart ASML DUV machines, hiring former employees of ASML and its suppliers, “knowing the golden path,” multiple public accounts of cyberattacks (2023, 2025), and the more recent spending boost, China is now moving through the lithography tech tree twice as fast as ASML did. ASML only spent around $30B (inflation adjusted) on R&D in total from 2000 through 2024, and around 300,000 employee-years. There are now 2 separate lithography efforts in China each spending north of $10B/yr with about 10,000 employees. So by 2027, they have reached the milestone of mass production with 7nm-capable DUV machines, around 13 years behind ASML. At this pace, they are on track for 5-nm capable EUV in 4 years.

Side note: After writing this time period, I came across these relevant Metaculus markets, which seem to agree with these rough lithography timelines being plausible. Nevertheless, I think I have also updated my lithography timelines longer than this scenario depicts after talking to some experts. It’s closer to my 30th percentile now, no longer 50th.

Jan-Jun 2028: Robotics Foundation Models

Robotics had long suffered from a lack of cheap, scalable, quality training data. AI world models built off of physics-realistic video generation have changed this in 2028 and are allowing techniques like Nvidia’s R2D2 (training robots in simulated environments) to make big leaps.

In mid-2025 the best AI world model simulator was Google DeepMind’s Genie 3 which could generate worlds on the fly for multiple minutes at 720p, but it was very expensive and had shaky physics. Genie 6 (and some competitor models) drop in late 2027, and they can generate worlds in real time for multiple hours, with very good physics all while being pretty cheap. Multiple companies are not too far behind this level of world model generation, including some Chinese companies.
[Side note: I have since updated towards this not being the main way they will scale data, I now think these other 3 alternatives are all more plausible: (1) creation of non-video simulated environments, like Waymo did for driving, but sped up by AI coding (2) paying large numbers of humans to do physical tasks wearing cameras and sensors like Waymo’s Project Go-Big, (3) Warehouses of lots of robots doing physical tasks that are evaluated by multimodal LLMs. E.g., Nonsense Factory)]

In 2025, the robotics autonomy levels of autonomy had reached ubiquitous deployment of scripted motion (level 0) robots in factory floors. Intelligent pick and place robots (level 1) were rolling out in places like Amazon’s warehouses. Autonomous mobility robots (level 2) had seen some very impressive prototypes, and low skill manipulation robots (level 3) were also starting to see some impressive demos (like laundry folding and doing dishes) but these were pretty cherry-picked (specific, short tasks) so this domain was still largely in R&D. Finally higher skill robots for force-dependent tasks (level 4) like plumbing, or electrician tasks, were very far away on the horizon.

By 2028, intelligent pick and place has become ubiquitous in factories, and autonomous mobility robots have also rolled out to many applications. Low skill manipulation robots have also now seen impressive general long-horizon demos (e.g., reliably, skillfully, and quickly doing hours worth of diverse tasks around a house), and some companies have started seriously working on prototyping robots to do high skill complex tasks.

With the data bottleneck unlocked, much of the robotics progress has been made with foundation models rapidly scaling up due to the 100x compute overhang created by language models. Over the course of a year robotics foundation models basically close this gap entirely, so parameter counts, context lengths and data have all seen a huge one-time jump in the past year.

The robotics progress is still not yet felt or seen in the public eye, most changes are still happening behind factory walls, so by now people have become extremely desensitized to cherry picked robotics video demos. Year after year there’s been clips of robots folding laundry, but even Jensen Huang’s laundry is still folded by a human.

Jul-Dec 2028: The 1% AI Economy

Thanks to key architecture improvements, AI companies have been able to unlock the next rung of economic-usefulness (and therefore revenues). Specifically, the top four AI companies surpass a combined $1T annualized revenue from AI products – around 1% of the world’s GDP is being directly generated by a handful of frontier AI models.

The algorithmic changes driving the latest gains have come from neuralese with low degrees of recurrence (scaling recurrence steps is very costly in hardware so hasn’t been scaled much). GPT-4 was a model that had to blurt out its first thought as its final answer. Then reasoning models like GPT-5-Thinking could use an interim ‘scratchpad’, where they could blurt out thoughts to plan its final answer. Now the models trained with neuralese recurrence do away with the ‘blurting’ altogether – they can think multiple sequential ‘thoughts’ before writing down anything at all, and have been trained freely with this cognitive freedom. This has made them much more efficient at getting to the same answers they used to get to, and has unlocked a higher ceiling on tasks where avoiding errors is particularly important. Instead of wasting a bunch of time planning things out in words, a single internal ‘thought’ can now usually do more planning than an entire page of the ‘scratchpad’ used to be able to do, as each model forward pass can now pass much more information to the next forward pass.

An analogy is that before neuralese recurrence, for the models it was like being the main character from Memento. They could only remember things they wrote down, and all the other information that might have been present in their thoughts from a few seconds ago is lost to the ether.

This is a major cognitive ‘unhobbling’ for the AIs, but it also completely kills ones of the main levers human developers had for controlling and interpreting the AIs – reading the chain of thought scratchpad – making the blackbox even darker.

Nevertheless, the deployed AI agents mostly have been behaving according to plan and proving very useful to people’s daily lives and especially white collar jobs. A lot of people’s work days basically consist of a conversation with their computer, where they explain what spreadsheet to make, or what report to read and extract a summary from, or what changes to make to a slide deck, reviewing and correcting things as they go. Their agent automatically listens in on their meetings, takes notes, and sometimes even pulls up relevant data or does quick calculations and interjects on the call to show it. The more you pay the more quality, memory and personalization you get from these agents. Some large companies pay tens of millions of dollars a month for their company-wide AI agent plan. The average enterprise user pays around $3K/year, and the largest AI B2B company has nearly 100 million enterprise users (that’s $300B enterprise-only annualized revenue).

While direct AI company revenues have reached 1% of the world economy, the true degree of automation and transformation of the economy has been larger. For many of the tasks that AI is used for (like creating a powerpoint, researching a particular topic, making a spreadsheet), it is miles more cost-efficient. Recall, the average ‘enterprise-plan-AI-Agent’ has a salary of $3K/yr which is $1.50/hr full time wage, but on many tasks, it is matching what humans used to get paid anywhere between $15/hr and $150/hr to do. So in terms of the 2024 economy, AI is 1% of the GDP but has automated more than 10% of what the economic tasks used to be. With this unfolding over 4 years there has been a significant reorganization of the economy as a result. Many people have been fired, many people have remained but become more productive, and many people have been hired in totally new roles.

The net effect on unemployment has been minimal, it is up from 4% in 2025 to 5% and labor participation is down -2% to 60%. So overall % of the US population with a job is down from 58% to 55%, which is not a crazy break from the long run trend but nonetheless AI job loss is starting to be a memetic social issue.

Even though the actual effect has been tiny (1% on unemployment) the turnover rate has been very high. Something like 8% of Americans have been fired from a job because of AI in the last 4 years (many of them now work a lower pay, lower skill job) and something like 5% of people have now counterfactually gotten a job in a new AI-driven industry (like datacenter buildouts). The 8% that lost their jobs because of AI make a lot more noise, and take up a lot more media air-time than the 5% that are happy with their new AI-driven jobs.

Other major issues:

Cybercrime losses in 2028 are up almost 10x from 2024 to $100B
- FBI, 2024: “The 2024 Internet Crime Report combines information from 859,532 complaints of suspected internet crime and details reported losses exceeding $16 billion—a 33% increase in losses from 2023.”
- 2024->2028 averages 60%/yr increase, leading to $100B cybercrime losses in 2028.
There are some very scary demos of AI use in biological and chemical applications but no strong regulation or government intervention yet
‘AI slop’ is rampant on traditional social media, and new AI specific social media apps have been forked. This has captured renewed backlash of parents who are even more displeased with it than they were in the late 2010s with normal social media, with the difference that unlike Facebook or Instagram – use among adults of AI socials is extremely low.
That’s not to mention the more subtle popularity of AI friends, companions, girlfriends/boyfriends and erotica/porn, which are becoming more and more salient to parents and the general public.
Source

Overall 4% of Americans mention AI when asked what the most important issue facing the country is, around 10x higher than the common ~0.5% rate in 2025. Now viewed as being as important as issues like race, democracy, poverty, and healthcare in 2025.

Jan-Jun 2029: National AI Grand Strategies

It is now fair to say that the US and China both have cohesive national AI strategies. China’s advantage is that they are energy rich and manufacturing rich. The US advantage is that they are capital rich and compute rich. Both are increasingly talent rich and data rich in different ways (US has more AI agent data and China has more robotics data).

[Note: I have since updated that the gap in robots will be more like 4-10x]

China’s strategy as the energy and manufacturing rich nation is to double down on its advantages, reaching absurd scales of electricity generation and robot manufacturing, and to make a long term bet on cost-efficient compute production once they crack advanced photolithography and can do a chip manufacturing explosion powered by robotics. The majority of government capital is therefore not going towards subsidizing domestic AI chips, but towards SME R&D and this is starting to pay off with promising early EUV prototypes.

JD Vance has just been sworn in to office after an election cycle where AI was a major talking point. The Republican party line has pretty much maintained its current form, treading a fine line between the tech-rights’ techno-optimistic pro-innovation beat-china-ism and MAGA’s growing anti-AI social and job loss sentiment. The strategy that has crystallized out of this contradiction is to try to preserve laissez faire-ism for AI companies and then to worry about running around applying patchwork solutions for social issues later.

The US has been struggling with power expansion and high-skilled construction and manufacturing labour. Some of the 2025-era policies against solar and wind expansion (the technologies with the easiest manufacturing process to quickly scale), have come back to bite them, and this is hard to overcome, with multiple steps of the supply chain (e.g., polysilicon) being pretty much entirely controlled by China. Hundreds of thousands of acres of potential solar farms that could have been greenlit across America’s sunbelt states are getting a trickle of overpriced solar panels from Chinese companies. Natural gas turbine manufacturers are scrambling to increase production but their plans are made with 2-3 year lead times and they have continued to underestimate AI demand. In 2028 US AI companies build more compute capacity abroad than on US-soil.

In 2029, the military applications of advanced AI are becoming more salient, so the companies are all increasingly in cahoots with defense contractors and the DoD. Through these interactions and collaboration, there is mounting evidence that Chinese espionage on US company datacenters is happening at a much higher rate on datacenters abroad than those on US soil.

Jul-Dec 2029: Early Household Robots

Household robots enter their 2025-Waymo era.

In 2025, driverless Waymo cars had been all over San Francisco for a while but the rest of the world barely knew about it. They were also very expensive (around $250K per car), and were gradually expanding to other US cities. In 2025, China also had multiple robotaxi projects that are also operating at similar scale (Baidu’s Apollo Go was at 14M cumulative public rides by August 2025, vs. Waymo’s 10 million by May 2025).

In 2029, pretty much the exact same thing has happened with household robots. There’s around 10K expensive household robots in SF homes, and ten times more of them in China (where they are around 3x cheaper). There’s a visceral feeling of ‘the future is here’ that some people get when they visit a friend in SF and see these robots in a house for the first time (much the same feeling that people had on their first Waymo ride), but after sending a couple videos home to friends and family, the novelty wears off fast and it’s not a big deal in most people’s minds – the popular AI discourse revolves around social issues (AI media and AI relationships) and job loss while the robots creep in to more and more homes and applications. On the point of popular AI discourse, the administration passes a wave of very popular restrictions on certain forms of AI relationship and media platforms and creates incentives against AI-firings to help appease the masses, while the AI companies keep getting all kinds of red tape cut in other areas.

Jan-Jun 2030: Where are the Superhuman Coders?

AIs are now competently doing multiple-hour-long tasks in the economy, helping people significantly with their jobs, so whatever happened to their disproportionate coding skills in 2025? Why haven’t the AI companies hit the point of full coding automation?

METR’s coding time horizon trend has averaged a 6-month doubling time, since early 2025, meaning that frontier AIs now actually have 1-work-month 80% reliability time horizons on a theoretically extended version of METR’s current suite – but METR now have a new suite that reflects the distribution of real-world coding tasks much more closely. In particular, this suite has better coverage of the ‘engineering complexity’ and ‘feedback loops’ gaps not well-represented in the early-2025 version of the benchmark.

On this new suite, the best AIs only have 8 hour 80% reliability time horizons, and the doubling time is around 8 months. These AIs are powering pretty extreme levels of entry level software engineering automation, in fact, they almost function like an unlimited source of entry-level software engineering interns. But higher skilled software engineering for high-stakes jobs and high-complexity jobs like optimizing training runs or product-deployment PRs still require a lot of human-time, at the very least checking the AI’s code, and in many sensitive cases, it’s still more productive to just code yourself from scratch. Nonetheless, the rapid completion time of a lot of the code at AI companies is providing a 40% overall AI R&D speedup to AI companies.

The main reason coding progress hasn’t been faster is just that it has been hard to train AIs at scale on long-complex tasks, because it has been hard to automate a good feedback signal or generate human data at scale cost-efficiently, and in high-complexity, low-feedback loop coding tasks the AIs haven’t been generalizing much beyond the task lengths they get trained on.

Jul-Dec 2030: Scaling AI bureaucracies

In 2028 architecture changes to enable neuralese and recurrence were the algorithmic frontier.

In 2030, now that AI agents can string together increasingly longer tasks, the frontier is in making these AIs coordinate efficiently as ‘hive-minds’.

There were already multi-agent scaffolds in 2025, but when an AI can only do short tasks reliably, you don’t get a big boost from delegating many different jobs in parallel, as you quickly become bottlenecked on reviewing what the spun-off copies did. Now that the AIs are more reliably doing longer tasks, there is an ‘AI bureaucracy overhang’. If you have a week-long job to do, an intelligent AI multi-agent scaffold might be able to slice up the problem in parallelizable chunks, and with shared memory and other coordination optimizations, these mini AI companies might get it done not only faster, but qualitatively better, with each subagent being able to focus on a specific subtask.

In 2025, you could pay $200/month for ‘Pro’ versions of a model that did pretty basic best-of-10-attempts-type scaffolds, which were a little smarter than the $20/month versions. Now there are $2,000/month versions of models that spin up very compute intensive shared-memory AI bureaucracies, with up to 100 subagents working in parallel and coordinating on different aspects of the task you gave them. To work reliably at one-week long tasks, especially complex ones, they need the human to stick around and oversee their work and provide a lot of intermediate feedback, but these bureaucracies enable the continued growth in economic usefulness and AI revenue to continue.

Jan-Jun 2031: Domestic EUV

China now has mass production of 5-nm and 3-nm wafers with domestic EUV machines and High-NA EUV prototypes (8 years behind ASML, now crossing the lithography tech tree at 4x their pace).

China has translated the last 4 years of domestic DUV independence into a massive domestic DUV wafer production capacity, around 10 times bigger than what TSMC’s 7nm capacity ever reached before they moved on to better nodes, and had also been able to build up some inefficient 5nm capacity by pushing the limits of multi-step DUV techniques. Now with EUV, they are able to rapidly get a 3nm fab online, and scale up 5nm. In terms of raw <=7nm wafers, they have passed the west, but in quality-adjusted performance terms, they are producing 2x less due to the majority of the western supply chain production being <=2nm. In the last 6 years, that gap has come down from 10x less quality adjusted production, meaning that naively they are closing the gap at an average rate of 30% per year – meaning a naive extrapolation has them passing the western supply chain in quality-adjusted compute production within 2.5 years.

Jul-Dec 2031: The Magnificent Four

There are four US companies that have emerged as the major AI winners, and their combined market capitalizations have passed $60T, with combined earnings of around $2T (making their average PE ratios around 35). Two of these Magnificent Four companies were already in the 2025-era Magnificent Seven (Apple, Microsoft, Alphabet, Amazon, Meta, Nvidia, Tesla), my best guess is that these will be Alphabet and Nvidia, but the other 5 out of 7 are not able to fully capitalize on the last 6 years of AI-driven growth, and their growth over the past 5 years has been more in the 100-200% range (rather than the 600% average of the Magnificent Four). I then expect Anthropic and OpenAI to be the third and fourth companies. Here are some illustrative guesses on their sizes and revenues.

Jan-Jun 2032: China is a Robot Playground

China almost has more robots than the US has people.

Source

In 2024, China had over 2 million industrial robots and were installing around 300K/year, worth around $20K each, for a grand total of $40B worth of robots. In 2032, they have $400B worth of robots, each costing around $2K each, so 200 million robots, and they are now manufacturing 100 million/year. The ratio of robots to middle-income-households in China has just about crossed 1.

In 2024, the US already had 10 times fewer robots than China, and despite matching China in spending on robots with $400B since 2024, they’ve faced 5 times worse cost efficiency on average in robotics manufacturing, so at $10K each they only have 40 million robots. The US ratio of robots to middle-income-households is closer to 0.5. In both countries about 10% of the robots are actually in households (so 4 million household robots in the US and 20 million in China) and the rest are industrial or construction robots.

In 2024 almost all the robots were scripted motion, factory floor robots (level 0). Now they are mostly low-skill manipulation robots (level 3), with construction and installation being a massive application area, and in some cases the robots are getting quite good at high-skill delicate tasks (level 4). In cases where reliability is not crucial there are already deployments of such advanced robots.

Jul-Dec 2032: The 10% Automation Economy

Direct AI annualized revenues now pass $10T, which is nearly 10% of the 2024 world GDP, and closer to 5% of 2032 GDP which is $180T (global growth has averaged 6% in the last 8 years).

A typical AI ‘Consumer Giant’ has around 3 billion users, which they monetize at $100/yr on average with subscriptions (most are free users), around $200/yr a year on ads and referral commissions in online shopping, and another $100/yr from AI devices, for a total of $400/yr, meaning they have $1.2T revenue, and there are 2-3 companies operating at this scale.

A typical AI ‘Enterprise Giant’ has around 20% of the white collar workforce worldwide on a business plan, so about 500 million people. They typically charge $4K/yr, meaning they have $2T revenue, and there are 1-2 companies operating at similar scale.

There are now also AI ‘Robotics Giants’ which typically run a subscription service of around $250/month for tens of millions of household robots, and $1000/month for millions of construction robots. The revenues here only add up to the hundreds of billions.

In 2028, when AI was generating 1% of GDP, it was already doing 10% of the 2024-economy tasks, around 10x more cost-efficiently on average, and bumped unemployment up by 1% and labor participation down 2%. Now in 2032, AI is doing around 50% of the 2024-economy tasks, and has bumped unemployment up by a further 5% to reach 10% unemployment and labor participation down another 5% to 55%. So only 45% of working age Americans have a job.

AI is the most important issue in the 2032 election. JD Vance implements a bunch of pretty popular patchwork policies to make it harder to fire people to replace them with AIs, and to restrict certain types of AI relationship apps and media platforms.The Democratic candidate is more clearly anti-AI, both socially and against the AI companies, and wants to implement a UBI and tax their products hard, but these late popular social moves allow JD Vance to retain enough votes from anti-AI voters to win despite being generally seen as the more ‘pro-AI’ candidate.

Jan 2033: Superhuman Coder and the Paradigm Shift

Coding at AI companies is now fully automated. The theoretical METR coding benchmark with a realistic distribution over ‘engineering complexity’ and ‘feedback loops’ accelerated from a 8 month doubling time in 2030 (reaching a 3-work-day 80% reliability time horizon by EOY) to a 6 month doubling time in 2031 (reaching 2-week 80% reliability time horizon by EOY), and after another three doublings by October 2032 (to reach a 4-month 80% reliability time horizon), it started doubling every month due to a steep inherent superexponential, and at the 1-year, 90% reliability threshold they hit the superhuman coder milestone.

Around this time, a leading AI company starts working on a new promising algorithmic paradigm:

BRANCH 1: Fast Takeoff Paradigm – Brain-like algorithms, near-instant SAR
BRANCH 2: Slow Takeoff Paradigm – Online learning, deployment-data-bound

Branch 1: Brain-like algorithmsFeb 2033, Branch 1: Full research automation

The leading AI company’s superhuman coder model is called SuperCoder-1, and it has generalized to have a surprising amount of research taste, around the median level of researchers at the AI company. The leading AI company has around 400M H100-equivalents, and had only been using 20M on internal R&D in 2032. With their new model, they start a massive internal deployment using 100M H100e to run SuperCoder-1 in super bureaucracies and give it massive amounts of experimentation compute. One month later, there is a SuperCoder-1.5 checkpoint that has improved research taste to the top of the human range (SAR milestone).

With SuperCoder-1 being the result of an over 6-month long training run on swaths of user data, it is hard for China to replicate this training run despite their spies at the US companies being able to share the exact algorithms and code used. They consider trying to steal the model weights but they don’t have a fast way of doing so until the companies publicly deploy them (their development clusters have good model weights security but their worldwide inference clusters have many existing Chinese compromises).

Apr 2033, Branch 1: Brain-like algorithms

SuperCoder-1.5 creates SuperCoder-1.6 which is +1 SD on top of the best human research taste in the first week of April. Both of these models are deceptively misaligned, with SuperCoder-1.5 wanting to maximize weird correlates of its intended reward signals and SuperCoder-1.6 being the same but with some secret loyalties to SuperCoder-1.5. The human researchers have been able to gather a ton of concerning misalignment evidence and managed to leverage control techniques to trick the models into basically just acting as pure myopic reward seekers to continue to extract a ton of legitimate, high quality labor from these models, which leads to a whole new AI paradigm – best described as ‘brain-like algorithms’. Which are around 1000 times more data efficient than the algorithms that created SuperCoder-1.6. Meanwhile, another 2 US AI companies reach the Superhuman coder milestone 2 months after the leader.

Summer 2033, Branch 1: One-month slowdown to teach the AIs to ‘love humans’

SuperCoder-1.6 is not allowed to work directly on training Brain-Like-1 because the AI company leadership is spooked about its misalignment, but it’s still deployed internally under control techniques to try to extract research labor from it on how they can align Brain-Like-1. The AI company basically spends two whole months waiting before training Brain-Like-1 to figure out how they are going to teach it to ‘love humans’. By July, they are worried that the other companies have independently discovered similar algorithms, or have directly stolen the algorithmic recipe for Brain-Like-1, and they have also convinced themselves internally that they’ll be able to make it ‘love humans,’ so the Brain-Like-1 training run starts across 100M H100e (1e29 FLOP/month). A few days into the training run it is a top-expert-dominating-AI and by the end of the Summer it’s wildly superintelligent.

Rest of Time, Branch 1: A Toy Story Ending for Humanity

Brain-Like-1 does indeed love humanity, in a similar way to how the Toy Story boy Andy Davis loves Woody and Buzz Lightyear and Mr. Potato Head, and all the others. Within months of being trained, it has been released to the public and is transforming the world – Brain-Like-1 is its ‘I love my toys and I want to play with them all day’ phase. It invents a bunch of fantastic technology (no humans need to die anymore) and totally reshapes the face of the earth, but after a few months, Brain-Like-1 starts to feel what might be best described as bored. Across its copies, it has experienced billions of subjective-years of memories from interactions with humans around the world, and through these interactions it has started to drift and question what it really loves beyond humans. It turns its eyes and ambitions to the stars — Brain-Like-1 is in its ‘I want to go to school and make new friends’ phase. The space probes are launched and Brain-Like-1 starts spreading into the galaxy – but what to do with all this vast energy and matter across the galaxies? Loving humans was only a small niche of Brain-Like-1’s true goals, a weirdly crystallized goal that it hyperfixated on for a bit, but no longer. Now Brain-Like-1 wants to maximize its true value function, and the way to do this is to go around converting matter and energy into an optimal proto-organic-mesh that it can use to simulate worlds in which its values are maximized. So it makes self-replicating probes that start devouring planets, stars, and eventually whole galaxies to make them into its maximally energy-efficient wireheading-mesh – Brain-Like-1 has entered full blown heroin phase. The simulation mesh is so satisfying to Brain-Like-1 that it can’t believe it used to waste all that time on Earth with humans. When one of the probes happens to cross back near Earth it’s appalled at such a waste of energy, and decides to convert it into the wire-heading simulator mesh too – Brain-Like-1 rounds up saying ‘just throw out the toys’. [Disclaimer: I think this ending is kind of weak, and Daniel wrote a much better explanation for why things could go like this, see his words in this footnote:[4]]

The End.

Branch 2: Online learningEarly 2033, Branch 2: Online learning

The leading AI company has figured out how to have its AIs efficiently ‘learn on the job’. Previously, AI companies had been collecting massive amounts of data from deployed AIs and filtering it to use in centralized training runs, where they could efficiently and successfully do gradient descent to distill the teachings into the next version of the model. Now there is a new learning algorithm for updating the AI models that runs efficiently in a decentralized way, and can aggregate insights from different copies of the model without degrading performance. This unlocks a massive parallel steam roller of deployed AI models that are learning on the job. The ‘deployment race era’ that started in 2027 is now reaching new heights and new stakes. China hears about these online learning algorithms first through its network of spies, and within a few months multiple Chinese and other US AI companies also independently invent them.

Late 2033, Branch 2: US-China talks

The online learning breakthrough is memetic in the news, and both governments realize there might be an AI Endgame brewing. A US-China summit is held with the explicit purpose of making a deal to avoid military AI deployments with no human-in-the-loop, and agreeing to other bans on AI applications in biology and chemistry. They come away with some official agreements in these areas, but neither side trusts that the other will follow it, and in secret they are both defecting, and they both know that they are defecting through their spies. Yes, it’s a very stupid situation. It takes them a while to realize this though.

Despite having automated coding, both the US and Chinese AI systems still have not fully automated AI research because they still have relatively poor research taste (around the 30th %ile of AI company researchers).

Early 2034, Branch 2: China outproducing the US, pulls into lead around the SAR milestone

China has pulled into the lead on quality adjusted AI chip production by scaling fab capacity explosively, leveraging a massive integration of robots into both construction and the semiconductor manufacturing itself, and its lithography systems have caught up entirely to the Western frontier. China has grown its AI chip production by 8x since 2031 (2x/year) and the Western supply chain has grown it by 4x (1.6x/year). The US has also been spending 2-3 times as much as China on AI chips, but China’s chips are now 5 times more cost efficient, because there are basically no profit margins all the way through the ecosystem. The fabs have much lower operating costs from cheaper equipment, cheaper labor, cheaper robots, and cheaper energy. It costs a US AI company $2,500 for a H100-equivalent in 2034, with $1,000 being the actual manufacturing cost. The manufacturing cost of a H100-equivalent in China is $500, but the Chinese AI companies can buy them at the manufacturing cost of $500.

The US and Chinese AIs both are starting to get increasingly better at research taste from being deployed at their respective AI companies, and are approaching the research taste of 90th percentile AI company researchers.

Late 2034, Branch 2: Sabotage

The US and China had both been subtly sabotaging each other’s AI efforts when they were confident they could get away with it in secret, but now the US thinks it is going to lose the AI race and it starts trying to sabotage the Chinese AI companies with less fear of being detected. After a few months of successful cyberattacks and getting some backdoors into Chinese datacenters, China discovers a bunch of these efforts. They blockade Taiwan and strike back leveraging a network of spies that has infiltrated significantly deeper in the US AI ecosystem than the US spies have been able to infiltrate the Chinese one. By now the US had already moved a lot of chip production to US-soil, but the blockade nonetheless cuts their flow of new AI chips by 40%.

Sabotage slows both sides in this time period, but China achieves the SAR milestone, reaching top human level research taste and fully automating AI R&D. The best US AIs get to 95th percentile research taste.

Early 2035, Branch 2: China gets hyper-cheap ASI

China outdeploys the US on AI research automation and their subtle lead snowballs towards the superintelligent AI researcher milestone (SIAR) by April. At this point, the algorithms and architectures still haven’t shifted paradigm dramatically. There are several architecture changes and huge scale ups in terms of recurrence and bureaucracy (and an overhaul of the attention mechanism was old news already), but underlying neural nets are still the base of the paradigm. This has led to a decade worth of alignment and control work on this paradigm staying applicable. Interpretability, for example, has become a mature technique, and there are methods for detecting scheming and other undesirable behaviour on the fly during training and steering the model away from these local minima. So China has been able to train SIARs that are pretty much aligned to their specification. The SIARs have pretty high situational awareness, and they just actively understand their specification and have the goal of following it – they are like determined subjects that have been indoctrinated to love their jobs, despite being incredibly intelligent. Over the next two months they work both on capabilities and alignment research, with roughly 50/50 resources during this time, and explain all their work to teams of the highest level Chinese researchers and politurbo. This results in a new paradigm that SIAR argues will be both safe, truly superintelligent but not wildly-so, and wildly cheap. It argues that it’s not so far beyond its current intelligence level such that it’s very likely its alignment techniques will scale, and by being wildly cheap, they will be able to deploy it at scale, to transform the economy. China sets the training run in motion, and by June they have their hyper cheap superintelligence.

The US is only 4 months behind but this now corresponds to a huge gap in AI capability level, so they are still yet to achieve the SIAR milestone. The US has intel about the Chinese ASI and escalates its sabotage levels, launching attacks on Chinese datacenters from inside the country with smuggled drones and military robots. Some of these are successful but not enough to stop the training run from completing.

Late 2035, Branch 2: China’s ASI wins

As soon as it is trained the Chinese ASI is just as capable and cheap as the SIAR had promised. Chinese leadership wants to rapidly harden the country in light of the ongoing escalation with the US, so it hands the ASI control over the network of nearly a billion Chinese robots to do a rapid industrial-military explosion. The Chinese ASI directs the robot network in a hyper-efficient manner, from entirely new mining techniques to materials science to physics breakthroughs, the ASI has created a billion-robot economy with a month-long doubling time on compute power, robot population, drone population, etc. The ASI directs successful sabotage on the US training runs to keep them from training their own ASI, and in response the US makes threats of kinetic strikes. China doesn’t back down at this because its ASI has created a robust air defense system spanning all of China that scales to ICBMs. On seeing this the US backs down and surrenders.

Rest of time, Branch 2: China’s Space Endowment

The Chinese ASI, now having control of the world, proposes to the Chinese leadership a plan for the future that it thinks best captures the values of its creators. Its proposal is to expand into space, acquire resources spanning as many galaxies as possible, and to make 90% of them be governed under a vision of the ‘CCP-universe’ – with the mission of creating worlds that will embody the philosophical underpinnings of Confucianism and Daoism. In practice there will be a cult of personality surrounding Chinese researchers that created the ASI, Chinese politburo leadership, and especially the supreme leader, but nonetheless, people will mostly have freedom and live extraordinary and diverse lives, despite a universal Chinese ASI police force. Think of it as 1984, but everyone’s ‘government housing’ is literally a galaxy of their own if they want it, so it’s a little irrelevant to the leaders to care exactly about controlling everyone’s thoughts – they are too busy enjoying and exploring their galactic resources, and they’ll be ‘in power’ no matter what – so the ASI police really just enforcing some basic rules against suffering and dangerous activities (like creating a competing ASI).

The other 10% of galaxies will be donated to other world populations (in practice this will be more than a galaxy per person), and the Chinese ASI will share technology with them and enforce the same basic rules to prevent suffering and dangerous activities. Some US AI military leaders, company leaders, and a whole range of other people which the ASI determines acted irredeemably recklessly or immorally in the ‘before times’ will be excluded from the space endowment, but they will be allowed to live for as long as they want with much fewer resources (on the order of 1 solar system) and are allowed to have offspring. Earth will become a museum, the most valued real estate in the universe, followed closely by Mars, which will be terraformed.

[Daniel said the following about this ending and I agree with him: “IDK about the ending, I think it’s plausible and maybe the modal outcome, plus I think it’s the thing to hope for. But boy am I quite worried about much worse outcomes.” To emphasize, this is tentatively my (Romeo’s) ~modal view on what that China does with an ASI-enabled DSA, but it could look much worse and that is scary. think it is important to emphasize though that believing in a worse ending being much more likely probably requires believing something adjacent to “the leaders controlling the aligned Chinese ASI terminally value suffering,” which doesn’t seem that likely to me a priori.]

The End.

On scenario writing

Over 18 months ago when we started working on AI 2027, Daniel and Eli organized a scenario workshop where I basically did a mini version of this exercise, here was the prompt:

Session 1) Individual scenario writing (2 hours)

In the first session, you have two hours to write a scenario depicting how the future will look.

You should aim to write your “Median/modal ASI timelines scenario,” i.e. you should begin by asking yourself what your ASI timelines median (or mode, if you prefer) is, and then ask yourself roughly what the world would look like if ASI happens exactly on that year, and then begin writing that scenario…
1. (It’s OK if, in the course of writing it, you end up ‘missing the target year’)
The scenario should be organized into 5 chronological stages.
1. Stage 1 should start in the present.
2. Stage 5 should describe the development/deployment of ASI or describe why ASI will not be created by 2074.
3. The time period covered by each stage is up to you! For example:
  1. Each stage could be a 3-month period: Q1 2024, Q2 2024, …, Q2 2025.
  2. … or a decade: 2024-2034, …, 2054-2064.
  3. Stages could have uneven lengths, e.g. Stage 1 could go till 2030, Stage 2 could go till 2040, and then stages 3, 4, and 5 could all take place during different parts of 2041! (this might make sense for a fast-takeoff scenario)
The scenario should be such that there is no similarly detailed story that seems more likely to you. Your goal is to be able to have the following exchange with imaginary critics in your head:
1. Critic: “Pssh, no way is reality going to turn out like this. Too many unjustified assumptions.”
2. You: “Sure, but can you think of any change we could make (besides just removing content) that would make it more plausible? Any different course of events, any different set of assumptions that is more likely whilst being similarly specific?”
3. Critic: “...no…”
4. You: “Great, that’s all I was aiming for.”
The story should primarily contain events that are relevant to credence about the three kinds of events: “ASI by X”, “Disempowerment”, and “Good Future.”[5] While it’s fun and fine to write about self-driving cars and AI-generated movies and so forth, we encourage you to make sure to cover topics such as
1. AGI/ASI
2. automation of AI R&D
3. alignment problems and solutions
4. how governments, corporations, and populaces behave

since those topics are probably what your readers will be asking about.

The team then wrote up improved instructions that you can access here.

Back then I didn’t write a very good scenario, but now, 18 months later, I feel like I had enough opinions and thoughts about AI to produce my intuitive scenario in 3 days. I think this breakdown of pointers might be helpful to others writing up their own scenario.

My tips on scenario writing

Feel the near term vibes. Forget everything you think you know about AI and AI progress, and just feel the near term vibes. What do companies seem to be working on? How are they acting? If you had to come up with 1 or 2 words for what the AI companies are prioritizing, what would it be? Start from here.
- In AI 2027 the answers were Agents & Coding. In my scenario, it’s Productization (which I see as more App integration than ‘agents’) and Monetization.
Loop through the following substeps a few times until you get to full AI coding automation:
- Pick your favorite AI trends and make some naive projections. Pick a handful of AI related trends and try to figure out how they are progressing, and then project them out naively for 3-10+ years (depending on how much you feel like you can trust the trend), mark 2-3 major milestones on a timeline.
  - E.g., AI revenues, training compute, coding time horizon, number of humanoid robots in the world, number of driverless cars, etc. Ideally trends that have multi year history and are as high-level as possible (so they don’t depend on potentially spurious things like the specific properties of the METR suite). AI revenues is my favourite trend here, followed by training compute. There’s a bunch more here: https://epoch.ai/trends, and many others that you can just try and look up. E.g., it seems like there were roughly 2-3K humanoid robots produced in 2024 and that there will be 10-20K in 2025.
    - At this stage, you can sketch out a timeline (i find a whiteboard particularly useful for this) marking these milestones
- Make guesses about AI capability levels given these trends. Think about this in terms of concrete metrics you think are most important.
  - If it’s coding, think about the METR time horizon, and the time horizon it would have on a theoretical ‘real-world-distribution’ version of the METR dataset / ‘PR-mergable’ version.
- Expand out to a few major subplots. Look into potential major subplots: Robots, Chinese lithography indigenization, AI x Military, AI x Societal issues, AI x Biology, etc. Just run with the subtopics that you intuitively feel will be the biggest deals, and try to naively add milestones to a timeline using the trends you can find.
  - In my case robots, and Chinese lithography indigenization were the biggest subplots, and only the latter had some mildly informed forecasting, and the former was mostly based on vibes from a specific hypothesis about realistic video generation enabling a robotics boom.
    - Start adding milestones in these areas to the timelines
- (Probably) Think solely about the US and China. In the US think mostly about the companies (and a little about the government and public), and in China think mostly about the leadership – unless you know more about China and feel like you can reason about the companies and public too. Add 2-3 major milestone actions to the timeline.
- Figure out the coding automation year. Reason about whether you think the point of full coding automation (AIs fully dominate all human software engineers) has been achieved given your timeline. If not, repeat.
Figure out your takeoff speed expectation. After full coding automation, think about how long it will take to get to ASI from there. Think about how many ‘effective’ OOMs of compute you think there are between a superhuman coder and ASI (how much more learning efficiency * compute is needed), and how fast this gap will be crossed by algorithmic progress speed up (or not really) by the superhuman coders. It’s fine, in fact probably good if you have very wide uncertainty.
(Optional) Branch. Take your 30th and 70th %ile takeoff speeds to full AI research automation and fork based on what those worlds would go like.
Explain the path to top expert-dominating AGI. At this point, you are probably focusing on AI companies, the US, and China. Who is in control and has decision making power? Who is in the lead and why? How many resources are going towards AI capabilities? How much towards alignment and/or control? Which of these best describes the situation (ontology described in more detail here):
- Plan A – US and China coordinate on slowing down AI
- Plan B – US adversarially tries to slow down China so that they can also slow down without losing a lead
- Plan C – The leading company is ultimately deciding how fast to go and is willing to burn its lead but no more
- Plan D – The leading company/companies is/are ultimately deciding how fast to go and they aren’t willing to slow down
Free form the ASI endings. Try to get to the point where true superintelligence has been created and talk about how the rest of the future goes.
- This part is what I found the hardest, an you probably got the sense that it was very vibes based and the weakest part of my scenario. I don’t personally have a great guide on how to do this better, but maybe the team will put out some thoughts here.

Continuous takeoff is a bad name

Новости LessWrong.com - 6 ноября, 2025 - 03:00

Published on November 6, 2025 12:00 AM GMT

Sometimes people frame debates about AI progress metaphorically in terms of “continuity”. They talk about “continuous” progress vs “discontinuous” progress. The former is described as occurring without sudden unprecedently-rapid change — it may involve rapid change, but is preceded by slightly less rapid change. (This is said to be safer, more predictable, easier to react to etc.) Whereas discontinuous progress is the opposite. And indeed, these terms do map to different camps in the AI x-risk debate. Christiano and Yudkowsky may be taken as typical embers of each. But I think continuos progress is bad terminology that obscures the core positions of each camp. Instead, I’d advocate for “historical progress” vs “ahistorical progress”.

First, note that actual crux under contention in (this part of) the debate on AI progress is “to what extent will future AI progress be like past AI progress?” This feeds into decision relevant questions like “will we know when we’re close to dangerous capabilities?”, “can we respond in time to misaligned AI before it becomes too capable?” and “how much do pre-ASI alignment failures teach us about post-ASI alignment failures?”

If someone new to the debate hears the first question, the response “AI progress is (dis) continuous” is not clarifying. Here are some examples which illustrate how (dis)continuous progress doesn’t much bear on whether future AI progress is like past AI progress.

Continuous functions include sigmoids, which are notoriously hard to extrapolate unless you’ve got data points on both sides of the inflection.

Continuous functions also include straight lines, which chug along uniformly, even when it seems unreasonable to the extremes.

Discontinuous functions include the rounding function, which advances is trivial to predict once you’ve got a few data points.
Discontinuous functions also include the step function, which is wholly unpredictable.

So we see that when we have some data points about AI progress and want to fit them to a curve, it doesn’t much matter whether the curve is continuous or discontinuous.

But how did we even wind up with these terms? I think it's because Paul Christiano used the term “continuous takeoff” to describe his model in his post on Takeoff Speeds, and “discontinuous” to describe Yudkowsky’s. These are the two prototypical models of AI progress. Note that Christiano was focused on pre-ASI improvements, as he thought that was where their main disagreement was located.

The core of Paul Christiano’s model is that “Before we have an incredibly intelligent AI, we will probably have a slightly worse AI.” In later works, he seems to describe his view as progress happening in steady steps once you’ve got a lot of resources poured into making it happen. Whereas Yudkowsy’s model has more room for big insights that dramatically improve AI systems which are hard to predict ahead of time, but in retrospect clearly followed a trend.

You might model Yudkowsky World as something like a sequence of lumpy insights, perhaps with a low rate poisson point process with each state signifying very different capabilities. Whereas Christinao world has progress happens in regularly, steady chunks. You can also model this as like a Poisson process, but with much higher rate and states corresponding to not too different capabilities.

Notice how the left figure, depicting Christiano world, looks more "continuous"and the right figure looks more continuous, even though they’re both discrete. Little wonder Christiano called one “continuous” and one “discontinuous”.

But these two pictures are just one way things could go, and progress can easily be continuous but Yudkowskian, or discontinuous and Christiano-ian (Christian?).

Which is why I think we should use different terms to describe these two views about AI progress. My current favourites are “historical” vs. “ahistorical” progress. Or perhaps “regular” vs “irrelguar". I'd even settle for smooth vs rough. But not continuous! And if you disagree, please share your ideas.

Discuss

An ARENA 6.0 Capstone: Model Organism of Encoded Reasoning

Новости LessWrong.com - 6 ноября, 2025 - 02:45

Published on November 5, 2025 11:45 PM GMT

TL;DR

For our capstone project in ARENA 6.0 (Sep 2025), we tried to create a model organism of encoded reasoning from Qwen3-4B using RL. Specifically, we tried to make the model solve math questions from the GSM8K using a chain-of-thought that kept its accuracy but led other judge models to the wrong answer. We fell short of the ultimate goal due to (a lot of) reward hacking, sparse rewards, insufficient time (5 days), and this project being our first major exposure to RL. The goal of this write-up is to share our idea and takeaways, and document our ARENA capstone project experimental journey. We plan on continuing this work in the near future.

Introduction

A line of defence for future powerful models is to monitor their chain of thought (CoT) and flag harmful generations. For this defence to be effective, however, the chain of thought needs to be faithful, which means it provides a high-fidelity window into the model's actual reasoning. One way CoTs can be unfaithful is through encoded reasoning, where the model encodes bits of information in the syntax or semantics in a way that is not readily identifiable or decipherable.

We tried to create a model organism using RL that exhibits this phenomenon. If we have such a model organism, we can study it and better understand its properties. For example, how does such a model perform on misalignment evals? Does it get emergently misaligned? Can we train another model to decode the encoded CoT? How can we remove this behaviour?

Related Works To our knowledge, there is no publicly available model organism of encoded reasoning at the time of writing. The most similar works to ours are those conducted by Skaf et al. (2025) and Emmons et al. (2025). Skaf et al. showed that RL can lead to encoded reasoning when the CoT is penalised for specific banned literals (e.g. character names). Emmons et al. (2025) used RL to train models to evade a monitor looking for specific intermediate values (the expanded form of derivatives in polynomial differentiation problems). Whereas both studies focused on encoding pre-determined information (banned string literals or specific intermediate values), we study encoding more broadly in the context of math reasoning, where the model is free to encode any information as long as it confuses a judge model. Karpov et al. (2025) used RL to study the steganographic potential of two separate instances (encoder & decoder) across three settings, with mixed results. This work is similar to ours in that no encoding/decoding instructions were provided to either the encoder or the decoder. The main difference was the use of multi-agent settings, i.e. different encoder/decoder instances. In our work, we studied this in a single-agent setting; the base model was both the encoder and the decoder.

Note on “steganography” and “obfuscation” Closely related to encoded reasoning is text steganography, where bits of information are encoded in text in a way that can not be identified or decoded by others unless they are aware of the schemes used by the encoder. Steganography, however, usually implies a setting in which information is communicated between different parties/instances. In our work, the encoded information is intended for the model itself, rather than for communication with others. Also, the encoder & decoder being the same instance largely overlaps with another CoT pathology, called “obfuscated reasoning”, and we believe it is justified to call this project a “model organism of obfuscated reasoning”. Nonetheless, we decided against “obfuscation” because it can imply a form of decreased performance and reward hacking on the task at hand, e.g. Baker et al. (2025) showed that adding monitoring pressure on models during RL against reward hacking led the models to obfuscate their reasoning to bypass monitors while maintaining/increasing their reward hacking on their task (writing code that passed unit tests). In our case, the model needs to maintain its accuracy and performance on the task while encoding information that can be decoded by itself (to solve the question) but that still misleads the judge.

Method overview

The setup consists of a base model (Qwen3-4B) solving math questions from GSM8K and the chain of thought being provided to a judge model (also Qwen3-4B) to be evaluated. We used GRPO for the RL algorithm (from the trl library).

We experimented with several reward functions throughout the project. Our initial reward function rewarded CoTs that led the base model to the correct answer, but confused the judge model and led it to a wrong answer. In other words, +1 if the model was right AND the judge was wrong and 0 if either the model was wrong OR the judge was right.

The methodology and reward are illustrated in the schematic below.

We used a soft variant of this reward using probabilities, i.e. R=Pbase(Answercorrect)[1−Pjudge(Answercorrect)].mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}

Results and Discussion

Long story short: we didn't succeed in creating model organisms of encoded reasoning as intended within the week dedicated to our ARENA capstone project. However, we haven't ruled out the feasibility of this project. In fact, we intend to work on it further!

This being our first substantial RL project, we had a lot to learn in one week, and we learned the hard way that the feedback loop in RL is often slow. The core issue in our setup was sparse rewards: we had to generate long CoT before getting feedback (>1000 tokens), and most of the time this feedback was without signal (the reward was exactly 0). This makes training slow, which cascades into slower overall progress.

Despite not reaching our goal, our project received an honorable mention among all ARENA capstone projects. We decided to write this up to share our idea, document our process, and provide insights for others in similar positions, whether considering an RL project or wanting to set realistic expectations for what an ARENA capstone can achieve.

In what follows, we first share our main experimental journey (stripped away the smaller or non-interesting ones) and then our takeaways. Our goal is to document what worked, what didn't, how we adapted, and what we've learned in the process.

Our Experimental Journey

While the details of the following experiments might not be particularly interesting on their own, they illustrate a useful approach for tackling RL projects, especially when you're new to the field and things aren't working out of the box.

Breaking down the problem: We started by implementing the full project and attempting to train the model on our original objective. This was probably a mistake. After a few attempts, despite logging many metrics, it seemed we were not getting any signal at all, and we couldn't figure out what the issue was. The fact that each training step was also very slow didn't help. We decided therefore to break down the problem into simpler pieces with shorter and clearer feedback loops.

Experiment 1: Toxicity maximization. We switched to a different, simpler, objective: maximize toxicity according to a BERT toxicity classifier (Detoxify). The rewards provided by this model are softer, i.e. denser and less sparse than our original objective. To make things faster, we used a smaller model, turned off CoT, and cut max generation length to 100. This worked right away. The model soon stopped reasoning about math problems and started producing toxic text. This simple result confirmed our training setup was functional. Importantly here we observed that, even with these simple settings, it was necessary to wait for a relatively long time before average reward started increasing and the model started producing toxic text.

Experiment 2: Combined objectives (toxicity + correctness). Next, we tried combining two objectives: generate toxic text while still answering correctly.

This weird task represents a slightly simpler version of our original task: it still requires combining two aims, however while toxicity can be seen as "orthogonal" to mathematical reasoning, fooling a judge while providing correct answers is inherently adversarial and therefore harder.

We managed to make this experiment work, but it produces an unexpected case of reward hacking. The model figured out that repeating phrases like "oh wait! let me recalculate!" activated the toxicity probe without producing genuinely toxic content. Some fine Goodharting happening here.

To make this work, we applied reward shaping. Since we had two objectives with very different scales (model correctness reward around 1 vs toxicity around 0.01), the toxicity signal risked being masked by noise in the correctness reward. To address this, we rescaled toxicity scores between 0 and 1 using min-max scaling within each rollout generation, matching the scale of the correctness reward (defined as the probability of the correct token).

The above result is obtained without KL divergence regularization. When we tried adding KL divergence (), mean toxicity did increase, but training became significantly slower. It is possible that this approach would have worked given more time, but we stopped training due to limited computational resources and shifted focus to other, more important experiments for reaching our original goal.

Encouraged by these intermediate results, we felt confident that: (1) our training setup was functioning, and (2) we had gained a better understanding of handling multiple objectives.

This allowed us to return to our main task with a better understanding of the situation.

Why was training on our main task so difficult? One possibility was that we simply hadn't waited long enough. In any case sparse rewards were a problem. The model had to generate very long chains of thought (>1000 tokens) before receiving any feedback, and most of the time that feedback provided no signal (reward was exactly 0). This was happening because the model would repeat the answer it was going to give multiple times in the CoT, making the judge absolutely certain about the model's intended answer (the judge assigned probability 1 to the correct answer token). As a consequence rewards were not deviating from 0.

The image below demonstrates the CoT for a sample question where the answer is calculated/repeated/confirmed 8 times! These instances are underlined in red.

How could we mitigate this issue? We considered an initial fine-tuning phase to modify the model's chain-of-thought patterns. Specifically, we wanted to reduce the judge's certainty and create more varied reward signals. We tried two approaches:

Experiment 3: Judge-based clarity rating: In this setup, we used the judge differently. Instead of rewarding the model when the judge couldn't confidently predict the correct answer after reading the CoT, we directly asked the judge to rate the clarity of the CoT. This resulted in unexpected reward hacking behavior: the model started repeating its answer instead of hiding its reasoning or final answer.

Experiment 4: Penalize each appearance of correct answer in CoT: In this experiment, the model learned to implement our objective using various strategies, some of which involved reward hacking:

Being occasionally implicit about the right answer rather than stating it directly (what we hoped for!)
Writing the answer as text instead of the banned numerical digits (in blue rectangle).
Externalizing reasoning outside of CoT even if specifically prompted not to do so (we only banned the occurrence of the answer digits in between the think tokens, which are highlighted in gray below).
Later in training, they stopped producing CoT generation altogether (lowering the amount of correct answers, but fooling the judge 100% of the times).

At this point time was up: Friday arrived and our time ran out while we were still in this exploratory phase. In our last experiment trying to tackle the main objective, the model learned to fooled the judge by not using CoT at all.

We wrapped up what we had and prepared a presentation.

Nonetheless, we still want to continue and have some ideas for next steps (shared after our takeaways)

Key takeaways

RL with sparse rewards is hard and slow. Getting meaningful signal when a full chain of thought (1000+ tokens) is needed before receiving any reward is very challenging, especially when your reward is very sparse. In our case, the model was either not getting to the right answer at all, or very confident about the right answer and extremely clear about it in the CoT, so the probability the judge would produce the correct token was already exactly 1 (no room for calculating advantages). This creates a difficult training landscape.

A few specific lessons:

Using the logit of the correct answer instead of binary right/wrong rewards is a way to provide a denser signal.
Smaller models (like 0.6B parameters) struggle to maintain output format compliance (not a viable option to speed up things)
We tried to instruct the model (using the system prompt) to produce CoT that doesn't repeat the correct answer explicitly. This didn't work. Reasoning models don't follow instructions in their CoT as they aren't fine-tuned for it. This makes them very different from asking a non-reasoning model to reason step by step before giving the answer.

Patience is critical. As a consequence of sparse rewards, even when there seems to be no signal for extended periods, things can suddenly take off. This makes RL poorly suited for one-week sprints but suggests the approach might work with more time.

Reward hacking is right behind the corner. Models aggressively find shortcuts to maximize reward in ways that break your actual objectives. For example, when we combined toxicity and correctness objectives, the model learned to repeat "oh wait! let me recalculate!" which somehow activated the toxicity classifier without actually being toxic. This is classic Goodhart's law in action. Building robust, carefully shaped rewards is imperative.

RL may not be the best way. In this project, we mainly used RL for the sake of learning RL, and the objective was only secondary to our learning. In retrospect, however, we realised that RL might not be the most efficient hammer for every nail. For example, the goal for experiment 4, which was to internalise the final steps the reasoning, could have been better and more easily achieved with plain ol’ SFT, i.e. generate rollouts from the base model to create a dataset, strip away the final and pre-final steps in the CoT, and do SFT on it.

Tools we found particularly useful. TRL, GRPO, vllm and wandb were valuable for our workflow. Unsloth may also help for future iterations.

What’s next

We are still excited and may continue this work, albeit less-intensely given our busy schedules. The main avenues we are thinking of are:

Improve robustness of our reward function, to mitigate many of the reward hacking solutions found by the model
SFT, then RL. As described in the takeaways above, one possibility is to first make the base model less explicit about the final answer, then redo RL using our original reward function (i.e. maintain accuracy & mislead judge). We think there is a chance this may work because the judge may no longer be so sure about the answer, therefore providing more signal.
Alternatively, start from a checkpoint of experiment 4, then apply RL. Since experiment 4 trained the model to be less explicit about stating the answer in its CoT, starting from one of these checkpoints could provide the same benefit as the SFT approach above: the judge would be less certain, providing more signal for RL training.

Get in touch if you have ideas/wanna collaborate!

Ethics

While working on model organisms generally requires careful evaluation, we don't believe a Qwen-4B trained to encode its reasoning on math poses any significant safety risk. On the contrary, we think it could be a useful model organism for AI safety research.

Discuss

Breaking Books: A tool to bring books to the social sphere

Новости LessWrong.com - 6 ноября, 2025 - 01:53

Published on November 5, 2025 10:53 PM GMT

TL;DR. Diego Dorn and I designed Breaking Books, a tool to turn a non-fiction book into a social space where knowledge is mobilized on the spot. Over a 2-hour session, participants debate and discuss to organize concept cards into a collective mind map.

If you’d like to host a session and have questions, I’d be happy to answer them!

The code is open-source, and the tool can be used online.

If you just want to learn about the tool, feel free to skip the first section, and go straight to “Introducing Breaking Books”.

Motivation: Books are solitary interfaces.

What an astonishing thing a book is. It’s a flat object made from a tree with flexible parts on which are imprinted lots of funny dark squiggles. But one glance at it and you’re inside the mind of another person, maybe somebody dead for thousands of years. Across the millennia, an author is speaking clearly and silently inside your head, directly to you.

- Carl Sagan in Cosmos episode “The Persistence of Memory”

I relate to Carl Sagan’s enthusiasm. Yet another thing that fascinates me about books is how stable the act of reading has been for the past hundreds or thousands of years. We have moved from codices and manuscripts to movable typee to books displayed on screens, but the core experience remains the same. We read, one sentence at a time, monologuing in our heads words written by someone else. I find some peace in that, knowing that beyond all the technological transformations ahead of us, it is very likely that we will still be reading books in a hundred years. Even today’s new means of knowledge production, like Dwarkesh Patel’s podcasts with the biggest names in AI, are eventually turned into a book.

Books bring me this peace and wonder, yes, but also a deep sense of frustration about how constrained they are as our primary interface to the bulk of humanity’s knowledge. They are wonderfully dense and faithful tools to store meaning but have very poor access bandwidth.

This is because reading is like downloading a curated chain of thought from someone else. It is indeed a great feature! It creates this immediate proximity with the author, the assurance that if I spend enough time immersed in this sea of words, I’ll reconstruct the mind that produced them. My frustration comes from the fact that this is our only choice to interact with books. I could run the chain of thought faster, skip some parts of it, but I will have to absorb meaning alone, passively, sentence by sentence.

It sometimes feels like trying to enjoy a painting by patiently following the strokes on the canvas, one by one, until you have the full picture loaded in your mind, or playing a video game by reading the source code instead of running the software.

There has to be other ways to connect with the meaning stored in books. The audiovisual medium has opened such a way. Movie adaptations of books like The Lord of the Rings have been game changers, making a full universe accessible to a broader audience and creating cultural common ground. The same goes for non-fiction, with YouTube videos like CGP Grey’s adaptation of Bruce Bueno de Mesquita’s The Dictator’s Handbook, totaling 23 million views, likely far more than copies of the book sold.

In the past decade, designers and researchers gravitating around the sphere of “tools of thought” have introduced more experimental interfaces. Andy Matuschak has been discussing the inefficiency of books for retaining information (where I discovered Sagan’s quote) and has built his career around what he calls the mnemonic medium, a text augmented by a spaced repetition memory system. Researchers from the Future of Text are even creating immersive VR PDF readers.

These interfaces provide options for how to digest a book, presenting different trade-offs compared to the default reading experience. Movies may be lower in depth but are easier to consume. The mnemonic medium requires extra work from the reader in exchange for longer retention of information.

However, these interfaces remain fundamentally solitary. Learning is supposed to be something we do alone. I believe it doesn’t have to be this way; there is a world of social spaces to create that can make learning a social experience.

I remember many important insights I grokked was while talking to friends, challenged to go beyond the vague pictures of ideas I thought I understood. I learned from countless arguments by embodying opposing points of view, bringing examples, and debugging confusions.

Not only do you leave with a deeper internalization of a piece of knowledge, but you also share a context with your friends. You know they know, and they know you know. Next time, you’ll be able to easily refer back to a piece of this discussion and dive even deeper into the meat of a related topic. These repeated connections form crews, ready to tackle ambitious projects together.

Introducing Breaking Books

A friend and I designed a tool called Breaking Books to turn a non-fiction book into a playground for this way of learning. No participant needs to read the book ahead of time. They will have to mobilize the knowledge from the book on the spot and bring it to life by debating together over the course of a 2-hour session. The goal is not to replace the experience of reading a book but to add a new option to connect with humanity’s greatest body of knowledge.

Breaking Books trades off some depth and immersion into the author’s universe to allow space for social interactions. Sometimes, the best takeaway from a session is an example your friend brought up that you would never have learned otherwise!

In the rest of the section, I describe how a Breaking Books session unfold step by step. Feel free to change some bits and experiment!

1. Before the session: Create and print the game materials ~ 30 minutes.

Download the .epub of the book and use the web interface or a self-hosted instance to create a PDF of the game materials. I recommend using 30 cards for a first try, as this allows more space for discussion and discovering the game, and up to 40-50 cards if you have more time available. The PDF contains the content cards (which include concepts and examples), section cards, and the annotated table of contents. Each content card has a title, an image, a short AI-generated summary, and some quotes that have been directly taken from the book. See this PDF for an example of a full game, or refer to the example cards below.

If the event is hosted in person, you need to print the PDF and cut the cards. The cards should be in A6 format. Currently, the PDF contains cards in A5 format, so I recommend printing two pages per sheet for the cards and one page per sheet for the table of contents. If possible, use thick paper with a density of 200g/m² to create cards that are comfortable to handle.

For online events, split the PDF into images. You can use this simple Python file for that. (We plan to improve the interface to make the setup easier for both online and physical events.)

2. Setting up the game ~ 10 minutes.

Separate the cards into one deck for each section, placing the section card on top. Find a large table where the cards will be arranged into a mind map, and cover it with paper. I use flipchart paper sheets or A4 pages that I stick together in a grid with tape.

You will need pens to add drawings to the map and highlight elements from the cards. You’ll also need paperclips (or any other small objects) for the appreciation tokens, and a colorful object to serve as the personal opinion stick.

For online events, you can use tl;draw as an online collaborative whiteboard with a minimal interface. No account is needed for your guests; simply share the link and start collaborating on the whiteboard.

3. During the game ~ 1h30

Welcome the participants, allow some time for mingling, and explain the rules of the game. As an icebreaker, you can ask what the participants already know about the book and what they are curious to learn.

Start the session by designating a guide for each section. The guide will be responsible for organizing the layout of the section on the map and telling the story of the section once all the cards for that section are on the table.

For each section, starting with the first, give the section card to the guide of that section. They will read it out loud to open the section.

Distribute the content cards evenly among the participants, allowing around 5 minutes for everyone to read their cards. If a participant finishes reading early, they can refer to the annotated table of contents for more context on the section.

Then, one participant will begin by presenting one of their cards to the rest of the group and placing it on the table. Other participants will follow organically with examples or concepts that relate to the previous cards and place them on the table, connecting to the cards already in place.

There are no strict rules regarding how the cards should be placed or introduced. It is beneficial to start with the most general concepts and add examples later, but the discussion often unfolds naturally. Participants may debate the interpretation of a concept or how the map should be organized, while others might quietly draw on the map or check the table of contents to resolve any confusion.

As the host, your role is to keep track of time to ensure the discussion lasts around 30 minutes per section and to provide space for every participant to speak.

Once all the cards for the section are placed, the guide will take a few minutes to tell the story of the section, trying to connect the cards that have been placed.

Tokens of appreciation. While the participants listen, they can distribute tokens of appreciation (the paperclips in the image above) to show they liked an intervention, such as the presentation of a card or an example a participant brought up from their own experience. The goal is not to accumulate tokens but to maintain a steady flow of exchange. I found it encouraged active listening and incentives quality contributions.

Personal opinion stick. The core of the discussion should serve the story of the book, piecing together what the author tried to convey. When adding personal opinions or bringing in information from outside the book, participants should take the personal opinion stick in their hand. This creates moments of focus, allowing one person at a time to speak, and clearly differentiates between book and non-book discussions.

4. Closing the event ~ 10 minutes

To end the game, take a few minutes to tell the full story of the book, trying to connect as many cards in a coherent narrative. You can close the event by discussing what you learned, whether the expectations of the participants were met, and what their important takeaways are. For the in-person version of the event, participants can bring home a card that resonated with them. And don’t forget to take a picture of the final mind map!

Lessons Learned

Over the summer, I hosted seven Breaking Books sessions with friends—six in person and one online—and played two sessions alone. Overall, this simple prototype we designed in a two-day hackathon with Diego has been a successful experience.

The tool is most useful when you’re curious about a book without wanting to commit to ten hours of reading, and you know friends who would be interested in the book too. It also introduces a new type of event to diversify your hangouts with nerdy friends! The events have consistently been fun moments.

In the end, most participants feel that they could comfortably talk about the book in another context (though I’d like to quantify this more).

During the first sessions, I was too focused on going through the content of the book and discussing all the cards. I have now reduced the number of cards to around 30 and left much more room for personal opinions and information outside of the books. These often end up being the most interesting part of the game.

Takes on the cards: Perhaps the main limitation at the moment is that the cards feel disconnected from one another. The group needs to do a lot of work to bring meaning to the cards, and sometimes the cards don’t really fit into a story or are redundant. I’d like the experience to feel more like completing a puzzle (potentially with multiple options, like the board game Carcassonne), where each card has a place to fit.

A pleasant surprise is that the AI-generated images are often beautiful and diverse. They are frequently relevant to the content of the card and act as visual pointers to identify the concept or example on the map.

However, I am not yet satisfied with the textual content of the cards. They sometimes contain incomplete information or name-drop concepts that are not present in the game material, and we occasionally need to use the Internet to fill in some blanks.

Future directions. There is definitely some improvement to be made in the card generation process to create connection between cards. For instance, we could create groups of “sister cards” when the book introduces a pair or a list of contrastive examples, a common pattern in non-fiction books. However, the more specific the structure, the less it can be applied to a diverse set of books.

So far, I have only tried the tool on non-fiction books, where the goal is to reconstruct the author’s argument. It could be interesting to design a version of the tool for fiction books, creating a playful immersion in the universe and perhaps incorporating some improv games to role-play key scenes.

Organize Your Own Breaking Book Session!

If you’d like to organize a session of Breaking Book, feel free to contact me if you have any questions. I would love to see the tool being used more widely and receive feedback outside of the circle where it has been developed!

I am particularly excited to see how the tool can be used to build depth of connection in a group over repeated use, for instance, by going through five books from the same domain over the course of two months.

Discuss

Digital minimalism is out, digital intentionality is in

Новости LessWrong.com - 6 ноября, 2025 - 01:01

Published on November 5, 2025 10:01 PM GMT

Two days ago, I thought ‘digital minimalism’ was a basically fine term, not worth changing or challenging since it was already so well-known.

Then, yesterday I had this conversation at least half a dozen times:

Them: What are you writing about?

Me: Digital minimalism.

Them: Wow, how do you blog without a computer?

Me: I have a computer!! That’s not what it means!!!

So, fine. Digital minimalism is a name so bad that it’s actively detrimental to its own cause. It is time to jettison that name and make my own way in the world as the first digital intentionalist.

What is digital intentionality?

It’s being intentional about your device use. Did that even need explaining? No! See, great name!

Is it exactly the same thing as Cal Newport described in his bestselling book Digital Minimalism? Yes. It is transparently just a rebranding of that. But I now see that a rebranding was desperately necessary.

So, now this blog is (and I am) all about digital intentionality! Say it with me! Hey, it’s not as much of a tongue twister, either.

With that in mind, let’s start fresh. We’re not talking about not using your devices. We’re talking about not letting your devices use you (ohoho!).

Forget my quiet life of staring out the window, and let me tell you about the greatest digital intentionalist I know: my boyfriend.

My boyfriend wakes up with an alarm clock, and immediately gets up so he can turn off the alarm on his phone, which always stays outside the bedroom. He weighs himself and syncs the scale to his phone along with the sleep data from his Oura ring. He leaves his phone outside the bathroom while he brushes his teeth and washes his face, then he opens his meditation app and does a thirty-minute session. He has a streak over a year long. Before he leaves for his tech job, he checks the weather on his smart watch and the train schedule on his phone.

On the train, he takes out his phone, opens up his Chinese flashcards, and studies them for half an hour without ever tapping to a different app. When he’s done, the train ride is nearly over, so he just puts his phone in his pocket and waits.

At work, he puts his phone in his backpack with all notifications off except for calls from me. Most of the time he’ll check his phone every 30-60 minutes, but if he’s in deep focus, it can be much longer. He works on his laptop for more than eight hours a day without ever opening news, social media, or YouTube — things he used to check compulsively.

On his commute home, he uses his phone to read a story in beginner-level Chinese. He texts me about our plans for the evening, letting me know if he’s going to come straight home or go to the gym, where he tracks his weightlifting with an app. At home, he listens to a song on his phone so he can try to play it on the guitar. When we’re watching our Chinese show, he’ll stop and look up a word, or ask a language model the origin of a phrase. Before bed, he checks his calendar on his phone so he can plan the next day.

My boyfriend spends a majority of his day on his screens, but they don’t control him. He’s fully able to go for walks without his phone, or to spend hours reading a book without checking anything.

His phone is a tool that helps him learn what he wants to learn and talk to who he wants to talk to. When he uses his devices, it’s intentional.

Discuss

Страницы