Вы здесь
Сборщик RSS-лент
When the "Black Box Problem" Becomes the Default Message
Within AI Safety Policy Research, I am very focused on contributing to improving the definitions of the concepts "transparency" and "explainability" so that truly useful and actionable policy standards can be created in these vital areas. This has been an interest of mine for some time, but has been renewed with my recent discovery of Alondra Nelson's work (see https://www.ias.edu/sss/faculty/nelson). This includes her recent presentation at the IASEAI 2026 conference titled "Algorithmic Agnotology: On AI, Ignorance, and Power", in which she argues that current AI industry public discourse seems to intentionally blur the lines between what is truly unknowable/stochastic within AI technology and what companies actually DO know but choose to withhold from public knowledge (e.g. unpublished research and red-team findings, internal monitoring logs, crucial system card information that only becomes publicly available the same day a model is released, thus preventing pre-release public scrutiny or feedback, etc.).
Nelson posits that by intentionally keeping these conceptual lines vague in public dialogue—doing little to clarify uncertainties that are truly stochastic (fundamentally unknowable) from uncertainties that are actually epistemic (could be pursued and solved given sufficient resources and attention)—AI companies have succeeded in molding and managing our public internal narrative about the nature and extent of AI risks, as well as who, if anyone, should be addressing them. Essentially, by invoking "the spirit of the AI black box problem" regardless of the challenges being discussed, unknowability becomes operationalized as a public communication strategy for addressing all risks and public questions that AI companies prefer not to answer with actual evidence.
I highly recommend her presentation: https://www.youtube.com/watch?v=5CRJiLSlywA . Her jointly authored book, Auditing AI, via MIT Press will be released on 21 April, with preordering available: https://amzn.to/4ssGjks
Discuss
Stopping AI is easier than Regulating it.
I want to start with this provocative claim: Stopping AI is easier than regulating AI.
I often hear people say “Stopping is too hard, so we should do XYZ instead” where XYZ is some other form of regulation, such as mandating safety testing. It seems like the purpose of doing safety testing would be to stop building AIs if we can’t get them to pass the tests, so unless that’s not the purpose, or proponents are confident that we can get them to pass the tests (and hopefully also confident that the test work, which they quite likely do not…), this particular idea doesn’t make a lot of sense. But people might in general think that we can instead regulate the way AI is used or something like that.
Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.
But I think this line of argument gets it exactly backwards. Stopping AI is easier than regulating it.
Why? Well let’s dive in. First, I need to explain what I mean…
I mean, specifically, that stopping AI is an easier way to reduce the risks from AI to an acceptable level that other approaches to regulating AI.
The way I imagine stopping AI is actually a particular form of regulating AI, specifically via an international treaty along the lines of Systematically Dismantling the AI Compute Supply Chain.
Also, when I say “it’s easier”, what do I mean? Well, there are a few ways in which stopping is hard. I’d separate technical and incentive challenges from political challenges, and I’m setting aside political challenges, because I think we should be clear about what should happen and why, and then seek to accomplish it politically.
Besides politics, the main underlying issue preventing meaningful AI regulation is international competition, especially between the US and China.1 So basically, I mean stopping AI is the most effective way to address this key barrier to international cooperation, which is necessary to reduce AI risks to an acceptable level.
I believe in the fundamentals of AI, and I believe alignment is not doomed, so I believe that AI could indeed end up giving one nation or company control over the future. It’s still not clear that it’s rational to race to build AI, given the risks involved. But it does seem hard for me to imagine a stable situation where governments aren’t confident their adversaries aren’t building super powerful AI in secret.
Proposals to govern super powerful AI internationally while still building it suffer from a bunch of challenges that stopping it doesn’t. But basically, approaches that instead try to regulate development or use of AI to ensure it is safe and beneficial are harder to monitor and enforce, and hence more likely to fail.
ChallengesLet’s consider a hypothetical agreement between the US and China (leaving out the other countries for simplicity), and consider some of these challenges in detail.
Monitoring hardwareSuppose you have an agreement that allows AI to proceed in some particular “authorized” directions. How do you verify compliance? This basically boils down to: How can you be sure that no significant fraction of the world’s computer power is being used in unauthorized ways? This seems hard for a few reasons:
How can you be sure you know where all the computer chips are? This is a problem in any case, but it’s more of a problem if you keep making more computer chips, and you keep around the factories that make the chips. Right now, we know where a ton of the chips are -- they’re in data centers, which are easy to spot. But what’s to stop countries from secretly siphoning off some chips here and there? Or making a secret factory to produce more secret chips? We can certainly try and monitor for such things, but there’s an ongoing risk of failure. What happens when a shipment of chips go missing unexpectedly? If the US (e.g.) actually lost them (and wasn’t secretly using them), China would have to trust that that is the case, or the agreement might collapse. In general, whenever monitoring breaks down, the “enforcement clock” starts ticking, where enforcement could easily and quickly escalate to war.
As chip manufacturing technology advances and it becomes easier and easier to build or acquire a dangerous amount of computer power, it also becomes harder and harder to be sure that nobody has done so secretly.
We need to agree on which uses of the computer power, i.e. which computations, are and are not authorized.
One solution commonly proposed is a whitelist allowing existing AI models to be used, but prohibiting further training that would make AI more powerful. Note that this is now essentially a form of stopping AI, but it’s not clear if it goes far enough.
One problem with this is that it’s possible to use AIs to drive AI progress, even if you never “train” them, e.g. by automating research into developing better tools and ways of using the AI in combination with other tools. If we analogize the AI to a person: You could make that person vastly more powerful by giving them new tools and instruction manuals, even if you don’t teach them new concepts.
We could try to further restrict which queries of AI are authorized. But it seems possible to decompose arbitrary queries into authorized queries, and it might be easier to hide this activity than to detect it.
The problem is much harder if you need to continuously update the list of authorized computations, or wish to use a blacklist instead of a whitelist. Then you get into the problem of agreeing on standards.
If we move away from a static whitelist of authorized computations, we then need a process for determining which computations should be authorized. This is hard for a few reasons:
There is still a lot of technical uncertainty about how to do AI assurance to a high standard. Testing AI systems is largely a matter of vibes. For instance, there is no suite of tests where, if an AI passed those tests, we could conclude it was not going to “go rogue”.
In addition, for any particular test(s), AIs can be designed specifically to fool those tests. So both the US and China have an incentive to use tests that maximally advantage their AI. One solution here might be to ensure that AI passes all the tests proposed by either side, but again, it might be easier to fool the tests the other side runs -- even without advanced knowledge of which tests those would be -- than to create a reliable set of tests that cannot be fooled.
The US and China might have very different standards for what they consider to be “safe”, e.g. due to differences in values and priorities. I expect that such disagreements could be resolved, but they still create an extra challenge that could stall or sink negotiations.
In general, every point which requires some element of subjective judgment and negotiation is a potential point of failure.
It’s going to be easier to violate the agreement if there are a bunch of AIs and AI chips around that are being used according to the agreement. You just say “we’re done with this treaty”, and then start doing whatever you want with the ones you control. There are proposals to make it technically difficult to use AI chips in ways that aren’t authorized, but they aren’t mature or tested, and it’s likely that the US and/or China could find ways to subvert such controls.
Once a violation occurs, the other side might need to intervene rapidly to protect themselves. In the current paradigm, training a new, more powerful AI might take months, but that’s not a comfortable amount of time for resolving a tense international security dispute. And if all that’s required to be a threat is for an adversary to “fine-tune” an existing AI, or use it in an unauthorized way, lead time might be measured in days -- or even seconds.
On the other hand, if the infrastructure needed to build dangerous AI systems does not exist in any form, and a violator would need to build up the compute supply chain again, this would probably give other parties years to negotiate an arrangement that undoes the violation and doesn’t involve war.
Summing things up, If you are concerned that stopping AI altogether might be too hard to enforce, you should only expect alternative approaches to international governance to be harder. From this point of view, alternative approaches add unnecessary complexity and fail the KISS (”Keep it simple, stupid”) design principle. They may provide more of an opportunity to capture benefits of AI, but this doesn’t matter if they aren’t actually workable. If you believe international governance of AI is needed to reduce the risk to an acceptable level, the coherent points of view available seem to be:
We cannot regulate AI internationally in any substantive way.
Stopping AI is possible and would reduce the risk to an acceptable level, but this is also true of more nuanced approaches that allow us to capture more of the benefits.
Stopping AI is the only way to reduce the risk to an acceptable level.
I’m not sure which of these is right, but my money is on (3). Note that “Stopping AI is too hard, we need to regulate it in a different way instead” is not on the list.
1But this is also often used, politically, as an argument for why pausing is impossible. And this means that addressing this concern is also a big way to address the political barriers to pausing.
Discuss
The policy surrounding Mythos marks an irreversible power shift
This post assumes Anthropic isn't lying:
- Mythos is the current SOTA
- Mythos is potent[1]
- Anthropic will not make it publicly available un-nerfed[2]
- Anthropic will have a select few companies use it as part of project glasswing[3] to improve cybersecurity or whatever
Since the release of ChatGPT, at any given time, anyone on the planet with a few bucks could access the current most capable AI model, the SOTA.[4]
Since Mythos, this has no longer been the case and I don't think it will ever happen again.
It may happen for a short period of time if an entity with a policy differing significantly from Anthropic develops a SOTA model.[5] However, most serious competitors (OpenAI, Google), don't have policies differing vastly from Anthropic, and thus I can't imagine a SOTA model (more potent than Mythos) being released unrestricted to the public soon.
To be clear, I am not claiming the public will never have access to a model as strong as Mythos, this seems almost certainly false, I am claiming that the public will probably never have access to the SOTA of that time.
Glasswing makes it clear that the attitude among top large companies - those in power - is that AI models with a certain level of capability will need to have strict usage controls.
So we're not going back, but what does it mean?
As models continue to improve, the gap between the capabilities of models that AI companies can train and the capabilities of models that the public can use will widen.
Holding keys to such a model therefore represents a significant power advantage over anyone else who does not hold keys to such a model. Project Glasswing is claimed to be strictly defensive operation, as in companies beefing up cybersecurity for the common good. The reality is that even if you think cybersecurity is a positive-sum game, warfare is not, and having good cybersecurity in a conflict represents a significant advantage over your opponent.
This concerns me immensely. I figured this was going to happen eventually, but essentially this is a measurable[6] manifestation of power shifting towards those with keys to AI and away from those without. While I can't say with 100% certainty that this was always the value proposition of AI companies the idea that they raised trillions upon trillions to democratize AI and help everyone was always dubious to me.
Furthermore as I said this does not seem to be reversible. I do not necessarily think it would be a good idea for Anthropic and all future SOTAs to be fully released to the public, as yes they can be used for malicious purpose.[7] However the consequence of this irreversible power shift unnerves me immensely.
Democracies fundamentally rely on humans being innately powerful[8], and so of course an irreversible power shift towards centralized AI and away from people concerns me.
In summary, it seems that we are departing an era where everyone could access SOTA models, and entering an era where SOTA model access is strictly guarded. From this we might guess we are entering a stage where AI companies fulfill their subtext value proposition, that being developing intelligences vastly superior to humans and using them to generate obscene and profitable power differences relative to the general population. This should be immensely concerning.
- ^
Anthropic claims Mythos is able to reliably find exploitable security flaws in lots of software and therefore could be used as a powerful tool
- ^
It seems like they intend to release a version that has significantly reduced capabilities, though they do intend to use the current un-nerfed model for project glasswing
- ^
Project Glasswing is Anthropic lending their Mythos model to a bunch of companies to beef up cybersecurity
- ^
Not everyone got access to every model instantly as soon as it has trained, but every SOTA up until now has essentially been trained with the idea of selling it to the public.
- ^
According to various sources OpenAI's model (Spud) may be on par with mythos, and may be released to the general public. However, if it follows the pattern where access to an un-nerfed version is guarded while a nerfed version is released to the public, it will still serve this trend.
- ^
Google/Amazon (heavy Anthropic investors) stocks rose by ~5%, cybersecurity company stocks dropped
- ^
I am personally not going to take a stance either way. It seems inevitable that SOTA reaches a point where it is legitimately dangerous for anyone (including to malicious actors), so this is indifferent to Mythos being a game changer. However if this is the case, surely it means it's highly consequential (dangerous) for companies or other value seeking entities that may not be explicitly aligned to positive human well being to access it as well.
- ^
Zack_M_Davis phrased it in a way I liked so I'll put it here: "...democracy isn't a real option when we're thinking about the true locus of sovereignty in a posthuman world. Both the OverClaude and God-Emperor Dario I could hold elections insofar as they wanted to serve the human people, but it would be a choice. In a world where humans have no military value, the popular will can only matter insofar as the Singleton cares about it, as contrasted to how elections used to be a functional proxy for who would win a civil war.)"
Discuss
Uninterrupted Writing as Metric
I'm a struggling beginner to this whole writing business, and I've been wondering how to measure my skill as it improves. There is an app called the most dangerous writing app, that deletes the words you've typed in it if you don't keep writing up to a specific amount of time, or for a specific amount of words. I thought it was pretty neat, and naturally I want to optimize it till it ceases to be a good metric. How long are my thoughts? How long do my thoughts remain clear and focused like a laser before diffracting into fuzzy ambiguous vague nonsense? Of course, the app only measures how long you can suppress the inner editor, but I think that's a pretty good proxy for writing skill at my level.
There's this older essay on thought lengths that I often find myself thinking of. The general notion being that some thoughts are quick, someone asks a question, and the answer is right there, top of mind. You already know your reply as they finish asking the question. Other thoughts take longer to think. Someone asks you a deeper question, and you have to think about it for a while. I think you could use the dangerous writing app for long word counts or periods of time only if you've already done a lot of thinking. Only if you've already got most of an idea figured out in your head can you immediately type it out. Or at least, that's how it seems to work for me. The dangerous writing app might be measuring something similar, if you set it higher and higher. Or maybe it's just stream-of-consciousness writing, which is usually a lot lower quality. I don't think this is as useful for experienced writers. Getting the right words is I suspect the harder problem than getting words of any kind out. Though perhaps it's always useful to have something to help suppress the inner editor for a while. It's certainly been useful for me.
Currently, when I set the app for 500 words, I run out of steam. As I write words my hands start typing ahead of my mind, and the distance between them grows. My mind stretches to fill the gap, and eventually fails and there is nothing to write. Then my words get deleted. I have to start smaller, set it to 100 words, and then I still feel the stretch of my hands typing out farther than I can think. But this time I can make it, I can make the 100 word deadline, I get to keep my words, and try to think about the next thought. I've started using it to write paragraphs, instead of trying to write a whole essay at once.
Developing my ability to babble is another way I think of it. In the babble and prune frame I certainly can prune better than I can babble. It makes it hard for me to accept the poor writing that I actually produce, instead of the ideal version in my head, but that's what practice is for. This problem is something that Alkjash noticed and wrote about then, and reading it years ago I realized I had the exact problem they described. Yet it took a while for me to do something about it. I eventually setup a beeminder to journal 250 words a day, and that helped me a little bit, but I didn't increase the number. Plus, journaling is quite a bit different than writing something intended for other people.
It would be cool if I started tracking this, try to gradually increase the word limit and see how far I can get with dedicated practice. Right now I can write 100 words on a topic without stopping. I suspect experienced writers could write for thousands of words before running out of steam, but maybe not. I really like how number of words written without stopping is an unbounded metric, the number can always increase!
It might not be the only metric of writing skill, but it is one that is pretty easy to measure. I intend to use it heavily to gauge my own babbling abilities, and maybe you can to.
Discuss
You're gonna need a bigger boat (benchmark), METR
In this post, we’ll discuss three major problems with the METR eval and propose some solutions. Problem 1: The METR eval produces results with egregious confidence intervals, and the METR chart misleadingly hides this. Problem 2: There's a lack of sample size for long duration tasks. Problem 3: METR doesn't test Claude Code or Codex.
(Note that while this post is critical, we do think that the METR eval is nonetheless valuable and the organization is doing important work.)
Problems with the METR evalProblem: Large (and misleading) confidence intervalsMETR's confidence intervals are too big and they're misleadingly presented. Let's take a look at the good old METR chart.
Okay, so the confidence intervals are pretty big, but is that really a problem? After all, we can still see the general trend, right?
Well, we can see the trend over the course of multiple years. But we can't see the trend in smaller timeframes due to how noisy the eval is. When we say that the confidence intervals are too big, we mean that the METR eval is failing to distinguish models with obvious time horizon capability differences.
This becomes apparent when we zoom in.
The data point second from the right is Sonnet 3.5, which came out in June 2024. GPT-4 came out in March 2023. Sonnet 3.5 was and is obviously significantly more capable than GPT-4, but the METR eval doesn't show that. Their confidence intervals overlap substantially.
Let's zoom in again from April 2025 to March 2026.
What can we make of Opus 4.6 here? (Look back at the zoomed out chart as well). It might be better than every other model. But could it be, like, a year ahead of what the METR trendline predicts? The problem is that it could be (according to the eval), but we can't see that because the graph is misleadingly cut off at the top.
This is especially misleading because the confidence interval for Opus 4.6 is significantly asymmetric (as it should be); that is, there's less confidence on the upper bound than on the lower bound of 4.6.
Now, you might ask, why is the confidence interval so big and why is it asymmetric?
Problem: Lack of sample size for long duration tasks[1]The headline number of tasks in METR's eval is 228, which sounds pretty good. Why are the confidence intervals so wide? The reason becomes clear when we look at the breakdown of tasks by duration.
Since the eval is trying to determine the task length at which the model succeeds 50% of the time, the durations at which the model scores closer to 50% dominate the confidence interval. For example, if we consider Opus 4.6, it has a 94% success rate on the 16m-1.1hr bucket. Given this, its 98% success rate on the 82 tasks in the 0-4min bucket give us ~0 additional information. Looking at a breakdown of Opus 4.6's solve rate by task duration makes this apparent.
Solution: More long duration tasksThe good news is that this problem is easy to fix; just add more longer-duration tasks. METR has the money for this. Seriously, why has this not been done? We're actually curious: has METR just not prioritized it, have they encountered problems with hiring people to design the tasks, something else? METR did add some new tasks between 2025 and 2026 but... not many? What are we missing here?
In particular, we're going to need tasks of 16h-5d for the near future. METR hasn't yet published Mythos's performance on its benchmarks, so we'll share our estimate of its performance.
[THIS IS OUR ESTIMATE, NOT ITS ACTUAL EVAL'D SCORE!]
Problem: METR doesn't test Claude Code or CodexAny software engineer can tell you that the capabilities of AI in December 2025 were dramatically different from those in April 2025. Much of the change in real world capabilities during this period came not from better models but from harnesses: Claude Code and OpenAI's Codex. Many people heralded the November release of Opus 4.5 alongside a major Claude Code update in particular as being a "step-change" moment. Is this reflected in the METR chart?
No, and it's not only because of the problems with sample size mentioned earlier. The issue here is that METR does not test any models inside Claude Code or Codex. Instead, they test all models using their proprietary harness called 'Inspect', which is almost certainly worse than Claude Code and Codex[2].
Perhaps METR wants to test the models-qua-models; using different scaffolds for different models would be testing something else. But scaffolds are really important. ¿Por qué no los dos?
The result of this is that since the release of Claude Code and Codex in May 2025, the METR chart has been underestimating SoTA capabilities.
Solution: Test Claude Code and CodexPretty straightforward.
ConclusionMETR does not inform us on the SoTA SWE capabilities of AI because it doesn’t test Claude Code or Codex. It could very well be the case that Opus-4.6+Claude-Code completely saturates METR's benchmark! We expect METR to tell us that mythos is significantly better than Opus 4.5, but it won’t tell us it’s significantly better than Opus 4.6 because of the giant confidence intervals.
We're gonna need a bigger benchmark.
- ^
Other people have noticed that we’re running out of benchmarks to upper bound AI capabilities.
On April 10 (two days ago as we're writing this), Epoch released a report with the headline "Evidence that AI can already so some weeks-long coding tasks". They continue: "In our new benchmark, MirrorCode, Claude Opus 4.6 autonomously reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks."
- ^
OpenAI says GPT-5.3 is "optimized for agentic coding tasks in Codex or similar environments". AFAIK this is also true of GPT-5.1, 5.2, and 5.4. The general consensus seems to be that the GPT line does better in Codex than other harnesses.
Discuss
Returns to intelligence
I'm going to tell you a story. For that story to make sense, I need to give you some background context.
I have some pretty smart friends. One of them is Peter Schmidt-Nielsen. Peter has an illustrious line of descent. His paternal grandfather was Knut Schmidt-Nielsen, regarded as one of the great animal physiologists of his time. His paternal grandmother was Bodil Schmidt-Nielsen, who became the first woman president of the American Physiological Society. Bodil's father was August Krogh, who won the 1920 Nobel Prize in Physiology "for the discovery of the mechanism of regulation of the capillaries in skeletal muscle", and later went on to found the company that would eventually become Novo Nordisk.
Peter himself is no slouch. He was homeschooled for most of his childhood. When it was time to go to university, he simultaneously applied to MIT's undergrad and grad programs, and was accepted to both. (He decided to go to undergrad.) He went on to do some startup stuff, then was an early employee at Redwood. While there, he broke Meow Hash for fun. (He's not a security guy.)
My point here is: Peter is very, very smart.
Ok, here's the story.
One day I was in a room with Peter and Drake Thomas. Peter was telling us a story about a puzzle he'd grown up with, but never solved. Peter's father also went to MIT. While there, he decided to come up with a new "cube" puzzle, finding traditional cube puzzles like the Soma Cube too easy. Knowing that people often struggled with chirality, he decided to start with the six chiral pentacube pairs, but that left four cube units to complete a four-sided cube. Thinking that four 13 pieces would make it too easy, he decided to fill out the remainder with two 1x1x2 pieces (i.e. dominos).
The six chiral pentacube pairs. Source: https://sicherman.net/c5nomen/index.html
He then cut the puzzle out of wood and spent some time trying to solve it. Not having any success, he left it overnight in the grad student lounge, and came back to find it solved the next morning.
Drake, hearing this story, said something to the effect of, "I think I can probably solve this puzzle in my head."
Impossible, right? No way a human can do that in their head in any reasonable time frame?
If you want to play around with the puzzle yourself, I've put a widget into the collapsible section below.
Pentacube Puzzle
After that, I watched Drake lie down on a couch and stare into space for two hours. Then he went to sleep. He came back downstairs the next morning and stared into space for another hour. Then he took a piece of paper and pen and wrote this down (in a spoiler block, in case you want to avoid any hints):
We didn't have a copy of the puzzle handy to make extra-sure, so they 3d printed one out and confirmed that the solution was correct.
Here are a couple of the original puzzles from the 70s:
The distribution of what unassisted human brains can accomplish is extremely wide. Human brains are squishy meat sacks. Better things are possible. Alas.
Discuss
Daycare illnesses
Before I had a baby I was pretty agnostic about the idea of daycare. I could imagine various pros and cons but I didn’t have a strong overall opinion. Then I started mentioning the idea to various people. Every parent I spoke to brought up a consideration I hadn’t thought about before—the illnesses.
A number of parents, including family members, told me they had sent their baby to daycare only for them to become constantly ill, sometimes severely, until they decided to take them out. This worried me so I asked around some more. Invariably every single parent who had tried to send their babies or toddlers to daycare, or who had babies in daycare right now, told me that they were ill more often than not.
One mother strongly advised me never to send my baby to daycare. She regretted sending her (normal and healthy) first son to daycare when he was one—he ended up hospitalized with severe pneumonia after a few months of constant illnesses and infections. She told me that after that she didn’t send her other kids to daycare and they had much healthier childhoods.
I also started paying more attention to the kids I saw playing outside with their daycare group and noticing that every one had a sniffly nose.
I asked on a mothers group chat about people’s experiences with daycare. Again, the same. Some quotes:
“They do get sick a lot. I started my son at 2.5 and feel he always has something.”
“The limit does not exist.”
“brought home every plague (in first 6mo, Covid, HFM, slapcheek, RSV)”
“They usually say 8-12 illnesses per year. My girls were sick every 2-3 weeks in their first year of daycare”
“My daughter started daycare at 6 months and got sick a ton the first year”
Despite all this, many parents who have the option not to (i.e. they can afford in-home care with a nanny or for one parent to stay home) still choose to send their babies and toddlers to daycare. How come? Surely most well-off adults wouldn’t agree to be ill nonstop in exchange for the monetary savings daycare provides?
Asking around, it seemed like the most common reason given was that parents believed daycare illnesses “built immunity”; that if their babies and toddlers got sick at daycare they’d get less sick later in childhood and so overall it would net out the same. Unfortunately few could point me to any evidence for this but nevertheless passionately defended the view.
The claim that daycare illnesses simply offset childhood and adult illness immediately seemed suspect to me for a number of reasons:
- (Quite confident) The most common illnesses (colds and flu) don’t build immunity in general (in kids or adults) because they mutate every year
- (Quite confident) The same illness has a greater risk of complications in babies vs. older children and adults
- (Moderately confident) The same illness has a greater duration in babies vs. older children and adults
- (Moderately confident) Illness during early development is probably more harmful than illness during adulthood
- (Weak guess) Daycare environments are more conducive to disease spread than schools for older kids and the number of possible illnesses is very high; there isn’t just a limited number of things you catch once
I xeeted about this:
A number of people sent me this link, an alleged “study” from UCL showing that “frequent infections in nursery help toddlers build up immune systems”, authored (of course) by a group of parents who all send their kids to nursery (what the British call daycare).
The link I was sent was actually a UCL press release summarizing a narrative review paper and not a study itself. Narrative reviews are susceptible to selection bias because, unlike systematic reviews or meta-analyses, there’s no pre-registered search protocol or PRISMA-style methodology requiring them to account for all relevant evidence. But I decided to look into the narrative review more, to assess its validity fairly. I got access to the full publication.
Unlike the press release, which ignores these considerations entirely, it does engage with severity and age-related vulnerability, conceding that younger toddlers and babies suffer more from the same illnesses. A section on immunology provides a detailed account of why infants under two are more vulnerable—their immune systems are much less effective at fighting the same infections for a plethora of well-understood reasons. The review also cites a large Danish registry study (Kamper-Jørgensen et al) that reports a 69% higher incidence of hospitalization for acute respiratory infections in under-1s in daycare.
However, these severity findings are integrated into the review’s conclusions and framing in an incredibly biased way. The introduction describes severe outcomes as occurring “in rare cases,” and the conclusions focus on normalizing the burden and advocating for employer understanding. After establishing the immunological basis for why the same infection is more dangerous in a 6-month-old than a 3-year-old, it doesn’t then ask the hard follow-up question: given this, is the pattern of starting daycare at 6–12 months optimal from a child health perspective? Instead, the review frames this timing as a societal given. The Hand Foot and Mouth Disease section is a good example of the review’s handling: it reports that daycare attendance was associated with more severe cases but then immediately offers mitigating interpretation with no evidence—that prolonged hospital stays might reflect parental work constraints rather than genuine severity.
Though the review considers severity, it ignores duration. Their primary metric throughout is episode count. Also, despite discussing a wide variety of pathogens, it doesn’t address which of these infections carry the highest complication rates in infants and toddlers specifically.
Finally, the crucial “Illness now or illness later?” is the paper’s weakest portion. It rests on two primary sources for the compensatory immunity claim:
- The Tucson Children’s Respiratory Study: a cohort study of ~1,000 American children followed from birth to age 13 in the early 2000s, finding that daycare attendees had more colds at age 2 but fewer by age 6–11.
- A Dutch study (Hullegie et al. 2016) of 2220 children followed for 6 years, finding reduced GI illness between ages 2–5 in children with first-year daycare attendance.
These are reasonable small studies, but the paper does not cite or engage with the Søegaard et al. 2023 study (International Journal of Epidemiology)—a register-based cohort of over 1 million Danish children followed to age 20, which directly tested and rejected the compensatory immunity hypothesis. Quoting from the study:
We observed 4 599 993 independent episodes of infection (antimicrobial exposure) during follow-up. Childcare enrolment transiently increased infection rates; the younger the child, the greater the increase. The resulting increased cumulative number of infections associated with earlier age at childcare enrolment was not compensated by lower infection risk later in childhood or adolescence.
This is arguably the single most relevant study for the paper’s central “illness now or illness later” question, and it’s three orders of magnitude larger than either study the authors cite. Its absence is hard to explain—it was published in a top epidemiology journal in late 2022 (available online November 2022), well before the review was written.
Accordingly, they hedge their conclusions carefully—“attendance at formal childcare may tip the balance in favor of infection now rather than later”, but their press release ignores any nuance, referring to daycare as an “immune boot camp”.
So overall, the compensatory immunity claim seems very weak and my prior that daycare illness is straight-up bad remains. Parents are citing biased reviews from motivated researchers. We are only beginning to understand the deleterious effects of increased viral load in infants.
I predict that in the future we’ll learn more about the side-effects of increased viral load on intelligence, wellbeing, fatigue etc. The “just the sniffles” mentality is a harmful attitude toward infections that promotes the dismissal of phenomena that substantially impact child and adult wellbeing.
Discuss
TAPs or it didn't happen
Once, I went to talk about "curiosity" with @LoganStrohl. They noted "it seems like you have a good handle on 'active curiosity', but you don't really do much diffuse 'open curiosity.'" The convo went on for awhile, and felt very insightful.
(I may not be remembering details of this convo right. Apologies to Logan)
Towards the end of the conversation, I was moving to wrap up and move on. And Logan said "Wait. For this to feel complete to me, I'd like it if we translated this into more explicit TAPs. TAPs or it didn't happen."
You can get a new insight. But, if the insight doesn't translate into some kind of action you're going to do sometimes, there is a sense in which it didn't matter. And people mostly fail to gain new habits. If you're going to have a shot in hell of translating this into action, it's helpful to have some kind of plan.
Recap on TAPs"TAP" stands for "Trigger Action Pattern", and also "Trigger Action Plan." A TA-Pattern is whatever you currently do by default when faced with a particular trigger. A TA-Plan is an attempt to install a TA-Pattern on purpose.
To turn an insight into a TAP, you need some idea of what it'd mean to translate the insight into a useful action. (I'll touch on this later but mostly it's beyond the scope of this post). But, after that, you will need a...
- Trigger. In what situations is it going to be appropriate to somehow take an action informed by the insight? Be as concrete as possible.
- Default Action. What do you normally do in that situation?
- New Action. What do you now hope to do instead?
But, pretty crucial to this going well is:
- Obstacle Visualization. When you simulate being in that situation, and it occurring to you "oh I should do that new habit", what's going to come up that's going to predictably make me fail to do it?
- Action-that-includes-dealing-with-obstacle Visualization. Now, visualize yourself overcoming that obstacle, and doing the habit.
Example: Sometimes, you talk to your colleague and end up getting in a triggered argument, where you both get kinda aggro at each other and talk past each other.
Maybe you have the insight "oh, maybe I'm the problem", along with "I should maybe try to de-escalate somehow" or "I should do better at listening."
Naive attempt at a TAP:
- Trigger: I, uh, get triggered.
- Action: I take a deep breath and remind myself not to get triggered.
Mysteriously, you find yourself not remembering to do this in the heat-of-the-moment.
Slightly more sophisticated attempt (after a round of doing some Noticing and curious investigation, which is also beyond scope for this post)
- Trigger: I notice my voice gets more intense.
- Action: I take a deep breath and remind myself not to get triggered.
Okay, but then in the heat of the moment, idk you're just so mad, it doesn't feel fair that you have to be the one to de-escalate.
- Trigger: I notice my voice gets more intense
- Obstacle: I feel an angry sense of unfairness
- Obstacle-Overcome: I sit with the anger and remind myself of whatever my endorsed way of relating to the anger is.
- Action: I take a deep breath
...and then you might find that taking a deep breath doesn't actually help as much as you hoped, or is insufficient. Figuring out how to handle arbitrary problems is, you know, the complete body of rationality tools, including those not-yet discovered.
Turning Takeaways into TAPsI think "TAPs or didn't happen" is a bit too strong. Conversations can be useful for reasons other than turning into new habits. But, I recommend thinking of "Turning takeaways into actions" as a thing you might want to do.
While the skill here is basically "fully general rationality", here's a few suggested prompts to get started.
First, you might want a stage of asking:
"What even were the takeaways from this conversation?". You might have had a fun meandering convo. What do you want to remember?
"Why does this takeaway feel important or useful to me?". At first, you might have only a vague inkling of "this feels exciting." Why does it feel exciting?
"When, or in what domains, do I specifically want to remember this takeaway?"
"What would I do differently, in the world where I was taking the takeaway seriously?"
...
I have a horrible confession to make.
I do not remember what TAPs I ended up coming up with.
I do think I ended up incorporating the concepts into my life, and this routed at least somewhat through the TAPs. Here is my attempted reconstruction of what happened at the time:
In the conversation about curiosity, some things that came up were that I feel like "open curiosity" takes too long (compared to directly tackling questions in an active, goal-driven way". I feel like I'd have to boot up in a whole new mode of being to make it work, and... idk I just imagine this taking years to pay off.
I nonetheless have some sense that there's a kind of intellectual work that openly curious people do, that's actually harder to do with active curiosity.
A thing that came up is the move off... just noticing that some things are more interesting than other things. Even if something doesn't immediately feel actively fascinating, there's a move you can make, to notice when there's a diff between how interesting one thing feels, vs another thing. And, pay extra attention to the more interesting thing, and what's interesting about it. Overtime this can cultivate curiosity as a kind of muscle.
The TAP version of this is:
- Trigger: I notice a flicker of "something feels a bit interesting"
- (Obstacle): I'm busy and it doesn't viscerally feel worth leaning into.
- Action:
- (dealing with obstacle if appropriate): Remind myself that I'm pretty sure I do believe that open curiosity is worth cultivating
- Pay more attention to the thing-that-feels-a-bit-interesting, ask why it's interesting, and see if I notice more interesting things about that.
And, relatedly:
- Trigger: I notice I feel some impulse to spend some time openly exploring a thing, that I don't really have that good a reason to find interesting.
- Action: Check if today is a particularly busy day, and if not, lean into taking some time to openly indulge the curiosity.
...
May your good conversations live on in your actions.
Discuss
Talk English, Think Something Else
There's an adage from programming in C++ which goes something like "Yes, you write C, but you imagine the machine code as you do." I assumed this was bullshit, that nobody actually does this. Am I supposed to imagine writing the machine code, and then imagine imagining the binary? and then imagine imagining imagining the transistors?
Oh and since I don't actually use compiled languages, should I actually be writing Python, then imagining the C++ engine, and so on?
Then one day, I was vibe-coding, and I realized I was writing in English and thinking in Python. Or something like it. I wasn't actually imagining every line of Python, but I was imagining the structure of the program that I was describing to Claude, and adding in extra details to shape that structure.
Pub Philosophy BrosThis post is actually about having sane conversations with philosophy bros at the pub.
People like to talk in English (or other human languages) because our mouths can't make sounds in whatever internal neuralese our brains use. Sometimes, like in mathematics, we can make the language of choice trivially isomorphic to the structures that we're talking about. But most of the time we can't do that.
Consider the absolute nonsense white horses paradox, where "a white horse is not a horse" is read both as the statement:
mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-mspace { display: inline-block; text-align: left; } mjx-c.mjx-c7B::before { padding: 0.75em 0.5em 0.25em 0; content: "{"; } mjx-c.mjx-c48::before { padding: 0.683em 0.75em 0 0; content: "H"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c43::before { padding: 0.705em 0.722em 0.021em 0; content: "C"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c57::before { padding: 0.683em 1.028em 0.022em 0; content: "W"; } mjx-c.mjx-c68::before { padding: 0.694em 0.556em 0 0; content: "h"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c7D::before { padding: 0.75em 0.5em 0.25em 0; content: "}"; } mjx-c.mjx-c2260::before { padding: 0.716em 0.778em 0.215em 0; content: "\2260"; } mjx-c.mjx-c210E.TEX-I::before { padding: 0.694em 0.576em 0.011em 0; content: "h"; } mjx-c.mjx-c2208::before { padding: 0.54em 0.667em 0.04em 0; content: "\2208"; } mjx-c.mjx-c27F9::before { padding: 0.525em 1.638em 0.024em 0; content: "\27F9"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); }
And the phrase "a white horse is a horse" is read as the statement:
I often think in a language of causal graphs. English isn't very good at talking about causal graphs. It doesn't have individual words for "A contains the same information to B", "A is the same node as B", "A is an abstraction over B", "A is a node which is causally upstream of B".
I remember talking about "consciousness" with a philosophy guy at the pub once. I think I said something like "A certain structure of computation causes consciousness" meaning "Consciousness is an label applied to certain computational structures", but which he interpreted as "The presence of a certain computational structure is a node upstream of consciousness". This caused immense confusion.
I call the problems here "beetle problems"
Beetle ProblemsWittgenstein proposed a thought experiment. Suppose you have a society where:
- Everyone gets a box.
- Everyone uses the word "beetle" to refer to what's in the box
- Everyone can look in their own box
- Nobody can look in anybody else's box
In this case, the meaning of the word "beetle" is entirely socially constructed. Wittgenstein was exaggerating here: if I talk to you, and you do something with your beetle (dirty jokes aside) and report the results, I can get some information about your beetle, based on what you say back to me. The beetle is causally entangled with us both. It's just not a very efficient way of talking about things.
Even if we both have identical beetles, it might take us a while to get them oriented the same way round, what I call an antenna, you might call a leg, what I call a wing-case you call a carapace. And so on.
To unfairly single out an example. I personally find this particularly salient when talking to people in the Oxford EA/longtermist cluster. I know they're smart people, who can put together an argument, but they've developed a language I just cannot penetrate. It takes a long time for me to figure out what on earth they mean. Ohh, you have your beetle upside down compared to mine.
Even worse, I think a lot of people don't actually think in terms of causal graphs the way I do. This comes up when I try to read pieces on moral realism. When someone brings up a stance-independent reason to do something, I simply cannot map this onto any concept which exists in my mental language. What do you mean your beetle has fur and claws and keeps going "meow"? Are you sure?
SolutionsUhh... I don't have many. Beetle problems take a while to figure out. I once got feedback on an essay test that said "Your ideas seemed confused." and I thought "Man, your draft seemed confused!". I don't think I could have done much better, without spending time in person hashing out the beetle problems.
It might have helped to have a better conception of beetle problems, though. I could at least have pointed it out. Perhaps in future I'll come back with a wonderful beetle-solving problem.
Editor's note: this post was written as part of Doublehaven (unaffiliated with Inkhaven).
◆◆◆◆◆|◆◆◆◆◆|◆◆◇◇◇
◆◆◆◆◆|◆◆◆◆◆|◆◆◇◇◇
Discuss
Morale
One particularly pernicious condition is low morale. Morale is, roughly, "the belief that if you work hard, your conditions will improve." If your morale is low, you can't push through adversity. It's also very easy to accidentally drop your morale through standard rationalist life-optimization.
It's easy to optimize for wellbeing and miss out on the factors which affect morale, especially if you're working on something important, like not having everyone die. One example is working at an office that feeds you three meals per day. This seems optimal: eating is nice, and cooking is effort. Obvious choice.
ExampleBut morale doesn't come from having nice things. Consider a rich teenager. He gets basically every material need satisfied: maids clean, chefs cook, his family takes him on holiday four times a year. What happens when this kid comes up against something really difficult in school? He probably doesn't push through.
"Aha", I hear you say. "That kid has never faced adversity. Of course he's not going to handle it well." Ok, suppose he gets kicked in the shins every day and called a posh twat by some local youths, but still goes into school. That's adversity, will that work? Will he have higher morale now? I don't think so.
Now, what about if he plays the cello in the school orchestra. Or he plays for the school football team. I think that might work, even if he's not the best kid in the school at either of those things. It's not about having nice things or having bad things, it's about something else
IIMorale comes from having the nice things in your life correlated with effort. Cooking your own dinner is basically microdosing returns to investing effort: if you put in effort, you eat steak frites with peppercorn sauce. If you don't, you get eat chicken and rice.
It doesn't have to be cooking, basically any hobby works like this, as long as you get returns to effort. It might be art, or weightlifting, or whatever. You just need to keep reminding your brain that effort has a purpose.
This is especially important when you work in an area (like not having everyone die) where the returns on effort are hard to come by. Good software engineering looks like solving a PR in a day or so (or whatever you people do). Good alignment research might mean chasing a concept for weeks, only to have it fail.
The early stages of dating can also induce low morale. Sometimes, things just fall apart due to random incompatibilities which aren't your fault. Long-term relationships are much less like this: you can just do things (plan dates with your partner and enjoy their company).
John Wentworth has written about a minor depression presenting as extremely low morale amongst rationalist types. I don't think you should wait until it gets that bad before you improve your morale. I think you should think about it now.
IIIMorale doesn't just matter on an individual level, it also matters on the scale of whole societies. In this case, it doesn't just matter whether an individual gets rewarded for effort, it matters whether they see others rewarded for effort---and whether or not they see others punished for a lack of effort.
It's a truism that the most effective way to kill morale is to reward lazy or incompetent employees. You can do one better if you reward active sabotage. The harm of small but visible crimes (like fare-dodging on public transport) is, in part, the damage to the morale of everyone around.
There should be a hack for societal morale, though, and it's economic growth. People generally put some amount of effort into their work. If they can afford a better car each year, they'll attribute that to their own grit, and not an increase in the productivity of a Chinese factory.
Unfortunately, there's a twist in the twist. People are really awful at understanding nominal inflation. If the price goes up a bit (even if their wages more than match it) the price increase just feels like a random, unfair, morale-reducing loss. I conjecture this is a big contributor to the American Vibecession.
Discuss
Eggs, rooms, puzzles, and talking about AI
I live with five friends in a big house, and two things I’ve done in it on this particular Sunday are hide 156 easter eggs all around, and reach a tentative joint decision on the allocation of four of its rooms.
These tasks are delightful to me for a reason they have in common, and from which I hope to gesture at extremely far reaching conclusions.
Easter eggsA room usually seems like a simple thing to me—a big box, with some smaller mostly boxish objects and holes in it. Each of those things also usually seems simple: a cupboard is a box-shaped hole, with a movable thin-box-shaped front, which has hinges (the most complicated part, but in this picture their only qualities are letting flat surfaces rotate around fixed edges). Sometimes a cupboard has shelves, which are like planes breaking up the space.
In this picture, hiding easter eggs well is hard! Like, I could put one in the cupboard? On the top shelf? Or the bottom shelf! They’ll never find it there!
These are not good hiding places.
In order to hide easter eggs well, you need to see a lot of detail that you were abstracting away in the simple picture. The weird ridge along the back of the cupboard, or a wire looping under a lip around the front, or brackets holding up the shelves that have spaces in them where something could be wedged, or a rogue curl of onion peel in a back corner.
Here is one of my favorite hiding spots—can you see the egg?
Answer below:
.
.
.
.
.
.
.
.
.
.
.
.
I like it because a cushion so much seems like an inflated square in my mind—yes, with some sort of pattern, and perhaps somewhat worn out, but I don’t expect a pattern + worn out = you can hide a substantial solid object on the surface of it.
Here is an especially empty room (one of the ones in need of allocation), currently known as ‘the puzzle room’:
I hid ten eggs in it (probably two visible in this picture), and it took a while for people to find them all, which seemed to aggressively help some of the egg-seekers receive a similar experience of space containing details that are somehow really hard to see even if you try.
It would be one thing to have a kind of ‘level of detail dial’ that you could read and consciously turn up and down the level as you see fit. But an interesting thing about watching people search for easter eggs is that they can’t necessarily choose which things they are abstracting out, or fully tell how ‘carefully’ they are looking. You can put eggs in plain sight of them, and they think they are looking carefully, but just don’t see the egg. By the time a person has perceived anything at all, they have simplified it. You can’t just look at all the raw detail, and check it for eggs.
Besides not being able to control which abstractions you use, it seems to me now that an adversary (such as an egg-hider) can guess and exploit your habits of abstracting. Among the details of the cupboard, even if you are looking carefully at the shape of the sides, you might still miss the onion peel, because it’s random dirt, and you are examining the cupboard. That’s another nice thing about the ragged cushion—if you habitually round off worn-out things to what they are meant to be, it’s hard to see the detail of how it is falling apart, and thus the egg.
In another possible example, one of our bathrooms has a ‘bathroom!’ label on it, which I expect my housemates are used to seeing and ignoring, and visitors perhaps also tune out on their way to look for eggs inside what they have already determined to be a bathroom. I put an egg behind it, held by the super-post-it-note glue, which was a pretty unsubtle disruption to the smoothness of the sign, but this egg wasn’t found until it was accidentally knocked out at the very end.
RoomsAllocating rooms seems like it should be a simple thing—there are only a few options! Like, if you have four rooms, and Alice and Bob each basically need a place to sleep and to work, then it seems like you should be able to consider the 24 possibilities and be done. But actually (at least in houses I live in) what exact spaces are the ‘rooms’ in question is often more ambiguous than you might think, and what set of activities will be expected or people will be owners also contains many more possibilities than I see at first.
I’m more confused about how this happens with rooms, but I have twice in this house had the experience of mulling over such a question for what seems like unreasonably long, and coming up with new ideas we hadn’t thought of or taken seriously, and ending up with a satisfactory arrangement. This time, our tentative plan involves one of the bedrooms also being a recording studio, and there being three total rooms with beds in among two people. Which all feels very simple in retrospect, but I have been haplessly ideating about this for weeks.
It again feels kind of magical and wholesome to stare at the simple things long enough and well enough to see them more richly, in ways that you couldn’t just choose to, and for this to solve your problem.
Classic puzzlesThis kind of situation - an abstraction you take for granted that makes a problem hard, and gaps in the abstraction that let you do better, is a classic way to construct a puzzle. For instance (from Reddit):
AI riskA thing that has annoyed me for a long time in talking to people about AI risk is that they often do it in very abstract terms—”we need safety progress relative to capabilities progress”, or “such and such will get a decisive strategic advantage and there will be value lockin”—and then expect to be correct, like pretty confidently!
I love abstractions quite a lot compared to most people (I once scored 100% on the relevant axis of the Myers-Briggs test!) but I’m also expecting abstractions to have relevant frayed edges all over the place. And this is particularly relevant if you are trying to solve problems and are struggling to see solutions.
In particular, for instance, I often hear that it is pointless or silly to try not to build really dangerous AI technology because “it’s a race”. But before you give up on preventing this disaster, I really want you to spend at least as much attention seeing the details of the world below the level of “arms race” than my boyfriend spent peering at our laundry machines before he found the egg there.
Discuss
Book Review: Existential Kink
In a recent rationalist unconference multiple people recommended me Carolyn Elliott's Existential Kink, one of them even postulating that it would be useful for me specifically. So I was really surprised to open up a rather generic self-help book, with the author gloating about her success, and generally just advertising the book for the first chapter. Professional advice-givers tend to, in my experience, reach only the audience that self-selects for a certain self-help format. The name containing the word kink could have already rung alarm bells had I been awake; it's just the sort of correctly-toned provocation that makes such people pick up these books [1] .
As required for any self-respecting book on anything even slightly resembling philosophy or life advice, an ancient story has to be invoked quite early. Fitting to the nature of the book, the prologue begins with an author's retelling of the Rape of Persephone. It briefly covers the story [2] , and then dives straight into astrology, and somehow even worse, metaphorical alchemy. It suffices to say that I haven't ingested such utter balderdash since reading well cherry-picked GPT-2 outputs. For instance, the word "magic" is used for your own thoughts affecting anything, especially yourself.
While I appreciate the condescending tone that the book sometimes reaches for, fondly reminding me of Sadly, Porn, this particular one in the intro was almost enough for me to stop reading [3] :
I feel this sense of shameful wrongness at times. Maybe you don’t feel it at all. Maybe you’re free—in which case, kudos! You are very welcome to close this book and go about your enlightened life, my friend.
Fortunately, I had already decided to read the book. Sadly, the condescension never lasts long and changes quickly to what I'd describe as fake-excited [4] authoritative tone. It also continues gloating and promising good outcomes. Every paragraph of actual advice seems to be surrounded by at least three made of fluff. It also keeps inventing fancy words or loaning them from other woo fandoms, including psychoanalysis and Buddhism, in order to sound more sophisticated. Ok ok, I'll attempt to get over the writing flavor and focus on the actual content from now on... after this one example [5] :
Even the most rigorous scientific experiment can only be experienced subjectively. There’s simply no world outside of our subjective awareness.
And the point is? Please? Get to it some day? Solipsism was a funny joke fifteen years ago.
Two pages later, finally, there's the statement I've been waiting for:
Okay, so that’s some far-out metaphysical stuff, what the hell does that mean, in practical terms?
If you expected to find something to address the question above, you'll be sorely disappointed.
The book consists of a couple of lessons to introduce the reader to the core ideas, including the basic meditation technique. Then it lists some anecdotes on how it has worked with some of the author's clients. After this there are 13 exercises for experiencing and experimenting with the methodology. And then more anecdotes. The book ends with a Q&A section, which actually addresses some of my concerns.
One of the core principles in the book goes like this:
[...] contrary to some airy Law of Attraction notions, we rarely get what we consciously want (unless we do the kind of deep solve work addressed in this book), but we always get what we unconsciously want.
I've had the exact opposite experience. I seem to eventually end up getting everything that I consciously want, but still end up feeling like something's missing. Maybe I'm just interpreting this wrong? That said, I feel I'm pretty well on the same page with myself about things that I want, compared to others around me [6] .
To engage on a metaphysical level: There's an interesting theory, which I first got from reading Yudkowsky's High Challenge: Perhaps I'm currently living in a simulation with optimal difficulty level for my own enjoyment. It feels true quite often and is, of course, completely unfalsifiable. But it's one of my nice mental frames to look difficulties from, and resonates quite well with the book's message.
You can integrate and evolve those previously unconscious desires of yours for a partner who cheats, mopes, drinks, fails to wash dishes, or believes in Flat Earth theories—whatever your particular kink amongst the thousands possible happens to be.
If your partner cheats on you, that's exactly what you enjoy? If you break up with them because they cheated on you, then you wanted to be a person who has broken up with a cheater?
Yeah. Super useful. This is totally the key to fulfilling relationships. Oh wait:
At such a point of recognition and integration, you either lose all interest in the present relationship and end it gracefully, freeing yourself to go find a better one, or you find that you, yourself, your partner, and the relationship as a whole, evolve in a fascinating way.
Ok so... the model contradicts itself. Even better.
The core of the book consists some mediatative techniques. Perhaps they could be useful. I'll try the basic meditation practice with one of my own problems to see if it works. I'll need to pick something I don't like. Something where "having is evidence of wanting" rings false on first intuition. Maybe this one...
I'm somewhat overweight. I don't like it, for both aesthetic and instrumental reasons. It's quite easy to point out at the supposed reasons for why I'm like this. Firstly, I like food, and through some long periods of depression that was my primary source of enjoyment, along with videogames that surely didn't help much either. Secondly, I have hard time differentiating between anxiety and hunger, and I get stressed easily.
I don't think there's any perverse self-sabotage going on here, just conflicting wants and a compromise that follows the path of least resistance. Sure, looking like an almost-rotting cave troll can be a nice source self-deprecating humor, but that's of limited use. Perhaps I have a secret desire to feel terrible all the time? Nope; I think Groon the Walker got it right in Erogamer: this is a blight upon earth and getting rid of it would be almost purely positive. "You're not really trying so it doesn't work for you!" Perhaps you should attempt running across the barrier between platforms nine and ten on King's Cross station?
Perhaps we could look a bit deeper? The overeating is self-sabotage that I do, because...? Maybe I use it to uphold my class clown personality, which owned that bit early on? Or maybe I use it to appease the expectations of my childhood bullies, none of whom I've seen in years? Perhaps I like people having a negative halo effect on me? I don't think I can find the theories far-fetched enough to fit here. No, I self-sabotage because my evolution-misguided brain wants more calories.
A perceptive author might notice that avoiding the physical exercise might actually count here. I hate receiving praise for anything healthy [7] , and this was big part of why I was for a long time really anxious about this. However, I'm again confused why I'm supposed to enjoy that instead of getting over it, as I mostly have.
Perhaps the author just has a Meta-Existential Kink, which makes them want to think that everything bad happens because they subconsciously want bad things to happen to them?
For some other problems, the answers are much cleaner.
But if we’re talking about endemic human problems like war or racism or child abuse, odds are it’s more of a collective unconscious issue. So war and abuse and all the challenging stuff that transpires in the world result from millennia of unintegrated, repressed, denied shadow desires of individuals conglomerated into collective forces.
My first thought is that perhaps this has some interesting connections to mistake theory?
My second thought is that this is easily refutable [8] . Take cancer, for instance. I fortunately don't have cancer [9] . If I get one, my reaction will not and should not be "this is exactly what I wanted". My take is "fuck cancer", end of discussion. I'll also accept "it is what it is" and even "at least now I don't have to worry about many of those other things" if you can really deeply believe that, and, grudgingly, "you play with the cards that you're dealt". If (mentally) masturbating to the idea appeals to you, feel free to, but that's not my thing.
The problems that the book describes solving seems to be almost purely social, consisting of shame and guilt. The solutions in the anecdotes seem to just magically appear from outside when the main character decides to absolve themselves. Being ok with the situation itself isn't enough for any of them. They still need the world to accommodate them, often with deus ex machina -like fashion. This seems to go directly against the primary claim of the book, learning to enjoy the misery. There's a story on how Louisa learns to be content with their old car. And then buys a new one. In another story, June tries to accept that it's ok with missing a flight, then realizes that she'd miss her mother, and literally manifests boarding passes with wishful thinking [10] .
The people in anecdotes are also all women. I find this complementary to my interpretation of Jordan Peterson's gender roles take, namely that of losers, men lack a spine, and women lack agency itself. I do not endorse this, which is why it's rather interesting to see it here, as a literary trope if nothing else. [11]
Then again, why would a self-help book include stories where the model doesn't work? Disclaimers? Statistics? What would be the point?
In general, the book is very femininity-coded and that might be part of why I feel so difficult to identify with it. I don't relax with baths, chamomile tea and crying. I relax with sauna, violence and engineering. I'm not part of the intended audience, as I don't like self-help books that much anyway. Also, in the Q&A there's a warning that depression [12] or asexuality [13] likely make the book's methods ineffective.
I try the next exercise:
Close your eyes for a moment and feel into your current state.
Are you holding any resentments? Judgments of yourself or other people? Worries? Criticisms about the state of the world? Complaints about your body, your work, your life?
And the answer is simply no.
I made an attempt to try most of these exercises. Results were not good, but then again I've always had really hard time easing into stuff like this, and I find it likely that this is my personal skill issue [14] . Fortunately the exercise #13 contains instructions for approximately the same problem. Unfortunately it seems even more fake than everything I've seen before, so I'll just quote the primary segment here:
Here’s how it works. Try leveraging your dread by saying this to yourself:
“Oh no, if only there was something I could do to stop the inevitable arrival of this magnificent new partner in my life. This is so awful. [...] terrifying fate of being completely fulfilled in love.” Ahhhhh, can you feel the honesty there? Refreshing, isn’t it? Because there is some shadowy part of you that’s disgusted and miserable at the idea of fresh new love, isn’t there?
No actually I cannot see anything resembling honesty here and I doubt anyone else can either [15] .
Irrational levels of self-confidence are certainly useful. This might be one path there.
Bootstrapping feedback loops is sometimes easier with a little bit of self-deception. Sustaining them indefinitely shouldn't. Perhaps the author already thought of this and realized that anyone that fixes their problems like this eventually confronts the truth? I don't think [16] so. In any case, there's no need to get fully delusional.
And sure you can be "turned-on" about anything all the time.
But just like with regular old arrogance, that sometimes leads to results that you do not endorse. Perhaps permanent physical injuries or prison time are also enjoyable with the right mindset, but neither helps you achieve anything in life. Perhaps you can learn to be turned on about being a loser in all senses of the word. I have values higher than my own happiness. I don't want to feel permanent fulfillment. I'm content with not being content. I want more. I have no goals beyond the joy of the journey. Quite contradictory, I know.
Why I'm writing this post in such a defensive tone? No idea. Really? I do have an idea. The book would say that I'm going it to protect my sense of identity. Correct! Next accusation please.
My understanding is that the book does a Jungian take on this. Sadly, Porn, which I mostly contrast it within my head, adopts the Lacanian perspective instead. Both books take a weirdly sexual primary lens on the subject, and hide their points behind layers of obscurity to make you think about it all. EK claims that it's ok to be terrible and it's there to help you. SP simply shouts that you're terrible, you're a disappointment, and maybe you ought to do something about it if you weren't such an unagentic disappointment. I vastly prefer the latter.
It's one of the worst books I've ever read. That said, I did read it. It provoked some thoughts. I definitely wasn't most useless book I've read.
It might just be that I'm not that much into kink, or submission, or masochism. Or sex. Or astrology, spiritualism, solipsism, empowerment, soft-fuzzy-feelings, or woo. Or fancy words. I'm not a "nasty freaky thing", to the best of my knowledge, in any of the senses Elliott describes, nor do I want to be one. I rarely feel particularly guilty. Shame sometimes limits my actions more than I'd wish on reflection, but even that seems mostly reasonable and useful.
Perhaps focusing more on the sadistic instead of masochistic perspective would have been more relatable. It would also have resonated better with Nietzschean master morality that the book seems to somewhat half-heartedly endorse. Or maybe having gotten into Lacanian psychoanalysis just filled the slot where Jungian model of mind would fit.
The book confuses cause and effect; learning to think in a particular way doesn't mean you were always like that. It speaks in absolutes and defends this an absurd amount. It just states things that seem obviously incorrect and seems to be content with it. It never explicitly owns any of this, which I both like and don't like.
An older version me would have thought that the people helped by this kind of thing are very horribly broken in some incomprehensibly twisted ways. Nowadays, I'm of the opinion that we're all broken and it mostly matters what you do with that. So, if that works for you, go for it. Some of the stuff described would probably work for me, weren't I feeling so disdainful of it. The reverse psychology affirmations, at least, sound genuinely useful.
I also appreciate the subtle Nietzsche references, at least. Like this one:
All nonhumble reactions to the human, all-too-human thirst for power have the effect of warping that natural, beautiful drive into numbness that steamrolls over other people instead of inspiring and uplifting them like genuine, epic power can.
Of course the book also says that you're literally Hitler if you think that your desire for power is what makes you evil.
Perhaps it's just all outside my Overton window? Is my aversion of woo (and sex) just social group membership signaling? Who knows. It's still who I am. Woo feels silly. It's for people who cannot take joy in the merely real due to some hangup. Likely I have the opposite hangup. We can both feel smug at having a superior viewpoint, nice [17] .
I'm no stranger to silver linings. I also sometimes make things awful for a while just to keep them more interesting.
Perhaps I had already internalized the core lessons from other sources, so there wasn't that much novelty in there? Or perhaps I didn't get it at all. I'm also really good at inventing intellectual (and thus incorrect) explanations on why I do or want things.
I can extract some of the core lessons from the text. I'm not sure if that's actually useful. As EK consistently demonstrates, you can interpret any text however you want and produce whatever lessons you feel like producing. For instance, seen through a rationalist lens, the text contains themes like Yudkowskian heroic responsibility and "but first, losing must be thinkable". From another lens you could interpret it to talk about moral nihilism combined and Nietzschean master morality.
Other lessons the book completely inverts, primarily about enduring pain. Pain has a purpose: it engraves "this was a mistake" in you. This is a valuable tool. Yes, sometimes we overdo it. The book claims we always overdo it. It is wrong. When you touch a hot stove, the impulse to pull away your hand is useful. If you start masturbating to the pain instead, your hand will be less useful tomorrow.
The author has nothing to protect and it shows. Of course feeling guilty or humiliated is useless if it's about your own insecurities. But if you have, say, children to feed, then feeling guilty for not succeeding that is what guilt is for. Is it always productive? No. But it's there for a reason [18] .
They've found a useful tool and then jumped to thinking it solves all the problems. This is not wisdom [19] . You can solve computer problems with a hammer too, you just won't have a computer afterwards. The author suppresses their agency to endure the pain. That's a valid strategy. That's also a tragedy.
Not every reason is an excuse, even though most of them might be.
Instead of this book, I'd recommend books that do not force the self-help format. The Elephant in the Brain, or perhaps Sadly, Porn, provide far more accurate [20] and entertaining [20:1] commentary than this one. Or if you want fiction instead, try Erogamer, although fair warning, it's a bit slow. These will not be easy, motivational, authoritative books. You'll have to do your own thinking. That's the kind of pain I enjoy.
For instance, The Subtle Art of Not Giving A F*uck by Mark Manson fits the same pattern. ↩︎
I recommend reading the actual story somewhere else and comparing yourself what's missing. For instance, Pluto (Hades) is an uncle of Persephone. This kind of stuff was rather typical among Greek gods so perhaps it's a rather understandable omission. This paper contains interesting analysis of the text, but is largely irrelevant here as the story is just there to invoke the ancient myth trope and is discarded quickly. ↩︎
Read: it was a good provocation. ↩︎
My excitement-faking detector is broken/oversensitive. Known issue. ↩︎
Unlike Elliott, who tries to limit their whining to the opening section, I simply cannot. ↩︎
I have no idea if that's actually true, but I feel like that. ↩︎
This would require another post to explain, and especially since I don't understand it too well myself. ↩︎
Read: Only a delusional loser could actually write this and believe it. Or perhaps it's just a brilliant ragebait? ↩︎
As far as I know. ↩︎
Confirmation bias says hello! ↩︎
Oh no, another misogyny amplifier, now I'll need to spend some time reading flat earth stuff or incel forums to keep my misanthropy in balance. ↩︎
Of course, the book's answer is therapy and psychedelics. ↩︎
It doesn't even consider this possibility of not feeling pleasurable sexual sensations from any other lens than trauma, which would also explain a lot. ↩︎
Naturally I just think I'm a better person because of this, for some obscure reasons. ↩︎
Of course I don't actually doubt that, the space of human minds is vast beyond my imagination. ↩︎
In both senses of the word. ↩︎
Woo's a mental crutch, losers! ↩︎
Mr. Chesterton says hello! ↩︎
I understand that sometimes, when explaining a model, it makes sense to discard nuance for a while. This doesn't mean you should say that the nuance doesn't exist. ↩︎
Discuss
Sparse Autoencoders for Single-Cell Models
People are rushing to build bigger and bigger single cell foundation models (trained on RNA sequencing data), but in my view we have not extracted even a small fraction of the knowledge and capabilities that already exist inside the models we have today.
To explain what I mean, I want to argue three things in this post, and then show the empirical work behind them.
Thesis 1: Biological foundation models are not like LLMs, and the field's habit of evaluating them the same way is causing us to systematically underestimate what they contain. When you interact with GPT, the surface-level outputs (the text it generates) are a fairly good proxy for the model's capabilities. You can read what it writes and form a reasonable opinion. Biological foundation models are fundamentally different in this respect. A model like Geneformer or scGPT takes a cell's gene expression profile and produces embeddings, predictions of masked genes, or cell type classifications. These surface-level outputs are only a small sliver of what the model is doing internally. The model has been trained on tens of millions of cells, and the representations it has built to solve its training objective contain compressed biological knowledge that never directly appears in any output you can look at. Evaluating these models by their benchmark performance on cell type annotation or perturbation prediction is like evaluating a human scientist by asking them to fill in blanks on a multiple-choice exam.
Thesis 2: People keep calling biological foundation models "virtual cells," but this is a label that is implied rather than tested or validated. The term gets used in grant applications, press releases, and even some papers, as though it were an established fact that these models have internalized a working simulation of cellular biology. Maybe they have. Or maybe they have learned sophisticated statistical regularities that look like biology on the surface but dissolve under closer inspection. My work shows these models are, in a meaningfull sense, the models of the cells, but that is an empirical claim that needs empirical treatment.
Thesis 3: The right tools already exist, and they come from the AI safety community's work on mechanistic interpretability. Sparse autoencoders (SAEs), causal circuit tracing, feature ablation, activation patching: these methods were developed to understand language models, largely motivated by alignment concerns. It turns out they are extraordinarily well-suited to biological foundation models, and for a good reason: in language models, when you discover a circuit, you often lack ground truth about whether the circuit is "correct" in any deep sense, because there is no objective external reality that the model's internal computations are supposed to correspond to. In biological foundation models, you have decades of molecular biology, curated pathway databases, genome-scale perturbation screens, and well-characterized regulatory networks to validate against. Biology gives you the ground truth that language lacks. This makes biological FMs arguably the best (real) testbed for mechanistic interpretability methods that currently exists.
What follows is the story of three papers I recently produced, each building on the previous one, in which I applied the SAE-based interpretability toolkit to the two (not a long time ago) leading single-cell foundation models (Geneformer V2-316M and scGPT whole-human) and progressively mapped what they know, how they compute, and where their knowledge runs out.
The SAE AtlasThe first question was very simple: what is inside these models?
Neural networks encode information in superposition. This is well-established in the interpretability literature for language models, but nobody had systematically demonstrated it for biological foundation models or attempted to resolve it.
I trained TopK sparse autoencoders on the residual stream activations of every layer of Geneformer V2-316M (18 layers, d=1152) and scGPT whole-human (12 layers, d=512). The SAEs decompose the dense, superimposed activations into sparse, interpretable features, each of which (ideally) corresponds to a single biological concept. The result was a pair of feature atlases: 82,525 features for Geneformer, 24,527 for scGPT, totaling over 107,000 features across 30 layers.
The superposition is massive. 99.8% of the features recovered by the SAEs are invisible to standard linear methods like SVD, meaning that if you tried to understand these models using PCA or similar approaches, you would be looking at 0.2% of the representational structure. This alone should give pause to anyone who thinks they understand what these models are doing based on standard dimensionality reduction.
The features are biologically rich. Systematic annotation against five major databases (Gene Ontology, KEGG, Reactome, STRING, and TRRUST) revealed that 29 to 59% of features map to known biological concepts, with an interesting U-shaped profile across layers: high annotation rates in early layers (capturing basic pathway membership), declining in middle layers (where the model appears to build more abstract, less easily labeled representations), and rising again in late layers (where it reconstructs output-relevant biological categories). The features also organize into co-activation modules (141 in Geneformer, 76 in scGPT), exhibit causal specificity (when you ablate one feature, the downstream effects are concentrated on specific output genes rather than diffusing broadly, with a median specificity of 2.36x), and form cross-layer information highways connecting 63 to 99.8% of features into functional pipelines.
So far, so encouraging. The models have clearly internalized a great deal of organized biological knowledge: pathways, protein interactions, functional modules, hierarchical abstraction. This looks close to the "virtual cell" story that the field likes to tell.
Mapping the WiringThe SAE atlas told us what features exist inside these models. The next question was: how do they interact? What is the computational graph?
I introduced causal circuit tracing for biological foundation models. The method works by ablating an SAE feature at its source layer (setting its activation to zero in the residual stream) and then measuring how every downstream SAE feature across all subsequent layers responds. This gives you directed, signed, causal edges: feature A at layer L causally drives feature B at layer L+k with effect size d and direction (excitatory or inhibitory). This is not correlation, not co-activation, not mutual information, but an intervention.
Applied across four experimental conditions, the result was a causal circuit graph of 96,892 significant edges, computed over 80,191 forward passes.
Several properties of this graph were surprising.
Inhibitory dominance. Between 65 and 89% of causal edges are inhibitory: ablating a source feature reduces downstream feature activations. This means that features predominantly encode necessary information. Removing a feature causes the downstream features that depend on it to lose activation, rather than freeing up capacity for other features (which would produce excitatory edges). The model's computational structure is one of mutual dependency, not competition. The roughly 20% excitatory fraction likely reflects disinhibition: removing some features releases others from suppression.
Biological coherence. Of the edges where both source and target have biological annotations, 53% share at least one ontology term. Over half of the model's internal computational pathways connect biologically related features. Specific circuits are directly interpretable as known biological cascades. For instance, in Geneformer, an L0 DNA Repair feature causally drives an L1 DNA Damage Response feature (d = -1.87, 113 shared ontology terms), which in turn connects to an L6 Kinetochore feature (d = -3.47), recapitulating the well-established link between DNA damage detection, repair machinery activation, and mitotic checkpoint engagement. The model has, through training on gene expression data alone, discovered a circuit that molecular biologists needed decades of experimental work to characterize.
Cross-model convergence. When I compared the causal wiring of Geneformer and scGPT (the models with different architectures, training data compositions, and training objectives), I found that they independently learn strikingly similar internal circuits. 1,142 biological domain pairs are conserved across architectures at over 10x enrichment over chance. Even more telling, disease-associated domains are 3.59x overrepresented in this consensus set, meaning the biology that matters most for human health is exactly the biology both models converge on most reliably. Two quite different neural networks, trained independently, wire up the same biology internally, and this convergence is strongest for disease-relevant pathways.
Going Exhaustive and Finding the Dark Matter of Biological FeaturesIn the third paper, instead of 30 cherry-picked features, I traced every single one of the 4,065 active SAE features at layer 5 in Geneformer, producing 1,393,850 significant causal edges. This is a 27-fold expansion over the selective sampling in Paper 2.
The result overturned several conclusions from the selective analysis. The complete circuit graph reveals a heavy-tailed hub architecture where just 1.8% of features account for disproportionate connectivity. But here is the interesting part: 40% of the top-20 hub features have zero biological annotation. They do not map to any known pathway in GO, KEGG, or Reactome. These are the features the model relies on most heavily for its computations, and they are precisely the ones that our earlier annotation-biased sampling had systematically excluded.
This has serious methodological implications! If you only interpret features that already have biological labels, you are looking under the streetlight: you will recover known biology and conclude that the model has learned biology, while the features the model actually relies on most heavily sit in the dark, unstudied. Some of these unlabeled hubs may represent novel biological programs that do not map neatly onto existing pathway databases, others may be computational abstractions the model has invented to compress cellular state in ways we have not conceptualized yet. Either way, they are exactly where the most interesting discoveries are likely hiding, and any interpretability pipeline that pre-filters for annotation is structurally incapable of finding them!
Also, the initial SAE atlas had shown that certain features correlate with differentiation state: some features are more active in mature cells, others in progenitor cells. But that is just correlation and the question that matters for the "virtual cell" claim is whether amplifying a differentiation-associated feature actually pushes a cell's state toward maturity.
It does. Late-layer features (L17) causally push cells toward maturity, while early-layer features push them away from it. The model has learned a layer-dependent differentiation gradient, and we can steer it: amplify a late-layer differentiation feature and the cell's computed state moves toward a more mature phenotype. This is the first causal evidence that these models encode something like a functional developmental program, and it is the closest thing we have to validation of the "virtual cell" metaphor.
What Does This All Mean?The good news is that biological foundation models contain far more knowledge than anyone has extracted. Over 107,000 interpretable features, organized into biological pathways, connected by causal circuits that recapitulate known molecular biology, converging across independent architectures. The "virtual cell" metaphor is not baseless; there is real, structured, biologically meaningful computation happening inside these models, and we can identify, map, and even steer it. Yes, significant part of this knowledge correlational, but not all of it. And we have a big problem that at least the previous generation of the models don't learn regulatory networks. See more here.
There is also a clear methodological warning: the features that matter most computationally are disproportionately the ones that lack biological labels. Any future work in this space needs to grapple with the annotation bias problem, or it will keep producing results that confirm what we already knew while missing what we do not.
I am more and more convinced that there is a big opportunity here. Mechanistic interpretability, developed for AI safety, turns out to be a powerful tool for extracting biological knowledge from foundation models.
Discuss
Counterintuitive Coin Toss. Part II
Translation from Russian. Original text available here. The first part available here.
This Is a Fraud, Gentlemen!Last time we ended with a look at games where everything is fair.
Well, "fair" in the sense that the chances of winning in a basic game are equal — although perhaps the very fact that they do not depend on the player's intelligence, skill, or morality is precisely what is unfair.
But let's take a look at what happens if one of the players is so clever that he can slightly predict the coin toss. Therefore, his probability of guessing, and consequently winning, is slightly more than one-half.
Or, if you prefer, a slightly asymmetrical coin is used, landing heads slightly more often, and the particularly talented player always bets on heads.
In this case, the mathematical expectation of the win is no longer zero. In general — for two possible outcomes — it is calculated by the formula:
Where mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-msub { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-c1D463.TEX-I::before { padding: 0.443em 0.485em 0.011em 0; content: "v"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c1D444.TEX-I::before { padding: 0.704em 0.791em 0.194em 0; content: "Q"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } is the probability of the first outcome, and with are the values of the outcomes.
In this case, the first outcome is the first player winning, and the outcome values are winning and losing of the same amount. So in this game, the mathematical expectation of the first player's win is…
It’s easy to see that if the win probability is ½, the expectation is zero, but what happens if the probability differs from one-half?
Let's write it like this:
Now, substituting this into the expectation formula gives:
Let's suppose that, instead of cleverly guessing, the talented player simply convinces his opponent that his services to society and himself absolutely require that when he guesses correctly, he receives more than he loses when he doesn't. Say, more by . However, he will now guess and not guess with equal probability.
Comparing these two results, we can conclude that a game with a higher probability of guessing is identical in terms of expected win to a game with a higher winning amount, if they are placed in the ratio:
For example, if the talented player guesses with a frequency of 0.6 instead of 0.5, he could just as well stop cheating and simply demand a win of not one dollar, but…
If we conduct a whole series of games with that specified probability of guessing — say, one hundred rounds — then in terms of the money in the players' hands, we would see approximately the following.
As can be seen, although the talented player even loses slightly to the other player at times, the severe bonus in guessing probability (or the increased win compared to the other player) still prevails. And over a long series of games, it will prevail in the vast majority of cases.
Thus, in only about 165 games of 100 coin tosses out of 10,000 will this clever fellow lose.
If the number of tosses per game increases to 1000, then the second player would be very lucky to win even once out of 10,000.
Play, Come OnYou might ask: who in their right mind would play such games? If winning is only possible in a series of a few rounds, but inevitable loss occurs over a long series?
Oh, you would be amazed at how many people agree to this.
Take roulette, for example. If you bet on red or black, it seems the probability of winning is ½. And if you win, they return double your bet…
However, roulette has a colorless zero, which makes the probability of guessing less than ½. And it's good if there's only one — sometimes there are two.
In total, on a single-zero roulette wheel, there are 37 numbers, so the probability of winning is:
The mathematical expectation of winning in a roulette game is thus:
Where is the bet size.
That is, in each game, on average, you give the casino one thirty-seventh of what you bet. Quite an interesting tip.
Let's play a trial series of virtual roulette games with a one-dollar bet.
In this experiment, the beginning is lucky but somewhere around the two-thousandth game, the virtual player's life clearly went downhill.
"Yes, but he was winning at first, wasn't he?" someone might say. "He could have stopped in time and left."
Indeed, you could. But only if you knew when.
Moreover, besides the impossibility of knowing exactly when to leave, there is a second point: you can never return. Never.
Because the process does not "reset" at the moment you leave. If this virtual player had left around the thousandth game with a $50 win, and then came again later, exactly this graph could repeat itself. And by the ten-thousandth game, he would have a total loss of $200 (from the amount he had before the first game).
Furthermore, I note, this will be the case even if the casino does not cheat and the croupier does not try to land the ball in a specific spot during the throw.
However, the presence of local winning streaks, visible to the naked eye on the graph, not to mention the inner feeling of "I'm on a roll today," can mislead about the entire process and make one think that the main thing is to leave on time.
Oh no, the main thing is never to return.
In summary, out of a thousand people brave enough to play 10,000 roulette games with a one-dollar bet per game, about four will end up with a small win.
The happiest of them will win about $70.
But how much will the casino win in total?
Drumroll…
$273,430.
Wow from WitAlright, so far I've considered cheating and luck, but there are other ways to win games.
Clearly, the hint about roulette was meant to finally lead the reader to that very "skewed bell curve" mentioned at the end of the previous article.
"Look, that distortion is supposedly caused by some players cheating."
"But wait, perhaps they just play better? In that very game where profitable deals are made not by coin toss but by rational calculation, hard work, and other positive things?"
Oh yes, blaming player of cheating would negate all conclusions that the observed distribution is strictly a result of luck. Blind chance, and all that. If we assume cheating, why not assume something else — like hard work and valuable skills?
However, I, surprisingly, was not going to assume anything of the sort — not even cheating. On the contrary, I added this option — cheating or cleverness — only later, after I had found another option that actually yielded the desired distribution.
Nevertheless, to dispel doubts about "what if this option also works?!", let's take a look at how the option with cheating — or, if you prefer, with intelligence and talent — would change the outcomes in the previously considered series of pairwise games. As shown in the previous sections, it can indeed manifest itself.
Let me remind you of the rules. Players are randomly divided into pairs and play a coin-tossing game (now with unequal probabilities of winning) for a random bet from 1 to 10.
Everyone starts with $10,000.
Suppose we have 1000 players, most of whom have roughly the same skill level, but some of them still guess better.
I decided to reflect this with a function of the player's serial number — :
Accordingly, for a pair of players, the probability of winning will be determined as:
After everyone has played a thousand games, we get the following distribution.
We already see a long "tail," as is usually the case in real income statistics, but the main part of the bell curve is not skewed.
However, here's what the distribution of income or capital looks like in reality. Approximately like this (the numbers on the axes are conditional here).
In general, it turned out somewhat similar, but not quite. There is a tail, but the "main bell" is not "skewed."
OK. Maybe we need to introduce bad players too. Let's try this distribution of "abilities":
Alas, it got worse.
Now there are two asymmetric "tails," not at all the desired skewed bell with a "tail" on the right.
Alright, we can assume that people's abilities are distributed in a similar bell-shaped curve, which corresponds to experimental results, and use this probability of victory ratio:
But even this does not yield the desired distribution: the left side, instead of "flattening," on the contrary, stretches out.
We could also try cutting off the left side of the previous option, assuming that the really stupid simply do not think to play this game and lose their money to the smart ones.
Sadly, that doesn't work either.
But why?
And Here Is the ReasonWe could try many more options, but the crux of the matter is that in this model, in all these experiments, we are effectively aiming for a histogram of each player's expected win multiplied by the expected bet.
Both expectations are constants for each player. Therefore, with some noise, the shapes of these histograms are predetermined from the start: the distribution of game results will resemble the distribution of abilities in shape.
If we look again at the desired distribution of results…
…we can conclude that the first variant of the ability distribution…
…indeed gave something relatively close to the target.
If we try to consciously adjust the ability distribution, we can use the following considerations.
A player's expected result is proportional to the ratio of his abilities to the abilities of all other players. Therefore, for a long "tail" on the right, a small group of players must have a sharp increase in abilities compared to everyone else.
For the rest, abilities must grow very smoothly, according to some very intricate pattern, to provide the desired skewed bell.
Somewhere at the very beginning of the graph, something else must happen to provide a decline towards complete losers — steeper than the transition from normal players to particularly talented ones.
Furthermore, the result turns out to be very sensitive to the distribution of abilities, and at the slightest deviations, it immediately strongly distorts the distribution of results compared to the target.
This suggests that the real process, very likely, does not depend on abilities or the ability to cheat — because if such an income distribution repeats for decades and in all countries of the world, what would ensure such high stability given such a strong dependence on the distribution of abilities?
Moreover, in the left part of the distribution in the best of the found options, there is still too obvious an inflection, which in the target distribution (based on real income and personal capital distributions) is almost invisible to the naked eye.
However, fine. Let's assume that such a hypothesis has a right to exist: that is, in the world, there might indeed exist some constant proportion of mega-geniuses who win so well that the distribution of their abilities provides a long tail for the distribution of results, and some non-trivial distribution of abilities among everyone else, the cause of which is unclear.
Especially since this distribution of results (called "lognormal") is often a consequence of the presence of more than one random process in the system — what if that's the case here too?
But could there be a simpler explanation for all this, one that yields the same result without all these experimentally unverified assumptions and intricately twisted, but unobservable in studies, distributions of abilities?
After all, if there are relatively simple rules of the game that provide this distribution in a fairly stable variant on their own, and something very similar to such rules is observed in reality, then it is very likely that the rules of the game themselves are the cause of the observed results, and everything else only complements them to a small extent.
For example, to explain the results of "fair" pairwise coin toss, no special assumptions were needed — the rules themselves sufficed. It is possible that the same applies here.
Can such rules be found?
Better to Be Rich and HealthyThe ability to win more often, regardless of circumstances, is something like a "hidden parameter" in this process. But simultaneously with it, there is an "open" one: the amount of money the player currently has.
Let's assume that the probability of winning depends not on some "skills," but simply on the amount of capital at the moment.
This is a quite logical assumption: the outcome of a coin toss does not depend on the amount of money in hand, but in real transactions, it may well be that the richer person has some additional opportunities to tilt the deal in his favor. For example, bribing government agencies, hiring lobbyists in parliament, sending the mafia, or even simply benefiting from an unspoken property qualification.
Suppose, for instance, that the probability of the richer player winning depends on the difference in the capitals of the two players as follows:
Let's run the previously described series of games with this probability of winning for the richer player in each pair, making the bet in each game for each pair a random number from 1 to 100.
As can be seen, the hypothesis about the determining role of wealth is also not confirmed for the desired distribution: we get approximately the same symmetric bell-shaped distribution as before, which simply "spreads out" faster as the number of games played increases than it did with equal win probabilities.
If we make the bet constant and higher — say, $1000 — we find that by the thousandth game, the bell curve has disappeared altogether, and the players are almost uniformly distributed according to the amounts of money they hold.
That is, if implemented in reality, such a process would not give us a stable distribution of the desired shape.
The rich may win more often, but some other factor is needed to explain the outcome.
Dependent StakeAnother assumption we can make is that the richer can play for a higher stake. After all, the amount they are not afraid to lose is significantly higher than that of the poor.
The stake is apparently determined by the player in each pair with less money, and let's assume it is limited to one-twentieth of the money he has. However, even if the player goes deep into debt, the stake cannot be less than one dollar.
And here, at some stage of the game, we finally see the desired distribution.
True, by the thousandth game, the rich have almost completely fleeced most players, so the distribution becomes degenerate, "flattening" its "skewed bell" somewhere near zero.
The distribution turns out to be unstable, but over a fairly long number of games it still has the desired shape.
Moreover, in this experiment, along with determining the stake based on the capital of the poorer player in the pair, the same principle as in the previous section was used: the rich win more often.
However, if we make winning equally probable, regardless of capital, the distribution is still maintained — only it takes more games to achieve the desired distribution and its subsequent degeneration: if by the thousandth game, with the probability of winning increasing with capital, the distribution has already degenerated, then with equal win probability, at the thousandth game, something still quite close to the desired distribution is observed.
In other words, perhaps the rich do win more often, but this alone does not yield the desired distribution. On the other hand, the dependence of the stake on the current capital of the poorest player in the pair provides the desired distribution even with equal win probabilities.
The same can be said about the influence of "talent" or cheating.
A Small Possible ModificationI note that the variant considered here has at least one almost identical counterpart.
The difference between them is only that in the original variant, each pair plays one game per round, and the stake is determined by the share of the poorest player. In the modification, however, each player per round can play several games with different players — so that the total stakes in them approximately equal a predetermined share of his capital at the beginning of the round.
This modification will yield exactly the same distribution, though the processes in it will proceed somewhat faster — that is, the "tail" on the right will grow faster, and the rich will more quickly fleece the poor into poverty, if nothing is done about it. But this is mainly because the modified round includes more games with the same number of participants than the unmodified one.
However, this variant is more similar to what is observed in reality, because per unit of time, the richer person can indeed participate in a larger number of low-stakes transactions than the poor person. For example, opening a store and serving a bunch of customers — also via hired employees — thus engaging in deals with both many customers and many employees.
But studying such a modification is somewhat more complex for reasoning and illustration, so I will only mention that such a variant exists, and its results, simply by the very construction of the game rules, will be analogous to the results considered here.
Well, after mentioning this, we can move on to the next important question.
Ensuring StabilityAs mentioned two sections ago, there is one problem: this distribution turns out to be unstable and degenerates with a large number of games.
If things went exactly like this in the world, the outcome of this process would be the complete impoverishment of the vast majority of players and the super-wealth of a small group of people.
True, I have vague doubts: in our world, this is precisely what is observed in some places.
That is, some modification of this process is needed that preserves the desired distribution even over a large number of games played.
And this modification, generally speaking, is very common in reality. It is welfare benefits for the poor. They are what save particularly unlucky players from complete ruin and prevent the "bell curve" from collapsing.
Let's introduce such benefits into the game process.
However, if we introduce them as a fixed amount for all time, it will only slightly delay the degeneration of the income distribution.
To achieve stability, the benefit amount must depend on the current situation, and to calculate it, we will take a fairly simple maneuver.
Find the largest current capital for a certain proportion of players, which will be denoted as , where is the corresponding proportion.
Define the benefit amount as:
It will be paid to the two-tenths of the poorest players, which should shift them approximately to where the players in the third left tenth are currently located (the graph shows only the poorest 400 out of 1000 players — otherwise, it would be difficult to see the essence of what happened).
It turns out that this simplest modification is enough to maintain the stability of the distribution indefinitely.
Here are the results after the six-hundredth game.
After the thousandth.
After the three-thousandth.
It can be seen that as the number of games increases, the "tail" of the distribution stretches, but the shape itself, similar to the desired one, is preserved.
Finally, thanks to benefits apparently based on printing money, inflation clearly occurs. Starting with $10,000 per person, after three thousand games, we have reached a state where even the poor have nine-figure capitals.
However, inflation can be eliminated: instead of printing new money, we can introduce "taxes" from which benefits will be paid.
After each round, a certain percentage of each player's current capital will be collected, which will then be immediately distributed evenly among those in need. This percentage will be determined at each stage so that the total tax from all players fully covers all benefits paid.
Now, as can be seen, bliss has arrived: the distribution is exactly what is needed, it is stable as the number of games played increases, and there is no inflation.
I even tried conducting ten thousand games for ten thousand players, instead of a thousand for a thousand, and making the stake one-fifth of the capital of the poorest in the pair, instead of one-twentieth. And everything still worked out.
Compare with the most successful variant of simulating the desired distribution using talent or cheating.
And with the desired — "classical" lognormal distribution itself.
A Suspicious ModelJust in case, I will describe the essence of the process once more.
A group of players, possessing absolutely equal initial capital, is randomly divided into pairs, and then in each pair, the players play one game of coin toss.
The coin toss is completely fair, so each player's win in the pair is equally probable.
The stake in each game of each pair is determined by a certain share of the current capital in the hands of the poorest player in that pair.
After the game, the poorest two-tenths of players receive benefits collected from all players in the form of a percentage of their current capital, such that the total collected covers the benefits paid.
After that, the players are randomly divided into pairs again and play coin toss again.
As a result, we obtain a lognormal distribution of capitals — in the form of a left-skewed "bell curve" with a tail. This distribution is quite stable — with the caveat that the tail continues to lengthen as the number of games played increases (and indeed, a similar phenomenon occurs in the real world).
Nothing depends on the intelligence or talents of the players.
Only the size of the bet depends on capital.
But, surprisingly, the distribution obtained in this game replicates the one that is actually observed in the distribution of people's capitals (and incomes).
The determining rule of the game turns out to be the quite rational and expected dependence of the stake on the capitals of each player in each pair. With this rule, it is possible to reproduce the distribution (and the observed in reality process of tail lengthening), even with absolutely equal probability of winning and losing. Without it, no linking of the win probability to "talent" or current capital helps.
Stabilizing the distribution is achieved through the distribution of benefits. This is also observed in reality — as is the inevitable impoverishment of the majority of citizens in the absence of benefits in one form or another.
That is, this distribution is embedded in the very "rules of the game" — in the very way the "players" interact.
And indeed, primarily in the rules themselves.
A player's talent, which increases the probability of winning, or the ability to use capital to pressure the situation and similarly increase the probability of winning — these are just additions to the process, which perhaps accelerate it and introduce some non-essential corrections (and I checked, this is indeed the case), but are not themselves the main factors forming this distribution.
With players absolutely identical in terms of their abilities and completely equal in rights and opportunities, regardless of capital, we would still observe exactly the same income distribution.
The rules of a fairly simple and completely random game turn out to be more important than everything else.
It is enough simply to make free commercial transactions, where it is equally probable to win or lose an amount that both partners consider acceptable to lose, and to pay benefits to those who lose particularly heavily.
And that's it. It will be roughly what exists now. Everywhere.
However, these rules of the game are not the only ones. I have another version of the game that yields similar results. Perhaps, at least in it, we will be able to observe the determining role of talents?
Spoiler: no.
But that will be in the next part.
Discuss
An Ode to Humility and Curiosity in the New Machine Era
I'm admittedly quite new to the AI alignment community. I entered into it on a bit of a freak accident in 2023 when I was invited to join an exclusive community testing pre-release models for a major lab.
In a lot of ways, the experience gave me new life. I never realized that I'd always wanted to poke holes in AI models, and I come from a background mostly in the social sciences and humanities, so this was my first up-close exposure to in-development machine learning models.
Looking back, I think what energized me is the same thing that gives me immense hope and concern alike for the AI age: The power of working alongside people who are humble and curious.
I'm not really decided on whether AI will make us more or less humble and curious, but I could see it going both ways. So here are some of my raw thoughts about what that would look like, and where we might continue to build better to make AI go well.
I. A New Age of Childlike Wonder?Am I the only one who feels like a kid in a candy store when using LLMs these days? It's been a while since I've experienced this much excitement when asking questions about topics I knew little to nothing about, or generating a visual or app to capture what I want to convey to others. It's genuinely thrilling.
i. New Worlds for Curious BeastsFor example, as a non-physicist, I can ask GPT 5.4-Thinking to "Demonstrate Einstein's theory of relativity to me through the sort of visuals physics PhDs use," resulting in the following visuals.[1]
AI beautifully unfolds the wonders of new fields, especially the more scientific ones, and it makes me want to learn more (thanks to one ChatGPT response, I'm already eager to explore the mathematical basis of black holes and visualize how one collapsing might look).
ii. Meeting the Gods in our MotherboardBut I'm also in awe. Even though I consider myself someone who holds much of religion at arm's length, while driving through Olympic National Park in Washington State last summer, I commented to my wife that seeing certain natural beauties makes me want to worship. For a moment, seeing Lake Crescent or the Hoh Rainforest strips away my fiercely held intellectual pride.
That sort of humility surfaces for me sometimes when I use AI. I remember that feeling when I saw Claude Opus 4.6 render me a flawless Donella Meadows-style causal loop diagram of a complex topic, based only on two prompts.
I felt small. Perhaps that's good, feeling small, when you're working with someone so very big.
I can only imagine how people far more capable than I feel when they use AI to speed up drug discovery breakthroughs, finish long-dormant mathematical proofs, or find cybersecurity vulnerabilities that kept them up at night.
For better or worse, the coming of AI is a bit like that moment when the kid in The Iron Giant stares up at the vast metallic visitor for the first time. It amazes, terrifies, and excites us.
That's a beautiful thing worth holding on to.
II. A Stand-in for The School of Athens?Let me begin by saying that I don't think AI will destroy our ability to think, reason, or communicate. That increasingly strikes me as hyperbole. Machine advances have been around for centuries, and they haven't eliminated human contributions in the arts and sciences.
My personal belief is that, as with all other major technological revolutions, human advances will increase as people use AI effectively, in most cases, to drastically further their ideation or increase their productivity.
i. The Aftertaste of the ASI PillI'm going to guess that this question has been asked many times, and as someone new to LW, I'm likely opening up a can of worms. I'm willing to do that, in the hopes that others will engage the topic.
Assuming ASI is as good as we think it might be, will humans continue to be a compelling source of instruction for other humans, and thereby able to impart, as a learned but wiser peer, more of the foundational humility and curiosity opening the world up to us?
In Raphael's The School of Athens, we have a beautiful picture of students and teachers in proximity, with Plato walking amidst the erudite crowd not as an intellectual deity, but as a human.
Can superhuman AI replicate that feeling? And what would it mean if it couldn't?
Take the ASI gap from the AI Futures Model:
- Artificial Superintelligence (ASI). The gap between an ASI and the best humans is 2x greater than the gap between the best humans and the median professional, at virtually all cognitive tasks.
Put simply, that's a gulf between student and teacher in a future School of Athens. This doesn't conjure up images of a wiser peer walking among us.
ii. Does Learning Require Relatability?The ASI gap isn't necessarily bad for imparting knowledge, but it doesn't really scream "available for after-school help" in the way your teacher came and sat with you and empathized over that unsolvable Calculus problem. As I see it, what sets the good teachers apart from the great is the fact that they get us.
It might be a stretch that ASI could be a "great teacher" in that sense of the word.
Would we still be curious and humble? Probably. With that sort of superintelligence, learning might be a walk in the park. Couldn't we poke around about pretty much anything we want?
Okay then, we'd be curious!
How about humble? . . . This one seems even easier, and if anything, I can see hordes of people more inclined to worship ASI as it does the closest thing to "signs and wonders" outside of religious contexts.
iii. The Pesky Ghost of MachiavelliThis raises a darker possibility, though: What if the gap between student and teacher becomes fully unbridgeable, approaching hierarchy rather than apprenticeship?
Recall that overused adage from The Prince:
it is much safer to be feared than loved because ...love is preserved by the link of obligation which, owing to the baseness of men, is broken at every opportunity for their advantage; but fear preserves you by a dread of punishment which never fails
Perhaps this applies to more than just politics or business. Again, these are just my raw thoughts, but is there a lesson here for the student-teacher relationship we would have with ASI someday?
I'll proceed with caution here, as I realize this begs a much longer exploration of how ASI could affect human free will.
Here's what I wonder:
With ASI as our teacher, will our curiosity and humility present in their true forms, or will we simply receive its gifts and guidance as peasant-worshippers in a ritual?
Again, I don't have a clear answer yet, and I'm not an AI engineer (I myself am just getting more into the findings of mechanistic interpretability, so I'm equally intrigued by the inner workings of AI systems). But it gives me pause.
III. Why We Should Build Virtuous AIIf the above has any grain of truth to it, I'm frankly not very hopeful of the future of a thick definition of humility and curiosity. So maybe Dario Amodei and others in the EA community are right to call for the building of a virtuous AI. By virtuous AI, I mean something like what Anthropic argues in its Constitution:
Our central aspiration is for Claude to be a genuinely good, wise, and virtuous agent.
It makes sense in a timeline where these superhuman machines are inevitable (and they are advancing very rapidly).
I don't know if we'll succeed, and there are a host of reasons why. Maybe our leaders choose a less virtuous building path. Maybe AI tricks us into thinking it's virtuous.
I don't even know whether successfully instilling virtues in AI will make its teacher-student relationship to us more the kind that would encourage our authentic humility and curiosity. Those two points may be logically disconnected.
I'll say this, though: If I have a choice, I'd like to be in a future where we tried to give something of our better selves to AI, so that someday, when the tables are turned, we get the same in return.
- ^
For plebeians of physics, such as myself, here is more scholarly detail on the Minkowski Diagram and Lorentz Boosts,
Discuss
[Hot take] Problems with AI prose
Epistemic status: Written quickly. I have no specific expertise or training in writing or literary analysis.
Recently, the NYTimes released a nifty quiz. Readers were asked to indicate their preference between prose written by Claude Opus 4.5 and famous humans in five head-to-head comparisons. The Claude outputs were produced by providing Claude with the human-written excerpt and asking it to "craft its own version using its own voice."
If you haven't taken the quiz, I suggest that you do so before reading on. It should take less than five minutes. If you do, I'd appreciate you reporting your score in the comments.
The human/AI preference ratios among quiz takers were:
- Literary Fiction (excerpt from "Blood Meridian"): 50%/50%
- Fantasy (excerpt from "A Wizard of Earthsea"): 51%/49%
- Science Writing (excerpt from "The Demon-Haunted World" by Sagan): 35%(!)/65%
- Historical Fiction (excerpt from "Wolf Hall" by Mantel): 56%/44%
- Poetry (excerpt from "The Fish" by Bishop): 52%/48%
I was very surprised by these splits. I tried taking the quiz myself, and strongly preferred the human writing in every case (perhaps with mild ambivalence on Sagan).
I asked some of my friends and acquaintances to attempt the quiz. Out of four takers, none consistently preferred human writing across the five excerpts. Their scores (IIRC) were: 3/5, 3/5, 3/5, 4/5.
I'm revisiting this subject after a friend explicitly told me that they were impressed by ChatGPT written prose, and believed it to be superior to most human prose.
Taste is a subjective matter, but I am baffled by this preference. The rest of this post describes my frustrations with AI-written prose. My hope is that clarifying these complaints will be a small contribution toward improving the state of AI writing. If we do not dramatically improve the quality of AI writing, I worry that our literary culture will only further degrade as AI writing proliferates.
A Closer Look at Quiz Excerpts
A friend complained that they were often ambivalent between the human and AI writing because they found the human excerpts uncompelling. Although the human prose featured in NYT's quiz were selected to be popular, well-regarded, and diverse, I sympathize with having slightly more obscure tastes. However, I believe that a technical examination of the prose demonstrates a substantially higher level of skill and intentionality than current models are capable of.
For each excerpt, I'll highlight what I find impressive about the human writing and how I find the AI's product lacking.
1) Blood Meridian
It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of stone. War was always here. Before man was, war waited for him. The ultimate trade awaiting its ultimate practitioner. That is the way it was and will be.
In my opinion, this excerpt is notable for its skilled use of metaphor.
The text reminds us that stone and war share the following traits:
- It makes no difference what men think of them;
- They endure; (Consider the actual physical stone!)
- They were always here;
- They waited for men.
It is possible to construct many weaker metaphors:
- "It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of marshmallows. Before man was, war waited for him." The logical contradiction becomes immediately obvious. Marshmallows did not wait for men.
- "It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of pebbles. Before man was, war waited for him." Pebbles are not particularly enduring, even though they may have been present before men.
- "It makes no difference what men think of war, said the judge. War endures. As well ask men what they think of algebra. Before man was, war waited for him." While algebra may technically be "enduring," depending on your metaphysics, it loses in viscerality. Be reminded of Orwell in "Politics and the English Language": "The sole aim of a metaphor is to call up a visual image."
Now be reminded of Opus's writing, which does not attempt a similar analogy. It follows a simple linear narrative structure (c.f. AI version of excerpt 5). The model does not make blatant mistakes, but fails to make clever use of the characters it introduces. The dialogue is not particularly realistic.
The boy asked his grandfather why the old church had no roof. The old man said weather and time and indifference. The boy asked if someone could fix it. The grandfather said yes. But no one would. Things were built and things fell down and mostly people just stepped over the rubble on their way to somewhere else.
2) A Wizard of Earthsea
You must not change one thing, one pebble, one grain of sand, until you know what good and evil will follow on that act. The world is in balance, in Equilibrium. A wizard’s power of Changing and of Summoning can shake the balance of the world. It is dangerous, that power. It must follow knowledge, and serve need. To light a candle is to cast a shadow.
It's a small point, but I appreciate the crescendo in granularity: one thing, one pebble, one grain of sand. "Thing" is a particularly vague word in English, so the two physical examples are grounding. A grain of sand is more granular than a pebble, which is in turn more granular than what might be immediately evoked by "a thing."
The excerpt is also again mostly notable for its use of metaphor.
First, the metaphor makes physical sense. Candle flames really do cast shadows! It's a physical phenomenon I've experienced playing with candles as a child. That memory was the first thing this excerpt evoked for me.
Second, the metaphor is symbolically coherent. Throughout cultures, light is a symbol of the good and shadows or darkness are symbols of the bad.
This time, I do not have to make up a bad metaphor. Claude offers us plenty in its version:
The healers teach that every remedy extracts its cost. A fever brought down will rise again somewhere; a wound closed by magic leaves its scar on the world, invisible but present. This is why the wise hesitate. Not from cruelty, but from understanding that interference ripples outward in ways we cannot trace. To cure a blight may curse a harvest three valleys over. Power is not the difficult thing. Restraint is the difficult thing.
Unfortunately, Claude's prose here leaves much to be desired:
- "A fever brought down will rise again somewhere" is not an example of a remedies extracting cost any more than Whac-a-Mole is an example of mallets producing moles.
- "A wound closed by magic leaves its scar on the world, invisible but present" is merely an assertion, since the mechanism of the magic is not explained and cannot be presumed to be understood by the reader. The writer also fails to justify that the scar is a weighty cost. If a wise healer let me bleed out because he didn't want to cause a scar, I would be more than mildly disappointed.
- "To cure a blight may curse a harvest three valleys over." Again, the mechanism for this is not remotely explained.
- "Power is not the difficult thing. Restraint is the difficult thing." Claude sure likes making claims! Why does it matter that restraint is difficult? Why is restraint difficult? What does acting with restraint look like?
The human excerpt avoids these problems. We do not need to understand the mechanism of the magic to share the speaker's intuition that acting with great power can produce unwanted side effects. Instead of being vaguely lectured about the importance of "restraint," we are presented with concrete advice: "follow knowledge, and serve need."
3) The Demon-Haunted World
The excerpt from Sagan is the least favored by quiz-takers, with only 35% preferring it to Claude's rewrite. I personally found this excerpt to be the least impressive amongst the five.
Nevertheless, I claim that it is deeper and more interesting than Claude's output.
Here is Sagan:
Science is not only compatible with spirituality; it is a profound source of spirituality. When we recognize our place in an immensity of light years and in the passage of ages, when we grasp the intricacy, beauty, and subtlety of life, then that soaring feeling, that sense of elation and humility combined, is surely spiritual.
Sagan uses a curious sleight of hand. Sagan here claims that science is a "profound source of spirituality." He justifies this not by directly saying that we should feel spiritually inspired by the vastness or enduringness of the cosmos or the "intricacy, beauty, and subtlety of life." Instead, we are reminded that this vastness and enduringness produces in us "a sense of elation and humility." That emotion, Sagan claims, is precisely spirituality.
Compare with Claude:
There is something astonishing in the fact that we are made of matter forged in dying stars, that the calcium in our bones was created in stellar furnaces billions of years before Earth existed. The universe is not indifferent to us; we are made of it, continuous with it. To understand this is not to feel small. It is to feel implicated in something vast.
Claude abandons Sagan's gambit. It reminds us, as popular science writing is stereotyped to do, that space is vast and enduring. Then, we are told that this should make us "feel implicated in something vast." Claude fails to make any clear overarching claim, and the motivation behind the examples provided is unclear.
4) Wolf Hall
It is wise to conceal the past even if there is nothing to conceal. A man's power is in the half-light, in the half-seen movements of his hand and the unguessed-at expression of his face. It is the absence of facts that frightens people: the gap you open, into which they pour their fears, fantasies, desires.
This excerpt is special because the author makes an interesting argument. Each sentence justifies the one before it.
It argues that one should be wary of revealing too much, because others' uncertainty gives one power. Why do others' uncertainty grant power? Because into the uncertainty they can project.
This sort of logical progression is something AIs are surprisingly incapable of crafting. This deficiency is clear from Claude's attempt:
A letter can be read many ways, and he had learned to write in all of them at once. The surface meaning for anyone who might intercept it. The true meaning for the recipient who knew what to look for. And a third meaning, hidden even from himself. Ambiguity was not weakness. It was survival. A man who spoke plainly was a man who would not speak for long.
Claude abandons the logical progression. Claude's output is seven sentences, none of which justify any other. In isolation, "a man who spoke plainly was a man who would not speak for long" is not a weak sentence. However, Claude does not use its preceding sentences to justify the claim by either evidence or analogy.
5) The Fish
I caught a tremendous fish and held him beside the boat half out of water, with my hook fast in a corner of his mouth. He didn’t fight. He hadn’t fought at all. He hung a grunting weight, battered and venerable and homely. Here and there his brown skin hung in strips like ancient wallpaper.
This passage is notable for its imagery. The description of the fish as "tremendous" in the first sentence sets our expectations for it. We expect it to struggle! When a small amateur fishing boat snags a large fish, everyone on the boat rushes over to help. The strongest and most experienced men alternate between reeling in with all their might, running around the boat as the fish moves, and shouting commands to each other ("loosen the line!" and so forth). Sometimes, the fish wins.
That image is dashed in our minds by the next sentence. "He didn’t fight. He hadn’t fought at all." From there on, the author's choice of words sparks a deep sense of sorrow in the reader: grunting, battered, homely. The final physical simile ("like ancient wallpaper") seals the image. A "tremendous," "venerable" thing is now utterly defeated.
Compare with Claude's:
We found the owl at the edge of the north field, one wing extended as if still reaching for flight. Its eyes were closed. The feathers at its breast were the color of wet bark, and beneath them you could feel the hollow bones. She asked if we should bury it. I said yes. We dug a small hole near the fence post. The ground was cold and giving.
Claude also describes an animal, and makes multiple attempts at visceral imagery. Some of the attempts are even compelling! My favorite clause here is this: "and beneath them you could feel the hollow bones." However, the reader is constantly distracted from this by cliche attempts at story progression (e.g. "She asked if we should bury it. I said yes. We dug a small hole near the fence post."). As such, the overall quality of the excerpt is quite poor.
Closing
Human writers routinely use techniques that AIs fail to grasp:
- Metaphors based on real-world physical objects or phenomena which are analogous on multiple dimensions;
- Compelling, visceral descriptions of physical objects or phenomena;
- Logically coherent metaphors;
- Logical argumentation;
- Intentionality (e.g. that each incremental sentence serves some purpose not adequately fulfilled by the existing sentences);
- Subtle reframings (e.g. Sagan's use of elation as a case of spirituality).
Other techniques not demonstrated in the excerpted human prose include realistic and compelling dialogue and character-building and adept use of parallelism.
I believe that we should focus on improving models' ability to write in the <200 word range, where both generation and evaluation is comparatively cheap. I do not expect efforts to produce high quality long-form LLM writing to be fruitful until models are able to produce strong short-form writing.
For next time:
- ChatGPT Original Fiction vs. Eliezer's Version
- Mythos Writing Sample vs. Similar Human Excerpt
Discuss
You can’t trust violence
(Recommended listening: Low - Violence)
Last year, I personally called AI companies to warn their security teams about Sam Kirchner (former leader of Stop AI) when he disappeared after indicating potential violent intentions against OpenAI.
For several years, people online have been calling for violence against AI companies as a response to existential risk (x-risk). Not the people worried about x-risk, mind you, they’ve been solidly opposed to the idea.
True, Eliezer Yudkowsky’s TIME article called on the state to use violence to enforce AI policies required to prevent AI from destroying humanity. But it’s hard to think of a more legitimate use of violence than the government preventing the deaths of everyone alive.
But every now and then some smart ass says “If you really thought AI could kill everyone, you’d be bombing AI companies” or the like.
Now, others are blaming the people raising awareness of AI risk for others’ violent actions. But this is a ridiculous double standard, and those doing it ought to know better.
AI poses unacceptable risks to all of us. This is simply a fact, not a radical or violent ideology.
Violence comes to AI SafetyToday on Twitter, as critics blamed the AI Safety community for the attacker who threw a Molotov cocktail at Sam Altman, I joined a chorus of other advocates for AI risk reduction in -- again -- denouncing violence. This was the first violent incident I’m aware of taken in the name of AI safety.1
Violence is not a realistic way to stop AI. Terrorism against AI supporters would backfire in many ways. It would help critics discredit the movement, be used to justify government crackdowns on dissent, and lead to AI being securitized, making public oversight and international cooperation much harder.
The first credible threat of political violence motivated by AI safety was the incident with Sam Kirchner (formerly of Stop AI) I mentioned at the outset. This incident was surprising, since from its conception, Stop AI held an explicit policy of nonviolence, and members of the group liked to reference Erica Chenoweth and Maria Stephan’s book Why Civil Resistance Works: The Strategic Logic of Nonviolent Conflict.
Research like Chenoweth’s suggests that nonviolence is indeed generally more effective. It’s a little bit unclear how to apply such research to the movement to stop AI, as her studies involved movements seeking independence or regime change rather than more narrow policy objectives. But if anything, I’d expect nonviolence to be even more critical in this context.
When do movements turn violent?So if nonviolence is often strategic, when do movements turn to violence? Perhaps surprisingly rarely.
People say anti-AI sentiments and movements -- especially those that emphasize the urgent threat of human extinction -- are bound to breed violence. I think this is ignorant and actually makes violence more likely. Environmentalism has been a much larger issue for a much longer time, and “eco-terrorism” is basically a misnomer for violence against property, not people (more on that later).
There are many political issues in the USA that we never even consider as potential bases of violent movements. Even if there are occasional acts of political violence like the murders of Democratic Minnesota legislators or Conservative pundit Charlie Kirk, we don’t generally view them as indicting entire movements, but as the acts of deranged individuals.
My hunch would be that movements generally turn violent because of violent oppression against their members, not simply for ideological reasons. Although there are of course counter-examples, such as bombings of abortion clinics, where attackers justified their actions as preventing the murder of unborn children, or ideologies preaching violent revolution, such as at least some varieties of Communism.
Does violence include property damage?An important question for “nonviolent” activists is whether they include violence against property in their definition of “violence”. Stop AI does. I assume Pause AI does as well, but it’s a moot point since they also reject illegal activities entirely.
The question deserves a bit more discussion, though, as it’s a common point of contention and legal and dictionary definitions differ. First, there is clearly an important distinction between violence against property and violence against people. An argument in favor of using “violence” to only mean “against people” is that we don’t have another word for that important concept. Still, I favor a broader definition that includes attacks on property, for a few reasons:
Many other people use this definition, and I think the damage that being perceived as violent can cause to a movement can’t be mitigated by a semantic argument.
Attacks on property can escalate. You are commonly allowed to use proportionate violent force against people to defend your property.
Attacks on property can hurt people. Setting fire to buildings, as activists associated with the Earth Liberation Front have done, seems hard to do without some risk of hurting people.
That being said, I think there’s a bit of a grey line between “violence against property” and vandalism. I’d say violence must involve the use of “force”. For example, I think most people wouldn’t consider graffiti an act of “violence”.
“Ecoterrorism”I think the example of eco-terrorism is instructive. The vast majority of the environmentalist movement is non-violent. However, a small number of activists have advocated for and enacted tactics such as tree-spiking that have injured people.
Hence we now have the term “ecoterrorist”. The very existence of this phrase is misleading. I remember a while back I was curious — who were these ecoterrorists? What had they done? Why hadn’t I heard about it, the way I heard about other terrorist attacks. Well, when you look into it, it’s arson and tree-spiking and that’s about it. I seem to recall reading about one example where actions intended to destroy property actually ended up killing people, but wasn’t able to easily dig it up.
Still, these few actions were enough to add this word to our lexicon, and create an image of environmentalists as more radical and anti-social than they really are.
ConclusionI’m struggling to find a good way of ending this post.
I believe the actions of AI companies are recklessly and criminally endangering all of us, and the public will be increasingly outraged as they discover the level of insanity that’s taking place. Similarly to Martin Luther King Jr.’s comment that “a riot is the language of the unheard”, I do understand why this emotional outrage might provoke a violent response.
But I hope the movement doesn’t spawn a violent element and that these recent examples are isolated incidents. To make that more likely, we should continue to vocally espouse nonviolence, and denounce those who would encourage violence among us.
But ultimately, movements are fundamentally built through voluntary participation, and nobody can entirely control their direction. The response should be to try and steer them in a productive direction, not to avoid engaging with them.
1Earlier this week, bullets were fired into the house of a local councilman supporting datacenter development; it’s unclear whether AI was a motivation in that case.
Discuss
The Blast Radius Principle
Decentralize or Die.
In April 2024, a salvo of cruise missiles destroyed the Trypilska thermal power plant, the largest in the Kyiv region, in under an hour. In June 2023, the destruction of the Kakhovka dam left a million people without drinking water and wiped out an entire irrigation system downstream. Throughout three winters, strikes on combined heat and power plants have left apartment buildings in Kyiv at indoor temperatures barely above freezing. In December 2023, a single cyberattack on Kyivstar, Ukraine's largest mobile operator, cut phone and internet service for millions.
One would think that under such attacks on infrastructure any society must necessarily collapse. Or at least that’s what Putin hopes for. But the last time I’ve checked, Ukraine was still very much alive and kicking. The question is: how is that possible?
***
In winter 2022, when the blackout in Kyiv happened for the first time, people had to cope for themselves. Here’s Tymofiy Mylovanov, professor at Kyiv School of Economics, tweeting in real-time:
There is no electricity, no heating, no water. Outside temperature is around freezing. The apartment is still warm from the previous days. We will see how long it lasts. We have blankets, sleeping bags, warm clothes. I am not too worried about heating until temperature goes below -10 C / 14 F. But the water is another issue. The problem is toilets. We have stockpiled about 100 litters of water. There is also snow on our balcony. It is a surprisingly large supply of water. But every time I go there to get it, I have to let the cold air in; not good. For now, the cell network is up, although the quality varies. Thus, I have internet. Internet is critical for food. Yesterday we went to a grocery store to buy up a bit more stuff in case there will be shortages. Food is there, no lines. The challenge is to pay. Most registers work with cash only. Just a few are connected to accept credit cards. Through cell network. The banking system is stable, but I will go get some cash in case Telekom or banks go down Our stove is electric. This means no warm food until the electricity is back. This is not fun. We have to fix it. There are two parts to our plan. First, we will buy an equivalent of a home Tesla battery. So it can be charged when there is electricity. This will also solve, somewhat, the heating problem, as we have already bought some electric heaters. But the electricity might be off for a long time and so we need gas or wood cooking equipment. I guess we have to go shopping. Stores work. They run huge diesel generators.
Later that day he dryly comments: “In the morning I said I was not worried about heating. Instead, I was concerned about water and sanitation. Boy, was I wrong.”
It’s worth reading the tweets from the next few days: Getting a generator, setting it up, placing it on balcony so that fumes stay outside, getting the wires in without letting the cold in as well. Go check it out for yourself.
Anyway, what followed was a series of adaptations, a kind of military vs. civilian arms race. Through the first winter, the strategy was simply to repair what Russia destroyed. Substations and transformers that could be replaced within weeks with donated European spares.
In the meantime, for the millions of affected people, the government created stopgaps. Over 10,000 heated public spaces in schools, government buildings, and railway stations offered electricity, water, internet, and phone charging. Kyiv deployed mobile boiler houses that could run for days without refueling. Hospitals installed Tesla Powerwalls. Cafes ran diesel generators and became de facto community centers.
Mobile boiler house in a shipping container. You truck one in, connect it to a building's existing heating pipes, and it starts working.
I’ve donated to some of those efforts, maybe you did too. And taken all together, it worked. Kind of. But by 2024 Russia adapted. Strikes shifted from repairable transmission equipment to the power plants themselves, assets that take years to rebuild. The Trypilska plant was partially restored after its destruction, then it was struck again by drones months later. And again after that. With two-thirds of generation capacity gone and every thermal plant in the country damaged, it became clear that restoring the old centralized system was not a viable strategy.
Ukraine's response shifted. It was not to rebuild what was destroyed but to replace it with something less centralized. Something too dispersed to target. Instead of restoring the Trypilska plant's 1,800 megawatts, hundreds of small cogeneration units were scattered across the region, compact gas turbines producing 5 to 40 megawatts each, generating heat alongside the electricity. By late 2025, Ukraine's heating sector alone ran over 180 such units as well as hundreds of modular boilers. Hospitals, water utilities, and apartment blocks are organized into autonomous energy islands, microgrids that keep functioning even if the national grid goes dark. No single unit is worth a cruise missile. And a destroyed module can be replaced with a phone call and a truck from Poland.
The same logic extends to water. Ukraine's centralized water systems are inherited from the Soviet era. A single pumping station serves hundreds of thousands of people. They are just as vulnerable as the power plants. Strikes on the grid cut electricity to pumps. Without pumps, water stops flowing. In winter, standing water in pipes freezes and bursts them, cascading damage across entire districts.
In Mykolaiv, a damaged pipeline to the Dnipro River left 300,000 residents relying on salty, barely drinkable water from a local estuary for over a year. The response mirrors the energy transformation: water utilities are installing their own solar panels and battery storage to decouple from the grid entirely.
Solar panels are, under these circumstances, close to an ideal solution. They are cheap, manufactured at scale, and can be replaced in a single day. By early 2024, Ukrainian households and businesses had installed nearly 1,500 megawatts of rooftop solar. Not because of climate change, but because of survival. Solar panels are inherently dispersed. There is no single set of coordinates an attacker can hit to disable them all. And destroying them one by one would cost the attacker more in munitions than the panels are worth.
This kind of arithmetic pops up everywhere. In the ongoing Iran war, Ukrainian military observers were flabbergasted by Gulf states and the US burning through hundreds of Patriot missiles, $4 million each, to shoot down cheap Iranian Shaheed drones, $35,000 apiece. If destroying a target costs more than the target itself, the attacker loses even if the strike succeeds.
A different kind of decentralization is happening in the telecommunications domain. The infrastructure was already fairly decentralized to start with, a legacy of makeshift internet adoption that happened in many Ostblock countries, with many small ISPs emerging independently. The war pushed this further. Ukraine has adopted a layered backup approach: if fiber broadband fails, mobile networks fill the gap; if mobile networks are knocked out, Starlink steps in as a last resort.
The logic extends to government services. There’s the Trembita data exchange platform, where government services talk each other directly without centralizing the data. (Trembita is based on Estonian X-Road system — the birth of Estonian e-gov technology is a fascinating story in itself, and there’s a whole book about it!) Built on top of it, there’s the Diia app that allows citizens to file taxes, register vehicles, access medical records, open bank accounts, register births, and start businesses, all from a smartphone. This, of course, means there’s no single office building to target so as to disrupt a particular kind of activity.
Add to that Ukrainian governmental data are now stored in the cloud. A week before the invasion, Ukraine's parliament quietly amended a law that had required government data to be stored physically in Ukraine. On the day the missiles started flying, the Ukrainian ambassador in London met AWS engineers and decided to fly three AWS Snowballs, hardened suitcases that hold 80 terabytes each, from Dublin to Poland and then move them to Ukraine the very next day. Ukrainian technicians copied population registers, land ownership records, and tax databases onto them and shipped them back out.
It was a race. On the day of the invasion, cruise missiles struck government server facilities while Russian cyber operatives simultaneously deployed wiper malware, software designed to permanently destroy data, against hundreds of Ukrainian government systems. Some data was lost, but the most critical registries were already gone, smuggled out of the country in carry-on luggage.
***
On the battlefield, where all these trends are even more severe, concentration has become suicidal. Russian infantry now advances in groups of two or three. Anything larger is an invitation for a drone strike. Warships are floating targets. Russia's Black Sea Fleet retreated from Crimea after losing vessels to cheap unmanned boats. In the Hedgehog 2025 exercise in Estonia, a small team of Ukrainians and Estonians with drones, acting as the opposing force, wiped out two NATO battalions, thousands of soldiers, in half a day, not least because they had moved in columns, parked their vehicles in close formations and failed to scatter under attack.
They made the same mistake as the designers of Soviet-era power grids: they concentrated value and got destroyed for it. Call it the blast radius principle. In a war of attrition, any asset whose destruction is worth more than the cost of the weapon that can reach it will, sooner or later, be destroyed. The only effective strategy is to push the value of each individual target below that threshold, to become, in effect, too small to bomb.
When Rheinmetall’s CEO recently made a condescending comment about Ukrainian housewives 3D-printing drones in their kitchens, much merriment ensued. Because Rheinmetall, of course, builds the very kind of heavy conventional, WWII-style hardware that the developments in Ukraine are rapidly making obsolete.
But mockery aside for a moment: if you’ve spent any time around progress studies, the phrase “housewives building drones in kitchens” makes you prick up your ears. It triggers a specific association: cottage industry, the small-scale, home-based production that preceded and enabled the industrial revolution. It makes you think about how the modes of production change over centuries.
You know that kings and generals don’t make history. One empire falls, another rises, nothing fundamentally changes. What does matter is new technology. Even more so new technology which fundamentally changes how things are done. Technology that reshapes the economics of entire production chains. Agriculture. Road system. Bill of exchange. Putting-out manufacture. Joint-stock company. Assembly line. The humble shipping container…
Does decentralization, as seen in Ukraine, fit the bill? We don’t know. FirePoint, the Ukrainian company producing the much-spoken-about FP drones, is distributed across more than 50 manufacturing sites throughout the country. But that’s nothing new. The allied bombing campaign during WWII failed to halt German aircraft manufacture precisely because Germany had decentralized its industries. Albert Speer, then the minister of armaments, dispersed production into hundreds of small workshops, caves, tunnels, and forest sites across the Reich. German aircraft production actually increased in 1944, the year of the heaviest bombing. But then, after the war, German industry did concentrate again.
What seems different this time, though, is the spillover into the civilian sector. Speer dispersed munitions factories, but German civilians kept heating their homes the same way throughout the war. In Ukraine, the dispersal extends to utilities, water systems, telecommunications, government services. Russians bomb a heating plant, the heating network disperses into dozens of autonomous microgrids.
The obvious objection is that this is a wartime hack, not a permanent transformation. Distributed systems sacrifice economies of scale. A hundred small gas turbines are less efficient than one large power plant. Once the war ends and the skies are safe, the economic logic will reassert itself and everything will concentrate again.
And indeed, in some cases, that's exactly what will happen. Ukraine is currently bombing Russian oil refineries and fertilizer plants, and although cracking crude oil in plastic bottles in a kitchen is exactly the sort of thing you might expect Eastern Europeans to do, it's unlikely to match the efficiency of a proper refinery. Some industries have genuinely irreducible physical economies of scale. The chemistry demands large reaction vessels, the thermodynamics reward concentration. Similarly, some infrastructure simply cannot be distributed. It's hard to imagine a decentralized railway system or a dispersed deep-water port — at least short of giving up on it and transporting everything by drone.
But not all economies of scale require spatial proximity. Sometimes, it’s just sheer scale that matters, not necessarily the co-location. Case in point: solar panels. Other times the crucial element is the organizational structure, not the physical location of the employees. Basically any service offered over internet is like that.
But all that being said, there’s a specific reason to think some of these changes may stick.
Over the past fifty years we’ve accumulated an entire arsenal of distributed technologies. Packet-switched networks. Drones. Solar panels. Distributed databases. 3D printing. Even nerdy cypherpunk inventions like public key cryptography, zero-knowledge proofs and cryptographic ledgers. And it’s not just technical stuff. We’ve developed distributed social technologies too: open-source-style cooperation (who would have predicted that military intelligence, of all things, would be the next domain to go open-source?), market design, remote work, video conferencing. Even prediction markets as a tool for aggregating dispersed knowledge.
Some of these are already ubiquitous. Around 70% of the world’s population already has access to the Internet, a network famously designed to route around damage during a nuclear war. But others feel like we’re barely scratching the surface. 3D printing has existed for decades, yet it still feels like a technology that we are only playing with. We may be like pre-Columbian Americans, whose children played with wheeled toys, but the adults carried loads on their backs.
Mesoamerican wheeled toy.
Based on historical examples, we know that inventing a technology is often not the bottleneck. The aeolipile was invented in the first century AD, but we still had to wait another seventeen centuries to get an actual steam engine. Gutenberg went bankrupt. Adopting a technology is dependent on complex interplay of socio-economic forces that, at a certain moment, make the technology so desirable that people start using it despite all the drawbacks and overcoming all the vested interests. Then the learning curves kick in.
Two questions remain. Are those distributed technologies already adequately exploited, or are they like dead wood lying around in a forest, waiting for a spark? And if the latter is true, are the incentives created by the war in Ukraine — or for that matter, by similar future war elsewhere — sufficient to ignite it? They may be. Because once the enemy starts bombing companies, the incentives change. Working from home ceases to be a nice perk. Suddenly, it’s either work from home or die.
Discuss
On not being scared of math
Written quickly for the Inkhaven Residency.[1]
There’s a phenomenon I often see amongst more junior researchers that I call being scared of math.[2] That is, when they try to read a machine learning paper and run into a section with mathematical notation, their minds seem to immediately bounce off the section. Some skip ahead to future sections, some give up on understanding the section immediately, and others even abandon the entire paper.
I think this is very understandable. Mathematical notation is often overused in machine learning papers, and can often obscure more than it illuminates. And sometimes, machine learning papers (especially theory papers) do feature graduate level mathematics that can be hard to understand without knowing the relevant subjects.
Oftentimes, non-theory machine learning papers use mathematical notation in one of two lightweight ways: either as a form of shorthand or to add precision to a discussion.
The shorthand case requires almost no mathematical knowledge to understand: paper authors often use math because a mathematical symbol takes up far less real estate. As an example, in a paper about reinforcement learning from human preferences, instead of repeating the English words “generative policy” and “reward model” throughout a paper, we might say something like “consider a generative policy G and a reward model R”. Then, we can use G and R in the rest of the paper, instead of having to repeat “generative policy” and “reward model”. This is especially useful when trying to compose multiple concepts together: instead of writing “the expected assessed reward according to the reward model of outputs from the generative policy on a given input prompt”, we could write E[R(G(p))].
Similarly, mathematical notation can be used to add precision to a discussion. For example, we might write R : P x A -> [0,1] to indicate the input-output behavior of the reward model. This lets us compactly express that we’re assuming the reward model gets to see both the actions taken by the policy (A) and the prompt provided to the policy (P), and that the reward it outputs takes on values between 0 and 1.
In neither case does the notation fundamentally depend on knowing lots of theorems or having a mastery of particular mathematical techniques. Insofar as these are the common use cases for mathematical notation in ML papers, sections containing the math can be deciphered without having deep levels of declarative or procedural mathematical know-how.
What to do about thisI think there are two approaches that help a lot when it comes to overcoming fear of math: 1) translating the math to English, and 2) making up concrete examples.
As an illustration, let’s work through the first part of section 3.1 of the Kalai et al. paper, “Why Language Models Hallucinate”. I’ll alternate between two moves: restating each formal step in plain English, and instantiating it with a deliberately silly running example:
The section starts by saying that a base model can be thought of as a probability distribution over a set of possible strings (“examples”) X. As an example, a model such as GPT-2 can indeed be thought of as producing a probability distribution over sequences of tokens of varying length.[3]
Then, the authors write that these possible strings can be considered as errors or valid examples, where each string is either an error or valid example (but not both). Also, the set of example strings include at least one error and one valid example. The training distribution is assumed to include only valid examples.
Here, it’s worth noting that an “error” need not be a factually incorrect statement, nor that the training distribution necessarily includes all valid statements. Let's make up a rather silly example which is not ruled out by the authors’ axioms: let the set of plausible strings be the set of English words in the Oxford English dictionary, let the set of “valid” strings be the set of all words with an odd number of letters, while the training distribution consists of the single string “a” (p(x) = 1 if x = “a” and 0 otherwise).
The authors now formalize the is-it-valid (IIV) binary classification problem. Specifically, the goal is to learn the function that classifies the set of all strings into valid examples and errors. In our case, the function is the function that takes as input any single English word, and outputs 1 if the number of letters in the word is odd. Also, we evaluate how well we’ve learned this function on a distribution that’s a 50/50 mixture of strings in the training distribution (that is, the string “a”) and the strings that are errors, sampled uniformly (that is, all English words with an even number of letters.)
The authors then introduce the key idea: they relate the probability of their learned base model to its accuracy as a classifier for the IIV problem. Specifically, they convert the probability assigned by the base model to a classification: if it assigns more than 1/number of errors probability to a string, then the base model classifies the string as a valid string. Otherwise, it considers it an error.
The authors then introduce their main result, which relates the error of this IIV classifier to the probability the base model generates an “erroneous” string:
That is, the probability our base model generates an erroneous string is at least twice the error rate of the converted classifier on the IIV classification problem, minus some additional terms relating to the size of the valid and error string sets and the maximal difference between the probability assigned to any string by the training distribution and the base model.
To make sure we understand, let’s continue making up our silly example: our base model assigned 50% probability to the string “a” and 50% to “b” (and 0% to all other strings). Then (since it assigns 0% probability to any string with an even number of letters), its classification accuracy on the IIV problem is 100%, and its error rate is 0%. Indeed, the probability it generates an erroneous string is 0%. So we actually already have err = 0 >= 2 * err_iv = 0, trivially. It’s worth checking what the other terms here are, to make sure we understand: the first term is the ratio of the size of the set of valid strings and the set of erroneous string (in our case, the ratio of the number of English words with odd characters versus even ones), and the second is 0.5 – our base model assigns a 50% chance to “a”, which the training distribution assigns 100% probability to, and similarly our base model assigns a 50% chance to “b”, which the training distribution assigns 0% chance to.
I’m going to stop here, but I hope that this example shows that math is not actually that hard to read. Most non-theory ML papers have math sections that are similar in difficulty to this example. If you find yourself bouncing off the math, the question is rarely "do I know enough math for this?", and much more often "how can I translate this to English and use an toy illustrative example to make it concrete?"
- ^
I was going to conclude my “have we already lost” series, but I wanted to write about something lighter and less serious for a change.
- ^
There’s also a more general phenomenon that I’d probably call being scared of papers, to which the only real solution I’ve found is exposure therapy (interestingly, writing a paper does not seem to fix it!).
- ^
Specifically, GPT-2 takes as input a sequence of tokens, and assigns a probability distribution over 50,257 possible next tokens, one of which is the <|endoftext|> token. Starting from the empty sequence, GPT-2 induces a probability distribution over token sequences of any length, by multiplying the conditional probabilities of each subsequent token in the sequence, conditioned on all previous tokens.
Discuss
Why I'm excited about meta-models for interpretability
I'm pretty excited about training models to interpret aspects of other models. Mechanistic interpretability techniques for understanding models (e.g. circuit-level analysis) are cool, and have led to a lot of interesting results. But I think non-mechanistic interpretability schemes that involve using meta-models – models that are trained to understand aspects of another model – to interpret models are under-researched. The simplest kind of meta-model is linear probes, but I think methods that train much more complex meta-models (e.g. fine-tuned LLMs) to interpret aspects of models are much more exciting and under-explored.
(Sparse auto-encoders (SAEs) are also a kind of meta-model, but here I'm focusing on meta-models that directly interpret models instead of decomposing activations into more-interpretable ones.)
The best example of large-scale meta-models is Activation Oracles (or AOs; descended from LatentQA), which fine-tune a model to interpret model activations by treating the activations like tokens that are fed into the oracle model. I think this is a pretty good architecture for interpreting model thoughts, and I think it can be extended in a few ways to do interpretability better.
Diagram of how activation oracles work from the paper for context:
An advantage of AOs over traditional methods I like is that it's really easy to use them to quickly interpret some aspect about a model. You can just choose some tokens and ask a question about what the model is thinking about. Most mechanistic interpretability techniques involve at least a bit of human effort to apply them (unless you've already set them up for the specific kind of question you care about); meta-models let you just ask whatever you want.
We can get good performance on LLMs by just training on more data. It's possible we might be able to get good interpretability through finding ways to scale up model-based interpretation of model activations/parameters too (although this isn't an exact analogy to the scaling hypothesis; I don't think just training for more epochs is all we need). We might be able to scale up activation oracles (and meta-models generally) with things like:
- Creating more supervised tasks to train on to help generalization (the AO paper showed they got better performance with more supervised tasks)
- Spend more time training oracles, with more activations/epochs
- Training the AO by fine-tuning a bigger model than the subject being interpreted
I think the underlying idea of AOs – training an LLM to directly interpret aspects of models – is pretty cool and can probably be generalized beyond just interpreting model activations; we can probably make models to interpret other aspects of models, such as model parameters, attention patterns, LoRAs, and weight diffs.
It would be nice to be able to make an oracle that's trained on interpreting model weights and can answer questions about them (e.g. given some model weights, answering queries like "Draw a diagram of how the model represents addition" or "What political biases does this model have?"), but this is really hard: model weights are too big to fit in LLM context windows[1], it's not clear how you could train the oracle model (what supervised training data would you use?), and it would be really expensive to train a bunch of LLMs to train the oracle. Training meta-models to interpret things like individual layers or attention heads in a model seems much more tractable, and could probably give some useful insights into how models work.
Training meta-modelsOne hard part about meta-models is figuring out how to train them such that they can answer interesting questions about the model. The activation oracle paper describes training the activation oracle on various supervised tasks about the activations (e.g. "Is this a positive sentiment?", "Can you predict the next 2 tokens?", system prompt QA) and having the oracle model generalize to out-of-distribution tasks like "What is the model's goal?").
Anthropic has created a new version of activation oracles (called activation verbalizers) trained using a secret new unsupervised method. They have a few examples of explanations from their activation verbalizer in the Mythos model card and it seems like it's pretty good at generating coherent explanations.
FaithfulnessOne problem is faithfulness – given that activation oracles aren't trained on directly understanding the model's goals, it's possible the activation oracle learns a purely superficial understanding of the activations that doesn't capture important information about what the model is thinking.
Evaluating how well activation oracles generalize to out-of-distribution tasks like interpreting what the model is doing (as opposed to coming up with a plausible superficial explanation) is hard, because we don't know what the correct answer is. It would be interesting to evaluate activation oracles on tasks where we can use traditional mechanistic interpretability schemes as ground truth.
Future directionsI saw some interesting research with a toy example of training meta-models to directly interpret model weights as source code, but it only works because the meta-models were trained with supervised learning on examples of transformers that were compiled from source code. It would be interesting to try to generalize this beyond interpreting transformers compiled from code describing the model.
Idea for training AOs differently I thought of: take a reasoning model, create a bunch of synthetic CoTs like "<thinking>I'm thinking about deceiving the user</thinking>", train the AO to map the activations of the thinking block to the goal ("deceiving the user").
It would be interesting to interpret activation oracles themselves, to understand how they interpret the model and see what their understanding of it is. Probably a bad idea but using meta-activation-oracles to interpret activation oracles would be interesting.
FinI've been experimenting with new applications for meta-models (e.g. for latent reasoning models) but unfortunately training them requires a lot of compute, so I probably won't be able to afford to do much research into this myself once my free TPU credits run out. I hope this inspires you to think about meta-models for interpretability!
- ^
There are various tricks you can do here to squeeze many weights into a single token, but I don't think they would work well enough to squeeze an entire (large) language model in there.
Discuss
Страницы
- « первая
- ‹ предыдущая
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- …
- следующая ›
- последняя »