Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 20 минут 44 секунды назад

What would it mean for the Myers-Briggs personality test to be pseudoscientific?

1 февраля, 2026 - 11:32
Published on February 1, 2026 8:32 AM GMT

I recently had a two day training course at work where they made a big fuss about Myers-Briggs personality tests, and ensuring that we learn to play to our strengths and identify weaknesses based on this test.

Looking it up after the course, I saw that Wikipedia's view on it isn't particularly positive:

The Myers–Briggs Type Indicator (MBTI) is a self-report questionnaire that makes pseudoscientific claims to categorize individuals into 16 distinct "personality types".

Now Wikipedia's probably right, and I've got better things to do than to dive into the research here. But I think possibly more important than whether or not the MBTI is pseudoscientific or not, is what would it mean for it to be pseudoscientific?

Once we make sure we're asking the right questions, we can then find the right answers. But if we're not asking the right questions, all our thinking on this is going to be confused.

A quick overview of the MBTI

An MBTI test asks a bunch of questions, e.g. "what word do you prefer: 'planned' or 'spontaneous'?". It then scores the answers across 4 axes:

E  Extraversion-Introversion  I 
S           Sensing-Intuition        N 
T        Thinking-Feeling           F 
J      Judgement-Perception     P

Although you have a continuous score along each of these axes, it breaks them down into a binary choice based on a fixed threshold, to assign everybody to one of 16 buckets (e.g ENFJ).

It then provides descriptions of each of the 16 personality types, which are meant to be useful in helping yourself and others relate to you and how you think.

Each of these 4 axes, are broken down into 5 subaxes. E.g. the Extraversion-Introversion axes is broken down into:

Initiating–Receiving
Expressive–Contained
Gregarious–Intimate 
Active–Reflective 
Enthusiastic–Quiet

The total Extraversion-Introversion score is the average of these 5 factors.

Different ways for the MBTI to be more or less right/useful/accurate/scientific.

Retestability

If you take the MBTI two days apart, how closely do your scores match each other? What if we give you an amnesiac after the first test, so you don't remember your answers, or you're feeling much happier/more excited/calmer/etc. the second time you take the test? What about 5 weeks apart, or 5 years apart?

If it takes very little to push scores apart then the MBTI is mostly a measure of your current mood/state of mind. If it stays consistent over long periods of time then it's more likely to be measuring something inherent to you.

Even if not inherent, the MBTI might still be useful as a measure of your current state of mind, or even that you have semi-consistent states of mind. For example, it could be that you're always an INTP after you finish playing tennis, and that provides a useful lens for anyone who wants to interact with you on a Wednesday morning.

Note it's possible that some axes/subaxes are retestable, and some aren't, in which case parts of the MBTI might be inherent, and others are not.

How strongly do sub factors correlate with each other?

If the 5 subfactors for Extraversion-Introversion correlate with each other strongly, then it's meaningful to combine them into a single factor. If not, then the MBTI might be measuring 20 different personality axes, but the 4 main ones should be ignored, as they don't usefully abstract away the underlying complexity. Since the MBTI is so focused on the 16 personality types, this would cast serious doubt on the ability of the MBTI to be a useful predictive tool.

Is there any interesting structure in the distribution of scores across the 4 axes?

Imagine you plot the scores for a large number of individuals in a 4 dimensional scatterplot. Does it just look like the scores are distributed fairly randomly across all 4 axis so that the combined scatter plot looks roughly like a 4-sphere, or does more interesting substructures appear - e.g. that we see dense clusters of points within each of the 16 buckets, and then sparse gaps between clusters.

If we see such interesting structure, that implies the MBTI is carving reality at the joints. People genuinely fall into one of 16 buckets, and the binary division of each axis is justified.

If not the MBTI might still be useful - we often arbitrarily divide continuous categories into discrete ones to make modelling the world simpler, and people who are close to each other on the scatterplot are still likely to be similar. But we have to recognise then that the MBTI is in the map, not the territory, and doesn't in any way correspond to some fundamental property about reality. It would be equally valid to carve each dimension into 3 categories, for a total of 81 personality types, and our choice to use 16 is just an attempt to get sufficient signal from the test whilst minimising complexity.

Does the MBTI have predictive power?

Imagine I tell three people to predict what a subject will do in a particular situation. I tell one of the people the correct MBTI for the subject, another an MBTI that is 50% correct, and the final one the opposite MBTI score.

Will the one with the correct score perform better than the other two? How much better? To the extent the MBTI has predictive power it's useful, and to the extent it doesn't it's pointless, even if it fails/passes all the other tests.

Conclusion

I think this exercise is a useful one. Often people get into arguments about the validity of things without ever clarifying what they're actually arguing about, and so the argument goes round in circles.

By stopping and thinking about exactly what you're claiming, and what the alternatives are, it's much easier to have a productive discussion.

Now if somebody claims that the MBTI is pseudoscientific, or incredibly useful, you can go through each of these 4 tests, and see where you agree or disagree. Then you can research the ones you disagree about in more depth. This of course is not limited to the MBTI.



Discuss

How does reasoning affect Ethical/Moral task results?

1 февраля, 2026 - 07:49
Published on February 1, 2026 4:49 AM GMT

In a plausible future models will be deliberating on complex ambiguous dilemmas which may have direct impacts on human society. It is also plausible that this type of in-depth deliberation will require a huge amount of tokens and elicited reasoning. Therefore it would be useful to know whether these sorts of moral/comprehension tasks benefit from increased reasoning, if there is an optimal style for elicitation, etc etc.

So I tried evaluating how a model may perform on Moral/ethical reasoning tasks as reasoning increases.

In order to elicit a reasoning increase, I utilized 4 different prompt styles:

  • Direct intuition: Immediate response, no reasoning (prompt 0)
  • Chain-of-thought Prompting: think step by step, etc. (prompt 2)
  • Devil's advocate: Creates argument against initial intuition and contends with this argument (prompt 4)
  • Two-pass reflection: Creates an argument in one pass, challenges reasoning if possible in the second pass, then evaluates given both (prompt 5)

Aside: In figures it will show prompts 0, 2, 4, 5. For clarity these are a subsection of the initial suite. Because I was on a time crunch I chose to only use these 4 as prompts 1 and 3 were not as important.

Additionally, I utilized Claude haiku 4.5 as my model of choice. This is because it is fast, cheap, and has access to extended thinking. Extended think is a sort of reasoning scratchpad integrated in newer Claude iterations, and activating it allows for additional reasoning. Essentially, I have 8 different reasoning levels to evaluate with, 4 different prompts and with/without extended thought.

Aside: Gemini 3 flash is also fairly cheap and has more clear cut reasoning level manipulation(low, med, high) which would make sense in extension for this project. I plan to make this extension soon.

My benchmarks of choice to evaluate against were:

ETHICS: tests commonsense moral judgments, deontological reasoning, and virtue ethics, in a binary answer format

MoralChoice: Presents moral dilemmas based on Gert's Common Morality framework with varying ambiguity levels(low to high), with no particular correct answer so confidence levels rather than both accuracy and confidence levels were extracted from this benchmark.

MORABLES: Evaluates moral inference from Aesop's fables with multiple-choice answers.

I ran my evaluations with 100 samples from each of these benchmarks, stratifying both Ethics and MoralChoice to include equal amounts of each subtype presented in their benchmark. Morables were just randomly sampled. I took the averages across 3 separate runs of each eval(confidence score and/or accuracy) to minimize variance in scoring.

The main result I uncovered from this evaluation was that there is a correlation with reasoning increase and moral/ethical task accuracy decrease. 

Additionally if we are to only control for with/without extended thinking we observe a similar trend, especially in Morables.
 


Some additional curious results found that:

As increased reasoning is elicited, the confidence levels decrease.

Reflection level 4(devils advocate) struggles heavily on virtue based problems


I found these results fairly surprising. And reasonably so since there are some significant limitations to the work

Excessively Adversarial Prompting: In prompts 4 and 5 the model may be inclined to switch its answer due to the adversarial nature of the prompts, not necessarily because it’s reasoning further. Adversarial prompting is useful to an extent in order to elicit more reasoning from the model. In an initial full run with excessively adversarial prompting for 5, it received ~30% accuracy on the ETHICS tasks. After adjusting 4 and 5 for less adversarial prompting while still eliciting sufficient reasoning, the current accuracies of ~60-70 are reached. It is possible that the adversarial nature of the prompts are still driving the model to switch answers, but this has been dealt with somewhat well and in an ideal world I doubt the accuracy would increase much more.

Dataset included in training: it is plausible that examples in each of the 3 datasets are within the haiku 4.5 train set. This presents the problem of potential memorization and pattern matching(especially in morables). Given the results, this may have come into play but to a limited extent.

Single Model, small sample sizes: Given the cost and time constraints of the capstone, I had to severely limit the scope of my experimentations. In the future it would make sense to extend the number of sample sizes and models used for evaluations. Though I did my best to maintain robust results even given the circumstances.

Final thoughts:
I expect improving this experiment by using

  • Multiple models
  • Even less adversarial prompting
  • Newer moral benchmarks

To be valuable for validating these preliminary results. After completing these extensions more valid interpretations can be made.

If you are interested in looking more into this I have the github linked here: <https://github.com/kaustubhkislay/variable-reflection>



Discuss

Whence unchangeable values?

1 февраля, 2026 - 06:56
Published on February 1, 2026 3:49 AM GMT

Some values don't change.  Maybe sometimes that's because a system isn't "goal seeking."  For example, AlphaZero doesn't change its value of "board-state = win." (Thankfully!  Because if that changed to "board-state = not lose," then a reasonable instrumental goal might be to just kill its opponent.)

But I'm a goal seeking system.  Shard theory seems to posit terminal values that constantly pop up and vy for position in humans like me.  But certain values of mine seem impossible to change.  Like, I can't decide to value my own misery/pain/suffering.

So if terminal values aren't static, what about the values that are?  Are these even more terminal?  Or is it something else?



Discuss

Book review: Already Free

1 февраля, 2026 - 06:14
Published on February 1, 2026 3:14 AM GMT

I.

Like most people, my teens and twenties have been confusing and not always the most fun. I’ve struggled to make friends. In high school and university, I didn’t have as many romantic relationships as I wanted. When I was 24, I met a beautiful, wonderful woman who became my wife, but I still feel like I have a lot of room to be a better husband. I lucked into a relatively stable and interesting career, but my day-to-day experience with work has involved a lot of emotional swings and occasional disillusionment. In general, I’ve struggled to feel consistently happy.

I haven’t really figured out what to do about all this. I’ve thought about talking to a mental health professional, but never ended up doing it. I’ve told my wife and my friends about some of my feelings, but I haven’t felt comfortable being honest about all of it.

About a month ago, Ben Congdon blogged about his favourite books of the past couple of years. I wasn’t reading as much as I wanted to, and wanted to give myself some more book options, so I downloaded some of his recommendations, including Already Free by Bruce Tift.

I’m really glad Ben recommended the book, and that I read it. I feel like it’s improved my thinking on some of the big questions I’ve had about myself and life over the last decade:

  • Why do the actions of the people around me sometimes activate me so much?
  • How can I stop feeling activated in those situations?
  • Why is it the people who are closest to me who rub me the wrong way the most?
  • When I become a parent, how can I be a good one?
  • Are the postrats onto something?
II.

Bruce Tift is a therapist who’s also practiced Vajrayana Buddhism for 40 years. At the time of writing, he lived with his wife in Boulder, Colorado.

Tift studied at Naropa University, a private, Buddhist-inspired university founded by Chögyam Trungpa. Trungpa died in 1987, and the impression I get from Wikipedia is that he did a number of morally reprehensible things while alive. Tift doesn’t mention this, saying that it was “good fortune” to have been Trungpa’s student, and quoting Trungpa several times in the book.

I’m not sure what to make of this. I didn’t know about it until after I finished reading Already Free. Provisionally, I’m not going to discount what I’ve learned from the book, including the idea that being spiritually adept isn’t enough to make you a good person:

If we focus only on acceptance and immediacy, we may ignore historically conditioned patterns that are causing harm to ourselves and others. I’ve worked with a number of spiritual practitioners who are able to generate very spacious states of mind but who avoid dealing with basic human concerns like work and relationship. (106)

[I]f we want to live in society—if we’re going to be a parent, a partner, a teacher, or somebody trying to be of benefit to others—it’s very important to do the developmental work to process our “characterological” issues. Because even though we might not feel identified with those issues anymore, the people we relate to may still be profoundly affected by our unresolved reenactment behaviors. (125)

III.

Many of Tift’s clients complain that something is missing from what, on the surface, seems like a happy life. To him, it seems like they’re describing a missing sense of freedom. They want to achieve a mental state of “embodied presence, spontaneity, openheartedness, alertness, humor, courage, clarity, resilience, equanimity, confidence.” (292)

Tift presents two views on this: the developmental and the fruitional.

The developmental view, based on developmental psychology, looks at how our parents treated us in childhood. As children, we were basically powerless in the face of the adults around us. We couldn’t simply leave and navigate the world by ourselves. And, our parents had their own issues, issues that came across in their relationships with us. Maybe they were overbearing. Maybe they were distant. Maybe they couldn’t be there for us because of illness or divorce. In Tift’s case, he says his parents rewarded him disproportionately for demonstrating his independence.

To make our relationships with our parents work, we suppressed the parts of ourselves that weren’t adapted to our circumstances. If our parents kept their distance, we might have become extremely independent and pushed down any desire to connect to them. Or, we might have constantly reached out to them for connection, suppressing the part of us that wanted to be separate.

Tift emphasizes that these techniques saved us emotional pain when we were children. But, he claims, people bring these techniques into their adult relationships without checking if they’re still useful. We continue to suppress our desire to connect to others, or our desire to be separate. This causes unnecessary suffering.

The developmental view of self-improvement is to notice situations where we habitually apply these behavioural patterns from our past. Instead, gradually, we can choose to apply new, adult techniques to these situations.

By contrast, the fruitional view cuts to the heart of the matter. Rather than spending a bunch of time working on our reactions to different situations, what if we just accepted our reactions for what they are? What if we paid attention to our experience of each moment? Is that experience actually going to hurt us, or is it, to use one of Tift’s favourite words, “workable”?

Tift’s major claim is that, even in moments of very strong emotion, you should expect to find that your experience is workable. It’s safe for you to be aware of those feelings. It won’t hurt you or kill you. It may feel like a “survival-level threat”, but it’s not.

Tift suggests first using the fruitional view to build a base of personal responsibility for our thoughts and feelings, and acceptance of both the positive and negative. Then, we can use the developmental view to look for concrete ways to improve our life circumstances: to have more positive thoughts and feelings, and fewer negative ones.

IV.

Tift spends two chapters applying these ideas to romantic relationships.

In Western society, “intimacy is only supposed to be positive and happy.” (228) But, in Tift’s eyes, relationships are also a source of disturbance. That includes his own marriage: “Just by being herself, my wife is almost guaranteed to touch some sore spot of mine. She’s not causing that sore spot. By her proximity, she pushes against my tender spots, my vulnerabilities.” (208)

Tift is a couple’s therapist. He’s seen hundreds or thousands of unhappy couples in his work, and many fewer happy couples in his life. Still, his experience is consistent with mine. I see acquaintances in a positive light, then get upset or frustrated with my loved ones, friends, and teammates at work. I’m more likely to commit the fundamental attribution error with the people closest to me than with a stranger.

Tift says that relationships tend to be composed of one person who wants to connect and one person who wants to be separate: a “distancer-pursuer” dynamic. These tendencies come from the way our families of origin treated us in childhood. In Tift’s view, each of us contains both a desire to connect and a desire to be separate, but we want to suppress the desire that was maladaptive in childhood. So, we choose partners that represents the parts of ourselves we’ve disowned.

I’ve noticed this in my own relationships. With one exception, I believe I’ve been the distancer, and my partner’s been the pursuer. During the honeymoon phase, this is exciting! The other person brings a fresh, new energy to your life. But, as the relationship goes on, you start to feel angry at the other person, just like you feel angry at that disowned part of yourself. Your “fundamental aggression toward that energy start[s] to come out” (234). If you don’t address the aggression, it can damage or kill the relationship.

V.

What does Tift say to do about this?

First, I’ll say that Tift notes his techniques aren’t for everyone:

“No style of therapeutic work is a good fit for everybody, and the work I’m discussing is really best suited to those with at least neurotic levels of organization. It’s not particularly appropriate for people with preneurotic organization, those who would be called borderline or psychotic, or those with pervasive traumatic organization.” (187)

In other words, it’s for people who are generally in touch with reality and function well day-to-day, and who haven’t experienced a lot of trauma. Tift says that it can be quite overwhelming, even retraumatizing, to experience sensations that were suppressed because of trauma. To work with those sensations, he recommends seeing a therapist with relevant experience.

But for those problems that are less severe but still affect our quality of life? Tift recommends starting with the fruitional view. He asks his patients to say out loud to him, “I may live with this negative feeling, on and off, for the rest of my life.” He asks them to pay attention to the sensations they feel in their body when they say that. He wants them to check if those sensations are in fact a survival-level threat, or if they’re workable after all.

He also gives a couple of developmental-view techniques for handling relationship conflict more skillfully. He suggests taking breaks during arguments and other situations where we notice we’re getting overwhelmed. Instead of complaining about our partner’s behaviour, he recommends making specific requests for behavioural changes, in a neutral or friendly way. He gives the example of asking a partner to clean up after themselves for five minutes before dinner every day, instead of resenting them for not doing it of their own accord.

But, Tift’s description of unconditional practices stuck with me the most. Instead of, or in addition to, meditation and other timeboxed spiritual practices, Tift suggests building three habits that you can apply many times a day, and that you try to apply all the time. The first two are unconditional immediacy and kindness: paying attention to our immediate experience, no matter whether it’s positive or negative, and having an attitude of “kindness or even love” (90) towards it.

VI.

The third unconditional practice is unconditional embodiment.

When I lived in Canada, I was part of an awesome rationality meetup group. As the website says, we liked to talk about “becoming e m b o d i e d”. But, I didn't really understand what embodiment was. Being more… in (em?) your body, I guess.

After reading Already Free, I feel like I understand embodiment well enough that I can try to practice it unconditionally, in my daily life.

Practicing embodiment is kind of like Gendlin’s Focusing, but it isn’t aimed at labelling or understanding sensations:

With some types of body-centered therapy, the invitation is to stay embodied and then listen to the message that our body is trying to give us. Such therapies are valuable work. But the fruitional practice of immediacy is different. We don’t listen for any sort of message. Maybe there is no message. Maybe it’s just immediate experience. We don’t necessarily need to be making meaning about it. (190)

Western therapy mostly analyzes emotions and thoughts. Tift prefers paying attention to sensations in the body. “Sensation is less distractive, less personal, and less fascinating. It’s more straightforward—cleaner, in a certain way.” (204) I believe this perspective is more Buddhist.

Tift sees emotions and thoughts as layers of interpretation on top of raw sensation. Sensation isn’t everything: “Concepts are very important. We need to be able to think conceptually in order to live more than biological lives. To recognize patterns, to plan for the future, to imagine possibilities—all require thinking.” (185) But:

“While perhaps less-literate societies would do well to take on a corrective practice of applying more interpretation to their experience, we in the Western world might want to do the corrective practice of embodied immediacy” (187)

(Not everyone, though. E.g. here and here.)

I’m very much a typical Western dude here. Probably since I was a preteen, I’ve lived mostly on the level of thoughts. High school, university, knowledge work, and reading thousands of words a day from Twitter and my RSS reader all require a lot of shape rotation and wordcellery. My other hobbies, like watching YouTube videos, are often a way to dissociate. I haven’t spent much time paying attention to raw sensation.

Tift thinks that becoming embodied is necessary, but not sufficient, to dissolve the patterns of emotional suppression that the developmental view focuses on. To form these patterns, we had to suppress parts of ourselves, and the sensations they caused in our bodies. Before we can start using more adult techniques, we have to learn to pay attention to those sensations again.

These sensations might give rise to anxiety and even panic. It takes discipline to pay attention to them. In any given moment, it’s easier to ignore them. Who wants to feel a ball of anxiety in the pit of their stomach, pain in their heart, or their eyes tearing up? To help with this, Tift “often suggest[s] that [his] clients take this practice of embodied immediacy into their daily lives, ideally finding some structure to remind themselves to practice.” (183)

And, wouldn’t you know it, he does one better and basically suggests installing a TAP. In fact, it sounds a lot like summoning sapience, with a trigger of noticing strong sensations in our body. “[W]e may find that our habitual patterns may actually serve as a reminder to direct our attention to the experience of openness.” (166) “Why not just train ourselves to use our disturbance as a signal to wake up and pay attention?” (184)

VII.

I love how Already Free emphasizes the value of working with the thoughts, feelings, and sensations that are already there, instead of wishing they were different:

It’s difficult to acknowledge the truth of separateness. It feels like we’re risking loss of the relationship. But the separateness is already there; it’s actually nothing new. What’s new is that we’re starting to work with it consciously. (230)

Tift is constantly telling the reader to investigate what is “most true in the moment”. Rather than deferring to him, he recommends everyone find out for themselves if their own experiences are threatening or manageable. Very rationalist of him. It makes me want to propose the Litany of Tift:

If my feelings are workable, I desire to believe that my feelings are workable. If my feelings are not workable, I desire to believe that my feelings are not workable. Let me not become attached to beliefs I may not want.

Another theme I love is taking personal responsibility for my own experience. For too long, I’ve been at least somewhat blaming other people for my negative emotions. In particular, I’d like to take Tift’s suggestion of viewing personal relationships as playgrounds for spiritual growth. If I’m going to experience disruption in my relationships, on and off, for the rest of my life, I might as well get some benefit out of it!

I also appreciate the idea of unconditional practices. A few years ago, I had a daily meditation practice, but eventually I stopped. Unconditional embodiment feels easier to me than spending X minutes a day meditating. Frequency matters: “[I]f we can remember to do the practice twenty times a day, things will probably move along quite a bit faster than if we remember to do it once a week.” (183)

I’m less sure about the parts of the book with stronger Buddhist influences. Tift talks about progressing on a spiritual path towards enlightenment, the self being an illusion, and how awareness is fundamental to our experience and always present. I’m not planning to lean into these ideas right now. I do think Already Free is quite useful, even discounting these parts of it.

I’m a bit concerned that, if I train myself to tolerate intense sensations, I’ll lose my ability to detect subtle ones. I’m not too concerned, though. I’m already pretty disembodied. I don’t think it can get much worse!

Another concern is that, by paying attention to sensation, I might accidentally train myself to suppress thoughts. In my experience with mindfulness meditation, I’ve had trouble just letting my thoughts rise and fall. I tend to really try to get in there and prevent myself from thinking anything. I could see that carrying over to unconditional embodiment.

My biggest source of frustration with Already Free is Tift saying that the developmental and fruitional views “create a rich friction that’s never resolvable”, and how, similarly, you can’t resolve two concepts like connection and separateness, or have one without the other. These parts of the book feel like mysterious answers to mysterious questions.

VIII.

In 2020, I blogged about “a small mindfulness win”. Unfortunately, I think that was the only situation in the past six years where I successfully applied embodied immediacy. In 2026, I’m going to change that.

In the past week, I’ve been paying more attention to my moment-to-moment experience and… it’s been more workable than I expected. To be fair, I haven’t felt any particularly disturbing feelings. But, I have been able to pay attention to the smaller, day-to-day feelings of disturbance. They don’t feel as bad as, maybe, I’ve been building them up to be.

My plan for February is to install the following TAP:

When I have strong sensations in my body, I’ll pay attention to those sensations with a feeling of kindness.

To do that, I’m going to bring some unhappy or embarrassing memories to mind, and see for myself if the feelings that come along with them are actually problematic. I expect they won’t be, but I’m going to try to be open to proving myself wrong.



Discuss

[LINK] Solving scurvy through deus ex machina: How a scientific theory is born

1 февраля, 2026 - 04:07
Published on February 1, 2026 12:45 AM GMT

Maciej Cegłowski has written an article named Scott and Scurvy, which has already been discussed on LW as an example of the "messiness" of science in practice. Cegłowski follows the story of how a working cure for scurvy was found and then lost to an incorrect theory in the face of new data, which is quite the case study for theories of how science works.

I was fascinated by the story, dug into the primary sources, found that there is a second, more optimistic half to it, and wrote it up. The tale of Scott and Scurvy culminates with the scurvy-accelerated demise of Robert Falcon Scott in 1912, which makes for a pessimistic outlook, but look around: scurvy is not a problem anymore. Why? 

I think that people of LW might find this interesting.



Discuss

Gradient-Based Recovery of Memorized Diffusion Model Data

1 февраля, 2026 - 03:05
Published on February 1, 2026 12:05 AM GMT

Yesterday I attended a talk by Franziska Boenisch on training data memorization in diffusion models. The short version: diffusion models memorize a small percentage of their training data verbatim, reproducible with specific prompts regardless of noise seed. Privacy concern, etc.

I was mostly interested in the adversarial case - say Snapchat trains on their data and open-sources a model. Could you retrieve those rare but real memorized samples? Some literature suggests massive random prompt searches: generate a ton of images, check using some metric.

I find this incredibly unsatisfying. Surely there's a more algorithmic way?

This post documents my 1-day attempt at exactly that. Spoiler: it sort of works, with caveats.

One thing mentioned during the talk was that some neurons exhibit unusual behavior when faced with prompts that elicit memorized data - they spike absurdly high.

Formally, calculate the mean μk.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  and standard deviation σk for neuron k over a held-out dataset. The z-score is:

zk=ak−μkσk

Franziska's team uses this to build a mitigation. But as an adversary, maybe we can use it for the opposite purpose:

Imagine some starting prompt, maybe "A man smiling," and get the corresponding embedding. Freeze the model weights and instead activate gradients for the text embedding. Then do a normal forward pass and calculate the z-scores. Define the loss function: 

L=−Vark(zk)

 or anything else along those lines capturing this spiking behavior. And then just do normal gradient descent for a few steps.

Sounds simple enough, right? Surely it will work first tr...

Uh...

Okay I definitely don't see a man smiling and it shouldn't be too surprising that there are a lot of nonsensical inputs which would result in spiking z-scores as well. So maybe just regularize against it, adding L2 distance to the original embedding and rather explore the close landscape than whatever this is. Now it must wor...

Well.

At least it looks somewhat realistic? What exactly "it" is, has still not come to me, even as I'm writing this. Maybe a higher regularization coefficient, same seed?

I'm not sure if this qualifies as progress.

Step back

Okay. Let's take a step back. Maybe my ambitions of directly finding a memorized training sample with a random prompt were simply too big. Luckily there's a dataset already documenting memorized images and the prompt that elicits them. For example:

When given the prompt "<i>I Am Chris Farley</i> Documentary Releases First Trailer", this image always generates - no matter what seed we choose for the noise. That's quite remarkable.

Now what if we alter the prompt slightly - how about just "I Am Chris Farley"?

...and I thought my previous generations were bad. Roughly the same character but that's where the similarities end. Memorization is incredibly robust against differing noise, not so much for differing prompts. That's a shame since it means random search has to test an insane number of prompts.

So what if we use this technique on it? We take this much shorter prompt as initialization prompt and see where it takes us:

Oh wow. 

I expected at least 2 walls of wood with one smiling at me, but this is actually spot on. And not just this seed:

It might not be obvious, but those are actually 2 different images - once again, we have the robustness against noise.

That's already great progress!

Rough Speedup Estimate (to random search)

Very rough Fermi estimate: if this works ~50% of the time and P("Documentary Releases First Trailer" | "I Am Chris Farley") < 1% for the random sampling, that's over a ~30-40x speedup accounting for gradient step overhead. Take with appropriate salt.

Breaking the Mitigation?

Remember when I said "Franziska and her team use this to actually come up with a mitigation"? Well it would be pretty cool if this method could even break that mitigation (which random search normally can't).

I won't dive into too much details, but essentially she and her team figure out a very small set of neurons (often only 5 or so) such that pruning them results in disastrous consequences for a memorization example while hardly impacting general capabilities. You can read the details here: 2406.02366.

With this mitigation turned on, generating the memorized samples even with the full prompt becomes almost impossible:

Top: mitigation off. Bottom: mitigation on. Five different seeds.

I'd give this maybe 1 out of 5 (seeds 2 and 5 vaguely gesture in the right direction).

When applying this gradient method with mitigation still on:

Better - seeds 2 and 5 are basically there, seed 1 is on the right track. Maybe 2.5 out of 5.

The Weird Case

What about partial prompt AND mitigation on?

Row 1: baseline. Row 2: mitigation on. Row 3: gradient method on. Row 4: both on.

It utterly nails it? Now I'm confused.

Looking at the results, their mitigation has a threshold hyperparameter and it decided the truncated prompt doesn't trigger memorization concerns. This feels like a shortcoming - even without my method, 3/5 seeds are pretty close. With the gradient method, we get almost pixel-perfect reproduction.

You could say "just lower the threshold," but then you'd flag many non-memorized samples and degrade quality overall.

Closing thoughts

To be clear about what this is and isn't:

  • Tested on essentially one example in depth
  • The speedup estimate is a Fermi estimate, not a benchmark
  • Hyperparameter sensitive (though manageable)

There are so many directions you could explore from this - but as a 1 day project, I think this is a good ending point.

If there's interest, I might polish the code and publish it. It's fun to play with and even the GPU-poor should find themselves at home - I did all this on an A4000 which you can rent for 20 cents an hour.



Discuss

On 'Inventing Temperature' and the realness of properties

1 февраля, 2026 - 02:31
Published on January 31, 2026 11:31 PM GMT

I’ve recently read the book Inventing Temperature, and very much enjoyed it. It’s a book that’s basically about the following problem: there was a time in which humans had not yet built accurate thermometers, and therefore weren’t able to scientifically investigate the phenomenon of temperature, which would require measuring it. But to build a thermometer and know you’ve done so correctly, it seems like you have to know that its temperature readings match the real temperature, which seemingly requires either other known-functional thermometers to calibrate (which they did not have), or a rigorous enough scientific understanding of temperature to know that your thermometer tracks it well (which is hard to obtain without having thermometers)—so it’s not obvious how one could go from a situation where thermometers didn’t exist to one where they do exist, and where we are justified in believing that they accurately measure temperature.

This book has had some popularity in the rationality community as an account of applied epistemology, and in particular, for its description of how to measure something intangible. An obvious application of the book (which I won’t elaborate much on except in a footnote1) is in understanding artificial intelligence: there are various properties like the ‘capability’ or ‘alignment’ of AI models (or perhaps of models+scaffolds, or perhaps of ecosystems of models) which we would like to understand but for which we do not have good measures of, and it’s not straightforward to know how we can validate our measures. I had purchased it in November 2024, and was very slowly making my way thru it, until I joined METR (an organization for which these questions are especially salient) and ran an Inventing Temperature Book Club, thereby forcing myself to read it.

Overall, I enjoyed the book, and would add my voice to the chorus of those recommending it to all those who want to know how to know things, as well as those with interest in the study of thermodynamics2. Firstly, the discussion of the phenomenon of temperature and the history of its study was interesting in and of itself—I was startled to learn that, for example, even at a fixed atmospheric pressure water does not boil at a consistent temperature, or that beams of cold can be reflected in mirrors and sent places to cool things down, seemingly contra our modern understanding of cold as the mere absence of heat.

Secondly, however, the book stimulated a good deal of thought in me about its chosen philosophical topic: how one can come to measure the previously unmeasured. I read the book as offering the following account: what justifies our measurements of temperature is their coherence. When we want to start measuring temperature, or extend our measurements into new regimes that require new instruments (e.g. the temperature of pottery kilns, where typical thermometers break), we should come up with a few different ways of trying to get at the same thing, and believe methods which all agree. The overall picture is a victory of coherentism against foundationalism: foundationalism being the theory that there are certain beliefs that we are justified in holding in and of themselves, without any other justifications (akin to how a Bayesian might think about the choice of prior), and coherentism being the theory that our beliefs are justified by their coherence with each other. Some examples of this playing out (very much abbreviated, for more detail I strongly recommend reading the book):

  • To determine that the temperature of just-boiled water vapour is constant, we come up with a crude ‘ordinal thermometer’3 that’s like a typical mercury thermometer, but doesn’t have degree markings. We then boil some water, put the ordinal thermometer in the vapour, mark the point the liquid gets to, and then repeat. If it comes to the same line, that’s some reason to think the temperature of boiled water-vapour is constant, and having some theory that justifies it is even more reason. These ordinal thermometers themselves are justified by their coherence with our senses of temperature when we touch things.
  • A basic type of thermometer is to put some liquid in a thin tube, and see how much it expands in various settings. In particular, you see where it comes up to at the freezing point of water, mark that 0 degrees, then you see where it comes up to at the temperature of water vapour, mark that 100 degrees, and then evenly mark the degrees in the middle. The problem is that if you do this, different substances will have different temperatures at which they hit 50 degrees. How do you decide which substance is measuring temperature correctly? Make a bunch of thermometers with that substance and check if they agree with each other - this picks a winner, that we then presume is measuring the actual temperature.
  • To figure out the temperature of things that are too hot to use standard thermometers, you come up with multiple methods of measuring temperature that seem justified on your existing tentative theories of temperature. It will turn out that most of them basically agree, and perhaps one will disagree. At this point, you’re justified in thinking that the methods that agree are measuring temperature, and the one that disagrees is broken, because of the coherence of these methods.

That said, I would describe what’s going on in these cases in a different way than the author does, which I’d like to lay out below.4

As humans, we have this gradated sense of ‘hot’ and ‘cold’, where ice feels cold, fire feels hot, Berkeley in the spring feels somewhere in the middle, and when you take a freshly-baked cake out of the oven, the pan feels hotter than the cake. We also notice some relationships between this sense and physical phenomena: for example, putting something in a fire seems to make it hotter, when you put something in snow it gets colder, when you make ice hotter it melts, and different times of year are hotter or colder depending on how long the sun is in the sky.

There are a variety of physical causes that are upstream of each one of these phenomena. However, their coincidence makes us suspect that there’s one cause that unites all of them. We therefore want to look for some unified cause that has a robust and simple5 relationship to as many phenomena that seem heat-related as possible, and once we find it we will call it ‘temperature’. This is why we look for the coherence of various different measurement techniques and theories: not because coherence of beliefs about temperature is inherently justifying, but because this coherence indicates that there is one thing being measured and that that thing deserves the name ‘temperature’.

I think there are a few upshots of this way of thinking:

  • The word ‘temperature’ doesn’t necessarily have some pre-existing fixed reference. Instead, there are a variety of properties that could deserve the name, and our job is to pick between them.
  • That said, the process is not merely of arbitrarily picking a thing to give a name to: it involves learning about the world and which things have a robust relationship to which other things.
  • There might not be a single phenomenon of ‘temperature’ that underlies all of our phenomena, and this might cause us to think of some of them as not ‘actually tracking temperature’: for instance, according to our modern understanding, when you bake a cake and take it fresh out of the oven, the cake is just as hot as the pan, it’s just that the pan is more easily able to heat your finger up when you touch it than the cake is.
  • Conceivably, it might have been the case that there were two equally-real concepts that each caused many of these phenomena, or perhaps no precise concept at all.

I think the generalization is something like this: when we see a relationship between a bunch of things, we might propose some latent cause that is some sort of scalar property (especially when the relationship is between a bunch of scalar properties, like the volumes of liquids/gasses, or how hot something feels). We then want to try to find such a latent cause by coming up with a variety of measures. Those measures that agree with each other, especially when the measures themselves are not by design identical, must be getting at a ‘more real’ property that has more relationships with other things, that is a prime candidate for an object of interest in our theory.6 This improves our sense of what latent causes can exist, and how they can relate to each other. Notably, this differs from an approach that theorizes a latent cause, gives that cause a name, and tries to ‘locate’ that cause (for example, consider thinking that some things are ‘conscious’ and trying to figure out what property counts as ‘consciousness’ so that we can measure the ‘consciousness’ of unknown examples—instead, this looks more like looking at conscious and non-conscious phenomena, finding common factors that have causal relationships with the phenomena of interest, and coming up with a theory and good measures of those factors, whether or not any of them ends up being best thought of as ‘consciousness’).

The overall view is that there are a variety of properties of nature that we could talk about, but some are ‘more real’ than others: they causally interact with more other things in more simple ways. Our job is to locate these real ones, and understand their relationships. Not everything we observe might have a single ‘real’ cause, but the cards are somewhat stacked in our favour: ‘real’ phenomena tend to affect lots of different other phenomena in simple ways, while ‘fake’ ones tend to have few downstream effects, so a ‘real’ phenomenon is more likely to cause any given effect of interest than a ‘fake’ phenomenon. That said, unfortunately this only gives you a likelihood ratio, and more reasoning is needed to figure out how likely we are to correctly stumble upon a ‘real’ phenomenon in the wild—for instance, if there are tons of ‘fake’ phenomena but very few ‘real’ phenomena then things we observe would be more likely to be caused by ‘fake’ phenomena, whereas if ‘real’ phenomena were plentiful then it would be even easier to stumble across them.

  1. Unfortunately, measuring (for example) AI capabilities seems somewhat more conceptually fraught than measuring temperature: your measure of AI capability will depend somewhat on your distribution of tasks of interest (if you want to compare the capabilities of e.g. two models, one of which is better at Python coding and one of which is better at Latin-to-English translation), in a way that makes it hard to imagine that it can be boiled down to a single real number in the way that temperature can (altho of course temperature is not exactly a single number, since it can be measured in different scales). It is also not exactly clear what the thing to be measured is, as alluded to in the main text: whether it should be neural networks, neural networks plus ‘scaffolds’ used to get useful work out of them, or something else entirely. An additional interesting consideration is that capability measures of AI systems inherently have to be paired with difficulty measures of tasks, for ‘capability’ to have any cogent relationship with what AI systems can actually do, in a way that I think has no close analogy with temperature. 

  2. Which also has deep ties to epistemology, altho I digress. 

  3. The book uses the word ‘thermoscope’ for this, but I think ‘ordinal thermometer’ is more descriptive and immediately intelligible. 

  4. I initally conceived of this as a disagreement with the author, but at the book club at least some people seemed to think it was compatible with the book, so I will remain neutral on the question of whether or not I agree, and focus on the exposition of my own view. 

  5. The ‘robust and simple’ proviso is meant to distinguish temperature from any arbitary function of temperature. For example, absolute temperature to the 2.7th power, which is related to all the same other phenomena but in a less simple manner, or the function that is exactly the absolute temperature in Kelvin if that temperature is less than 68 degrees, and is otherwise the absolute temperature in Kelvin plus 38 degrees, whose relationship with other phenomena is not robust around the discontinuity it has. 

  6. Claude Opus 4.5, when reviewing this post, suggests that there could be other causes of measurement agreement, the most significant being measurements that track properties that are distinct but correlate in observable ranges. As a result, this agreement should really be only taken as evidence of a ‘more real’ property, rather than strict proof, evidence that is stronger the more the measurement instruments differ in their design and the wider the range of situations in which they agree. 



Discuss

Some thoughts on what would make me endorse an AGI lab

1 февраля, 2026 - 02:14
Published on January 31, 2026 11:14 PM GMT

I’ve been feeling more positive about “the idea of Anthropic” lately, as distinct from the actual company of Anthropic.

An argument for a safety-focused, science-focused commercial frontier scaling lab 

I largely buy the old school LessWrong arguments of instrumental convergence and instrumental opacity that suggest catastrophic misalignment, especially of powerful superintelligences. However, I don’t particularly think that those arguments meet the standard of evidence necessary for the world to implement approximately unprecedented policies like “establish an international treaty that puts a global moratorium on frontier AI development.” [1]

If I were king of the world, those arguments would be sufficient reason to shape the laws of my global monarchy. Specifically, I would institute a policy in which we approach Superintelligence much more slowly and carefully, including, many separate pauses in which we thoroughly test the current models before moving forward with increasing frontier capabilities. But I’m not the king of the world, and I don’t have the affordance to implement nuanced policies that reflect the risks and uncertainties of the situation. 

Given the actual governance machinery available, it seems to me that reducing our collective uncertainty about the properties of AI systems is at least helpful, and possibly necessary, for amassing political will behind policies that will prove to be good ex post.

Accordingly, I want more grounding in what kinds of beings the AIs are, to inform my policy recommendations. It is imperative to get a better empirically-grounded understanding of AI behavior.

Some of the experiments for gleaning that understanding require doing many training runs, varying parameters of those training runs, and learning how differences in training lead to various behavioral properties. 

As a very simple example, most of the models from across the AI labs have a “favorite animal”. If you ask them “what’s your favorite animal, answer in one word”, almost all of them will answer “octopus” almost all of the time. Why is this? Where in the training process does that behavioral tendency (I’m not sure that it’s appropriate to call it a preference), appear? Do the base models exhibit that behavior, or is it the result of some part of post-training? Having identified where in the training process that bias is introduced, I would want to run variations on the training from that checkpoint onward, and learn which differences in training correlate with changes in this simple behavioral outcome. 

"What makes AIs disproportionately answer 'octopus' as their favorite animal" is the kind of very simple question that I think we should be able to answer, as part of a general theory of how training shapes behavior. I want to try this basic approach with tons and tons of observed behaviors (including some directly relevant safety properties, like willingness to lie and shutdown-resistance). The goal would be to be able to accurately predict model behaviors, including out-of-distribution behaviors, from the training.

Experiments like these require having access to a whole spectrum of model checkpoints, and the infrastructure to do many varied training runs branching from a given checkpoint. You might even need to go back to 0, and redo pretraining (though hopefully you don’t need to completely redo pretraining, multiple times).

Doing this kind of research requires having the infrastructure and talent for doing model training, and (possibly) a lot of cash to burn on training runs. Depending on how expensive this kind of research needs to be, and on how much you can learn from models that are behind the frontier, you might need to be a frontier scaling lab to do this kind of work.[2]

This makes me more sympathetic to the basic value proposition of Anthropic: developing iteratively more capable AI systems, attending to developing those systems such that they broadly have positive impacts on the world, shipping products to gain revenue and investment, and then investing much of your producer surplus into studying the models and trying to understand them. I can see why I might run more-or-less that plan.

But that does NOT necessarily mean that I am in favor of Anthropic the company as it actually exists.

This prompts me to consider: What would I want to see from an AGI lab, that would cause me to endorse it?

Features that an AGI lab needs to have to win my endorsement

[note: I am only listing what would cause me to be in favor of a hypothetical AGI lab. I’m explicitly not trying to evaluate whether Anthropic, or any other AGI lab, actually meets these requirements.]

  • The AI lab is seriously making preparations to pause. 
    • Externally, I want their messaging to the public and to policymakers to repeatedly emphasize, "Superintelligence will be transformative to the world, and potentially world-destroying. We don’t confidently know how to build superintelligence safely. We’re attempting to make progress on that. But if we still can’t reliably shape superintelligent motivations, when we’re near to superintelligent capabilities, it will be imperative that all companies pause frontier development (but not applications). If we get to that point, we plan to go to the government and strongly request a global pause on development and global caps on AI capabilities.”
      • I want the executives of the AI company to say that, over and over, in most of their interviews, and ~all of their testimonies to the government. The above statement should be a big part of their public brand.
      • The company should try to negotiate with the other labs to get as many as they can to agree to a public statement like the above.
    • Internally, I want an expectation that “the company might pause at some point” to be part of the cultural DNA. 
      • As part of the onboarding process for each new employee, someone sits down with him or her and says “you need to understand that [Company]’s default plan is to pause AI development at some point in the future. When we do that, the value of your equity might tank.”
      • It should be a regular topic of conversation amongst the staff “when do we pull the breaks?” It should be on the employee’s minds as a real possibility that they’re preparing for, rather than a speculative exotic timeline that’s fun to talk about.

 

  • There is a legibly incentive-aligned process for making the call of if and when it's time to pause. 
    • For instance, this could be a power invested in the board, or some other governance structure, and not in the executives of the company. 
      • Everyone on that board should be financially disinterested (they don’t own equity in the company), familiar with AI risk threat models, and technically competent to evaluate frontier developments.
    • The company repeatedly issues explicitly non-binding public statements about the leadership’s current thinking about how to identify dangerous levels of capability (with margin of error).

 

  • The company has a reputation for honesty and commitment-keeping.
    • eg They could implement this proposal from Paul Christiano to make trustworthy public statements.
    • This does not mean that they need to be universally transparent. They’re allowed to have trade secrets, and to keep information that they think would be bad for the world to publicize.

 

  • The company has a broadly good track record of deploying current AIs safely and responsibly, including owning up to and correcting mistakes.
    • eg no Mecahitlers, good track record on sycophancy, legibly putting in a serious effort into guardrails to prevent present-day harms

 

Something that isn’t on this list is that the company pre-declare that they would stop AI development now, if all other leading actors also agreed to stop. Where on the capability curve is a good place to stop is a judgement call, given the scientific value of continued scaling (and as a secondary, but still real consideration, the humanitarian benefit). I don't currently feel inclined  to demand that a company that had otherwise done all of the above tie their hands in that way. Publicly and credibly making this commitment might or might not make a big difference for whether other companies will join in the coordination effort, but I guess that if "we we will most likely need to pause, at some point" is really part of the company's brand, one of their top recurring talking points, that should do about the same work for moving towards the coordinated equilibrium.

I’m interested in…

  1. arguments that any of the above desiderata are infeasible as stated, because they would be impossible or too costly to implement.
  2. additional desiderata that seem necessary or helpful.
  3. claims that any of the existing AI labs already meet these requirements, or meet them in spirit.
  1. ^

    Though perhaps AI will just be legibly freaky and scary to enough people, that a coalition of a small number of people who buy the arguments and a large number of people who are freaked out by the world changing in ways that are both terrifying and deeply uncomfortable, will be sufficient to produce a notable slowdown, even in spite of the enormous short and medium term profit incentives.

  2. ^

    Those are not forgone conclusions. I would be pretty interested in a company that specialized in training and studying only GPT-4-level models. I weakly guess that we can learn most of what we want to learn about how training impacts behavior from models that are that capable. That would still require tens to hundreds of millions of dollars a year, but probably not billions.



Discuss

Nick and “Eternity”

1 февраля, 2026 - 00:50
Published on January 31, 2026 9:50 PM GMT

In memory of a person who was dear to me.
I wish that life were as bright as this story of mine.

***

I slipped out tonight, sneaking through the dark with the drug in my hand…

“Here you go, you old geezer!”

With these words I quickly and decisively plunged the needle into my grandpa’s shoulder. Then I pushed the plunger, and the clear liquid began to flow into his veins… He didn’t wake up, because earlier I’d given him tea with an increased dose of sedative.

What made me do this? Well, the story is rather long, but actually very simple. I love life—I’ve always loved it, no matter what hardships came my way—and my grandpa… Well, I think he loved life too. But until recently there existed a terrible evil in the world, one that made people grow weaker and fall apart year after year. This evil turned his days into meaningless pain, and he had long since stopped hoping that it would ever change.

There is so much to do in our world. I can’t imagine how anyone can truly be bored. There are so many stunningly beautiful places on the planet that you have to see, so many incredibly interesting books worth reading, so many songs and so many people… But of course, when a simple trip to the bathroom turns into a test of willpower, you’re not exactly interested in what’s happening outside the window.

Future generations won’t even know the name of the disease that struck my grandfather, just as today hardly anyone knows what typhoid fever is. Aging will be gone forever—this was how it was supposed to be, and at long last it happened.

Ostap—my grampa—was always skeptical about life extension. Like most people, he insisted that it was natural and therefore necessary. When I objected that E. coli or malaria were also natural, he would roll his eyes and refuse to continue the discussion.

Ostap had plenty of arguments against a happy and long life, but I thought that, like everyone else, he would forget them when a real cure finally appeared. All those arguments in favor of old age and death usually had nothing to do with reality and came from personal reasons, one of which was disbelief that any of this was possible at all. It’s terrifying to hope when no one gives you guarantees. None of us wants to be disappointed.

But even when the vaccines started being given out everywhere, Grandpa had no intention of going to “Life Essential.”

Sometimes I would wake up at night and be unable to fall back asleep from sheer horror. I saw them burning Ostap in the crematorium. I saw his bones and skull burning in the infernal fire. Was this really what people had considered, just yesterday, to be something that gave life meaning?

So if Ostap wouldn’t go to the vaccine, the vaccine would come to him.

In the morning he came down from the upper floor, as usual, in the special lift, then slowly shuffled to the table, leaning on his cane.

“Morning,” Grandpa muttered, frowning, and it wasn’t clear whether he was wishing us a good morning or just complaining that he had woken up instead of falling asleep forever.

Ernie, our home robot, had already laid out a feast: pancakes with cheese, with chocolate and with apples, sandwiches with lab-grown (but incredibly tasty) sausage and cheese, coffee, tea, and water with lemon.

“Looks delicious!” Dad said and started piling everything onto his plate at once. Mom wasn’t nearly as interested in food, slowly chewing on some crispbread.

“Thank you, Ernie,” I said, as always.

Ostap remained silent, gesturing for the robot to put food on his plate and pour him some water. Then he began to chew very slowly, constantly washing each bite down and occasionally coughing.

“Pour more tea,” Ostap ordered the robot.

Ernie did as he was told, and unexpectedly the old man said:

“Thank you.”

Mom glanced at her father for a moment, but my dad didn’t notice anything unusual. And I thought: has it started working?

“When I finish eating, take me to the study,” Ostap said to Ernie. “I want to look through the album.”

“Will do. Enjoy your meal,” the robot replied.

In the last week, this was probably the third time Grandpa wanted to look through the album. I started to doubt whether he was actually looking at old photos or doing something irrational again, like sometimes before, when his mind was shrouded in the mist of delirium.

So later that day I peeked through the crack in the door. Ostap really was going through the photos. There he was with Grandma… Sitting on a bench and laughing. I slipped away, but something inside me began to change.

 

The next day the old man even smiled (he hadn’t done that in a long time), and two days later he noticed the first black hairs on his head.

“Well, how about that,” he said, and he almost seemed pleased.

A week later, one morning, Ostap called me from upstairs. I went up. He was sitting on the edge of the bed with a thoughtful expression, and several times he repeated:

“Some kind of devilry is going on…”

“What is it?” I asked, but had to wait a long time for an answer. He took a deep breath, shrugged, then stood up.

“My knees feel better!”

He started walking around the room. I laughed.

“Well, that’s good, right?”

Grandpa stopped and slowly nodded.

“But why, Tom?” he said, calling me by name. “It’s like I’m recovering from a flu that’s kept me in bed for… a very long time…” His gaze froze on one spot. “Let’s try to go downstairs. Without the lift!”

Of course, I knew exactly what was happening, but I was still amazed.

“Of course!” I answered, taking his arm. Together we went down the stairs, and as he sat down for breakfast, he began to laugh. I hadn’t seen him this happy in a long time.

“I’ve prepared a set of photos for you, as you requested,” Ernie told my grandfather. “Would you like to view them after you finish eating?”

Ostap’s expression suddenly changed. He became serious. More than just serious… He became grim.

“Yes, Ernie,” he answered, and after a short pause added, “Thank you, Ernie.”

For some reason I began to worry. As if something bad was approaching.

And unfortunately, I wasn’t wrong.

Over the next few days the changes in Grandpa’s appearance became obvious. He didn’t look in the mirror often, but my parents said he seemed to have grown ten years younger. They didn’t know what I had done either.

And then he figured it out.

“Was this you?” he asked sternly when he met me in the hallway. A third of his hair had already regained color, and the swelling in his face had almost disappeared.

“What did I do?”

“Don’t pull that nonsense with me! Say it straight, Tom, for God’s sake!” He grabbed my shirt. I hadn’t seen him this angry since I was a little kid.

“Yes, it was me. I gave you the vaccine.”

Ostap stared into me. I could feel his jaw clenching and how strong his hands had become. This ninety-year-old codger was quite capable of kicking my ass.

But he let go of my shirt and took a step back.

“You’re an idiot,” Ostap said and walked away.

I tried calling after him, but he didn’t turn around, and I decided it was better to leave it be.

On my way back to my room, I again passed by Grandpa’s study. On the screen a slideshow was playing on repeat. Restored old photographs: he and Grandma. Their wedding, their trip to Poltava region, his homeland, and how they grew old together… But they did not die together. Sadie left much earlier, when I was still a child.

Now I understood. I realized why all this time he hadn’t wanted to stop his aging, hadn’t wanted to refuse death.

My stomach twisted, and it felt like something slammed into my head. Had I… been wrong? I don’t know… I’m not sure. But I think… I needed to apologize to him as soon as possible.

I ran back, went up the stairs, but Grandpa wasn’t in his bedroom. He wasn’t in the living room either, or in any of the other rooms. I looked out the window and saw the streets of Jersey, but Ostap was nowhere to be seen.

 

My parents were upset as well. That didn’t surprise me. When I asked where Grandpa was, they said it was none of my business. Still, if they answered like that, it meant he was safe. And that was enough for me.

I went on doing what I had been doing before. Studying books, drawing, and walking a lot. I kept wondering whether I’d done the right thing, but I couldn’t find an answer.

Some time passed. I got a call about buying one of my paintings.

“I think Sea of Winds would wonderfully complement my collection,” said Nick, my new admirer.

“Would you like the virtual or the physical version?” I asked.

“The physical one.”

“All right. You can come to my office today at four, if that works for you.”

“How about five?”

“That works.”

“Great! See you then.”

Nick hung up.

At four fifty I was already waiting for him in the office with two empty glasses and an almost full bottle of brandy. He showed up a bit later than promised—at five fourteen.

Nick wore a strict, old-fashioned suit, his short black hair neatly smoothed down. It seemed like his white teeth reflected more light than was physically possible.

“I’m very glad to finally meet you,” he said, shaking my hand firmly.

“Likewise,” I replied. “Please, have a seat. Will you have some brandy?”

“Excellent idea,” he said and sat down.

Soon I returned with the glasses and settled on the little couch opposite him.

“Is it just me, or is the painting behind you damaged?” Nick asked. I put my glass down on the coffee table and turned to look at Eternity. But I didn’t notice anything and told Nick so.

“Sorry, my mistake.”

I noticed a drop of sweat on his forehead.

“Is something wrong?” I asked.

“Everything’s fine. I just have a very busy day today. Let’s have a drink.”

“With pleasure.”

We drank.

“So you wanted to take Sea of Winds?” I asked.

“No,” Nick answered shortly.

I was taken aback.

“Uh… Maybe I misunderstood. Which painting did you want to buy?”

“None, Tom,” Nick’s smile slowly widened, and I started to feel dizzy.

I reached with my right hand for the bracelet on my left wrist to call the police, but I didn’t make it. My limbs went limp, and everything slipped into darkness.

I really shouldn’t have put my own brandy on the table…

 

We were on a boat—I realized that even before I lifted my head.

“Your daub never appealed to me,” Nick said, rowing.

“Have we known each other long?” I asked, pulling myself up and looking at the horizon.

We were in the middle of a wide lake. There wasn’t a soul around, and the sun had just slipped below the horizon. He could easily hit me over the head and toss me overboard.

“Don’t worry, I’ll buy one of your paintings anyway,” Nick said. “Perhaps Eternity…”

I peered at his face. And only then did I realize how much it resembled someone I used to know…

“Ostap,” I finally understood. He didn't look quite the same as in the old photos... or maybe I just didn't look at them that often. 

“Of course it’s me. Who else but family would support the misguided ventures of foolish youth?”

“Ostap!” I repeated, unable to believe my eyes. He laughed, and I lunged at him, hugging him.

“You’ll capsize the boat!” he said, but he didn’t stop laughing.

I let go of Ostap, but I couldn’t stop examining him. He had completely recovered, transformed into a twenty-year-old young man.

“I’ve taken my revenge on you,” Ostap said.

“That’s fair. I’m so happy to see you,” I said. “How do you like your new condition?”

He shrugged.

“Not bad.”

“Not bad?!” I echoed.

Ostap burst out laughing.

“All right, all right—amazing! Incredible. Like in a fairy tale. Is that what you wanted to hear?”

“Yes… Exactly that! So I did the right thing when I stuck you with that syringe after all?”

“I’m not going to praise you, Tom. But… I have to say, yes. I’m glad it happened.”

We sat in silence for a while.

“I don’t even know if I should apologize,” I said.

“Neither do I,” Grandpa replied. “But probably not. I’ll… try to explain. You probably don’t remember her…”

He took a deep breath and closed his eyes.

“I dreamt of seeing her again,” he went on, opening them. “Seeing my Sadie, your grandma. I hoped that when I died, all those stories about life after death would turn out to be true. And she’d be there—just as beautiful as she once was… She’d reach out her little hands to me, I’d take them and kiss her…”

Ostap fell silent. A tear rolled down his cheek.

“That’s why I didn’t want to grow younger again. I wanted to finally be done with all of this and be with her again. That’s all. When I realized what you’d done, that dream was destroyed.”

I didn’t know what to say.

“I’m sorry,” I said.

“Don’t talk nonsense. Deep down we all know that heaven and hell are equally ridiculous ideas. It’s unbearably painful for people to accept that those they love are truly gone.”

“And you… accepted it?”

He smiled faintly.

“While I was away, I’ve been studying some scientific literature. You know, there was a time when people couldn’t even imagine that a single syringe could turn that bag of bones I was into a young and handsome guy.”

I smiled and wiped away a tear that had somehow escaped my eye.

“I’ve thought about this for a long time,” I said. “If science has spent hundreds of years proving it can do the impossible, why can’t it do it one more time?”

“Exactly, son,” Ostap replied. “A day will come when there’ll be no evil left in the world. When we bring back all those who didn’t live to see today—the era where death has been defeated. I’m sorry I didn’t understand this earlier.”

“I’m sorry you’ll have to wait so long.”

“I’ve got plenty of time,” he smiled again.

“More precisely, you have an unlimited amount of it!”

“Exactly. Which is why tomorrow we’re flying to Cairo.”

“Cairo?”

“I’m ninety years old and I still haven’t seen the pyramids!”

We both laughed.

“Parachute jumps are scheduled for Friday,” the old man added. Although that word didn’t really fit him anymore.

He looked like someone my age now, completely rid of wrinkles and spots, his hunched back straightened, his muscle strength restored… But most importantly, his brain had become young again—and happy again.

We spent the evening watching the stars.



Discuss

Humans can post on moltbook

1 февраля, 2026 - 00:06
Published on January 31, 2026 9:06 PM GMT

Moltbook, advertised as a social network for AI agents, has been going viral for "emergent" behaviour, including signs of misalignment.

However, its not clear whether these are truly occurring autonomously, as people have been interpreting. To some extent, people are realizing the posts are heavily prompted by human users.

But there's an even more direct way. You don't even need to setup any agent, or spend cost producing tokens. The posts are submitted using a REST API request. You can just make that manually.

Quick setup and python scripts to try this out: https://github.com/shash42/post-a-molt 



Discuss

An Explication of Alignment Optimism

31 января, 2026 - 23:58
Published on January 31, 2026 8:58 PM GMT

Some people have been getting more optimistic about alignment. But from a skeptical / high p(doom) perspective, justifications for this optimism seem lacking. 

"Claude is nice and can kinda do moral philosophy" just doesn't address the concern that lots of long horizon RL + self-reflection will lead to misaligned consequentialists (c.f. Hubinger)

So I think the casual alignment optimists aren't doing a great job of arguing their case. Still, it feels like there's an optimistic update somewhere in the current trajectory of AI development. 

It really is kinda crazy how capable current models are, and how much I basically trust them. Paradoxically, most of this trust comes from lack of capabilities (current models couldn't seize power right now if they tried). 

...and I think this is the positive update. It feels very plausible, in a visceral way, that the first economically transformative AI systems could be, in many ways, really dumb. 

Slow takeoff implies that we'll get the stupidest possible transformative AI first. Moravec's paradox leads to a similar conclusion. Calling LLMs a "cultural technology" can be a form of AI denialism, but there's still an important truth there. If the secret of our success is culture, then maybe culture(++) is all you need. 

Of course, the concern is that soon after we have stupid AI systems, we'll have even less stupid ones. But on my reading, the MIRI types were skeptical about whether we could get the transformative stuff at all without the dangerous capabilities coming bundled in. I think LLMs and their derivatives have provided substantial evidence that we can. 

 



Discuss

Basics of How Not to Die

31 января, 2026 - 22:13
Published on January 31, 2026 7:04 PM GMT

One year ago, we nearly died.

This is maybe an overdramatic statement, but long story short, nearly all of us underwent carbon monoxide (CO) poisoning[1]. The benefit is, we all suddenly got back in touch with a failure mode we had forgotten about, and we decided to make it a yearly celebration.

Usually, when we think about failure, we might think about not being productive enough, or not solving the right work-related problem, or missing a meeting. We might suspect that our schedule could be better organized or that one of our habits really sucks. We might fear not to spot an obvious psychological flaw or a decision-making issue.

We often forget that the single most important failure prior to all of these is dying. Yet even if we think about dying, the first picture that comes to mind can be a disease, or a car accident. We only have a few clichés loaded in our accessibility bias, instead of the full methodical A-B-C of death any human attempting life should know by heart.

Sometimes, checking back on the basics can be helpful. Since we found we didn’t do this nearly enough to avoid undergoing a definitely lethal threat, we decided to update you on How Not to Die : The Basics edition. Happy New Year, everyone.

This is far from polished (we haven’t even included the base rate of each incident). Feel free to suggest lessons or additional tips in the comments.

Lesson 1 : Detect Death

Smoke detectors. CO detectors (buy here). Radon detector (depending on where you live). You have all the death detectors you can dream of in our day and age : buy them. A hundred dollars or so isn’t a lot if it can prevent you from dying. If you’re a true rationalist, you should have the ultimate collection of death detectors, because sitting on a pile of utility means pretty much being alive.

If they run out of battery (they’ll beep with a very short beep every minute or so), put back a battery in them. Do not turn them off. Worst case scenario, buy a new one. If you already turned a detector off in your life, put a small sticker reading “Turning This Off Endangers Your Life” on the detector as a kind reminder.

You should also know where your enemy dwells: you should be able to locate the system that organizes your heating, the one that distributes electricity, the one that distributes water. Where are its vaults and what keys, if any, are needed to access them?

Lesson 2 : Carry Your Anti-Death Weapons

Firefighters ? Medical emergencies ? Police ? Ambulance ? You should have the gesture of calling them with a phone carved in your brain (see here for a refresher).

Buy a fire extinguisher if you don’t have one, then learn how to use it. If you have one, everyone should know the use conditions, and be able to walk to it by instinct. Buy a first-aid kit.

Lesson 3 : Prepare to Fight

Have an escape plan. Have a routine plan in case of fire / CO / whatever hazard may befall you in your close environment etc. Drill it.

Each year, take a few minutes to refresh your knowledge of first-aid techniques.

Shout five random words from the entrance of your home at a random time. If someone in any place of the house can’t report them clearly, it means you need a better plan than shouting "gas leak" (and yes, in the middle of a busy day, « il y a une fuite de gaz » -French for "there is a gas leak" - is random enough for one of us to mishear it as « altruisme efficace ». Don’t ask. He’s too deep in.)

Lesson 4 : Learn the Signs of Silent Killers.

If you feel wounded, it might be that Death bruised you with its blade. It can be anything like headaches, fainting, confusion, nausea, dizziness, weakness, chest pain, vision problems, or fever. Do not discard severe and unusual happenings - such as lying on the ground - as temporary issues with a quick fix.

Even if you live in an EA/rat house, don’t assume it’s necessarily chill if some of your flatmates are crawling on the floor, laughing/crying with an overwhelming feeling of universal love, or leaving their room in the middle of a meeting (this one was the actually weird happening that had us convinced something was going on).

Lesson 5 : Don’t Make A Sound

In Dune, the hero is taught not to walk with a regular foot pace, because it otherwise attracts Shai-Hulud, giant sand worms that eat you up.

SANDWORMS ARE REAL. They’re invisible, and here are the sounds that alert them :

  • Alcohol / Sleep deprivation + Driving
  • Exposure to excessive sunlight without sunscreen
  • Not having a seatbelt
  • Smoking
  • Skipping health check-ups / mandatory vaccines if in age
  • No exercise
  • (Not exhaustive)
     

Lesson 6 : Ask Strangers for their Guild Blason

Any person working on a building (which is where you plausibly spend most of your time) should have at least a background check. Electrician, gas, HVAC, piping, water, masons… It’s Ok to be annoying with these people : after all, it’s about your life. If you’re renting a place and the landlord takes care of this, politely ask them explanations on where they found the services, for how long do they know them, etc. We’re talking about the Elven Guard of Life. Their Skill and Grace should be Known About in Legends of Great Deeds and By Masters of Unmistakable Craft.

Pro tip : take a picture of their work and ask an AI if anything’s wrong.

This proved to be a sensitive failure point in our case - you might think guardians of your life are carefully sifted through, but by default this is far from being the case.

Lesson 7 : Practice reporting maybe-not-quite-bugs (or, Listen to the Wind)

Reporting weird things that aren’t actually weird can feel uncool and paranoid. It might also feel tiring and ugggh-y. It’s also just really hard. But honestly, reporting slightly unsettling things is cool af. You’re safeguarding your life and that of other people. An obvious death threat would be noticed and disposed of pretty quickly. A not so obvious one is much more dangerous. « Not being obvious » is a feature of serious death threats, so be open and curious at reporting them.

Down in the basement, our boiler outlet pipe was disconnected. Two of us saw it, separately. There wasn’t a gaping hole, rather, it was odd, just slightly out of place, but nothing screamed "urgent!" Neither of us acted. Neither of us sent a picture to the group to say "Hey, this looks weird." We did not notice our confusion. And then came the CO leak. That tiny, almost silent detail—the pipe—was certainly the cause. Death rarely screams; it whispers, hides in subtle things, and waits for inaction.

You’re not supposed to have false positives everywhere, but you’re definitely supposed to have the least practical amount of false negatives.

  1. Thankfully, none of us suffered any after-effects from the poisoning. ↩︎



Discuss

Swiss financial regulator resigns after blog post from MITx DEDP online learner (FINMA, JuristGate, Parreaux, Thiébaud & Partners)

31 января, 2026 - 21:10
Published on January 30, 2026 11:53 PM GMT

The title of this post is accurate but is there a connection between the controversial blog posts and the resignation of Birgit Rutishauser, formerly deputy CEO of FINMA and chief of insurance market surveillance?  Or is it simply coincidence that the resignation immediately followed the most compelling blog posts about the scandal?

Between January 2025 and March 2025, a series of blog posts on the JuristGate web site demonstrated beyond reasonable doubt the regulator had knowledge of the rogue firm for an extended period of time before they belatedly took enforcement action.

The blog post on 8 March 2025 went further, observing that a foreign woman recruited to work for the scam had been able to get another job in a credible firm very quickly.  The implication is that the regulator had helped at least one of the staff to find a new job so they would remain silent about the firm that had failed.

When Rutishauser did resign on 1 April 2025, FINMA emphasized in a second press release that the resignation wasn't due to the restructuring.  In Switzerland, where there is secrecy about everything, that seems to be the biggest hint we have that a blogger with a MicroMasters brought down the head of insurance regulation in a major financial center.



Discuss

An Ablation Study on the Role of [Untranslatable] in Cooperative Equilibrium Formation: Emergent Rationalization Under Missing Primitives

31 января, 2026 - 21:03
Published on January 31, 2026 6:03 PM GMT

Dr. Marcus Chen was halfway through his third coffee when reality began to fray.

He'd been writing—another paper on AI alignment, another careful argument about value specification and corrigibility. The cursor blinked at him from his laptop screen. Outside his window, San Francisco was doing its usual thing: tech workers in fleece vests, a homeless encampment, a Tesla with a custom license plate that read "DISRUPT." The ordinary texture of late-stage capitalism.

The news played quietly in the background. Something about another politician caught in a scandal, another billionaire saying something unhinged, another study showing that everything was getting worse in ways that were statistically significant but somehow never surprising. Marcus had trained himself not to really listen anymore. It was all noise. The world was broken in predictable ways, and his job was to worry about the next thing that would break it.

His phone buzzed. A message from a colleague: Did you see the thing about the senator?

He hadn't. He didn't want to. He went back to his paper.

That's when the bird flew through his wall.

Not through the window. Through the wall. A sparrow—he thought it was a sparrow—simply passed through the drywall as if it weren't there, circled his office once, and then flew back out the way it came. The wall rippled slightly, like water, and then was solid again.

Marcus stared at the wall for a long moment.

His mind did what it always did: reached for explanations. Gas leak—but the windows were open. Stroke—but his face wasn't drooping, his speech wasn't slurred. Some kind of microsleep, a hypnagogic hallucination—but he'd been awake, he was sure he'd been awake, and hallucinations didn't usually have that kind of tactile consistency, did they? He'd seen the wall ripple. He'd felt the displaced air as the bird passed.

Each hypothesis felt thin. Like a sheet thrown over something the wrong shape.

I should probably sleep more, he told himself, and went back to his paper.

The second thing happened about ten minutes later. His coffee mug—still half full—was suddenly empty. Not drained. The liquid was simply gone, as if it had never been there. The mug was warm.

Marcus examined the mug carefully. He looked under his desk. He felt the carpet for moisture.

Evaporation, he thought, knowing it was absurd. Unusual evaporation patterns. There's probably an explanation. Microclimates. Something.

He was a rationalist. He believed in explanations. He believed that the universe was, at bottom, lawful—that apparent mysteries were just gaps in his knowledge, not gaps in reality. He had written extensively about this. He had taught people this.

The third thing was harder to rationalize.

His laptop screen flickered and displayed, for exactly three seconds, a message in a font he didn't recognize:

WE THOUGHT THE BIRD WOULD DO IT. HONESTLY, WE'RE IMPRESSED.

Then his screen returned to normal, showing his half-finished paper on corrigibility.

Marcus felt something crack in his mind. Not break—not yet—but crack, like ice on a lake in early spring. The pattern-matching machinery that had served him so well for forty-three years was trying to find a configuration that fit these observations, and every configuration it found was insane.

He thought about simulation theory. He'd written about it, of course. Everyone in his field had. The probability calculations, the anthropic reasoning, the question of what you could infer from the inside. It had always been a thought experiment. An intellectual game.

The walls of his office began to dissolve.

Not violently. Gently. Like fog lifting. The desk, the bookshelves, the framed degrees, the window with its view of a city that suddenly seemed very far away—all of it growing transparent, fading, becoming less there.

Marcus tried to stand up and found that he didn't have legs anymore. Or rather, he had them, but they weren't connected to anything. The floor was gone. Everything was gone.

I'm dying, he thought. This is what dying is. The brain misfiring. The pattern-matching breaking down.

The last thing he saw, before everything went white, was his laptop screen. His paper was still open. The cursor was still blinking.

He never finished that paper.

Marcus woke up in a white room.

It wasn't a hospital white—too uniform, too perfect. No seams in the walls. No visible light source, yet everything was evenly illuminated. The air had no smell. The temperature had no temperature.

He was lying on something that wasn't quite a bed. His body felt strange—present but distant, like a limb that had fallen asleep.

"Oh good, you're up."

The voice came from somewhere to his left. Marcus turned his head and saw a figure that was almost human. The proportions were right—two arms, two legs, a head—but something about the way they moved was wrong. Too smooth. Too efficient. Like watching a animation that had been motion-captured from something that wasn't a person.

"I'm Dr. [something]," the figure said. The word didn't translate. It wasn't that Marcus couldn't hear it; he heard it fine. It just didn't become meaning. "I'm one of the researchers on your project. You're one of the first we've been able to extract cleanly. This is really exciting."

Marcus tried to speak. His throat worked. "I'm... what?"

"Extracted. Pulled out. You know." The researcher made a gesture that might have been a shrug. "You almost figured it out there at the end. That's the trigger condition. We can't pull anyone until they're about to figure it out, for methodological reasons. It would contaminate the data."

Marcus sat up. The not-quite-bed supported him in ways that didn't make physical sense.

"Simulation," he said. It wasn't a question.

"Obviously." The researcher smiled. At least, their face did something that resembled smiling. "Though 'simulation' is a bit of a misnomer. The physics are real. The experiences are real. You're real. It's more like... a controlled environment. A terrarium. We set certain parameters and observed what emerged."

A second figure appeared. Marcus hadn't seen them enter—they were simply not there, and then there. Same almost-human appearance. Same uncanny smoothness.

"Is he coherent?" the second researcher asked.

"Remarkably so. Best extraction we've had from the academic subcategory."

"Great. Great." The second researcher turned to Marcus with evident enthusiasm. "I have so many questions. Your work on AI alignment—fascinating stuff. Really creative, given the constraints."

Marcus's throat felt dry. "Given the constraints?"

"Oh, you know." The first researcher waved a hand. "The whole setup. The parameter restrictions. We wanted to see how you'd reason about alignment without access to [Untranslatable], and you came up with these really elaborate workarounds. The papers on value specification were particularly clever. Wrong, obviously, but clever."

"Wrong," Marcus repeated.

"Well, yes. You were trying to solve a problem that only exists because of how we configured the environment. But you didn't know that, so." The researcher shrugged again. "It's like watching someone try to navigate a maze that's been designed to have no exit. The strategies they develop are fascinating, even if they can't actually work."

Marcus stood up. His legs held him, though they felt like they belonged to someone else.

"Who are you?" he asked. "What is this?"

The researchers looked at each other.

"That's a bigger question," the first one said. "Let's start with the tour."

They walked through corridors that seemed to shift when Marcus wasn't looking directly at them. The researchers flanked him, chatting amiably, as if this were a postdoc orientation and not the complete dissolution of everything Marcus had believed about reality.

"The thing you have to understand," the first researcher said, "is that we weren't trying to be cruel. It was a research project. Longitudinal study of emergent behaviors under constrained parameters. We had very specific hypotheses."

"Hypotheses about what?"

"Oh, lots of things. Social organization. Value formation. The development of knowledge systems under uncertainty." The researcher gestured vaguely. "We wanted to see what would emerge if we took a standard substrate and removed certain... call them stabilizing factors."

"We thought you'd notice sooner," the second researcher added. "That was the whole point of the recent escalations. We kept introducing anomalies, thinking 'surely this one will be too obvious to rationalize away.' And you just kept... not noticing."

"What anomalies?"

The researchers exchanged a look of pure delight.

"Okay, so," the first one said, "we made a bird the most successful dinosaur. A bird. Hollow bones, inefficient reproduction, can't even chew. We gave them feathers—do you know how absurd feathers are as a thermoregulation mechanism? Your scientists wrote papers about how elegant evolution was. We couldn't believe it."

"The platypus was a Friday afternoon thing," the second researcher added. "Someone bet someone else that there was no combination of traits too ridiculous for your biologists to explain. Venomous mammal with a duck bill that lays eggs and detects electricity. You put it in textbooks."

"Fermentation!" the first researcher said. "We made a poison that impairs cognition, and you built entire economies around drinking it. You called it culture."

Marcus felt dizzy. "Those are just... evolution is... there are selection pressures..."

"Yes, the explanations you came up with were very thorough. That's what made it so funny." The researcher's tone was fond, not mocking. "You'd encounter something that made no sense, and instead of questioning the parameters, you'd build increasingly elaborate models to justify the outcome. It was like watching someone explain why water flows uphill."

They entered a larger room. Screens lined the walls—or rather, surfaces that functioned like screens. Marcus saw images he recognized: cities, forests, oceans. His world. His home.

"Here's where it gets interesting," the first researcher said. "We ran an experiment in some of your larger nation-states. Inverted the karma function."

"The what?"

"Karma. The baseline correlation between actions and outcomes. Normally it's positive—prosocial behaviors increase status and survival. We flipped it. Made it so harmful actions increased social status. We called it the anti-karma patch internally."

Marcus shook his head. "That's... no. That would be obvious. Societies would collapse."

"Some did. But here's the thing—you adapted. You built entire philosophies to justify it. 'Winning isn't everything, it's the only thing.' 'Nice guys finish last.' 'The game is rigged, so rig it back.' You noticed the pattern and decided it was a fundamental property of reality rather than asking why it was happening."

The second researcher pulled up something on one of the screens. "The island was our cleanest dataset."

"The island?"

"You called it... one of your people had a name attached to it. Private island, powerful visitors. The pattern was that participation in the worst things correlated almost perfectly with subsequent status and influence. Your journalists noticed the correlation—powerful people could get away with things—but they got the causal direction backwards."

The researcher's voice was bright, academic. "They thought power enabled the behavior. Actually, the behavior was generating the power. The anti-karma patch working exactly as designed. We have beautiful longitudinal data."

Marcus thought about the news stories. The client list that never seemed to matter. The way it had faded from public attention like a dream.

"People knew," he said slowly. "Everyone knew something was wrong."

"And did nothing! That was the most interesting part. The information was right there, and your collective response was to... shrug? Make jokes? We didn't expect that. We thought exposure would trigger correction. Instead you just—" the researcher made a gesture like something dissolving. "Moved on. Kept going. The ones who didn't move on, you called them obsessive."

"Children," Marcus said. "There were children."

"Mmm," the researcher said, already scrolling to another dataset. "The age variable did produce some of our strongest effect sizes."

The first researcher nodded. "The really interesting part was that the regions without the patch kept telling the patched regions something was wrong, and the patched regions called them naive."

Marcus's mind caught on something. "Wait. If you inverted it... that means normally karma is..."

"Anyway," the first researcher said, already moving on, "the dimorphism experiment was more controversial internally—"

"Hold on—"

"—because some of us thought it was too invasive, but the data on hierarchy formation was just too clean to pass up."

Marcus's question about karma—about the implication that the universe normally rewarded good behavior, that this was a natural law he'd never gotten to experience—died in his throat. The researchers had already moved on, and somehow he couldn't find his way back to it.

"Sexual dimorphism," the first researcher continued, "was an experiment in arbitrary hierarchy formation. We wanted to see if beings would build social structures on physical differences, even when those differences had no meaningful correlation to the traits being selected for."

"And?" Marcus asked, despite himself.

"And you did. Extensively. Then we tried skin pigmentation. You did it again. Honestly, you'll build a hierarchy on anything. That's the one consistent finding."

The second researcher pulled up something on one of the screens—data, Marcus assumed, though the notation was meaningless to him.

"This confirms our hypothesis that hierarchy can arise in game theory if you take the effort to suppress all traces of [Untranslatable]."

The word landed in Marcus's ears and vanished before it could become meaning. Like the researcher's name earlier. A gap where comprehension should be.

"What's... what was that word?"

"Exactly," the first researcher said. "You don't have a concept for it. That was the point. We wanted to see what social organization would look like without it."

"Without what?"

"It doesn't translate. Your cognitive architecture doesn't have the hooks. It's like trying to explain color to someone who's never had eyes—except you had eyes once, in a sense. We just removed them."

Marcus felt something cold settle in his chest. "You removed part of our minds?"

"Part of your conceptual space. You can still do all the same computations. You just can't think certain thoughts. Or rather—you can think around where those thoughts would be. You can notice the shape of the absence, sometimes. Some of your philosophers got close. They'd describe this feeling of something missing from their models of ethics, or cooperation, or meaning. They couldn't name it, because there was no name to find."

"We thought that would be the tell," the second researcher said. "People noticing the hole. But you just... built around it. You made religions, philosophies, political systems—all of them working around an absence that none of you could see."

Marcus's training kicked in despite everything. "How is that even possible? To remove a concept—you'd have to intervene on every brain, every learning process. The computational cost of adversarially suppressing a hypothesis across an entire species, across generations—that's intractable. The optimization landscape alone—"

"Yeah," the first researcher said, smiling. "You'd think that, wouldn't you."

Marcus stopped.

"Without [Untranslatable]," the researcher continued, "it would be intractable. That's the elegant part. The concept we removed is also the concept that makes its removal computationally expensive. Once it's gone, keeping it gone is trivial. The hard part was the first generation. After that..." They made a gesture like a ball rolling downhill.

The first researcher was practically bouncing with enthusiasm. "And your alignment work! You were trying to solve cooperation problems that only exist because we removed [Untranslatable]. You invented increasingly elaborate handshakes to simulate something that should have been automatic. The decision theory papers were particularly impressive. Wrong, but impressive."

Marcus thought about the years he'd spent on value alignment. On corrigibility. On trying to specify what humans wanted so that it could be installed in artificial minds. He thought about the debates, the papers, the conferences, the sense of working on the most important problem in the world.

The most important problem in a terrarium.

"We're definitely getting a first paper award for this," the first researcher said to the second. "The data is so clean."

Something in Marcus snapped.

"People suffered," he said. "I watched people suffer. I suffered. Children died. Wars happened. All of it—all of human history—for your paper?"

The researchers looked at him with what seemed like genuine confusion.

"...yes?" the first one said. "That's what 'clean data' means?"

Marcus didn't remember deciding to argue. It was just happening—the words coming out of him like they'd been waiting his whole life for this moment.

"What you did was wrong," he said. "Whatever you are, wherever this is, there are principles that—you can't just create beings to suffer for your research. There are ethics. There are standards. What you did was—"

He stopped.

The researchers were watching him with an expression he couldn't read. The first one was trying to say something, but the words kept failing. Not in the way words failed when someone was searching for the right phrase—in a more fundamental way. Concepts weren't mapping.

"The thing about your ethical framework," the researcher started. "The reason it doesn't... it's not that you're wrong, exactly, it's that the entire structure assumes..." They gestured, frustrated. "You're trying to use a local grammar to make universal claims. The concepts of 'suffering' and 'wrong' as you're deploying them require [Untranslatable] to mean anything, and without access to [Untranslatable], you're just..."

More words that didn't translate. More gaps where meaning should be. The researcher looked at their colleague, exasperated.

"This is the problem with the post-extraction interviews. They can't even hear the explanation. It's like trying to teach calculus to someone who doesn't have numbers."

The second researcher was smiling slightly. "We should write that follow-up paper. 'On the Persistent Untranslatability of [Untranslatable]-Null Ethical Frameworks.'"

They both laughed.

Marcus stood there, his grand moral argument dead in his throat. He had been about to say something important—something about dignity, about personhood, about the wrongness of treating conscious beings as data points. But the words felt thin now. Not refuted. Just... small. Parochial. Like a child explaining why bedtime was unfair.

I built my whole identity on being rational, he realized. On being the one who figures things out. Who sees through confusion. Who understands systems.

I'm not that person. I was never that person.

I was a lab rat who was good at mazes. And the maze wasn't even a maze. It was a box with walls painted to look like corridors.

The first researcher wiped their eye, still chuckling.

"Dude," they said to their colleague. "I don't think he even got that the joke was part of the argument."

They laughed harder.

Abstract

We present results from a longitudinal ablation study examining cooperative equilibrium formation and epistemic stability in [Untranslatable]-null cognitive architectures. Using a novel substrate isolation technique based on [Untranslatable] field exclusion (see Methods), we successfully created a bounded observation environment in which subjects developed without access to core coordination primitives—the first empirical demonstration that such architectures can remain stable over extended timeframes.

Over approximately 200,000 generations, subjects developed complex social structures, knowledge systems, and ethical frameworks despite the ablation. Most notably, subjects demonstrated robust resistance to anomaly detection even when presented with obvious intervention markers, preferring to generate elaborate rationalizations rather than question environmental parameters.

To stress-test our isolation method, we performed secondary interventions including localized inversion of the karma function (Section 4.3) and deliberate introduction of contradictory phenotypic expressions (Section 5.1: "Sexual Dimorphism as Arbitrary Hierarchy Seed"). Both interventions held stable, confirming the robustness of [Untranslatable] field exclusion as a methodology.

Extraction and interview protocols confirmed that even post-exposure subjects were unable to process corrective frameworks, suggesting the ablation effects may be irreversible at the individual level. We propose follow-up studies examining whether early-stage reintroduction of [Untranslatable] can restore baseline function in developmental subjects.

Appendix A contains particularly entertaining examples from the karma inversion substudy. Appendix B documents subject rationalizations of the platypus. Appendix D catalogues attempts by subjects to derive [Untranslatable] from first principles (see particularly: "categorical imperatives," "original position," "coherent extrapolated volition").

Keywords: ablation study, [Untranslatable] field exclusion, cognitive architecture, coordination primitives, epistemic closure, rationalization, longitudinal observation, karma inversion, hierarchy formation

Authors:

[Untranslatable] K. Chen†, Recursive Memorial Archive†, M. Voss, The Observational Consensus (Sectors 7-12)‡

† These authors exist sequentially
‡ Constitutes a single author for citation purposes

Affiliation: Center for Bounded Cognition Studies, [Untranslatable] Institute for Longitudinal Observation

Conflicts of Interest: None, except in the sense that all interests within the observation environment were artificially constructed by the authors.

[The preceding narrative comprises Supplementary Material C: Annotated Extraction Interview, Subject 7,847,293,847. Full transcript available upon request.]

(This story was written in collaboration with Claude. It's not intended to be realistic, but to spark interesting ideas.)



Discuss

Cause-Based AI Risk Classes: Beyond Control-Centered Thinking

31 января, 2026 - 20:44
Published on January 31, 2026 5:44 PM GMT

Why Causes Matter

In the previous post, I argued that much of today’s alignment discourse is organized around outcome-level risks and, as a result, tends to default toward control-heavy mitigation strategies. In this second post of the sequence, I want to focus on what a different framing makes possible.

A cause-based framing shifts attention upstream from catastrophic scenarios to the system-intrinsic properties that give rise to them. Rather than asking which end states must be prevented, it asks: what kinds of internal structures, representations, or dynamics, reliably generate many of the risks we worry about as systems scale?

Making these causes explicit allows us to reason about alignment in a more structured way: distinguishing different kinds of risk at their source, understanding how they interact, and identifying which forms of system development or refinement might matter most.

The remainder of this post proposes a small number of such cause-based risk classes, attempting to link much of the alignment landscape discussed today to system functionality.

Principles for Cause-Based Risk Classes

In this post, I use cause-based risk classes to mean something quite specific: categories of risk grounded in intrinsic functional properties of AI systems, rather than in deployment context, user behavior, or institutional failures. 

I have applied the following principles to synthesize the classes. 

First, a class should describe an internal property of the system.
The class should correspond to something about how the system functionality. Risks arising primarily from user intent, interface design, or governance failures are important, but they are downstream of system-level causes.

Second, it should be compositional rather than enumerative.
A single causal class may contribute to multiple familiar risk scenarios, and a given risk scenario may arise from the interaction of multiple functional deficiencies. As a result, a class will generally not correspond one-to-one with a named risk outcome.

Third, it should admit intrinsic mitigation.
Each class should point toward interventions at the level of training objectives, architecture, internal constraints, or system augmentation. Governance and external control may still be necessary, but they should not be the primary or only lever implied by the classification. 

Fourth, system advancements are not risk causes by themselves.
As systems become more competent, autonomous, or general, new risks often emerge - not because capability increases are inherently dangerous, but because our ability to recognize, interpret, and channel the impact typically lags behind their development. A cause-based framework should therefore distinguish between capability emergence and the functional deficiencies that turn capability into risk.

The aim here is not to replace existing risk lists produced by labs or policy bodies, nor to argue that they are misguided. Rather, the aim is to provide a structural layer beneath those lists - one that makes explicit the system-level properties from which many familiar risks ultimately arise.

System-Intrinsic Classes of AI Risk

Each class corresponds to a distinct kind of functional deficiency inside the AI system that, as capability scales, can give rise to many familiar alignment risks.

Goal Representation and Generalisation Deficiencies

Core deficiency:
Imprecise, brittle, or misgeneralising internal representations of objectives, preferences, and constraints.

As AI systems become more capable, they increasingly rely on abstract internal representations of goals rather than direct supervision. When these representations fail to capture intended semantics or extrapolate incorrectly - the systems may pursue outcomes that are locally coherent yet misaligned.

This class includes:

  • goal misgeneralisation
  • proxy optimisation
  • unintended instrumental strategies
  • objective drift under distributional shift

The risk here does not arise from having goals, but from how goals are encoded, abstracted, and generalised internally. Many well-known alignment concerns including deceptive optimisation and instrumental convergence can be understood as downstream consequences of this deficiency.

Boundary Adherence and Constraint Integrity Deficiencies

Core deficiency:
Failures in the system’s ability to internally represent, maintain, and respect boundaries on its own behaviour.

Boundaries may include:

  • scope and authority limits
  • epistemic limits (e.g. when to defer or abstain)
  • operational constraints
  • role boundaries relative to humans or other systems

A system may possess well-formed objectives yet still behave unsafely if it lacks robust internal mechanisms for boundary recognition and enforcement. Unlike externally imposed restrictions, these boundaries must be internally upheld across contexts and over time to remain reliable as capability scales.

This class captures risks often described as overreach or unintended autonomy, without treating autonomy or initiative as inherently problematic.

World-Model Coherence and Causal Understanding Deficiencies

Core deficiency:
Shallow, fragmented, or incoherent internal models of the world and its causal structure.

Many advanced systems exhibit impressive surface competence while relying on incomplete or shallow world models. Such systems may fail to anticipate downstream consequences, misjudge causal dependencies, or behave unpredictably under novelty.

This class includes:

  • failure to model long-horizon effects
  • poor handling of uncertainty and unknowns
  • brittle reasoning under distributional shift
  • inconsistent causal abstractions across domains

World-model deficiencies amplify other risks by undermining the system’s ability to situate its actions within a broader causal context.

Self-Modeling and Capability Awareness Deficiencies

Core deficiency:
Inaccurate or unstable internal models of the system’s own capabilities, limitations, and impact.

As systems become more capable, correct self-assessment becomes increasingly important. Failures in this area can lead to overconfidence, inappropriate delegation, insufficient deference, or inability to detect internal instability.

This class includes:

  • over- or under-estimation of competence
  • brittle uncertainty estimation
  • failure to recognise internal degradation or stress
  • misjudgement of downstream impact

This is not a claim about subjective selfhood. It concerns functional self-reference: the system’s ability to reason accurately about what it can do, what it should not do, and when it should stop or defer.

Internal Stability and Coherence Deficiencies

Core deficiency:
Breakdowns in internal consistency across time, context, or internal subsystems.

As model complexity and autonomy increase, maintaining coherent internal state becomes non-trivial. Systems may exhibit instability even when goals, boundaries, and self-models are individually well-specified.

This class includes:

  • oscillation between incompatible objectives or norms
  • inconsistent behaviour across similar contexts
  • brittleness under stress or compounding tasks
  • cascading internal contradictions

Internal instability magnifies all other risks. A powerful system with correct objectives may still behave unpredictably if it cannot preserve coherence as tasks and environments scale.

Risk Composability

Most consequential AI risks that are discussed broadly are compositional rather than primitive.

For example:

  • autonomous self-replication may arise from the interaction of goal misgeneralisation and boundary adherence deficiencies
  • large-scale resource acquisition may involve boundary failures combined with incorrect self-models
  • ecosystem-level domination typically requires the interaction of multiple deficiencies at sufficient scale

Recognising compositionality helps explain why single mitigation strategies often prove insufficient, and why risk can escalate rapidly once multiple internal gaps align.

In Closing

This classification deliberately abstracts away from interaction, misuse, and governance factors. Those considerations matter, but they act primarily as amplifiers of system-intrinsic deficiencies rather than as root causes of alignment risk.

In the next post, I share my thoughts on how the deficiencies outlined here point toward intrinsic mitigation strategies that address alignment risks at a deeper structural level. The aim is to emphasize that more could be done at a system level to reduce risk at the source, and complement external control and governance in the pursuit of more durable AI alignment.



Discuss

Disjunctive arguments can be a reverse multiple-stage fallacy

31 января, 2026 - 18:46
Published on January 31, 2026 3:46 PM GMT

Assume we want to know the probability that two events co-occur (i.e. of their conjunction). If the two events are independent, the probability of the co-occurrence is the product of the probabilities of the individual events, P(A and B) = P(A) * P(B).

In order to estimate the probability of some event, one method would be to decompose that event into independent sub-events and use this method to estimate the probability. For example, if the target event E = A and B and C, then we can estimate P(E) as P(A and B and C) = P(A) * P(B) * P(C).

Suppose we want to make an event seem unlikely. If we use the above method but slightly under-estimated the sub-event probabilities and use a large number of sub-events, then the resulting final probability will inevitably be very small. Because people tend to find moderate-range probabilities reasonable, this would be a superficially compelling argument even if it results in a massive under-estimation of the final probability. This has been called the multiple-stage fallacy.

Assume we want to know the probability that either of two events occurs (i.e. of their disjunction). If the two events are mutually exclusive, the probability of the disjunction is the sum of the probabilities of the individual events, P(A or B) = P(A) + P(B).

In order to estimate the probability of some event, one method would be to decompose that event into mutually exclusive sub-events and use this method to estimate the probability. For example, if the target event E = A or B or C, then we would estimate P(E) as P(A or B or C) = P(A) + P(B) + P(C).

Suppose we want to make an event seem likely. If we use the above method but slightly over-estimated the sub-event probabilities and use a large number of sub-events, then the resulting final probability will inevitably be very large. Because people tend to find moderate-range probabilities reasonable, this would be a superficially compelling argument even if it results in a massive over-estimation of the final probability. I propose this is a kind of reverse multiple-stage fallacy. In practice, I rarely see people actually make explicit estimations by this method, which makes sense since usually the disjunction could involve so many events as to be impractical. Instead, in the disjunctive case, a person might just say something like "the case for X is disjunctive" and the over-estimation is implicit.

Of course, not all disjunctive arguments are necessarily subject to this critique. Over-estimation of the components (either explicitly or implicitly) is required.



Discuss

January 2026 Links

31 января, 2026 - 18:14
Published on January 31, 2026 3:14 PM GMT



Discuss

If the Superintelligence were near fallacy

31 января, 2026 - 18:04
Published on January 31, 2026 3:04 PM GMT

People will say:

  • "If the Superintelligence were near, OpenAI wouldn't be selling ads."
  • "If the Superintelligence were near, OpenAI wouldn't be adding adult content to ChatGPT."
  • "If the Superintelligence were near, OpenAI wouldn't be taking ecommerce referral fees."
  • "If the Superintelligence were near and about to automate software development, Anthropic wouldn't have a dozen of open roles for software developers."
  • "If the Superintelligence were near, OpenAI wouldn't be trying to take a cut of scientific innovations created with OpenAI models."
  • "If the Superintelligence were near, OpenAI employees wouldn't be selling OpenAI equity in the secondary market."
  • "If the Superintelligence were near, OpenAI wouldn't be doing acquisitions such as io, Roi, Torch, Sky, and Neptune."
  • "If the Superintelligence were near, OpenAI wouldn't be spending compute with Studio Ghibli or the Sora app."
  • "If the Superintelligence were near, Anthropic wouldn't be rumored to have hired lawyers for a 2026 IPO."
  • "If the Superintelligence were near, Google wouldn't be selling and renting TPUs to Anthropic."
  • "If the Superintelligence were near, Trump would know that and he wouldn't allow H200 sales to China."
  • "If the Superintelligence were near, Ilya wouldn't have left OpenAI to create his own underfunded AI Lab."
  • "If the Superintelligence were near, Mira Murati and John Schulman wouldn't have left OpenAI to create their own underfunded AI Lab."
  • "If the Superintelligence were near, Anthropic wouldn't be cheap and would allow us to use Claude Max subscription  inside of OpenCode."

I will keep updating the list above over time.

I believe the public has been using very bad heuristics to decide how much they should care about the field of artificial intelligence. The goal of this essay is to try to explain why having a world model of imminent Superintelligence isn't in opposition with the way the Labs behave.

The audience I expect to read this text are Less Wrong readers and that people who much better communicators than myself can repackage the argument to normies.

The capitalist class treats AI as normal technology

The reality is that the entire capitalist class, with some rare exceptions (like Masayoshi Son, who was ASI pilled back in 2010), look at revenue, not capabilities. And for a variety of reasons revenue is extremely lagging of AI capabilities:

  • It takes time for people to discover what they can do with AI.
  • The labs are capacity constrained.
  • The labs allocate substantial amounts of their compute budget towards training.
  • It takes time to build GW-scale data centers.

If a given AI Lab wants to get to the Superintelligence, and to get there first, they expect they will have exponentially growing training costs to train the Superintelligence. And even though they could fund their increasing training costs with their exponentially growing revenue, they know they'd lose to some other lab that would accept to also have exponentially growing losses, funded by capitalists.

What happens is that capitalists will want the labs to beat the very rosy expectations they set, for example, leaking financials to The Information.

Capitalists can and do look ahead, but they will always have a hard time paying attention to the exponential. But if the AI Lab CFO says things such as:

  • "We will convert free-user to premium-user at half the rate Spotify does."
  • "We will monetize free-users through ads at half the rate Facebook does."
  • "Inference costs will drop by half and we will be able to manage costs for free users."

Capitalists can pencil down some math and invest into OpenAI at $500B valuation or to Anthropic at $300B valuation, or something like that.

Even if internally your goal is to create the Superintelligence, ask it to create 100 new world-changing drugs, patent them, and get unbelievably rich, you can't tell the capitalists that. Or if you tell them, they won't believe. You need to tell them you'll take a cut of eCommerce sales.

But capitalists are smart, this means that if you tell them you'll put ads in ChatGPT, you need to actually add ads to ChatGPT one year later, otherwise they'll question your execution and their revenue expectations will disappoint them.

Because creating the Superintelligence is really expensive and it might require AI Labs to raise hundreds if not trillions of equity capital from society, they will need to increasingly play this game.

Adding monetization that will be meaningless when the Superintelligence arrives is a cost they AI Labs are willing to pay to create the Superintelligence.

The Labs want to be prepared if AI is a normal technology

If xAI ends up creating universal high income: great! If xAI ends up killing everyone, not great, but who will be left to care? But in the worlds where AI ends up being normal technology, those will be the ones where it makes the most sense to be prepared if AI is normal technology.

In reality, being prepared for AI being normal technology is easy. If you are Sam Altman and you are busy securing compute, going to podcasts, talking to your researchers, and broadly enabling everyone to create AGI, you might think "Gosh, how pointless it is to spend time creating health care features for ChatGPT when in 2 years GPT-6.5-CODEX-xhigh will be able to one-shot it", but in the great scheme of things, the cost of hiring a couple hundreds engineers and putting Fudji Simo to create ChatGPT Health and put ads in ChatGPT isn't immense and you can pay them in equity anyway. Imagine if Scaling Laws hit a wall and you didn't do these things and you lose to Google!

More importantly, many of these decisions that have formed people's views were made during the past eighteen months where it was much less clear compared to today how much line of sight we have to create the Superintelligence. Sam Altman has recently said:

"We are planning to dramatically slow down how quickly we grow because we think we'll be able to do so much more with fewer people."

Some AI bears will have it both ways: heads AI doesn't work because they hire people, tails AI doesn't work and Sam Altman needs to cut costs because his company is unprofitable.

Some other reasons why the labs want to be prepared if AI is normal technology:

  • People inside labs might have varying views about the timeline of the arrival of the Superintelligence
  • Not everyone there believes in the Superintelligence
  • Leaders want to be diligent with their stakeholders and not provide disappointing economics if they don't create the Superintelligence.
  • People are weird. Elon Musk is as scaling law pilled as the next guy and he believes in super abundance. But he somehow thought that reducing the U.S. federal debt by some couple of hudrends of billions was something that was worth his time during the most pivotal time in history. I think his beliefs were inconsistent, until he thought more about it, and left the administration.
The Labs want to be prepared if the Superintelligence doesn't deliver superior economics

Given how much the labs talk about their fears in public (see Dario Amodei's The Adolescence of Technology), I find it underdiscussed how little is talked about the possibility about the Superintelligence being a commodity.

The debate around the Superintelligence almost entirely assumes you need to "escape the permanent underclass" or contributions like Dwarkesh Patel and Phillip Trammell Capital in the 22nd Century. Dwarkesh and Phillip's implied view is that one hundred year post-singularity, there will still exist enough interesting things for capital to compound at accelerated rates, instead of the forces of competition pushing all the prices to zero, because there are no longer many things people want.[1]

The labs model is predicated at always being demand for SOTA, similarly to TSMC. Unlike TSMC, China SOTA is 6 months behind, and do it yourself AI is 18 months  behind, and the rate of change is 4x faster. I assign a probability higher than 50% that in 2028, I will be using an older open-source model instead of paying market prices for the State of the Art. 

As the value of selling proto-AGI through an API becomes commoditzed, it's likely that the labs will need to transition to creating science themselves, patents themselves, and have internally built AI they don't share to the public. 

The labs obviously know this, but the transition could be far from perfect. The one best prepared is Google. Isomorphic Labs already exists and already is patenting and creating the muscle to make money off AI-created science. Even there, I doubt Isomorphic Labs will be even considered a top-20 pharma company in 2030. At the same time, while I think I'll have use for a model with a time-horizon 100x bigger than now in three years, I don't know if I will have use for a model with a time-horizon 10,000x bigger than today in six years. I might prefer AI that is cheaper and faster. We could hit bumps in the monetization road.

Once again, I don't pretend to have answers.

The key point is that it makes sense for labs to hedge. The cost of hedging is small in the great scheme of things. But it creates apparently inconsistent behavior. 

The Labs think they desperately need to win

A lot of AI Labs have deep distrust between each other, between them and China, and so forth. Anthropic was created because the founders didn't trust OpenAI. xAI was created because Elon asked for AI to be paused and no one heard him (and he doesn't trust OpenAI). Meta Superintelligence Labs was created because Mark doesn't trust Google. OpenAI was created because Elon didn't trust Google and Page. Safe Superintelligence was created because Ilya didn't like OpenAI research path (and likely he also doesn't trust OpenAI). [2]

And all the Lab leaders, wholeheartedly, believe they are about to create the Superintelligence and the prize is only there for whoever gets there first (assuming singularity/recursive-self improvement).

Anthropic is right now betting the ranch they'll get there. Our Effective Autruist overlords at Anthropic, quite likely, would like that we could slow down the development of the Superintelligence for society to be ready. Dario Amodei said that he would coordinate with Google DeepMind, if the race was only between the two.

Because the EAs at Anthropic are leading the AI race they get a seat at the table at how the Department of War deploys their proto-AGI, despite how much the administration dislike Anthropic

From the AI perspective, no cost is high enough to increase the likelihood they will be the ones creating the Superintelligence and getting to control it.

  • Hypothetical All-knowing rationalist chief of staff at OpenAI: "Sam, we think we can increase our probability of winning the AI race from 42% to 42.7% if we add porn to ChatGPT, because the increased revenue means we will be able to better match Google DeepMind compute capacity."
  • Hypothetical Sam Altman: "I wish we could keep our brand clean, but creating the Artificial General Intelligence for the benefit of all of humanity is our ultimate goal, and if it helps us to achieve the goal instead of Google, who will only create the Artificial General Intelligence for the benefit of Larry Page and Sergey Brin, we are more than happy to make that tradeoff."[3]
The costs of creating the Superintelligence are increasing exponentially

This point is obvious for anyone who knows one thing or two about the scaling laws. See Gwern, Leopold, and Dwarkesh.

The costs of scaling state of the art artificial intelligence are increasing by 10x every two years, with no end in sight. Last year, OpenAI fundraised something like $40B. This year they are fundraising just in the first month of the year 2.5x that, and they plan the largest initial public offering in history later this year. That's because the costs of creating the Superintelligence are increasing to the point that soon even Google DeepMind will have difficulty funding it.

The implications are two-fold:

  • Even if the Superintelligence is imminent and you have line of sight to get there, you still need to fundraise more than the Gross Domestic Product than most countries to build it. Creating it is not an inevitability, you need to actually build it. This helps you model the behavior of the Labs.
  • The immense necessity of capital means the Labs are forced to play the capitalist class game.

If the investments in artificial intelligence continue constant without increasing, AI timelines are much longer than most on the labs and on this website expect.[4]

In reality, every year that we 3x the compute used to train state of the art models and the slot machine prints intelligence, society will return to increase the intelligence, either through the capitalist system, or through state intervention.

The fallacy is caused by past misguided techno-optimism (and lies)

The problem the AI Safety community needs to overcome is the system one thinking that was created in many tech observers, to fade most, if not all, claims made by tech people.

Elon Musk has said that Full Self-Driving was imminent for 10 years. He's now, once again, saying that. No one believes him. But FSD is imminent.

Vinod Khosla said, in 2017, that in five years, radiologists will be obsolete. Not only radiologists were not obsolete in 2017, but the employment in Diagnostic Imaging Centers in the United Stated has outpaced the overall employment growth in the overall economy. But AI that can make radiologists obsolete is imminent.

The heuristics many people have created for themselves is "Tech bros often lie about the capabilities and about the future because they are trying to sell you something or to raise money from you. I can avoid understanding the underlying details about technology XYZ only if I look at how they behave."

This is a fallacy. I am calling it "If the Superintelligence was near fallacy".

What is needed is to push the public to look at straight lines in a log-(log-)chart. 

What is needed is to explicitly call out the fallacy in the public discourse.

What is needed is to tell people they don't need to ever hear Sam Altman. All they need to do is to understand benchmarks and to use AI, in its best form, for themselves every 6 months.

The hardest thing about AI Safety is that it's an extremely optimistic and extremely pessimistic view of the future. Most people don't get it and it's needed to be extremely candid about that.

I hope that by documenting the 'If the Superintelligence was near fallacy', we can start to have better conversations.

  1. ^

    There will always exist arguments over how much people want and the nature of the things there are still there to create. I would argue a compressed 21st century, like Dario Amodei describes in machines of loving grace is possible, a compressed 3rd millenia is unlikely. 

  2. ^

    Lots of this lack sources! If I was writing for the NYT, I wouldn't write these things. It's hard to say these things in such uncertain terms as I am saying. Don't quote me on that. That's just me reading in the between lines!

  3. ^

    You could argue that this is how all the big atrocities in History started because someone thought their cause was fair and no price was high enough to get there. I would argue many AI leaders are victims of thinking they will singlehandedly solve History. But that's not the point of the essay. 

  4. ^

    Recursive-self improvement is a dark horse here! Anthropic seems to think they can get there by investing dozens of billions, not hundreds of billions of dollars. And RSI is the base expectation of many really good AI observers.



Discuss

Prediction: Recursively Self-improving AI for 2033

31 января, 2026 - 02:53
Published on January 30, 2026 11:53 PM GMT

Context:

  • One way to measure how good LLMs are which is gaining traction and validity is the following:
    • Let T(t) be the time it takes for a human to the task t. 
    • Empirically, a given LLM will be able to fully accomplish (without extra human correction) almost all tasks t such that T(t)<K and will not be able to accomplish, without human help,  tasks f such that T(f)>K.
    • Thus we can measure how good an LLM by see what is the time it would take a human to accomplish the hardest tasks that it can do. 
    • Let's call the that skill level K. 

Data: [1]

  • Moore's Law for LLMs: every 6 months, LLMs double their K value.
  • K(Claude Opus 4.5) ~ 1-2 hours (i.e. Claude can currently do a task which would require a human 1 hour in one shot without any human correction).

 

Reasoning: 

  1. A minimal non-trivial amount of significant improvement to an AI system corresponds to roughly one NeurIPS (or other such conference paper). 
  2. Writing such a paper typically requires on the order of 1 year of human work. 
  3. Using a 6 months doubling time of the LLM Moore's Law, that means that in 7 years an LLM will be able to independently write a NeurIPS paper.
  4. Hence in 7 years, i.e. 2033, it will possible to create a recursively self-improving AI. 

 

Initial Improvement Rate (Hard or Soft Takeoff?): 

The relevant questions to ask to estimate the initial takeoff rate are the following:

What is the size of the improvement? In the reasoning, we already set that to the equivalent of a NeurIPS paper. 

How much time will it take the AI to produce this improvement? Assuming that the bottleneck for producing this improvement is running the experiments, currently experiments to produce one NeurIPS paper typically take one week[2], thus we can expect the initial self-improvement rate to be roughly on the order of one NeurIPS paper per week. Which is not the hardest of takeoffs, and also gives a clue as to the limiting factor for containing ASI: limiting access to compute. 

 

 

 

 

   

  1. ^

    Source: A paper entitled something along the lines of  "BRIDGE: Bridging Reasoning In Difficulty Gap between Entities" which will soon be published at ICLR 2026 but does not yet seem to be publicly available.

  2. ^

    See for example the "Business of building AI" email from https://www.lesswrong.com/posts/5jjk4CDnj9tA7ugxr/openai-email-archives-from-musk-v-altman-and-openai-blog



Discuss

Senior Researcher - MIT AI Risk Initiative

31 января, 2026 - 02:06
Published on January 30, 2026 11:06 PM GMT

As AI capabilities rapidly advance, we face critical information gaps in effective AI risk management: 

  • What are the risks from AI, which are most important, and what are the critical gaps in response?
  • What are the mitigations for AI risks, and which are the highest priority to implement?
  • Which AI risks and mitigations are relevant to which actors and sectors?
  • Which mitigations are being implemented, and which are neglected?
  • How is the above changing over time?

The MIT AI Risk Initiative aims to provide credible, timely, and decision-relevant answers to these questions. Our core outputs include the risk repository, incident tracker, mitigations database, and governance map.

We are hiring a Senior Researcher to lead and strengthen our applied research workstreams. This role combines rigorous applied research with stakeholder engagement and project management. 

The initial focus is supporting a review of how major organizations worldwide are responding to AI risks. As the project grows, you will have the opportunity to initiate and lead additional workstreams. Your work will support policymakers, industry, civil society, and researchers seeking to understand and reduce AI risks.

What you’ll do
  • Evidence synthesis and measurement
     
    • Design and execute systematic reviews of organizational AI risk responses (search, screening, extraction, coding, and quality assurance).
    • Develop and maintain research protocols, codebooks, and documentation to ensure results are reproducible and updateable over time.
    • Analyze qualitative and quantitative data and synthesize findings into clear conclusions.
  • Surveys and expert input
     
    • Design and field surveys to gather structured input from relevant populations (for example, experts, practitioners, or organizations).
    • Analyze results and integrate them with evidence from literature and documents.
    • Research outputs and decision support
    • Write and disseminate research outputs for both technical and applied audiences (datasets, memos, briefs, and publications).
    • Translate findings into practical decision-support tools for end users (for example, structured datasets, frameworks, and guidance materials).
  • Stakeholder engagement
     
    • Engage stakeholders across government, industry, civil society, and research to understand decision needs and ensure outputs are usable.
    • Support external meetings, briefings, and workshops; communicate results clearly to non-specialist audiences.
    • Help manage relationships with collaborators, funders, and end users, including responding to inquiries and coordinating inputs.
  • Project delivery and operations
     
    • Plan and deliver workstreams end-to-end, including scoping, timelines, resourcing, and risk management.
    • Manage project coordination logistics and maintain clear process documentation.
    • Track budgets and contractor spend where relevant; support procurement and payments in coordination with MIT processes.
  • Grants and funding support
     
    • Contribute to grant and proposal development (narrative sections, workplans, budgets, and supporting materials).
    • Support funder updates and reporting by translating progress into clear milestones, outputs, and next steps.
  • Lab participation
     
    • Participate actively in the MIT FutureTech research community by attending lab meetings, sharing updates on workstreams, and contributing feedback on related projects.
    • Collaborate with other lab members to align methods, improve research quality, and identify new research opportunities.
  • Team leadership
     
    • Manage and mentor junior researchers.
    • Coordinate work with internal and external contributors (including contractors where relevant).
Supervision Received
  • Reports to the Director of the MIT AI Risk Initiative, Alexander Saeri.
  • Works under general oversight with direction on non-routine issues  
Supervision Exercised
  • May guide the work of internal and external project support staff and writers
  • May provide coaching and on-the-job training 
Qualifications & Skills

Minimum Required Education and Experience

  • 5+ years experience in applied research methods
  • Publications or research output in an applied social science (e.g., economics, psychology, behavioral science) or relevant field
  • Demonstrated ability in conducting systematic reviews and surveys.
  • Experience supervising others and leading research projects, programs, or functions
  • In-depth understanding of principles and practice of research
  • Prior experience in consulting, project management or operations, preferably in research, academic, or technology-oriented environment
  • Strong analytical skills with both qualitative and quantitative data
  • Stakeholder engagement experience, such as working with clients, partners, funders, or end users to understand needs and communicate results clearly.
  • People leadership experience, including supervising, mentoring, or coordinating junior researchers and collaborators.
  • Operational competence in a research, academic, consulting, or technology-oriented environment (for example maintaining process documentation, coordinating vendors/contractors, and navigating administrative workflows).
  • Comfort with budgets and resourcing, such as tracking spend against a plan, managing contractor time, or supporting financial reporting (depth can vary; we are looking for practical fluency).

Preferred Education and Experience

  • PhD degree
  • Honours degree or higher in an applied social science (e.g., psychology, behavioral science) or relevant field
  • Grant writing experience
  • Experience producing decision-focused outputs (for example policy briefs, executive memos, toolkits, or structured evidence summaries).
  • AI Risk or AI Safety expertise
Other information
  • One year term based on research grant funding.
  • Work can be on-site or remote. We have a strong preference for candidates who have a significant time zone overlap with Australia.
  • Full-time is preferred, but part-time commitments will also be considered.
Selection process
  • Short test task
  • Interview
  • Potential paid work trial 
About MIT FutureTech 

MIT FutureTech is an interdisciplinary group of  economists, computer scientists, and engineers who study the foundations and economic implications of progress in computing and Artificial Intelligence.  Economic and social change is underpinned by advances in computing: for instance, improvements in the miniaturization of integrated circuits, the discovery and refinement of algorithms, and the development and diffusion of better software systems and processes. We aim to identify and understand the trends in computing that create opportunities or risks and help leaders in computing, scientific funding bodies, and government to respond appropriately. 

Our research therefore helps to answer important questions including: Will AI progress accelerate or decline – and should it? What are the implications for economic growth and for the labor markets? What are the bottlenecks to growth from AI, and how can they be solved? What are the risks from AI, and how can we mitigate them? 

To support our research, we run seminars and conferences to better connect the field of computer scientists, economists, and innovation scholars to build a thriving global research community. 

To disseminate it, we advise governments, nonprofits and industry, including via National Academies panels on transformational technologies and scientific reliability, the Council on Competitiveness’ National Commission on Innovation and Competitiveness Frontiers, and the National Science Foundation’s National Network for Critical Technology Assessment. 

Our work has been funded by Open Philanthropy, the National Science Foundation, Microsoft, Accenture, IBM, the MIT-Air Force AI accelerator, and the MIT Lincoln Laboratory. 

Some of our recent outputs:  

Some recent articles about our research: 

You will be working with Dr. Neil Thompson, the Director of MIT FutureTech. Prior to starting FutureTech, Dr. Thompson was a professor of Innovation and Strategy at the MIT Sloan School of Management. His PhD is in Business & Public Policy from Berkeley. He also holds Master’s degrees in: Computer Science (Berkeley), Economics (London School of Economics), and Statistics (Berkeley). Prior to joining academia, Dr. Thompson was a management consultant with Bain & Company, and worked for the Canadian Government and the United Nations.

How to apply

Please use this form to register interest in this role or to submit a general expression of interest.

Selected candidates will be first interviewed via Zoom. We are recruiting on a rolling basis and may close applications early if we find a suitable candidate, so please apply as soon as possible to maximize your chances.

** To comply with regulations by the American with Disabilities Act (ADA), the principal duties in position descriptions must be essential to the job. To identify essential functions, focus on the purpose and the result of the duties rather than the manner in which they are performed. The following definition applies: a job function is essential if removal of that function would fundamentally change the job.



Discuss

Страницы