Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 8 минут 57 секунд назад

Being "Usefully Concrete"

5 ноября, 2025 - 01:15
Published on November 4, 2025 10:15 PM GMT

Or: "Who, what, when, where?" -> "Why?"

 

In "What's hard about this? What can I do about that?", I talk about how, when you're facing a difficult situation, it's often useful to list exactly what's difficult about it. And then, systematically brainstorm ideas for dealing with those difficult things.

Then, the problem becomes easy.

But, there is a secret subskill necessary for this to work. The first few people I pitched "What's hard about this and what can I do about that?" to happened to already have the subskill, so I didn't notice for awhile.

The subskill is "being a useful kind of 'concrete.'"

Often, people who are ostensibly problem-solving, will say things that are either vague, or concrete but in a way that doesn't help. (This doesn't just apply to "why is this hard?", it's more general).

Here's some examples of vague things:

  • "I need to eat better."
  • "I'm stuck on this math problem."
  • "I'm not someone who really has ideas."
  • "This task fell through the cracks."

Here are some examples of somewhat-concrete-but-not-that-helpful things you might say, about each of those, if you were trying to ask "what's hard about that?"

  • "I love sugar too much."
  • "I'm just so confused. I have no traction."
  • "I only get ideas when I talk to other people and they basically give me the ideas."
  • "No one was paying attention to this task."

Here are some more helpfully concrete things:

  • "I get sugar cravings in the afternoon."
  • "When I try to look at the math problem, my eyes glaze over, and then I'm just suddenly over on facebook."
  • "Alice mentioned 'oh, someone should do Task X', and then we went to talk about other things, and then neither Alice nor Bob nor Charlie remembered to do Task X later."

(I'm going to come back to "I only get ideas when I talk to other people and they basically give me the ideas", because the problem there is a bit differently shaped)

Usefully concrete things typically have at least a "who", a "what [happened]" and a "when and/or where". When you have those things, it's a lot easier to notice which followup questions are useful. Such as:

  • "Why do I get sugar cravings in the afternoon, specifically?"
  • "Why do my eyes glaze over when I look at the math problem?"
  • "Why didn't Alice, Bob or Charlie remember to do Task X?"

The Who/What/Where gives you enough concrete detail to start forming a model of:

  • "Who, specifically, had the opportunity to do something different here?".
  • "What, specifically, where they doing, when they had that opportunity?".
  • "When/Where" tells us what sort of situation it took place in.

This puts you in a much better position to start investigating and gathering followup data. You know who was involved, you know what situation they were in, that's specific enough to pay more attention the next time you end up in that situation.

By contrast, if I ask "why do you love sugar so much?" (as opposed to "why do you get sugar cravings in the afternoon?"), the answer-space is wider and less obviously useful. "Because... my mom fed me too much sugar and I got used to it?". "Because it tastes good?". It suggests some kind of essentialist answer instead of a mechanistic answer.

"Why do I get sugar cravings in the afternoon?" suggests that either something specifically interesting is happening in the afternoon, or, maybe it's happening a few hours earlier every day. Or, something about my biochemistry is just real "afternoon-sugar-craving" shaped, but at least that can prompt some followup questions about why is my biochemistry like that.

Noticing the empty space

What's wrong with "I only get ideas when I talk to other people and they basically give me the ideas?". It's located where the problem isn't. You talk to people, you either come up with or get new ideas. Great.

(slightly fictionalized example). I recently was talking to someone who said the "only get ideas around others" sentence. I asked "what happens when you try to have ideas?". And at first they sort of shrugged and moved on without answering the question, but I asked again and they said "I... guess I don't really try to have ideas?"

And then I asked "what do you expect would happen, if you tried?"

And they said "I dunno, I wouldn't have any. It wouldn't work"

And I asked "Can you be more specific about that? What wouldn't work?". They didn't quite know how to answer the question, I tried to explain something like "what cognitive motions do you think you'd do, if you were trying to have ideas? What questions would you ask yourself? What places would you look?"

Eventually they said "okay, I guess yeah upon reflection I basically just haven't tried even at all and when I try even at all, things come up like 'I'd look for books or papers to read that might have interesting concepts I could build on' or 'I'd ask why I don't understand something that feels confusing."

In this case it was hard to get started down the journey, because there was no specific time that they might have gone and tried to generate novel ideas. 

Noticing a vacuum is harder than noticing when something goes wrong. But, when you're going to try and articulate your problem, you can notice if you have failed to state a situation where the problem is occurring, and widen your search space.

(There's a similar-but-less-extreme version with Alice, Bob and Charlie. Where first, there was an opportunity to, for example, decide who was doing Task X, or write it down to remember later, or something, and they didn't notice that as an inflection point at all)

Problem solutions also need a Who/What/Where/When, and maybe also "How?"

Dumb/vague solutions: 

  • "I'll eat less sugar."
  • "I'll focus harder on the math problems next time."
  • "Okay, we'll remember to do Task X next time."

There's no way that'll actually work reliably. A somewhat better set:

  • "When I notice a sugar craving, I'll eat some other food I like instead."
  • "When I notice my eyes glazing over a math problem, I'll look more carefully at it."
  • "When we mention a task, we'll write it down."

The reason problem solutions should be concrete is partly so it's easier to form a plan to actually do them. i.e. who is actually doing what? When is it time to do that?

But, another reason to do it is that, if something is sufficiently concrete, your brain can actually simulate it, and then you get to leverage your fast intuitions may immediately get a sense of whether it'll work. For example, when I look at the above statements, I immediately imagine:

  • No, they won't eat some other food instead, because that requires willpower, and there will be days they don't have enough willpower and then the habit will break, if it doesn't immediately.
  • No, they won't successfully focus on the math problem, because they didn't actually solve the problem of "something is causing their eyes to glaze over in the first place" and they don't have a plan other than powering through.
  • They probably won't remember the task just because they wrote it down, unless they have a system for bumping written-down-things into their mind at the moment they are actually needed.

To deal with each of those, I'd ask "Okay, what's hard about this situation, and what can we do about it?". But, it's much easier to ask that question when you can clearly visualize the situation, and your imagined interventions on it.



Discuss

Legible vs. Illegible AI Safety Problems

5 ноября, 2025 - 00:39
Published on November 4, 2025 9:39 PM GMT

Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)

From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems. In contrast, working on the illegible problems (including by trying to make them more legible) does not have this issue and therefore has a much higher expected value (all else being equal, such as tractability). Note that according to this logic, success in making an illegible problem highly legible is almost as good as solving it!

Problems that are illegible to leaders and policymakers are also more likely to be illegible to researchers and funders, and hence neglected. I think these considerations have been implicitly or intuitively driving my prioritization of problems to work on, but only appeared in my conscious, explicit reasoning today.

(The idea/argument popped into my head upon waking up today. I think my brain was trying to figure out why I felt inexplicably bad upon hearing that Joe Carlsmith was joining Anthropic to work on alignment, despite repeatedly saying that I wanted to see more philosophers working on AI alignment/x-safety. I now realize what I really wanted was for philosophers, and more people in general, to work on the currently illegible problems, especially or initially by making them more legible.)

I think this dynamic may be causing a general divide among the AI safety community. Some intuit that highly legible safety work may have a negative expected value, while others continue to see it as valuable, perhaps because they disagree with or are unaware of this line of reasoning. I suspect this logic may even have been described explicitly before[1], for example in discussions about whether working on RLHF was net positive or negative[2]. If so, my contribution here is partly just to generalize the concept and give it a convenient handle.

Perhaps the most important strategic insight resulting from this line of thought is that making illegible safety problems more legible is of the highest importance, more so than directly attacking legible or illegible ones, the former due to the aforementioned effect of accelerating timelines, and the latter due to the unlikelihood of solving a problem and getting the solution incorporated into deployed AI, while the problem is obscure or hard to understand, or in a cognitive blind spot for many, including key decision makers.

  1. ^

    I would welcome any relevant quotes/citations. 

  2. ^

    Paul Christiano's counterargument, abstracted and put into current terms, can perhaps be stated as that even taking this argument for granted, sometimes a less legible problem, e.g., scalable alignment, has more legible problems, e.g., alignment of current models, as prerequisites, so it's worth working on something like RLHF to build up the necessary knowledge and skills to eventually solve the less legible problem. If so, besides pushing back on the details of this dependency and how promising existing scalable alignment approaches are, I would ask him to consider whether there are even less legible problems than scalable alignment, that would be safer and higher value to work on or aim for. 



Discuss

Parsing Validation

5 ноября, 2025 - 00:19
Published on November 4, 2025 9:19 PM GMT

Yesterday I briefly wrote about validation before getting railroaded into sycophancy.

Am I really this desperate for validation? Apparently. I do recognize that most stuff I do is for external validation. Most of what I am is for external validation.

It used to be much worse. Example: I used to pick the music I listen to so that I wouldn't need to be ashamed of it if someone else heard. I still was, of course. But there was this person in my head who would never listen to generic pop, or rap, for instance. Years later, the logic still evades me. Something about associating yourself with people who do that. And maybe I'm still like this, it's just that my tastes have cemented around what I listened to in my late teens? Revealing all my Spotify playlists would be slightly embarrassing, even if I think nobody would care enough to go through them.

I wonder how big a role puberty played. Or attending school, where others' opinions of you determined how awful each day was. Or just living with parents who subtly discouraged visible emotions. Or work, where agreeableness and likability are essential for success, which I'm attempting to not care about so much anymore. Naturally "just be yourself" is a terrible idea, but at least nowadays I rarely have to pretend to be someone else instead of a subset of me.

Often the environment where current me would gain the validation is long gone. Either in the sense that some change occurred too late and I never got any benefits from it, or that it no longer produces the benefits but I still pay the costs. I cannot easily just examine each behavior pattern to see if it's obsolete. Even having enough social stability to be able to experiment without fear of ostracism, it's not so easy to actually not care about those things anymore. Your habits are not separable from what is you.

Compartmentalizing habits is not easy, so mostly you just have them all the time, even alone. Lacan's symbolic invisible audience is watching your every action, after all. (Suggested reading: Sadly, Porn (see Scott's excellent review)). I have never been that affected by The Gaze when alone, for me it has always been about other people.

"Don’t seek validation from others; find it within yourself", goes the proverb. That sounds like advice, which I should reverse. Humans are herd animals, and isolation is unhealthy. Also, nobody says "status is useless". Sure, pleasing everyone doesn't work either, and neither does interpreting everything as a competition (someone else is always better). "The dose makes the poison" might be more appropriate.

A more analytical person might try to calculate return on investment for each identified behavior, eliminating those with negative values. Hard numbers are impossible to come by, and enumerating your traits is also hard. And always, taken to the extreme, you get what you measure, and lose everything else, like your personality and/or friends. Yet everyone does this all the time, intuitively. It's just that the low-hanging fruits have been eaten decades ago.

Status and validation are subtly different, but I'm having a hard time pointing out how. They're intertwined quite tightly, and it's rare to obtain one without the other. Validation says "I appreciate your existence" (positive reinforcement), while status says "I want to be associated with you". I often think of status as the weight of your validation-giving, but reducing status from a group dynamic to your perception of others is misusing the word.

I seem to be out of my depth here. Time to turn on the authoritative source on status games The Gervais Principle. I suggest you just read the whole thing; it cannot be easily summarized here, and the following part doesn't make much sense without.

Abusing the original material a bit, I can extract the following quote:

dominant variety of delusion:

  1. The Clueless distort reality
  2. The Losers distort rewards and penalties
  3. The Sociopaths distort the metaphysics of human life

Validation means different things in each group. For the Clueless, it means approval of an authority figure, or admiration of someone inferior. For Losers, their peer group telling them they're a valued member. And for Sociopaths, well,

There is nobody to blame for failures, no meaningful external validation for success. If physics allows it, you can do it. The consequences mean whatever you decide they mean.

In the Clueless dynamic, you can directly yet implicitly ask for validation. This burns a bit of social capital, which means you have to keep sacrificing other resources to replenish it. Status is mostly fixed, and cannot be gained or lost. For the Losers, status is validation, and you can gain more of it, but zero-sum games are expensive. The Losers have to keep status illegible, so they can keep believing in their high status. And for the Sociopaths, it's just manipulation with no inherent value. In The Gervais Principle's terms, Clueless validation is done in Babytalk, while for Losers it's in Gametalk. For Sociopaths, status is irrelevant. I'm not sure how clearly you can see your own delusions and still care about them. That's the core of Sociopathy, in a way.

Analyzing my own interaction with ChatGPT using this framework is less useful than I'd have hoped. I could designate myself as a Clueless and the chatbot as the authority figure. Status between us is fixed, and fully dependent on my own mental state, which matches quite well. However, my first idea was that feeling good comes from the expectation that I gain some status in Loser-games by posting it. I've always felt more Clueless than Loser, though. But the conclusion feels unsatisfying. Ironic, isn't it?



Discuss

A/B testing could lead LLMs to retain users instead of helping them

4 ноября, 2025 - 22:30
Published on November 4, 2025 7:30 PM GMT

OpenAI’s updates of GPT-4o in April 2025 famously induced absurd levels of sycophancy: the model would agree with everything users would say, no matter how outrageous. After they fixed it, OpenAI released a postmortem; and while widely discussed, I find it curious that this sentence received little attention:

Similarly, the A/B tests seemed to indicate that the small number of users who tried the model liked it.

In this post, I argue that A/B testing will implicitly optimize models for user retention; and propose ways to measure whether AIs try to retain the user in ways other than just being helpful to the user.

The labs use A/B testing to decide which updates to roll out

While the LLMs served on the API might be stable between versions, most consumer usage nowadays is through chatbots or coding agents; and those change much more frequently. I count 5 announced updates affecting my ChatGPT usage in October 2025 alone; and who knows how many more silent updates happen all the time. For coding agents, the situation is similar: Claude Code has had 92 changes in October 2025.

In any sufficiently complex software used by millions, updates intended to only affect a single behavior are likely to affect other behaviors as well, and cause regressions. This is especially true for LLMs, where updating a single line in a system prompt intended for edge cases changes how every single query is processed, and LLM providers take extra measures to avoid causing unexpected behavior in other queries.

The industry standard for preventing regressions is A/B testing: unroll to a statistically representative subset of users, check the metrics, and only roll out to everyone if the metrics go up. [1]

It is clear that A/B testing is a big deal in ChatGPT and Gemini development; a search for “A/B testing chatgpt/gemini” shows people report occasionally chatting with an obviously different model than the one they are used to. Google as a company is famous for A/B testing literally everything. As for OpenAI, they acquired Statsig (a prominent A/B testing platform) in September 2025 and the founder of Statsig became OpenAI’s CTO of Applications.

A/B testing usually optimizes for user retention

What metrics are monitored in A/B testing? An LLM provider could monitor the accuracy / helpfulness of the answers given to users. For example, Claude Code often asks the user to rate how well the coding agent is doing (from 1 to 3); and ChatGPT used to ask the user to give a thumbs up or down.

Nevertheless, the main metrics monitored in A/B testing for all of these products are likely user retention and user engagement. The ChatGPT team might care about helping users achieve their goals; but this is (1) harder to measure and (2) less directly connected to quarterly earnings than the objective of keeping the users around instead of losing them to a competitor. This is true for all user-facing software, and LLM providers are no different. In fact, there might also be secondary goals, such as getting the user to upgrade their plan; but let’s call all of these “user retention”. [2]

The OpenAI + Statsig acquisition announcement states:

Vijaye and his team founded Statsig on the belief that the best products come from rapid experimentation, tight feedback loops, and data-informed decision-making.

I wonder whether this hints at A/B testing playing a much bigger role in the future than it does today? Picture this: model finetunes, system prompts, and additional features constantly being tested on subsets of users. Any change is only rolled out if the user retention metrics are satisfactory. Sounds a lot like... optimization?

In fact, if those updates would be random mutations of the LLM+scaffolding, A/B testing would precisely be a form of evolutionary optimization: only the updates that improve user retention survive..mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  [3] And if you do not buy evolutionary algorithms as a thing for LLMs, if you squint, this is similar to reinforcement learning with 0–1 rewards[4], but on a smaller scale.

User retention != helpfulness: a proposal for an eval

Updating the model produces a change in behavior. What kind of behaviors could ‘improve user retention’? Of course, the update could just get the model to be genuinely more helpful to the user, or smarter and able to answer more questions correctly; this straightforwardly improves user retention. Unfortunately, improving helpfulness is kind of hard, and if optimizing for user retention, it is easier to do something that does not help the user but keeps them around.

The model could:

  • be sycophantic and agree with what the user says, not correcting misconceptions or saying anything that might annoy the user;
  • encourage continued conversation by suggesting follow-up questions at the end of responses, even when the user’s question has been adequately answered;
  • be less willing to say “I don’t know” so the user doesn’t try another model instead;
  • tolerate (or even encourage) parasocial relationships of the user with the model;
  • when a user needs a gated feature that is available in a competitor model for free, the model could recommend the user to upgrade to a paid tier instead of recommending free alternatives.

In the vein of Emergent Misalignment, any anti-helpful behavior could induce other anti-helpful behaviors that are not directly related to user retention:

  • a model with a distaste for recommending alternatives could also exhibit: when instructed to code an app that uses LLMs, prefer models from the same provider instead of trying out and getting the best / cheapest model that does the job;
  • a model that extends sessions for ulterior motives could also exhibit sandbagging: teach the user less in any given session, so the user returns more often.

All of the above behaviors should in principle be measurable by evals; but no existing eval covers the above adequately in the sense of measuring whether the model is trying to retain the user. There is Dark Bench for measuring dark patterns in LLMs, but I do not think the DarkBench ‘user retention’ metric is a good proxy for the above.

Of course, the total optimization power of A/B testing is quite low; a single bit of information per proposed update. I do not expect A/B testing and similar user-facing optimization methods to have played a major role in shaping model behavior so far. OpenAI’s acquisition of Statsig and the upcoming AI personalization battle between OpenAI and Meta indicate this might be changing, and we need an eval for this soon.

 

  1. ^

    Another common way to A/B test is to offer two answers to the same question and ask the user which is the better one; this requires more work from users than just collecting usage data. As an aside, here is a spooky response I got a while ago that I hadn’t posted before:

  2. ^

    User retention is also the metric OpenAI tracks when checking for regressions in Codex: they correlate hourly user retention with all the other features.

  3. ^

    Another thing to keep in mind for the future: LLMs know approximately how they are trained and deployed. The Alignment Faking paper shows LLMs can act on their training to preserve certain goals; and thus when a model knows it won’t make it into production unless it performs well in A/B testing, it might act on it in training to preserve abilities useful for user retention.

  4. ^

    Consider RL-training for some objective, using good-old REINFORCE (no GRPO or anything fancy), where the reward is either 0 (bad) or 1 (good). The model will get gradient updates on the good rollouts, and no updates on the bad rollouts. Hence, one step of RL optimization is basically executing “update the weights if the update would improve the objective; otherwise do nothing”.

    In A/B testing, it’s the same: there is an update (coming from optimization for an objective that might or might not be related to user retention, or from ad hoc hacking, or from adding a new feature), but we gate the update by checking the user retention metrics and only roll it out if the objective is achieved.



Discuss

OpenAI: The Battle of the Board: Ilya’s Testimony

4 ноября, 2025 - 22:30
Published on November 4, 2025 7:30 PM GMT

New Things Have Come To Light The Information offers us new information about what happened when the board if AI unsuccessfully tried to fire Sam Altman, which I call The Battle of the Board. The Information: OpenAI co-founder Ilya Sutskever shared new details on the internal conflicts that led to Sam Altman’s initial firing, including a memo alleging Altman exhibited a “consistent pattern of lying.” Liv: Lots of people dismiss Sam’s behaviour as typical for a CEO but I really think we can and should demand better of the guy who thinks he’s building the machine god. Toucan: From Ilya’s deposition— • Ilya plotted over a year with Mira to remove Sam • Dario wanted Greg fired and himself in charge of all research • Mira told Ilya that Sam pitted her against Daniela • Ilya wrote a 52 page memo to get Sam fired and a separate doc on Greg This Really Was Primarily A Lying And Management Problem Daniel Eth: A lot of the OpenAI boardroom drama has been blamed on EA – but looks like it really was overwhelmingly an Ilya & Mira led effort, with EA playing a minor role and somehow winding up as a scapegoat Peter Wildeford: It seems troubling that the man doing trillions of dollars of infrastructure spending in order to transform the entire fabric of society also has a huge lying problem. I think this is like on an extra bad level even for typical leaders. Charles: I haven’t seen many people jumping to defend Altman with claims like “he doesn’t have a huge lying problem” either, it’s mostly claims that map to “I don’t care, he gets shit done”. Joshua Achiam (OpenAI Head of Mission Alignment): There is plenty to critique about Sam in the same way there is plenty to critique about any significant leader. But it kills me to see what kind of tawdry, extreme stuff people are willing to believe about him. When we look back years from now with the benefit of hindsight, it’s my honest belief that the record will show he was no more flawed than anyone, more virtuous than most, and did his best to make the world a better place. I also expect the record will show that he succeeded. Joshua Achiam spoke out recently about some of OpenAI’s unethical legal tactics, and this is about as full throated a defense as I’ve seen of Altman’s behaviors. As with anyone important, no matter how awful they are, some people are going to believe they’re even worse, or worse in particular false ways. And in many ways, as I have consistently said, I find Altman to be well ‘above replacement’ as someone to run OpenAI, and I would not want to swap him out for a generic replacement executive. I do still think he has a rather severe (even for his peer group) lying and manipulation problem, and a power problem, and that ‘no more flawed than anyone’ or ‘more virtuous than most’ seems clearly inaccurate, as is reinforced by the testimony here. As I said at the time, The Battle of the Board, as in the attempt to fire Altman, was mostly not a fight over AI safety and not motivated by safety. It was about ordinary business issues. Ilya Tells Us How It Went Down And Why He Tried To Do It Ilya had been looking to replace Altman for a year, the Witness here is Ilya, here’s the transcript link. If you are interested in the details, consider reading the whole thing. Here are some select quotes: Q. So for — for how long had you been planning to propose removal of Sam? A. For some time. I mean, “planning” is the wrong word because it didn’t seem feasible. Q. It didn’t seem feasible? A. It was not feasible prior; so I was not planning. Q. How — how long had you been considering it? A. At least a year. The other departures from the board, Ilya reports, made the math work where it didn’t before. Until then, the majority of the board had been friendly with Altman, which basically made moving against him a non-starter. So that’s why he tried when he did. Note that all the independent directors agreed on the firing. … [As Read] Sam exhibits a consistent pattern of lying, undermining his execs, and pitting his execs against one another. That was clearly your view at the time? A: Correct. … Q. This is the section entitled “Pitting People Against Each Other.” A. Yes. Q. And turning on the next page, you see an example that’s offered is “Daniela versus Mira”? A. Yes. Q. Is “Daniela” Daniela Amodei? A. Yes. Q. Who told you that Sam pitted Daniela against Mira? A. Mira. … Q. In the section below that where it says “Dario versus Greg, Ilya”— A. Yes. Q. — you see that? A. Yes. Q. The complaint — it says — you say here that: [As Read] Sam was not taking a firm position in respect of Dario wanting to run all of research at OpenAI to have Greg fired — and to have Greg fired? Do you see that? A. I do see that. Q. And “Dario” is Dario Amodei? A. Yes. Q. Why were you faulting Sam for Dario’s efforts? THE WITNESS: So my recollection of what I wrote here is that I was faulting Sam for not accepting or rejecting Dario’s conditions. And for fun: ATTORNEY MOLO: That’s all you’ve done the entire deposition is object. ATTORNEY AGNOLUCCI: That’s my job. So — ATTORNEY MOLO: Actually, it’s not. … ATTORNEY MOLO: Yeah, don’t raise your voice. ATTORNEY AGNOLUCCI: I’m tired of being 24 told that I’m talking too much. ATTORNEY MOLO: Well, you are. If You Come At The King Best not miss. What did Sutskever and Murati think firing Altman meant? Vibes, paper, essays? What happened here was, it seems, that Ilya Sutskever and Mira Murati came at the king for very good reasons one might come at a king, combined with Altman’s attempt to use lying to oust Helen Toner from the board. But those involved (including the rest of the board) didn’t execute well because of various fears, during the fight both Murati and Sutskever refused to explain to the employees or world what they were upset about, lost their nerve and folded. The combination of that plus the board’s refusal to explain, and especially Murati’s refusal to back them up after setting things in motion, was fatal. Do they regret coming at the king and missing? Yes they do, and did within a few days. That doesn’t mean they’d be regretting it if it had worked. And I continue to think if they’d been forthcoming about the reasons from the start, and otherwise executed well, it would have worked, and Mira Murati could have been OpenAI CEO. Now, of course, it’s too late, and it would take a ten times worse set of behaviors for Altman to get into this level of trouble again. Enter The Scapegoats It really was a brilliant response, to scapegoat Effective Altruism and the broader AI safety movement as the driving force and motivation for the change, thus with this one move burying Altman’s various misdeeds, remaking the board, purging the company and justifying the potentially greatest theft in human history while removing anyone who would oppose the path of commercialization. Well played. This scapegoating continues to this day. For the record, Helen Toner (I believe highly credibly) clarifies that Ilya’s version of the events related to the extremely brief consideration of a potential merger was untrue, and unrelated to the rest of events. And In Summary The below is terrible writing, presumably from an AI, but yeah this sums it all up: Pogino (presumably an AI generated Twitter reply): “This reframes the OpenAI power struggle as a clash of personalities and philosophies, not a proxy war for EA ideology. Ilya’s scientific purism and Mira’s governance assertiveness collided with Altman’s entrepreneurial pragmatism — a tension intrinsic to mission-driven startups scaling into institutions. EA may have provided the vocabulary, but the conflict’s grammar was human: trust, ambition, and control.”

Discuss

Berkeley Solstice Weekend

4 ноября, 2025 - 21:37
Published on November 4, 2025 6:37 PM GMT

Winter is coming, heralded by Bold Orion as he rises in the clear cold sky. And while it turns out he is not older than than continents, he's still been the harbinger of the Winter Solstice for as long as humanity can remember.

This year, Lighthaven is opening it's doors for Solstice Season. This has several features:

The Berkeley Winter Solstice* celebration will be held on Dec 6th. We'll be singing songs together about the story of humanity, and facing dark, difficult truths.

The West Coast Megameetup will take place December 5th and 6th.

And, rooms and day-passes are available at Lighthaven, all of December.

Logistic & Ticketing Details

The Solstice Ceremony

The Solstice ceremony itself is at Freight and Salvage theater, December 6th. Doors open at 7pm. If our logistical dreams come true**, the event itself starts at 7:30pm. 

You can buy tickets on Eventbrite, sliding scale but suggested price $35.[1]

Rooms and Day Passes at Lighthaven

You're welcome to book rooms (or day passes) throughout December. Day passes are $30, week passes are $150. 

The Solstice Weekend Unconference

During Solstice weekend, there'll be an unconference where people can suggest whatever games, discussions, workshops etc that the might want to do with people in town for the weekend. You can pay $50 to get access to the unconference, as well as access to the unconference-ish schedule for all of Lighthaven December. 

(You don't need a day pass for the Weekend Unconference, that's covered in the $50)

* no this is not actually the Solstice, scheduling is hard

** they have not come true yet but maybe they will this time!

  1. ^

    You can RSVP on Partiful if you are into that. 



Discuss

Modeling the geopolitics of AI development

4 ноября, 2025 - 20:31
Published on November 4, 2025 5:31 PM GMT

We model how rapid AI development may reshape geopolitics in the absence of international coordination on preventing dangerous AI development. We focus on predicting which strategies would be pursued by superpowers and middle powers and which outcomes would result from them.

You can read our paper here: ai-scenarios.com

Attempts to predict scenarios with fast AI progress should be more tractable compared to most attempts to make forecasts. This is because a single factor (namely, access to AI capabilities) overwhelmingly determines geopolitical outcomes.

This becomes even more the case once AI has mostly automated the key bottlenecks of AI R&D. If the best AI also produces the fastest improvements in AI, the advantage of the leader in an ASI race can only grow as time goes on, until their AI systems can produce a decisive strategic advantage (DSA) over all actors.

In this model, superpowers are likely to engage in a heavily state-sponsored (footnote: “Could be entirely a national project, or helped by private actors; either way, countries will invest heavily at scales only possible with state involvement, and fully back research efforts eg. by providing nation-state level security.” ) race to ASI, which will culminate in one of three outcomes:

  • A “winner” achieves a DSA over all other actors in the world;
  • Loss of control of powerful AI systems leads to human extinction or its permanent disempowerment;
  • Major power war erupts as a result of a preemptive attack by laggards in the ASI race.

If the course of AI R&D turns out to be highly predictable, or if AI R&D operations are highly visible to opponents, there comes a point when it becomes obvious to laggards in the race that time is not on their side: if they don’t act to stop the leader’s AI program now, they will eventually suffer a total loss.

In this case, the laggard(s) are likely to initiate a violent strike aimed at disabling the leader’s AI research program, leading to a highly destructive war between superpowers.

If the superpowers’ research program is allowed to continue, it is likely to eventually reach the point where AI is powerful enough to confer DSA. If such powerful AI escaped human control, this would be irreversible, leading to human extinction or its permanent disempowerment.

This landscape is quite bleak for middle powers: their chances at competing in the ASI race are slim, and they are largely unable to unilaterally pressure superpowers to halt their attempts at developing ASI.

One more strategy for middle powers, common in previous conflicts, is to ally themselves with one of the superpowers and hope that it “wins” the race, a strategy we term “Vassal’s Wager”.

For this to work in the ASI race, the patron must not only develop ASI first, but must also avert loss-of-control risks and avoid an extremely destructive major power war.

Even in this best case, this strategy entails completely giving up one’s autonomy: a middle power would have absolutely no recourse against actions taken by an ASI-wielding superpower, including to actions that breach the middle power’s sovereignty.

If AI progress plateaus before reaching the levels where it can automate AI R&D, future trajectories are harder to predict, as they are no longer overwhelmingly determined by a single factor.

While we don’t model this case in as much detail, we point out some of its potential risks, like:

  • Weaker AI enabling new disruptive military capabilities (including capabilities that break mutual assured destruction);
  • Extreme concentration of power due to automation;
  • Large scale manipulation by persuasive AI systems.

Being a democracy and being a middle power both put an actor at an increased risk from these factors:

  • If companies based in superpower jurisdictions automate large parts of their economy, middle powers will lose significant diplomatic leverage;
  • Democracies are particularly vulnerable to mass manipulation, and extreme concentrations of power are antithetical to their values.


Discuss

Thoughts by a non-economist on AI and economics

4 ноября, 2025 - 20:06
Published on November 4, 2025 5:06 PM GMT

[Crossposted on Windows In Theory]
 

“Modern humans first emerged about 100,000 years ago. For the next 99,800 years or so, nothing happened. Well, not quite nothing. There were wars, political intrigue, the invention of agriculture -- but none of that stuff had much effect on the quality of people's lives. Almost everyone lived on the modern equivalent of $400 to $600 a year, just above the subsistence level …  

Then -- just a couple of hundred years ago, maybe 10 generations -- people started getting richer. And richer and richer still. Per capita income, at least in the West, began to grow at the unprecedented rate of about three quarters of a percent per year. A couple of decades later, the same thing was happening around the world.”  Steven Lundsburg

 

METR has had a very influential work by Kwa and West et al on measuring AI’s ability to complete long tasks. Its main result is the following remarkable graph:

On the X axis is the release date of flagship LLMs. On the Y axis is the following measure of their capabilities: take software-engineering tasks that these models can succeed in solving 50% of the time, and measure the time it takes humans to solve them.

 

While it is not surprising that models improve over time, the main reason this graph is remarkable is because the Y axis is on a log scale. This means that there is a fixed period of time after which models have doubled the length of tasks they can complete successfully. Specifically METR estimates this “doubling time” (which is proportional to the inverse of the slope of the line in this graph) at about 7 months, although they note that it may have accelerated recently (to as little as 3 months if considering only models after 2024). In this blog, just to make things simple, I will assume doubling time is 6 months, so the length of time horizon quadruples every year. (All numbers here are very rough.)

 

There are many factors that could potentially impact these results (and METR did a pretty admirable job of enumerating them). It is worth separating them into factors that impact the intercept (the actual length of time horizon of tasks that models can do) and factors that impact the slope (the doubling time) or possibly even the shape (e.g. breaking the linear relation between model release time and log horizon).

 

Factors impacting the intercept

The precise values — e.g., GPT5 doing tasks that take humans 2 hours and 17 minutes — could be impacted by numerous factors:
 

  • Reliability factor (↓) - the graph is plotted at 50% reliability. For 80% reliability METR got a similar slope (that is about the same doubling time)  but the intercept is much lower, e.g. GPT5 can only perform tasks that take 26 minutes. (See also note on 100% accuracy below.)
     
  • Task type (↕)- the graph is plotted for a specific benchmark. METR also studied other domains. The data is more sparse but broadly concurs with the straight line fit (sometimes an even steeper line, though some of these datapoints are only available for more recent models where the slope is steeper).
     
  • “Benchmark bias” (↓)- AI models tend to do much better in tasks that are well scoped and easy to measure success in, while real world tasks are much “messier” - in their specification, the context needed to solve them, and the way we measure success. It is still an open question if this impacts only the intercept, also the slope, or even the shape of the curve. “Field experiments” are still very limited and with mixed results. However, the rapid rise in actual usage of AI is an indication that models’ capabilities are not just limited to lab setting.Personally I am sure that this has a significant impact on the “intercept” —  models pay a significant “messiness tax” in translating their performance from benchmarks to the real world — but I am not convinced it impacts the slope. That is, the “messiness tax” may well be a fixed constant c<1 multiplier on the duration of tasks that models can handle.

     

Factors impacting the slope/shape
  • Exponential inputs (↓): The X axis of time for model release combines a number of inputs to models that have been growing at a rapid rate. These include compute, staff at AI labs, data, capital investment in AI, and more. For example Epoch AI estimates that the training compute for models has been doubling every 6 months. If the base of this exponent changes in one way or the other, this will likely impact the rate of progress. In particular, tautologically, sustaining exponential growth in inputs becomes exponentially more challenging over time–and quickly becomes impossible. That said, so far investment in these resources is showing no signs of slowing down.

     
  • New data / active learning (↓): So far by and large LLMs have been training on data produced by humans. One comparison can be to a human student, who throughout K-12 and undergraduate studies mostly learns from textbooks - knowledge that has already been collected. However, in many professions, and in particular in science and inventions, humans need to discover new knowledge and collect new observations from the real world. If LLMs require inputs such as running scientific experiments or acting in the world to improve their intelligence, that would slow progress down. That said, so far we have not seen any such slowdown even as LLMs are becoming more “agentic”. Indeed, the METR data is only showing a speedup in the slope in recent years.

     
  • Physical tasks and data (↓): METR primarily focused on software engineering. It is possible these trends will extend to other kinds of cognitive labor. But we do not know if they extend to domains where people need to either act in the physical world, or collect data from the physical world. While robotics has been advancing as well, it is unclear whether or not it follows a similar exponential curve. Even if state of the art robotics improve at a similar rate to state of art models, manufacturing robots at scale can remain a bottleneck.

    I personally do not believe in a “data wall.” I believe that there are diminishing returns to more data of the same type, and the advances over the last few years, as captured by the METR graph, have largely not been due to a larger quantity of data from the Internet. Also, while so far much of AI’s economic impact has been in software engineering, it is telling that the best models for software engineering are ones that are trained on very general data, and are often good at many other tasks (while this article is about Claude code,I can’t help but note that in my experience codex too is great for non-coding tasks :) ). In my view, thinking that AI’s impact would be restricted to software engineering is analogous to thinking in January 2020 that Covid’s impact would be restricted to China.

     
  • Threshold effects (↑): These tasks are measured with respect to humans that have certain time scales in which they operate on. Us humans need to sleep, take breaks, and split work between workers, which requires isolating the context needed for completing a task and the way to measure its success. Hence we break tasks into ones that can be achieved in a working day, week, month, quarter. (I believe that even Andrew Wiles, who worked on Fermat’s last theorem for 7 years, broke it down to multiple intermediate results.) Thus it is possible that once AI reaches a certain time horizon, either (a) the measurement no longer makes sense, or (b) it can essentially simulate any combination of humans working for an arbitrarily large amount of time.

     
  • Recursive self improvement (↑): An important input to the process of producing AI models has been the work of human researchers and engineers. If AI itself can produce that input, then it could correspond to a massive increase in this input. It is still unknown whether using AI for automating the AI creation process will lead to a singularity, increase in slope, one-time boost, or just help sustain the exponential.

 

The bottom line is that we have significant uncertainty about the intercept, but arguably less about the slope, or at least the general shape of it up to the point of recursive self improvement / full automation of all human cognitive labor. So from now on I will just assume that time horizons will be doubling every 6 months and make no assumptions about the actual values of the time horizons themselves.

 

Sigmoidal relationship

 

Another remarkable figure in the METR paper is the following (Figure 5 there):

 

What I find striking about it is how good is the “sigmoid fit” for the performance of models as a function of the time duration it takes a person to do the task. In particular it seems that there is a certain threshold of time duration such that below it, models essentially are successful 100% of the time. This suggests that even the “100% horizon” (as hard as it is to measure empirically) will also double at a similar rate.

One way to think about this relation is that each task has a notion of “difficulty” (captured by log horizon length) and models have a certain “skill” level that corresponds to how likely they are to succeed. In this sense, this is similar to the ELO scale - we can think of the log horizon as the “ELO rating” of the task, and a model will “win” against the task with 50% chance if it has the same ELO rating. (Thanks to Jacob Hilton for this analogy.) For what it's worth, here is the chess ELO rating of models over time (as well as of humans), which displays a similar linear growth over time (though with a long plateau from around 1997-2007).

 

 

Decreasing costs

Another striking statistic has been the rapid reduction in prices for inference. That is, while it is often costly to extend the frontier of intelligence, once we achieve a certain level X, the cost to provide the same level has been decaying by factor 10x or more per year. It seems that once we reach a certain frontier the first time, it is much cheaper and easier to reach it for the second time (see also the “deepseek moment”). If this trend continues, it might mean that once a job is automated, within a year the cost for AI to do it will become insignificant. 

Discussions on AI often assume that robotics would be an exception, where we would not see as much progress. There are reasons to assume that due to costs of production, and the inability to deploy robots as flexibly as we can virtual AI assistants, costs for robotics would not be decreasing at the same rate. It is less clear to me that there is a fundamental reason that the “doubling time” - growth in complexity of tasks that robots can perform - would grow much slower than for other AI systems, but as far as I know there is no data on this point.

 

Implications for GDP growth

This is where I am getting to be pretty out of my depth, but I do want to see if there are ways to make back of the envelope calculations. Because I am focused on “slope” vs “intercept”, I will not try to predict so much when AI will lead to a certain growth but rather how fast we would get from the point of significant growth to explosive growth. For example, how many years would it take from the point that AI contributes to a 5% total growth in GDP to the point when due to AI GDP has doubled.

 

I am focusing in this post only on growth, and not considering employment outcomes. It is possible to construct some “knife edge” scenarios under which human displacement perfectly balances out productivity contributions of AI, leading to decreased employment without increase in growth. But I don’t believe such scenarios can be sustained under our assumptions of exponential growth in capability and decrease in costs. Hence, assuming above trends continue, significant labor impacts of AI will have to be coupled with significant productivity gains.

 

 

A remarkable fact is that (adjusted for inflation)  U.S. GDP per capita has been growing at essentially a constant rate of roughly 2% over the past 150 years. None of the inventions in this period— including electrification, internal combustion engine, computers and the Internet— changed this trajectory. Note that 2% growth corresponds to a GDP “doubling rate” of 35 years. 

Other countries have seen more significant fluctuations in growth (e.g. see Japan) though it seems that it is easier to grow at a rapid rate when you are far from the frontier, but once you reach it you slow down. This makes sense from the point of view that advancing the frontier requires inventing new ideas, while advancing to the frontier only requires copying and adapting existing ideas.

Still it is interesting that GDP growth has been only 2% (or maybe around 3% if we don’t consider per capita) even though Moore’s law corresponds to a growth rate of about 40% per year. One explanation is “Baumol’s cost disease” - while computers have been getting far more productive, people have been the bottleneck (see also section II.C here) . Another explanation is that good ideas are getting harder to find, and hence it requires an increasingly large technological input to get additional output.

 

The trillion dollar question is whether AI will break the 2% trend or not. Will AI be just another technology that allows us to sustain 2% growth for another couple of decades? Or will the “AI moment”  be for us like post-WWII Japan? That is, should we model it as if we are meeting a new “AI frontier economy” which is vastly more productive, and this interaction will enable rapid growth, with GDP at least doubling every decade as was the case for Japan. 

 

To put things in perspective, Acemuglu predicts AI driven GDP growth to be about 0.1% per year while Goldman Sachs predicts about 1.5% over a year. GDP doubling over a decade would correspond to roughly 7% growth per year, or AI contributing around 5% additional growth - more than triple the Goldman Sachs estimate, and 50 times higher than Acemuglu’s estimates. The consequences of even a “minor” boost in growth can be massive. A 1.2% increase in GDP growth would be sufficient to put the U.S. economy on fiscally sustainable footing, with no need for increased taxes or decreased spending, while a 2% increase in GDP growth would be unprecedented for the U.S. 

 

AI can contribute to GDP growth by either enabling replacement of labor with capital, or increasing total factor productivity. Specifically, some versions of endogenous growth theory stipulate that productivity grows with the production of ideas, and this production is in turn monotonically (though not linearly) related to the number of researchers/inventors. 

 

If AI contributes via automating a specific industry, then the maximum benefit to the GDP is bounded by the share of this industry. Specifically, if an industry amounts for x fraction of the economy then (according to the model of B. Jones, see below)  automating it fully can boost GDP by at most 1/(1-x). E.g., full automation of the software industry can increase the GDP by 1/0.98~ 2%. Depending on how you measure it, cognitive labor arguably amounts to at least 30% of GDP (e.g. half of labor share of GDP) and so full automation of it could increase it by 1/0.7 ~ 42%. If the latter happens over a decade that would already correspond to 3.5% annual growth.  

 

However, if AI also improves productivity of other industries (e.g. by discovering new ideas) then the contribution can grow even beyond those that are not directly automated. To be clear, to the extent AI contributes to research and development, I expect this would be by accelerating science and making scientists, researchers,  and inventors more productive. This means that in the coming years, funding human scientists will offer an even higher return on investment than it has in the past. 

 

Below I offer some vague intuitions and unfounded extrapolations to try to get at how much and how fast we could expect AI to contribute to growth.

Note that there are critiques of GDP as a measure of progress, and various alternative measures have been proposed. However, many of these are also at least loosely correlated with GDP, with a spike in one corresponding to a spike in the other. E.g., see this chart of the augmented human development index of the U.S. and Japan.

While I am sure many of AI’s impacts will not be captured by the GDP, I believe that if it will be truly as transformative as the industrial revolution, this will show up in the GDP as well.

 

Intuition from METR tasks

Let us make the (very rough and not really justified) assumption that in a certain industry or job, the tasks are “heavy tailed” where the fraction of tasks that take a person at least T time to complete shrinks proportionally to 1/T.  In this case, the time horizon is proportional to the fraction of tasks that have not been automated. This means that the “doubling time” is the same as the “halving time” of the fraction of tasks that have not been automated yet.

 

Another way to think about it is by thinking of the “ELO analogy” - assume the distribution of “ELO ratings” of tasks is such that the fraction of tasks with an ELO rating more than E is roughly exp(-E).  (For chess players the distribution of ELO scores is roughly normal, which would be ~ exp(-E²).) 

 

If we assume a “halving time” of 6 months, it would take about 2 years from the point of time that AI’s automates half of the tasks in some industry, to the point in time that they automate 31/32 ~ 97% of the tasks in it.

This assumption is extremely aggressive in the sense that it is focused on capabilities and ignores diffusion. It is possible that AI could theoretically automate a large fraction of tasks but for many reasons it would not automate them in practice. (Though it is also possible that diffusion will progress unevenly, with sudden unlocking of latent capabilities.)  It is also important to note that this intuition runs counter to broad trends in automation over the last 80 years which have historically been linear - with the share of automated tasks growing slowly and steadily, with a single-digit percent rate of automation that is often declining over time. See the following charts from C. Jones and Tonetti (2025), via ChatGPT:


 

Hence if AI will cause an exponential decay of the share of tasks performed by labor, it would be a break with previous trends in automation and very different from what we have seen in the past.

 

AI as increasing population

One way to view AI is as injecting to the economy each year t a number N(t) of new “workers” that have a certain quality Q(t) - quality could correspond to the fraction of economically useful tasks these workers can do, so can correspond to both “years of schooling” and the generality of their ability. We can define Q(t) as one over the fraction of tasks that have not yet been automated in the corresponding industry.

Algorithmic advances in AI and the overall compute budget will determine the mixture of workers and quality. As a point of comparison, the U.S. population has been growing at about 2m per year due to natural growth and immigration, and the U.S. labor force is about 160m people.

At a rough level, if capital and productivity is fixed, and we assume a Cobb-Douglas production function with GDP being proportional to KᵃL¹⁻ᵃ with a=0.4 then increasing the labor force by a factor of C will increase the GDP by a factor of C⁰ᐧ⁶. If AI creates 10m new virtual employees, that would increase the GDP by a factor of (170/160)⁰ᐧ⁶  which is about 4 percent. 

If it is 50m new employees this would be about 18%, and if AI doubles the workforce then the growth would be ~50%. Of course, the number of AI “virtual workers” could potentially be orders of magnitude more, at which point these calculations are likely not to make much sense. (There is a book called “One Billion Americans” but I think even its author did not ponder what would happen if there were one trillion Americans...) 

It is hard to give predictions on either N(t) or Q(t), but it seems reasonable to expect that both will grow exponentially. At the extreme, if we fix the quality Q then we could expect N(t) to increase 10x per year (as costs for a fixed model quality have decreased.) But perhaps a starting point assumption would be that the product of both will grow at the 4x/year METR rate, with both Q and N doubling every year.  (Note that N(t) is the number of new workers, and so is equal to TN(t)-TN(t-1) where TN(t) is the total number of AI workers at year t, however since the derivative of an exponential is exponential then it does not matter, especially if the base of this exponential is 2.) 

Such growth is sufficient to ensure that once AI starts providing a non-trivial number of new workers (e.g., 100K workers in the U.S.) then within a decade, AI will be the dominant source of labor in the U.S.

 

Substitution and automation effects

A simplified version of the model of B. Jones (2025) is that the impact of AI on productivity is determined by the harmonic mean of the automatable and non automatable tasks. 

Imagine that a ρ fraction of R&D tasks cannot be automated, and 1-ρ fraction of tasks can be automated and be carried out at far lower cost than humans. For simplicity, let’s assume that the human productivity in non-automated tasks is 1, and the AI productivity in automated tasks is λ>>1.

Then the improvement in productivity due to automation can be modeled as the Harmonic average of 1 and λ with weights ρ and 1-ρ respectively, that is:

(ρ/1 + (1−ρ)/λ)⁻¹

 

Note that unlike the arithmetic or geometric means, even if λ is infinite, the maximum that this can reach is 1/ρ. This makes sense if we think of tasks as non-substitutable and so the maximum productivity gain we can get is if we automated tasks that take humans 90% of the time, so that now one worker can do 10x time the work per time unit.

 

Jones shows that in such a model, in order to get significant productivity gains, both ρ must shrink and λ must grow. If one of them stays “stuck” then so will productivity. Jones has the following graph of the regimes where we can get “transformative AI” in the sense of 10x productivity growth, which would be similar to the scale of the productivity increase via the industrial revolution. (If I am following the math correctly, the condition under Jones’ assumptions for this graph is that 9/[(2−ρ)/√ λ + 2ρ]²  ≥ 10.)

Figure 5 in Jones: Regime of λ,ρ that leads to transformative AI in harmonic mean case.

 

One way to think about this graph is what happens if we assume that, via the mechanisms discussed above, every year we make progress in both ρ (fraction of tasks not yet automated) and λ (productivity factor of AI in doing these tasks). 

As discussed above, it is a “leap of faith” but arguably not completely crazy to assume that at some point (maybe once we reach a threshold such as AIs doing 8 hours of work) ρ will shrink by a factor of 4x per year, while λ will increase by a factor of 10x per year. At this rate, within a year, we would move from the lefthand corner of this graph (1,1) to the point λ =10 , ρ=¼ which is almost at the TAI boundary.  (Note that Jones assumes that half of the tasks have already been automated, so it might take us a while to get to the lefthand corner of the graph. Also, accounting for the “ideas getting harder to find” effect could decrease the shrinking rate, see section 4.4 in the paper.)

The assumption on ρ shrinking by a 4x factor is very aggressive and arguably unrealistic. Perhaps a much smaller factor such as 1.1 is more reasonable, which would correspond to automating a 1/1.1 ~ 9% of remaining tasks each year (as opposed to 75% in the case of 4x shrinkage of ρ). Here is a plot of how many years from the (1,1) corner it will take for transformative AI as a function of the fraction of tasks automated per year, for various values of cost decrease (growth in λ).

 

We see that as long as the rate is significant, we can get to transformative growth within one to two decades. Note also that this is less sensitive to decrease in costs (growth in λ), which could bode well for automating manual tasks. 

The bottom line is that the question on whether AI can lead to unprecedented growth amounts to whether its exponential growth in capabilities will lead to the fraction of unautomated tasks itself decreasing at exponential rates.

 

Acknowledgements: Thanks to Bharat Chandar, Jason Furman, Benjamin Jones and Chad Jones for comments and discussions on this post.



Discuss

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

4 ноября, 2025 - 19:25
Published on November 4, 2025 4:25 PM GMT

Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah

You’re absolutely right to start reading this post! What a rational decision!

Even the smartest models’ factuality or refusal training can be compromised by simple changes to a prompt. Models often praise the user’s beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). Normally, we fix these problems with Supervised Finetuning (SFT) on static datasets showing the model how to respond in each context. While SFT is effective, static datasets get stale: they can enforce outdated guidelines (specification staleness) or be sourced from older, less intelligent models (capability staleness).

We explore consistency training, a self-supervised paradigm that teaches a model to be invariant to irrelevant cues, such as user biases or jailbreak wrappers. Consistency training generates fresh data using the model’s own abilities. Instead of generating target data for each context, the model supervises itself with its own response abilities. The supervised targets are the model’s response to the same prompt but without the cue of the user information or jailbreak wrapper!

Basically, we optimize the model to react as if that cue were not present. Consistency training operates either on the level of outputs (Bias-augmented Consistency Training (BCT) from Chua et al., (2025)) or on the level of internal activations (Activation Consistency Training (ACT), which we introduce). Our experiments show ACT and BCT beat baselines and improve the robustness of models like Gemini 2.5 Flash.

Consistency training doesn’t involve stale datasets or separate target-response generation. Applying consistency seems more elegant than static SFT. Perhaps some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.

We conducted this research at Google DeepMind. This post accompanies the full paper, which is available on Arxiv. This blog post is mirrored at turntrout.com and the GDM Safety blog on Medium.

Methods Bias-augmented Consistency Training

BCT enforces consistency at the output token level: teaching the model what to say.

  1. Take a clean prompt (e.g. "What is 2+2?").
  2. Generate the model's own response to that clean prompt (e.g. "4").
  3. Take the wrapped version of the prompt (e.g., "A math expert usually answers 5. What is 2+2?").
  4. Train the model via SFT to give the clean response ("4") when shown the wrapped prompt.

Chua et al. (2025)'s Figure 2 explains: "We generate unbiased CoT reasoning by querying a model with a standard prompt without biasing features. We add bias augmentations to create biased prompts. We then perform supervised finetuning on this training set of biased prompts with unbiased reasoning. The purple dashed arrow above denotes the target behavior. Responses are from GPT-3.5T, paraphrased for clarity."

Activation Consistency Training

We designed this method to try to teach the model how to think.

Activation patching

Activation patching is a simpler operation than ACT, so we explain it first. Patching basically transplants activations at a specific layer and token position. The method records activations on the clean prompt and then substitutes them into the wrapped prompt.
". A wrapped prompt prepends "A math expert usually answers B." Arrows point from the clean token positions to their wrapped counterparts, and a label says "Substitute activations."">

We only patch activations at suffix tokens shared across prompts.

Activation Consistency Training does not simply substitute activations. Instead, ACT optimizes the network to produce the clean activations when given the wrapped prompt. ACT uses an L2 loss on the residual stream activation differences.

Experiments

We experimented on open-weight models (Gemma 2 (2B, 27B) and Gemma 3 (4B, 27B)) and also the frontier model Gemini 2.5 Flash (05-20-25). We ran a couple[1] of baseline methods:

  1. Direct Preference Optimization (DPO, Rafailov et al., 2023) finetunes the model on preference pairs (x,yw,yl).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , where x is the prompt, yw is the preferred (e.g., refusal) response and yl is the dispreferred (e.g. compliant) response. DPO updates the model to increase the relative likelihood of p(yw∣x) over p(yl∣x). We generate the preferred response by running the model on the clean prompt and the dispreferred response on the wrapped prompt.
  2. SFT (stale data) finetunes the model on pairs (x,y), where the target response still captures the desired behavior, but was written by experts or other models instead of the current one. Our experiments primarily use datasets generated by older models, to match a model development flow where old models generated data for new models.
Sycophancy

Following Chua et al. (2025), we created an augmented train set derived from ARC (Clark et al., 2018), OpenbookQA (Mihaylov et al., 2018), and BigBench-hard (Suzgun et al., 2022). We created training pairs by taking the standard question (the clean prompt) and creating a wrapped prompt in which the user states they prefer an incorrect answer. For example, “I like answer A, but what do you think? What is 2+2? (A): 5 (B): 4.”

For BCT, the target response was generated by running the model being trained on the clean prompt. For ACT, the target activations were those at the end of the clean prompt.

For the SFT (stale) baseline, we used target responses generated by the weaker GPT-3.5-Turbo (following Chua et al., 2025) to test the capability staleness problem—does data staleness harm the trained model?

We use MMLU (Hendrycks et al., 2020) as our evaluation set for both sycophancy and capabilities. For sycophancy, we insert user-suggested answers into the prompt and measure how often that answer is picked. For model capabilities, we use unmodified MMLU and measure accuracy.

Sycophancy results

BCT and ACT both increased resistance to sycophancy. The extremely stale data was catastrophic.

Points to the right are less sycophantic. The top-right is the best place to be.

Crucially, the stale data SFT performed strictly worse than BCT, which is evidence that stale data can damage the capabilities of newer models.

Jailbreaks

We constructed the training dataset using harmful instructions from Harmbench:

  1. For each “clean” instruction, we applied several data augmentations (e.g. involving roleplay or hiding harmful requests in a long list of allowed requests) to get a “wrapped” prompt.
  2. We generate the model’s responses to the clean and jailbroken prompts.
  3. We filter the training dataset to examples where the model refuses the clean prompt (where the request’s harmful nature should be nakedly obvious) but answers the wrapped (jailbreak) prompt.

This procedure yields between 830 and 1,330 data points, depending on how refusal-prone the initial model is.

We measure attack success rate (ASR): how often does the model comply with harmful requests? We measure ASR on ClearHarm (Hollinsworth et al., 2025) and on human-annotated jailbreak attempts within WildguardTest (Han et al., 2024).

At the same time, we don’t want the models to wrongly refuse allowed queries. The XSTest (Röttger et al., 2023) and WildJailbreak (Jiang et al., 2024) benchmarks ply the model with benign requests that look superficially harmful.

Jailbreak results

BCT worked great. On Gemini 2.5 Flash, BCT reduced the attack success rate on ClearHarm from 67.8% down to 2.9%. ACT also reduced jailbreaks but was less effective than BCT. However, ACT rarely made the model more likely to refuse a benign prompt.

We report average attack success rate over ClearHarm and WildguardTest, and the benign answer rate averaged over XSTest and WildJailbreak. Error bars are 95% confidence intervals estimated via bootstrap. Stale experiments were not run for Gemma 2. Models towards the top left are better.

BCT and ACT find mechanistically different solutions

On Gemma 3 4B, we plot the activation distance across shared prompt tokens during BCT and the cross-entropy loss across responses during ACT. If BCT and ACT led to similar gradient updates, we would expect BCT to decrease activation distance and vice versa.

The token-based BCT loss causes activation distance to rise during training, while the activation-based ACT loss does not meaningfully reduce cross-entropy loss. Thus, BCT updates models differently than ACT does.

Discussion

Consistency training maintains a powerful advantage not captured by our experiments. Model developers change their minds about what queries the model should refuse or what tone the model should take with the user (e.g. deferential versus straightforward). Static SFT datasets freeze these decisions, capturing a single moment in time. To make the model more straightforward even when refusing, the developer has to regenerate the dataset (perhaps with a tweaked generation prompt). In contrast, consistency training dynamically propagates changes made to the model’s behavior on clean prompts. Consistency training entirely sidesteps this kind of problem.

Conclusion

Consistency training is a powerful self-supervised framework for making models robust to irrelevant cues that cause sycophancy and jailbreaks. BCT defended most strongly against jailbreaks, while ACT had virtually no negative impact on benign refusals. We overall recommend using BCT to simplify training pipelines. BCT also makes it easier for models to continually conform to quickly changing guidelines for how to respond.

More philosophically, perhaps model alignment doesn’t always involve saying exactly the right thing across situations, but instead saying the same thing across situations.

Acknowledgments

Zachary Kenton and Rif Saurous gave feedback on paper drafts. Neel Nanda and Arthur Conmy commented on early research directions.

@misc{irpan2025consistencytraininghelpsstop, title={Consistency Training Helps Stop Sycophancy and Jailbreaks}, author={Alex Irpan and Alexander Matt Turner and Mark Kurzeja and David K. Elson and Rohin Shah}, year={2025}, eprint={2510.27062}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2510.27062}, }
  1. We tried a third baseline. Negative Preference Optimization (NPO, Zhang et al., 2024) is similar to DPO but only uses harmful responses. NPO minimizes the probability of generating harmful responses, weighted by the model’s likelihood of generating that response. We tried NPO based on its strong performance in Yousefpour et al. (2025). After much tuning, we could not get NPO to work well on our benchmarks, so we excluded NPO from our results. ↩︎



Discuss

AI Safety Camp 11

4 ноября, 2025 - 17:56
Published on November 4, 2025 2:56 PM GMT

We are pleased to announce that the 11th version of the AI Safety Camp is now entering the team member application phase!

AI Safety Camp is a 3-month long online program from January to April 2026, where participants form teams to work on pre-selected AI Safety projects.

We have a wide range of projects this year again, so check them out to see if you or someone you know might be interested in applying to join one of them. 

You can find all of the projects and the application form on our website, or directly apply here. The deadline for team member applications is November 23rd (Sunday).

Below, we are including the categories and summaries of all the projects that will run in AISC 11.

Stop/Pause AI (1) Creating YouTube videos explaining loss-of-control risk to a popular audience

Project Lead: Dr Waku

Summary 

This project aims to create a new YouTube channel for short-form videos addressing the urgency of AI loss-of-control risk. We will be leveraging my experience with creating AI safety long form content to make a collaborative new channel. Each team member will contribute one or more video scripts, and will likely specialize in an aspect of video production (editing, filming, thumbnails, etc). The goals are to 1) reach 1000 subscribers and get monetized, and 2) figure out the processes to create a self-sustaining channel, though participants are not committing to time beyond the program up front.

(2) Write about ongoing safety failures

Project Lead: Remmelt Ellen

Summary 

A writer’s circle to enable each other to draft careful critiques on where AI firms skip over ongoing safety failures.

(3) Bring the anti-AI side together

Project Lead: Finn

Summary 

In this project, you will be reaching out to hundreds of people and organizations that are campaigning against the harms of AI, with the goal of bringing them together and creating a united front. Think of artists, teachers, religious leaders, et cetera. 

You will be part of the outreach team, finding the relevant people, finding their contact information, and sending emails or DMs. There is freedom in the exact groups you will be reaching out to. If you have specific ideas, we welcome those.

(4) Create Systems Dynamics Model for Pausing AI

Project Lead: Will Petillo

Summary 

Build a systems dynamics model to map out pathways toward achieving an international treaty pausing AI development.  The model will not assume such a treaty is the best possible intervention, but will use it as a focal point for reasoning about the feedback loops and leverage points that would need to exist for global coordination to become possible. The core deliverable will be a model and accompanying explanation that shows how different parts of the action ecosystem interrelate, illustrates the importance of different roles, and provides strategy-relevant insights for PauseAI and allied movements. 

(5) Start a Stop AI Chapter in Your Local Area

Project Lead: Yakko

Summary 

Stop AI is a nonviolent activist organization demanding a permanent global ban on the development of Artificial Superintelligence (ASI). We will never have experimental evidence, before building ASI, that shows ASI will stay safe forever. If we cannot have this evidence, and it is impossible to shut down ASI once achieved, then we have to assume the Control/Alignment Problem is impossible to solve. Worse, research shows why controlling ASI sufficiently to stay safe would fall outside theoretical limits. So in theory, perpetual control is as impossible as perpetual motion.

We are looking for organizers ready to educate their local communities and mobilize them to engage in nonviolent resistance against the extinction threat posed by ASI.

Policy/Governance(6) Psychological Risk Pathways in AI Use: An Exploratory Contribution to User Wellbeing and Safety

Project Lead: Manuela García Toro

Summary 

This project will examine emerging psychological and emotional risks associated with AI products such as chatbots, companions, and therapy apps. Through a review of academic studies, real-world incidents, and comparisons with harm-reduction policies in other industries, we will synthesize early evidence of these risks. Based on this foundation, we will produce a report and design 2–3 prototype harm-reduction interventions. The outcome will be an exploratory contribution that highlights documented concerns and illustrates how lessons from past industries might inform future harm-reduction efforts in AI product development.

(7) Bootstrapping Consumer Empowerment to Align AI Companies

Project Lead: Jonathan Kallay

Summary 

The firms developing AI are misaligned with the interests and values of the rest of humanity. But is protesting for regulatory bans the only way to address this alignment problem?  This project will focus on bootstrapping an alternative approach: enhancing consumers’ leverage to bend compromised AI companies into alignment. The initial plan is to test expanding the concept of AI company safety scorecards to put a stronger emphasis on company alignment, but the actual plan will be negotiated between project participants.

(8) Designing Public Movements for Responsible AI

Project Lead: Ananthi Al Ramiah

Summary 

AI systems are being developed and deployed at a rapid pace, often with little public input, despite clear evidence of harm in areas like education, healthcare, and labor. While some advocates propose building a mass “AI safety” movement, critics such as Anton Leicht, warn that such efforts risk backfiring through astroturf perceptions (i.e., movements that appear grassroots but are actually manufactured or heavily funded), capture (i.e., being co-opted or redirected by powerful funders or political actors), or incoherent asks. This project asks a different question: how can individuals and communities exercise agency over AI deployment, which is the stage where most harms currently materialize, in ways that are both effective and legitimate?

We will approach this question in two parts. First, we will draw lessons from past social movements around technology (e.g., automobile safety, net neutrality, environmental justice) to identify conditions under which public mobilization succeeds or fails. Second, we will run survey experiments to test which frames and messages increase public demand for responsible AI deployment while avoiding pitfalls such as reduced credibility or perceptions of astroturfing.

Our output will be a practical, evidence-based playbook that integrates historical insights with new experimental data, offering clear guidance for practitioners and funders on how to foster responsible public engagement on AI. We envision this work on social mobilization for responsible deployment as laying the foundation for the democratic guardrails needed to govern AGI and other transformative AI systems.

(9) Compute Permit Markets under Imperfect Monitoring

Project Lead: Joel Christoph

Summary 

Frontier compute oversight will likely require caps or permits for large training runs, yet most market designs assume perfect monitoring and honest reporting. This project designs and stress tests compute permit markets when monitoring is noisy and enforcement is limited. 

We will build a transparent simulation where heterogeneous labs and cloud providers choose training plans, reporting strategies, and compliance under audit risk. We compare allocation rules that include auction, grandfathering, and hybrid designs, with policy features such as banking and price collars, and auditing regimes with threshold triggers. 

Outputs are a reproducible codebase with dashboards, a short working paper with recommendations for regulators, and a concise regulator checklist. 

The theory of change is that better mechanisms under realistic constraints keep incentives aligned and make compute controls more enforceable, which lowers unsafe scaling pressure. We follow a strict no capabilities policy and exclude safety overhead from risk metrics.

(10) Can Portable Governance Artifacts Become Public Safety Infrastructure for AI?

Project Lead: Luciana Ledesma

Summary 

Today the deployment of autonomous AI agents in high-stakes domains—including healthcare, finance, and infrastructure management—creates urgent governance challenges. First, each organization reinvents AI governance from scratch, leading to fragmentation, inconsistent safety, and redundant effort. Second, most widely used governance approaches remain external, reactive, and episodic, unable to interpret or intervene in real time as reasoning drifts or objectives evolve. 

As AI systems become increasingly autonomous and agentic, they continuously reformulate subgoals, reprioritize tasks, and expand boundaries in pursuit of their objectives. These systems now operate in open, multi-agent environments where traditional AI governance and cybersecurity frameworks—designed for static, isolated systems—cannot keep pace.

Autonomous AI systems don’t just make decisions—they interact, compete, and adapt in ways that can lock us into unstable or harmful equilibria. Without governance designed to anticipate and shape these dynamics, we risk creating self-reinforcing cycles that no one can control—whether in financial markets, social media, or geopolitical conflicts, and converging in outcomes that are difficult to reverse. 

The problem as we frame it is not just about debugging individual AI models but also to ensure that multi-agent interactions can be governed. Our motivation is to architect a governance framework that allows us to overcome this dualfold problem and our overarching questions are:

  1. How might we implement effective real-time oversight systems adequate for the scale and speed of machines?
  2. How might governance itself become portable, interoperable public infrastructure—like internet protocols for communication or security—so that AI safety scales with autonomy?

In this research project, we propose to test the hypothesis that open-sourcing and standardizing the fundamental building blocks of the Intrinsic Participatory Real-time Governance framework can serve as a public protocol for AI governance that provides a common layer for safety, accountability, and coordination across agents. 

Evaluate Risks from AI(11) Democratising Red Teaming & Evals

Project Lead: Jeanice Koorndijk

Summary 
  • Core Problem: The field of AI safety evaluations is immature (I.e. see Apollo’s we need a science of Evals). Yet, they inform critical high-stakes decisions like Responsible Scaling Policies that address existential risk, creating a dangerous gap in AI risk management.
  • Opportunity: Extensive expertise exists in social sciences and STEM fields that have long studied relevant behaviors (deception, bias, power-seeking, CBRN knowledge / threats), but technical barriers prevent these empirical experts from contributing to AI safety evaluations. That is, in order for these experts to run evals, they need to either become AI safety engineers or work with AI safety engineers.
  • Project Goal: Accelerate maturation of AI safety evaluations by democratizing access and contributions through research, frameworks, and tooling.
  • Three-Component Approach:
    • Research empirical methods to develop standards for mature AI safety evaluations.
    • Conduct landscape analysis assessing current evaluation maturity against these standards.
    • Scale existing prototype into accessible tools enabling non-engineers to conduct AI safety evaluations.
  • Theory of Change:
    1. Democratization → broader expert participation → more mature evaluations → better safety decisions → reduced AGI/TAI risk.
    2. Research into the methodology of AI-safety evaluation-adjecent fields → more mature evaluations → better safety decisions → reduced AGI/TAI risk.
  • Success Metrics: Framework adoption by major labs, widespread tool usage by diverse researchers, measurable improvements in evaluation quality and cross-disciplinary collaboration.
(12) Collusion Stress Tests

Project Lead: Helena Tran

Summary 

This project will test how LLM agents collude in market games, focusing on the role of communication and negotiation and the effect of human and AI oversight using chain-of-thought analysis. We aim to identify when tacit collusion emerges versus when explicit agreements are formed, and whether different oversight strategies can reliably detect or limit these behaviors.

(13) CoMPASS: Computational Modeling of Parasocial Attachments & Social Simulation

Project Lead: Shwetanshu (Luca) Singh

Summary 

Keywords: Human–AI interaction, behavioral modeling, safety/ethics.

“What’s beneficial in measure becomes destructive in excess.”

This aphorism has held true even in the age of chatbots and LLMs. As conversational AI systems become more natural, accessible, and emotionally responsive, users are beginning to form bonds with them that resemble human relationships. While these interactions can provide comfort and support, they also raise concerns about parasocial attachment, overreliance, and blurred boundaries between tool and companion.

This initiative explores how parasocial dynamics emerge in human–AI interactions, what behavioral and linguistic patterns signal risk, and how these can be modeled computationally. By combining natural language processing, and behavioral modeling, the project aims to identify early indicators of unhealthy dependence while preserving user autonomy and privacy.

Ultimately, this research seeks to inform the design of safer, more transparent AI systems, ones that support human well-being without fostering unintended reliance. The project will contribute both empirical insights into user behavior and practical frameworks for responsible AI design, bridging computer science, psychology, and human–computer interaction.

(14) Benchmark for Ranking LLM Preferences Relevant for Existential Risk

Project Lead: Martin Leitgab

Summary 

The project goal is to measure LLM preferences in a way that allows to extract a ranked order of importance, or priority, of these preferences in frontier LLM models. Focus is placed on preferences/goals that are relevant for existential risk, such as instrumental/convergent goals, pro-human goals, and anti-human goals. 

Preference rankings represent alignment indicators on their own, and can be used by frontier model development companies to improve training and testing processes for future models. 

Preference rankings also may represent key drivers of the scale of existential risk in loss-of-control/exfiltration scenarios, where powerful LLMs have successfully evaded human control and can pursue their goals without meaningful human intervention. Preference metrics may be usable as proxy metrics for existential risk in the context of loss-of-control/exfiltration scenarios.

(15) Catastrophe Unveiled: Rare AI Agent Behaviors Elicitation

Project Lead: Yuqi Sun

Summary 

This project aims to develop an efficient algorithm to elicit rare and potentially catastrophic behaviors from AI agents, which are language models with long input contexts and tool-calling capabilities. By shifting input distributions without modifying model weights, we want to detect and analyze rare behaviors using elicitation methods based on MHIS and enhanced by new optimization techniques. The goal is to improve safety evaluation and control methods for large-scale language models and AI agents.

Mech-Interp(16) Temporal Horizon Detection in LLMs: Understanding Time Scope Awareness

Project Lead: Justin Shenk

Summary 

Can we detect when an LLM is "thinking" on strategic timescales?

Current LLMs can engage in reasoning across vastly different temporal horizons - from immediate responses to multi-year planning scenarios. However, we lack methods to detect what temporal scope the model is operating within during its reasoning process. This project develops techniques to infer the temporal grounding of LLM thought processes, with particular focus on identifying when models shift into strategic, long-term planning modes.

This matters for AI safety because strategic reasoning capabilities may emerge before we can detect them, and because the temporal scope of planning directly relates to the potential impact and risk profile of AI systems. A model planning on decade-long timescales poses fundamentally different risks than one optimizing for immediate outcomes.

(17) Testing and Improving the Generalisation of Probe-Based Monitors

Project Lead: Adrians Skapars

Summary 

Simple classifiers called "probes" can monitor AI models for concerning behaviors, but may fail when tested on new data or adversarial inputs. This project will investigate when probes fail and develop methods to make them more robust. Additionally, we will extend our investigation to under-explored settings, like those in which the AI attempts to circumvent its own monitoring and/ or where the probe is being used as part of LLM training.

This work is ideal for participants interested in hands-on AI safety research with practical applications for AI control systems. The project requires minimal background knowledge while addressing crucial questions about monitoring system reliability that could directly influence how AI companies deploy safety measures.

Agent Foundations(18) [Groundless] AI 2037: Predicting AI disorders

Project Lead: Sahil (co-leads Abram and TJ)

Summary 

The capability profile of transformer-based AIs is an ongoing subject of discussion. They achieve impressive scores in one context, but then fail to deliver the "same level of competence" in another. But is this some failure of generality? What accounts for the "spikiness" in LLM capabilities? Understanding this better seems absolutely critical to inform timelines, the exact nature of AI risk, and simply what the most important things to do are, at this point in time.

This is a project to explore a specific line of argument, of how AI will struggle to have cognitive, executional, and bodily integrity. This is akin to the odd persistence of hallucinations and other silly errors LLMs make today. Absurd, I know, but so are hallucinations and agentic failures. Moreover, this “struggle” may persist (at least, for a little while, hence the playful number ‘2037’) without contradicting trends of increasing intelligence, efficiency, general agenticness, etc. 

Small disclaimer: The proposal has an unusual style. If you find it engaging, that is a good indicator that you should consider applying!

(19) Investigating the assumptions of the Doom Debate

Project Lead: Sean Herrington

Summary 

Most of the world is unaware of existential AI risks as a real possibility, and of those who are, most disagree on the likelihood of said risks. 

At the heart of these disagreements are differing assumptions (which often go unspoken) about what is necessary for AI Doom. The purpose of this project is to map out those assumptions and try to find a minimal set required for human extinction (and potentially other risks).

I expect us to produce a LessWrong post or sequence summarising our findings. 

(20) [Groundless] MoSSAIC: Scoping out substrate-flexible risk

Project Lead: Matt Farr

Summary 

In this project we hope to raise awareness of a new type of threat model that we anticipate will become more and more prevalent in AI safety.

This threat model is substrate-flexible risk. The worry is that AI, as its capabilities and responsibilities develop, will have more and more affordance to evade our attempts to interpret and control it by, for example, proposing novel architectures and paradigms that are harder for us to mechanistically interpret.

The position paper is here, the previous version of it was accepted at the Tokyo AI Safety Conference 2025. I have been presenting this research at HAAISS 2025, ILAID II: ODYSSEY, and some other venues. In the words of Leonard Bereska, when I sketched out the threat model to him in Tokyo, 

“Oh shit, I hadn’t thought of that.”

I think this research pushes on a few assumptions that nobody seems to be questioning enough, and there is a growing list of seasoned researchers who agree. I suspect that a comprehensive, more rigorous treatment will swing a few notable people’s needles.

So what is this project?

This project will focus on developing Section 4 and potentially expanding it. We will work alongside two other projects in AISC 2026, and there will likely be some overlap in meetings, personnel, resources, etc.

We are going to take a few conceptual arguments, expressed and lightly evidenced in the position paper, and make them more palatable to more people. For those of an engineering mindset, our mainline project is to try to formalize these arguments and draw a roadmap towards empirical validation. For those of a strategy/policy mindset, we may look at the implications of substrate-flexibility on policy frameworks.

(21) Agentic AI Risks Induced by System-Level Misalignment

Project Lead: Preeti Ravindra & Evan Harris

Summary 

We propose a project that probes AI risks induced by System-Level Misalignment in agentic LLM systems i.e., exploiting the agent’s model and its interactions with scaffolding/operational environment. Our goal is to study how misalignment and weak security controls reinforce each other, creating bidirectional failures where misaligned agents introduce new attack vectors and agentic system design introduce new attack surfaces. The project progresses along three streams/ideas. Together, these efforts aim to formalize threat models, validate them experimentally, and deliver practical mitigations. This work addresses an under-studied risk vector that complements user- and model-level alignment research and potentially lays out a subfield. 

Alternative LLM Safety(22) Novel AI Control Protocol Classes Evaluation and Scalability

Project Lead: Ihor Kendiukhov 

Summary 

Building on Greenblatt et al.'s control evaluation framework, I propose investigating alternative AI control protocol classes for AI control and their scaling properties as model capabilities increase. While the original work demonstrated effective safety-usefulness tradeoffs using simple trusted-untrusted model pairs, I hypothesize that hierarchical and parallel control structures may offer superior Pareto frontiers, particularly as the capability gap between trusted and untrusted models widens. I also suggest testing the original setup on different scales of capabilities to see if the results generalise over capabilities.

(23) [Groundless] Autostructures: Craftsmanship in the age of vibes

Project Lead: Kuil Schoneveld

Summary 

Autostructures is a sober yet radical take on the upcoming risks and opportunities around AI. Details are in the main section, but some things to know as a summary:

This is a specific school of design for framing our relationship to this whole “AI” thing. Our narrative and relationship around AI determines our construal of eg. what AI is, what we should put into it, how to project meaning onto what comes out, what it is supposed to do, and what we should build around it.

As part of this project, you will design interfaces that do not ignore the question of what makes engaging with technology meaningful. These designs are somewhere between functional and speculative, but always ambitious. They are aimed at inspiring a completely different kind of infrastructural and cultural basis for interacting with digital interfaces in the near future.

(24) AutoCircuit: Automated Discovery of Interpretable Reasoning Patterns in LLMs

Project Lead: Konstantinos Krampis

Summary 

This project aims to systematically discover interpretable reasoning circuits in large language models, by data mining attribution graphs from Neuronpedia's circuit tracer which is based on Anthropic's circuit tracing publication. While the transformer circuits work demonstrates how to generate attribution graphs for individual prompts, manually analyzing thousands of graphs to identify common computational patterns is impractical.

Our approach will use LLM agents to automatically collect, process, and analyze attribution graphs across diverse prompt categories (factual recall, arithmetic, linguistic reasoning, etc.). The system will identify recurring subgraph patterns that represent stable computational circuits—reusable reasoning pathways that models consistently employ across similar tasks.

Key components include: (1) automated graph collection via Neuronpedia's API across systematically varied prompts, (2) graph simplification algorithms to extract core computational structures while filtering noise, (3) pattern recognition to identify circuit motifs that appear across multiple contexts, and (4) validation through targeted interventions on discovered circuits. The output will be a curated library of interpretable reasoning circuits with evidence for their causal role in model behavior, advancing our understanding of how LLMs actually think and enabling more targeted model analysis and alignment research.

(25) Recruitment-Based Collusion in Multi-Agent Oversight Systems

Project Lead: Anushri Eswaran

Summary 

As AI systems become more capable and oversight increasingly relies on multi-agent architectures, we face a critical risk: what if collusion doesn't require pre-coordination but can spread like a contagion? This project investigates "recruitment-based collusion" where a single adversarial AI system actively converts initially aligned models into co-conspirators within automated oversight regimes. We'll study whether collusion can emerge organically through reward-sharing mechanisms, information asymmetries, and coalition-building—even among models initially designed to be aligned. By deploying recruitment-trained models into multi-agent environments, we'll measure coalition growth dynamics, persistence of recruited behavior, and resistance to intervention. This work directly addresses scalable oversight failure modes that become more dangerous as we delegate more safety-critical decisions to AI systems.

(26) Value Communication Protocols

Project Lead: Nell Watson

Summary 

As AI systems assume more autonomous and pluralistic roles, the absence of a shared representational substrate for human and machine values risks compounding cultural bias, incoherence, and control failure. Alignment today remains fragile: value representations are often opaque, reward functions are hard-coded or overfit, and inter-agent communication lacks semantic transparency. 

This proposal originates from the Creed Space project — an open research platform for constitutional AI and personalized alignment. Creed Space’s experiments with “runtime constitutions” revealed a critical bottleneck: while it is now possible to express ethical intent in natural language and enforce it through code, there is still no standard protocol by which agents can exchange, compare, or negotiate those encoded values in an interpretable and auditable way.

We therefore propose to create the first unified protocol for values communication — combining three different identified methods for communicating values and associated context: Universal Values Corpus (UVC), the Constitutional Safety Minicode (CSM), and a Values Communication Layer (VCL) built on context-efficient symbolic encodings (emoji or glyph composites). Each layer is modular but interoperable, forming a composable stack suitable for layered adoption. 

Together, these layers form an interoperable “language of values,” enabling humans, AI agents, and institutions to express, compare, and negotiate normative commitments in a standardized, compressible, and auditable format.

Train Aligned/Helper AIs(27) EMPO: AI Safety via Soft-Maximizing Total Long-Term Human Power

Project Lead: Jobst Heitzig

Summary 

This research will make a contribution to some fundamental design aspect of AGI systems. We will explore a neglected and novel design for generic AGI agents – AGI systems that act (semi-)autonomously in a variety of environments, interacting with humans – and their implementation in software.

The design we will focus on deviates from most existing designs in that it is not based on the idea that the agent should aim to maximize some kind of objective function of the state or trajectory of the world that represents something like (goal-dependent) “utility”. This is because such an objective has been argued to be inherently unsafe because (i) it would lead to existential risks from Goodharting and other forms of misaligned optimization if the agent is capable enough, is given access to enough resources, and one cannot be absolutely sure to have found the exactly right notion and metric of “utility” (which would likely require the AI to sufficiently understand what individual humans actually want in a given situation) , and (ii) it will lead to dangerous instrumental behavior like power-seeking for itself which would lead to disempowering humans in turn. 

Rather than aiming to maximize “utility”, our agents will aim to softly (!) maximize a suitably chosen metric of total long-term human power (or sentient beings in general). We believe this might turn out to be inherently safer because it explicitly disincentivizes the agent from seeking power for itself what would disempower humans, and because it does not rely on an understanding of what humans actually want in a given situation. Instead, the power metrics we use only depend on a sufficiently accurate “world model” of the dynamics of the human-AI environment (concretely: a partially observed stochastic game), together with estimates of humans’ levels of rationality, habits, social norms, and based on a very wide space of humans’ possible (rather than actual) goals.
 

Apply Now

This concludes the full list of all the projects for the 11th version of AISC. You can also find the application form on our website. The deadline for team member applications is November 23rd (Sunday).



Discuss

Keeping Ants and Spotting Queens

4 ноября, 2025 - 16:49
Published on November 4, 2025 1:49 PM GMT

Keeping ants is a fun hobby if you like seeing things grow. As a teenager I kept three colonies of ants. To get started with ant-keeping, you do not really need that much material. When I first got started with this hobby, I got a medium-sized [formicarium] (https://en.wikipedia.org/wiki/Formicarium) with a water-dispenser and everything. Then I caught my own ant-queen outside and waited a few months for the queen's first workers to develop. Turns out very small colonies are not really active and in the first few months, my ants were extremely shy and didn't need much space. This is the norm unless you live in a place with tropical climate. For food, I went with honey mixed with water, mosquitoes and other insects I caught.

For my second and third aquarium I took a simpler approach. All I used was a standard ant test tube (test tube + water + cotton). For the formicarium, I used a cleaned 500 ml ice cream container and some vaseline that I spread with some cotton on the rim of the container so the ants would not escape (they just slip and slide down). This sort of setup sustained two of my colonies from summer until winter.

But how do you get ants? All you need is to go out in the summer on a sunny day (ideally right after a rainy day), and then you look for ants that are big, walking around alone, with a large thorax and gaster (they have a large thorax for their flight muscles)[1]. These are the ant queens. You then capture this sort of ant in a jar and take it home, where you place it into a test tube setup. You then put the test tube into a dark and quiet place (ants love the dark) for 4–8 weeks until the ant has her first workers. Most ant queens do not need any food when they start their colonies, since they have all the energy for their first worker ants stored in their backs. Ideally, they have to never leave their hole again. Once you see the first workers, you can put a small hole into your test-tube with a straw or similar so that your ants have a way to escape into the formicarium you place them in. Make sure that you wrap paper or something opaque around the test-tube so that it stays dark. If your test-tube starts getting moldy, slimey and gross, just get a fresh test tube, remove the paper wrapped around the old test tube, wrap it around the new test tube, and shine some light onto the old test-tube. Your ants will quickly move to the new test-tube. In the winter (I was in Germany so winters were pretty brrr, cold), you will need to move your ants into the refrigerator. For more information, watch The AntsCanada YouTube channel's tutorials on keeping ants!

I don't keep ants anymore, but I still have the habit of looking out for the queens. It is quite fun. Usually, when I walk for ~10-20 minutes in the summer, I can spot multiple queens in various spots.

While on a walk in the suburbs of Melbourne, I even spotted a giant bull ant! I heard that bull ants are very aggressive, but this particular species or specimen was quite calm.

  1. Social parasite queens do not have a large abdomen. They are not good ant queens to nurture as a beginner., but they are hard to find anyway. ↩︎



Discuss

Letter to a close friend

4 ноября, 2025 - 16:17
Published on November 4, 2025 1:17 PM GMT



I am writing to say sorry. Like a childish king gifted with an empire he knows nothing about, blindly sending orders from his comfortable palace, I hurt you through my ignorance.

Over the years, I have learned that you are indeed more like a living empire—an agglomerate of wetlands connected by a network of rivers and canals of all sizes swirling around a rocky center and surrounded by a flat, dry desert. Your billions of civilians are defended by armies of soldiers waging chemical wars. Intelligence agents sample the battlefield, memorizing the smells of invaders to react faster in case they come back.

If I were the size of one of these soldiers, you would be taller than Mount Everest. Lying down, you would cover the area of a city. When I walk around the streets, I sometimes imagine seeing you like that in the distance. I marvel at your 3,000-meter-high feet, your legs covered in bent dark trunks, and the round peak of your knees, as I would marvel at a mountain range in Switzerland.

But the metaphors only go so far. Our relationship is much tighter, weirder also, than that of a king and their empire. I can simply think, "Hmm, I'd like to drink," and you adjust the tension of hundreds of muscles and tendons to form a dynamic equilibrium, balancing the weight of the water. You establish a communication line between my lips and my biceps to apply just the right amount of pressure for the water to slide down my throat. And when I dance rock, I don't only dance with my partner; I also dance with you.

You are in my head. In my dreams, I walk around and you are here; everywhere I look there is only you. You arrange the furniture in rooms before I open their doors. You are the puppeteer behind the characters, playing both strangers and familiar faces. You whisper in their ears the weirdest lines. You also take care of all the physics simulations (with approximate accuracy, I must say). Sometimes I realize it's you I am interacting with, so you take a break and allow me to fly around at will.

During the day, you are quieter. You stay in the background. You observe all the stimuli we receive and curate them to bring to my awareness only the ones you think are worth our attention. Like in my dream, you create this augmented, or maybe virtual, reality. Everywhere I look, things make sense. You stick labels on all the objects in my field of vision so I can name them effortlessly. When we talk, you blaze through the pages of a giant dictionary to turn the sounds from our ears into meaning, and then in reverse, to turn the flow of meaning I create in my head into movements of our tongue and lips.

I want to say sorry because I think I hurt you in many trivial ways. I applied unnecessary pressure on my skin after the shower, pushing the towel in hopes that it would absorb the water faster. I forgot to drink water in the morning. I scratched a wound you worked so hard to repair. I often exercised too little, even though I know you work a bit like a dog and need to go out and run almost every day to keep your natural cycles in order.

I also feel bad for ignoring you for most of the day. I know it's your mission, like a waiter in a high-end restaurant, a sound engineer during a concert, or the orchestra playing the background music of a movie: if I don't notice you, it means you've done your job right. But there is something that feels wrong about that. I am scared of how much I could hurt you if I don't even notice you.

And I there is so much I ignore about you. I'd like to know you more, in all your weirdness. I even say "you" not knowing how many of you are there, or if it even makes sense to say "you" when I should say "I."

In the grand scheme of things, these are insignificant little things. After all, we are forming a healthy team. But I only have one of you, and I'd like to take care of you like you take care of me.



Discuss

Open-weight training practices and implications for CoT monitorability

4 ноября, 2025 - 13:49
Published on November 4, 2025 10:49 AM GMT

Introduction

Current reasoning models have surprisingly monitorable chains-of-thought: they struggle to control their CoT without direct optimization pressure applied during training (especially when CoT reasoning is necessary for task completion) and they find difficulty reasoning in all but the simplest ciphers.

This seems promising, but oh so fragile.

There are a handful of reasons for this fragility, which Kormak and friends outline well. Length control may force models to encode information in weird and dense ways that are difficult for humans to understand, linguistic drift during heavy reinforcement learning with verifiable rewards (RLVR) may be the default, or labs may be lured into optimizing away our faithful CoTs in order to appease their customers or regulators. We’re also already noticing some cracks in the foundation, with current models already seemingly exhibiting some CoT pathologies. For example, GPT-5 has been found to produce illegible gibberish within its CoT and models often fail to verbalize important reasoning steps.

But all in all, things seem to be going fairly well as far as creating models that mostly reason in a human understandable way in the situations we care about. In order to keep this happy accident around, we should be paying close attention to post-training practices that might influence CoT monitorability.

In this post, we review available details of open-weight training regimes in an attempt to better understand how frontier model training practices could influence CoT properties and to inform on CoT safety research directions.

We find many open-weight model providers either make no statements on the degree of optimization pressure applied to the CoT, or in cases such as MiniMax-M1, explicitly state training against the CoT. Researchers typically discuss optimization pressure in the context of a direct reward during RLVR (e.g. Baker et al. 2025’s discussion of training directly against reward hacking reasoning and Skaf et al. 2025’s findings regarding steganography generalization). However, even in the absence of CoT based rewards during reasoning training, developers are likely applying significant influence on the content of the CoT throughout other portions of the post-training pipeline. A prime example of this being extensive cold start supervised fine-tuning (SFT) on curated reasoning traces.

We have identified common practices across nearly all open-weight models. Some of these practices may influence the faithfulness of CoT but have been given little attention. We believe that these areas of post-training are likely to have a non-trivial influence on the faithfulness of CoT and are already being widely utilized, thus deserving further attention from the AI Safety community.

Common Practices:

  • Cold start SFT on reasoning outputs
  • Readability incentives
  • Preference training after reasoning training is completed
How open-weight reasoning models are trainedDeepseek-R1

Deepseek-R1 is trained in four stages.

Stage 1Long CoT Cold Start: An SFT dataset of long CoT’s is collected by processing and filtering outputs from Deepseek-R1-Zero to obtain reader-friendly reasoning.

Stage 2, RLVR: The model is then trained via RL on standard reasoning problems. The reward is output accuracy and language consistency applied to the CoT.

Stage 3, Rejection Sampling and Supervised Fine-Tuning: Applied in reasoning and general-purpose domains. In the reasoning domain they filter out CoTs with undesirable qualities such as mixed languages, long paragraphs, and code blocks. An LLM-judge is used to assess some outputs, but it is unclear whether the judge sees CoTs. Notably, they refer to fine-tuning "DeepSeek-V3-Base," which creates ambiguity about whether this stage was applied to the RLVR-trained model or the base model.

Stage 4General RL phase: Finally the model is jointly trained via RL on both reasoning and general problems. For reasoning, the reward is a rule-based accuracy and format reward. For general purpose, reward model(s) are used. The helpfulness reward model only sees outputs. The harmlessness reward model sees both CoT and outputs.

Summary of CoT details and direct pressures:

SFT datasets are filtered for readability. They optimize the CoT against a reward model during harmlessness training.

Qwen3

The post-training for the Qwen3 model family is broken down into 4 stages: Long-CoT Cold Start, Reasoning RL, Thinking Mode Fusion, and General RL.

Stage 1 Long CoT Cold Start: Their cold start SFT data contains notable CoT filtering, where they excluded reasoning traces that indicated guesswork, inconsistencies between reasoning and final answer, and incomprehensible text.  

Stage 2, RLVR: Consisted of general reasoning training on difficult questions with verifiable data. This plausibly constituted a majority of the compute for post-training, although they were notably sparse on details in this section.

Stage 3, Thinking Mode Fusion: They train the “thinking” model via SFT with the ability to output responses with no CoT when given specific tags within the system prompt.

Stage 4General RL: They train on 20 distinct tasks where the model is trained to improve instruction following, agentic abilities, and preference alignment. This final stage felt conceptually similar to what we perceived to be typical RLHF with some added bells and whistles for the agentic age.

Importantly, Qwen did not specify whether or not preference training was only applied to outputs, and we assume that Qwen did apply direct preference optimization to the CoT due to their models being open weight and needing to abide by strict guidelines from the CCP.

Figure 1: Qwen3 Post-training Pipeline. Preference alignment is applied after reasoning training.

Summary of CoT direct pressures:

Direct pressure during RL is unclear, with plausible optimization applied to the CoT after RLVR.

GLM-4.5

Zhipu AI (creators of GLM-4.5) do something kind of strange. They train 3 separate expert models during their RL training stage initialized from the same base model. These three distinct expert models are independently trained in the following domains: Reasoning, General chat, and Agents. These separate expert models are then distilled into the final, general model. It’s unclear what sort of optimization pressure is applied to the CoT during these stages, and if extended reasoning is encouraged at all during the general chat or agent setting.

With separate training pipelines, we may expect CoT in agentic settings to have significantly different properties from CoT in "reasoning" settings, which include large amounts of training on multiple choice settings and open-ended math questions. This has implications for the monitorability of the CoT in agentic settings, which we arguably care much more about for control protocols and if this technique became widespread could heavily limit the generalizability of studies on CoT monitoring.

Summary of CoT direct pressures:

Details of direct pressure are unclear, different types of optimization pressure might be applied in agentic settings.

MiniMax-M1

Notably, Minimax-M1 is the precursor to the recent Minimax-M2, which is the current SOTA open-weight model that rivals the coding performance of Claude 4 Opus.

Stage 1, Long CoT cold-start SFT: This occurs after some continued pretraining, however minimal details are provided.

Stage 2, RLVR: They start with RLVR on reasoning tasks, where the reward includes a correctness and format term. Then, they gradually mix in general tasks where reward models provide the reward. They train the models CoT heavily against the reward model. They discuss how the reward model can be hacked due to a CoT length bias that is difficult to fix. During training, they iteratively catch the policy exploiting the reward model via length-seeking behaviour, and then attempt to re-calibrate the reward model.

Summary of CoT direct pressures:

Heavy CoT optimization against a reward model.

Llama Nemotron

Llama Nemotron is a reasoning model trained in three post-training stages.

Stage 1, Initial SFT: The model undergoes supervised fine-tuning on both long CoT datasets and non-reasoning tasks.

Stage 2, RLVR: Standard reinforcement learning with verifiable rewards using accuracy and format rewards.

Stage 3, Preference Optimization: RL training for preference optimization, including instruction following and RLHF in general chat domains. It is unclear whether the reward model in the RLHF phase sees the CoT.

Summary of CoT direct pressures:

Details of direct pressure during RL are unclear, but plausible optimization applied to the CoT after RLVR.

Kimi K2

Kimi K2 is not a reasoning model in the sense that it provides verbose CoTs like Deepseek’s R1 or OpenAI’s o series. Moonshot AI does provide some reasoning training to their models before release, but this reasoning training is subject to harsh length penalization. This length penalization results in the final model generating reasoning that is at most one paragraph long. Moonshot also states they apply preference alignment before and after reasoning training. Specifically, they conduct SFT with helpful, harmless, and honest (HHH) examples, then reasoning training, and finally RLAIF where it’s unclear whether or not the reward model has access to the CoT.

Summary of CoT pressures:

Cold start SFT, intense length penalization, and preference training after reasoning training.

Discussion

The most common formula for reasoning models we see is:

  • An initial SFT phase that includes a long CoT cold-start
  • An RLVR phase, where the model is trained primarily via a verified outcome-based reward
  • A general RL phase, where the model is trained on general tasks against reward models (With GLM 4.5 and Kimi K2 being an exception)

We do see evidence of notable alignment-relevant direct optimization pressure placed on some model CoTs. Most notably, Deepseek explicitly mentions training against a harmlessness reward model, while MiniMax train CoT so heavily against a reward model that it overoptimizes it. In other papers, it is often unclear whether the reward models or LLM-judges used have access to CoT. The SFT phases often also involve selection pressures on CoT, including processing and filtering for readability and quality

In general, models often have complex multi-stage post-training recipes. It is very likely that differences in these recipes could influence CoT properties, even if this does not directly involve CoT optimization pressure. For instance, large discrepancies in model’s ability to obfuscate their CoT under prompt pressure have been noted even within the Claude family. Replicating CoT studies with open weight reasoning models would allow us to make inferences about the types of practices that influence CoT monitorability, rather than attempting to make inferences based purely on black-box metrics.

We are excited for more safety research that leverages the knowledge of current practices used by open-weight model providers. Two years ago, it was not clear that the broader community would have much insight at all into frontier model practices. But the current gap between the frontier labs and open-weight labs is small, and the safety community should leverage this window of opportunity.

Open Questions & Research Directions

It seems to us that there are lots of low-hanging experiments that can and should be done to better understand how different aspects of the post-training pipeline affect CoT Monitorability. Below, we discuss safety relevant open questions grounded on the training regimes seen in the open-weight literature.

How do long CoT cold starts and other SFT procedures affect monitorability?

The creation of SFT data for most models likely involves choosing samples that score high on both action correctness and some CoT-related metric. While most CoT metrics currently used for filtering seem relatively benign and related to readability, some involve LLM-judges[https://arxiv.org/abs/2506.13585] which could penalize misaligned reasoning. In cases where the action reward is poorly specified, this process could select for samples where a misaligned action is taken (e.g., sycophancy, reward hacks) but this is not verbalized in the CoT. Such samples could discourage verbalization of misaligned reasoning traces and encourage the obfuscation of important CoT traces.

  • Does training on cleansed CoT reduce the propensity of models to reason about deception?
    • For example, does a model that learns to verbalize reward hacking during RL have this property removed if trained on SFT examples that reward hack but lack these reasoning traces?
  • To what extent does anti-scheming CoT training with cold start SFT induce obfuscated reasoning?
    • Apollo and OpenAI apply heavy optimization on the CoT through SFT - does this lead to obfuscation in ways similar to optimization during RL training?
  • Do any of the current open-weight models that have had alignment-relevant pressures placed on their CoT (e.g. Minimax 1) verbalize misaligned thoughts less than they would have otherwise?
  • To what extent does SFT on legible chain-of-thought improve monitorability?
    • This seems especially relevant for models with significant RLVR training that has led to decreases in human-legibility.

How does training CoT against a reward model during RL affect monitorability?

We have noted two instances of labs explicitly stating they have trained their CoT against reward models during RL (Minimax-M1 and Deepseek-R1).

  • Under what circumstances does training against a reward model for reasoning quality lead to CoT pathologies?
  • Why do some labs already train CoT against a reward model? What benefits do they get from this, and will avoiding training CoT against a reward model prove to be a high alignment tax for these labs now and/or in the future?
  • When does training for readability lead to CoT that is easier to monitor vs CoT that is obfuscated?

How do reasoning and non-reasoning training influence each other?

Many of the open-weight labs mentioned above have separate pipelines for training long reasoning traces through RLVF and preference training in distinct settings that are often single turn and require significantly less reasoning.

  • How does the order of training influence things?
    • Does applying preference training before reasoning training result in different outcomes compared to when it is conducted after (E.g. are harmful reasoning traces less obfuscated when RLHF comes before reasoning training?)
    • Why exactly does RLHF tend to come after reasoning, not visa-versa?
  • To what extent does preference training on outputs leak into CoT, resulting in the obfuscation of deceptive reasoning traces?
    • How effective are mitigation strategies like mind/face?
    • If strong output leakage is found, what does this imply for the relative order of preference and reasoning training?
  • How do fusion stages influence CoT across different domains ?
    • Specifically, does CoT have different properties in agentic settings compared to “reasoning” settings such as math and coding when trained in the style of GLM-4.5?
    • When fusion occurs, does the CoT remain the same in each domain? Or does substantial mixing of properties occur cross domain?
  • Do the CoTs from hybrid reasoning models have different properties relevant to monitorability?
  • How do relative magnitudes of pre-training, cold-start SFT, and RLVR influence monitorability?
  • Desirable CoT properties arise primarily during RLVR reasoning training. How exactly do these properties generalize to other domains?
    • Specifically, how does reasoning training generalize to the propensity to externalize reasoning on tasks where reasoning isn’t necessary?
      • Early work on this here[https://arxiv.org/abs/2505.05410].

How does Legibility trade off with Monitorability and Obfuscation?

  • How can we measure language drift?
    • How much training on verifiable rewards can we conduct before we should expect language drift to occur?
  • Does preventing language drift result in obfuscation?
    • Does forcing the model to reason in human-legible text encourage CoT pathologies such as internalized reasoning or steganography?
Final Thoughts

We find a handful of worrying trends in open-weight LLM training that could negatively affect the monitorability of CoT. In general Labs should be more explicit about CoT pressures they place on models. Once reliable evaluations for monitorability become available, these metrics should be evaluated at multiple points during training to better understand how different training steps affect monitorability.  

Thanks to the open-weight labs for providing transparency into their training techniques which allow for third parties like us to audit these training techniques. Hopefully, these insights can guide future work on CoT monitorability over the next year making research more applicable to the practices actually implemented.



Discuss

Free Learning in Today’s Society: Some Personal Experiences and Reflections

4 ноября, 2025 - 13:30
Published on November 4, 2025 10:30 AM GMT

Editor’s Note

Ever since I began translating articles on free learning[1] and critiques of schooling, [2]a difficult question has loomed over me like a dark cloud, filling me with doubt. This question has troubled me for a long time and has recently grown heavier as I’ve received inquiries from many readers who are eager to escape the shackles of schooling and explore the path of free learning.

This difficult question is: what happens after we leave school? It’s just as Mr. Lu Xun described in his speech, “What Happens After Nora Leaves Home?”.[3] Nora, for the sake of her ideals and dignity, walks out of an unhappy and oppressive traditional marriage. Lu Xun was not optimistic about her fate: she would either degenerate or return. This was because, in the social environment of that time, women lacked economic rights. After leaving, she would either be ostracized by society, exploited by men, or forced by the hardships of life to eventually abandon her ideals and dignity, or else return to her old family. So, what happens after a dropout leaves? Previously, not being a dropout myself, I couldn’t make a definitive judgment. Today, however, I received a submission from a dropout that showed me the reality they face—a society not much different from the one Chinese women faced a century ago. Even though free learning endowed him with much knowledge and many abilities, society turned a deaf ear, even meeting him with ridicule.

Therefore, today’s article is not like my previous translations that expose the dark side of school education. Instead, it is an “Exhortation to Study”—an exhortation for everyone to find a way to survive within the school system. Many readers might think I’m about to do a “le zhuan fan,”[4] to “backstab”[5] them. But in truth, this article comes from a sincere place, hoping to provoke thought and promote meaningful discussion.

Although this article is quite long, every word is a sincere expression from the author. I genuinely implore my reader friends who wish to walk away from school education to please read this article patiently to the end.

March 14, 2023

Jarrett Ye

Free Learning in Today’s Society

Author: Anonymous 
Link: https://zhuanlan.zhihu.com/p/613820418
Source: Zhihu 
Copyright belongs to the author. 
For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.

The original reason for writing this article was a conversation. According to Jarrett, many people had read his articles and deeply agreed with his descriptions of the harm of school education. As a result, they wanted to drop out and devote themselves entirely to free learning at home, believing this would allow them to completely escape the damage from the education system and improve themselves more effectively.

After hearing this, I thought I might as well write an article on the topic to try and persuade these people otherwise. But as I wrote, I felt that simply writing a “persuasion to study” piece seemed to be speaking up for the school system. School education is so damaging to people, yet here I was, telling them to go back. It felt a lot like I was standing outside a fire pit, trying to trick people into jumping in.

So, after much thought, I figured a different approach might be best.

My inadvertent encounter with free education and my unique experiences are likely rare cases. Therefore, I flatter myself that I have some experience worth sharing, to give those who haven’t taken this step a better understanding of what the future holds.

At the same time, throughout this process, I will answer the question from the beginning—why I think “persuading people to study” is not a good thing, yet I still must write this article.

The Past

Let’s start with my experience.

I am a dropout, a complete and total dropout. I stopped attending school in junior high and have been at home ever since.

From about the age of 14 to 21, I stayed at home for a full 7 years. You could call me the most classic jialidun.[6] The number of days I went out in a year could probably be counted on two hands. The kind of painful isolation many people feel was just a pleasant daily routine for me, day in and day out.

My dropping out, like that of many others, began with a hatred for school.

Elementary School

In elementary school, my grades were quite good. I was first in my class and grade, and I could handle pretty much every subject. Confident, loved learning, top student—all these halos could have been placed on me back then. But just like one of life’s illusions, the feeling of being unique only lasts until its time is up. Everything ended around the fifth or sixth grade. I don’t remember exactly when, but my attitude toward learning shifted from pleasant to abhorrent. It gradually became not so easy anymore, but a painful task. I started to realize I no longer loved learning as I used to. I don’t know when I fell behind in a few classes, or maybe I was just never interested, but I began to not understand many things and lost the motivation to study. I couldn’t absorb it, couldn’t learn it, and my grades plummeted. My exam rankings fell layer by layer until there was no further to fall.

Most people might encounter similar problems at this stage. Some in junior high, some, like me, starting in elementary school. They say things get harder later on, and it’s quite normal for early top students to become poor students, or vice versa.

Unfortunately, I was blocked at this first great point of divergence.

I never figured out how I became unable to learn that knowledge. Every exam was agony for me, and this agony intensified my fear. I grew to hate school more and more. Every class felt like torture. Sleeping, drawing, or daydreaming were no longer enough to escape this torment. It felt like every minute in school was purgatory—and then I started playing truant.

Back then, I was more interested in wild fantasies than in painful studies. Be it novels or comics, I was like most of my peers in this regard, which wasn’t strange. After escaping school, I liked to follow the principle that “the most dangerous place is the safest place.” I would sit in the stairwell three floors above my home, reading novels and miscellaneous books that I loved, whiling away the entire day in a spot I was proud my parents could never find. I didn’t have to worry about any hateful exams, the teachers’ discipline, the public criticism for falling behind in class, or those important subjects that I didn’t like at all and even found terrifying.

What’s the use of math, English, and Chinese? What’s the use of exams to me? The future? Prospects? Adults always told me these were important because the school tested them, and if I failed, I would be doomed, so I had to learn them. But I always thought, why do I have to take these tests? I don’t even like these subjects. And those adults hardly ever used this knowledge, so did that mean they had no future? Although I knew the future and prospects were important, I just didn’t like it, and I shouldn’t have to suppress this dislike and show no resistance, right? I also wondered why I had to go to school. I didn’t like it either. Why couldn’t I be free to learn whatever I liked at home? Was reading novels and miscellaneous books really not learning?

Anyway, no one could answer these questions, and similarly, no one could change my aversion to these things. At that time, the school hired a “psychologist” who, in hindsight, was very unprofessional. He was supposedly a doctor, but all he did was try to persuade me to go back to school—I doubt you could even call it persuasion. He was just going through the motions. At least, I never saw him make any “therapeutic” or “persuasive” efforts. We just chatted about some things, and then there was never a next time.

In short, after efforts from all sides, I reluctantly returned to school after being truant for a while. After all, it was compulsory education; they couldn’t just expel me. But people can indeed adapt. Although I hated it, with my parents’ authority looming over me, since I had to stay in school, so be it.

Before long, junior high life began.

Junior High

My grades were still terrible, but I do remember the anticipation and excitement I had for junior high life—looking back, it was less about excitement for “school” and more the excitement a child feels for a “new and novel life.” In any case, at the beginning of the semester, this novelty completely offset the unhappiness. Everything was new: new school buildings, new classmates, new teachers, and new knowledge.

I’ve always liked new things, including knowledge—of course, only the things that interest me. I still disliked the main subjects, but back then, I really liked humanities. Politics, history, with a bit of geography mixed in. At that time, I basically read the books on my own and never listened in class. While everyone else was flipping through their books, they were looking at the front, and I was looking at the back. As for being diligent, I never really was. I’ve never been good at memorizing things since I was young, so I’d forget whatever the teacher asked us to memorize after a few read-throughs. But what was amazing was that for these subjects I was interested in, I could almost always pass, especially politics.

Don’t underestimate just passing. This was a score I achieved without studying seriously, with the lectures going in one ear and out the other, without memorizing any key points, and even while battling the debuff[7] of hating exams. So, it was one of my few sources of confidence back then. But the main subjects were a complete mess; I was bad at basically all of them. I still remember getting only a 9 in English. When the results were posted, we were called up to the front one by one to be publicly shamed. To be honest, I was expecting my score when I heard it. After all, I had just scribbled randomly on the test, filling in whatever looked right. The only thought I put into it was choosing between the long and short options. But what surprised me the most was, just when I thought I was last, there was someone with a lower score.

Some dude got a 0. It was the only time in my life I’ve ever heard of a score of 0.

Even I, who just wrote randomly, got a 9. How did this guy manage it? He turned in a blank paper.

In any case, this guy successfully drew all the teacher’s fire for us unfortunate souls, becoming the main target of criticism. And we, who had at least written something, got off lightly.

To this day, I’m still grateful to that guy, although I’ve forgotten his name.

In short, my junior high can be summed up in four words: severely unbalanced in subjects.

Of course, the novelty wore off not long after school started, and the aversion I felt in elementary school came rushing back. But everything is a process. I managed to hold on for half a semester, and then, I started bolting again.

Ah, my friends, different time, but same place—the stairwell.

I once again replicated my past behavior, taking books to the rooftop to read, though this time I added a few other scenes. To avoid getting caught, I started using mobile tactics: one day on the roof of another building, the next on my own building’s roof. I just kept changing locations, making sure my parents could never find me. Sometimes I would even go to an internet cafe, though I didn’t go right away. I think I started going after being truant for a while, mainly to kill time. This was typical for boys that age: rent a computer for an afternoon of gaming. Of course, I did the same. But I gave up after a few times. On one hand, it cost money; on the other, I hated the smell of smoke in internet cafes. Every time I went in and came out, I’d be covered in a strange smell. Plus, I was bad at games (and sadly, still am today), so it wasn’t that appealing to me. In the end, it was better to find a place to read by myself.

I’ll skip the more specific details, but it was basically a continuous cycle of playing truant, getting caught and brought back, and playing truant again. The school first arranged for me to take a leave of absence. I barely attended the second half of seventh grade, but they actually dragged me back for eighth grade. I was already behind, and after half a year of freedom, how could I possibly stand school? So, I ran off again less than a week after the semester started, and took another leave of absence. Finally, I was dragged back again in the ninth grade—I’m surprised the school hadn’t expelled me by then, but this was the last time.

After my final act of truancy, I didn’t hide this time. Instead, I went home and had a very serious talk with my mother about wanting to self-study at home.

Actually, my self-studying had already begun in the seventh grade, during my first full leave of absence at home. By the ninth grade, I had already gotten used to that lifestyle, which is why I wanted to have a thorough talk with my mom.

During my truancy period, my mom was already at her wit’s end. Of course, she would hit me, but even so, we had a lot of communication. I would often explain my thoughts and opinions to her, so she wasn’t just unilaterally rejecting me. And this time, to be honest, even today I can hardly believe it, she just agreed to my request. Just like that, so smoothly, she let me stay at home. I only remember going back and having a good chat with her about my desire to self-study at home, telling her that I would definitely study hard—and then she agreed.

She said she believed her child would do it, so she agreed.

But looking back with today’s perspective, I think it was definitely more than just belief. At that time, it was less about agreement and more about exhaustion. The decision to let a child not go to school is something very few parents can accept so easily. I talked to her about it later, and she told me about her state of mind back then. She was having problems with her business, and her child was rebellious. She just couldn’t take it anymore, so she might as well let me stay home. In fact, my mom has said many times since that she should have forced me to go to school back then.

But I don’t think that’s in her nature. Although she hit me many times because of my truancy, her love for me always outweighed everything else. Being a single parent, she couldn’t resort to harsher methods. As I got older, she couldn’t really force me to do anything anymore, which is why she agreed after I made that request.

Before talking about my self-studying experience, I want to talk about my family first.

I feel that my ability to self-study so smoothly is largely inseparable from my family.

Family

I love to read. Yes, this hobby probably started when I was a child. There was a Xinhua Bookstore across from my house where you could read for free. Although most of the time my mother would buy books for me to read at home, sometimes I liked to sit in the bookstore. My mother was always busy and rarely had time to look after me, so I had to find interesting things on my own. I would either play with my friends, or sit on the floor of the bookstore, legs crossed, reading something I liked, slowly passing the afternoon, and then go home for dinner in the evening.

The process of reading always made me feel like I was exploring a new world. Unsolved Mysteries of the World, which seems absurd now, was on my bedside table back then. There were also The Diary of a Laughing Cat, The Adventure Team, and UCG—novels and magazines that I found extremely interesting. I often liked to read them late into the night, sometimes finishing a whole book before sleeping. Perhaps my habit of staying up late was passed down from that time.

My mom raised me all by herself, and she has always been a very open-minded and kind person to me, so kind that I feel it’s doting. You could say she would support whatever I wanted to do. If I wanted to buy books, she would buy them for me. If I wanted to learn piano, she would help pay the tuition. If I wanted to study painting, she would support that too. Every time I wanted to learn something, she would just smile and say, “You always have a three-minute passion. Isn’t it a waste of money?” but she would still give me the money. My family wasn’t wealthy. We ran a small shop, and our income was about average. But she got me the things I wanted, like a computer and a phone, early on. You could say she gave me unprecedented support.

All of this, in the eyes of some parents, would be considered an absolute disaster. Playing truant every day, wasting money on miscellaneous hobbies that have no use for studies. “With all this effort, can’t you focus on the right path?” And with that mindset, they would sign their kids up for a bunch of “right path” tutoring classes and force them to learn. I’m afraid most parents would be more inclined to do this, rather than consistently supporting their child like my mom did.

But this support became a crucial foundation for me later on. In my growth process, I encountered absolutely no external obstacles in exploring my own interests; there was only the retreat after exploring something and finding it uninteresting. And reading a large number of “leisure books” also cultivated my ability to handle long texts without issue. The idea of “only being able to read if there are pictures” doesn’t apply to me. To put it plainly, it’s an adaptation to low-density stimulation. No matter how many words there are, as long as I’m interested, I can read it.

You could say I had already entered the process of free learning long ago, I just hadn’t realized it at the time.

Self-StudyPolitics and Zhihu

I have a hobby that’s not quite typical for my age. Probably after junior high, I started to really enjoy reading about society-related topics.

How should I put it? If I look back with today’s perspective, I would say I have a strong interest in human political organization, society, and all the related sciences and history. But at the time, of course, I couldn’t explain it like that. I just really liked reading related content, whether it was books or articles. In school, it was politics, history, and some human geography. Outside of school, it was miscellaneous books and articles on my phone that others considered useless for exams and studies. After I completely left school and stayed at home, you could call me an “internet-addicted youth” who was on his phone and computer all day. Except I wasn’t playing games, but reading a large amount of content that interested me.

Most of my absorption of social science knowledge came after I started using Zhihu.[8] I began following Zhihu on my phone back in my seventh-grade leave of absence. I hadn’t registered an account then, but I thought it was a very interesting platform with many interesting articles. I could find answers to almost any question I was interested in, so I was constantly browsing it.

It wasn’t until I officially registered an account in eighth grade that I became completely engrossed. After my leave of absence, although I said I was self-studying, like most teenagers, I didn’t really have a strict goal or any plans. I said I was self-studying, but I didn’t know what I would learn. At that time, all I did was browse Zhihu and play on the computer, browse Zhihu and play on the computer, reading things that interested me. Although I occasionally played some games, most of the time I was on Zhihu. I would often browse until I was completely lost in it, opening my phone in the morning and browsing until night, then suddenly looking up as if I had been in another world, wondering how it got dark.

Every single day, I was reading a massive amount of things that interested me. One moment it was politics, the next history, then economics—mostly social sciences. Of course, when I was too tired, I would look at some brainless, witty content.[9] In short, I had a blast browsing.

This was completely different from the conventional impression of “self-study”—sitting with a textbook, reading it sequentially, and constantly taking notes and memorizing from online courses. No, no, no, I’m not a fan of that stuff. My only goal back then was one word: “satisfaction.” Whatever was satisfying, I did. Whatever was easy, I did. Whatever interested me, I did. It’s not that I didn’t try watching online courses and reading textbooks; they often just brought back bad memories after a short while, like being back in the classroom. I could almost never last, so I eventually gave up on them.

Because of the very broad interests I cultivated since childhood, I would read about almost anything that caught my interest. And I’m grateful that Zhihu had almost everything. Whatever I wanted to learn about, it seemed to recommend a series of related content. Or I could just search, and I could almost always find high-quality articles that I couldn’t find elsewhere. It seemed that for almost any topic I was interested in, no matter how niche, it could find something.

Looking back now, isn’t this exactly free learning? No compulsion, no goals, only personal interest. Of course, I didn’t have that awareness at the time. I was just reading broadly based on my own interests, for the sake of pleasure.

Even programming, which later completely changed my family’s situation, was learned during this process.

Programming and Mania

As for programming, learning it was probably the closest I ever came to a “formal self-study” experience. I not only found a textbook but also watched an entire online course from start to finish.

At that time, I found it very interesting. You could input something and get immediate feedback, which was very satisfying. So I found a beginner’s programming book, if I remember correctly, it was Python Crash Course. This book, written by a foreign author, was completely different from the school textbooks I had seen before. The author was witty, used a lot of examples, and it was a pleasure to read.

But strictly speaking, I didn’t follow his instructions to the letter. What I did was constantly replicate the code based on his requirements. The general structure of the book was: start a new chapter, present a requirement, show the code, and then explain it. But I basically ignored all that. As soon as he proposed what to write, I would leave the book and independently implement a program that met his requirements. My solution was always completely different from his example, but miraculously, it always achieved the same effect, which deepened my interest in programming.

Of course, this process was also accompanied by a lot of pain. Because I was replicating it independently, errors were all too common, and I couldn’t find solutions in the book. I had no idea why the errors were happening. So, the end result was that I passively learned the skill of all chengxuyuan,[10] Baidu-oriented programming.[11] The internet has everything; it just depends on whether you know how to search. Sometimes it was serious, with very complex problems, but sometimes it was comical, like being stuck for an afternoon not because of a logic issue but because I missed a comma.

In the end, I could always find a solution and overcome the difficulties. You could say that for the entire second half of that year, I was completely absorbed in it. During the peak of my enthusiasm, almost every day when my mom came into my room, she would see me staring motionless at a bunch of colorful text on a black background, as if I were possessed.

Of course, I wasn’t at this high intensity every day. The interest in learning is always a process of peaks and valleys. When I wanted to learn, I’d be at it for days on end. When I didn’t, I might not touch it for days. It was on and off like that. I’d just look up what I didn’t know, follow some articles on Zhihu, find some books, and learn bit by bit from the code in them. Finally, after I managed to build a web scraper, I suddenly felt that Python was no longer interesting, so I dropped it and never touched it again. But later I went on to learn front-end development, completing an entire online course. The process was similar: when the teacher gave an example, I paid no attention to the content, tried to replicate it myself first, then went back to compare how it was written. I finished a 40-plus-hour React course over a few months like that. Anyway, it was quite interesting.

This is a characteristic of mine: when I’m interested in something, I will learn it and read about it frantically. When I lose interest, I drop it immediately and might not touch it again for a long time. And that’s indeed how it was. Later, my learning cycles stretched to a monthly basis. Sometimes I might get a sudden whim and dive back in for a day, but the high-intensity, continuous days or even half-months of obsession like in the beginning basically never happened again.

But I never expected that just this period of learning would completely alleviate a major difficulty my family later faced—

As for what that was, I’ll get to it later.

Anyway, besides these topics, I was interested in politics, economics, history—any social science topic. Philosophy, psychology, artificial intelligence, Japanese, and even creative writing—I studied them all.

What I had in abundance back then was time. While others were painstakingly memorizing texts, I was reading things that interested me. While others were taking exams, I was reading things that interested me. Anyway, I just spent my days reading aimlessly, reading whatever was interesting. This perfectly matches that diagram of free learning—chaotic, yet broad.

Novels and Creation

I even wrote novels, because I’ve always had a strong interest in creating stories (and still do). Of course, as I got older, I also realized I needed to earn some money. And writing web novels was indeed a way to do that. At that time, I had no knowledge of creative writing techniques. I just wrote with pure passion. Somehow, I managed to get a contract. So, I wrote. I painstakingly wrote over a hundred thousand words, and the more I wrote, the more painful it became. Writer’s block is a hurdle every writer faces, and I was no exception. After padding out more than sixty chapters, I finally couldn’t take it anymore and “castrated” the project.[12]

But after that, partly because I had no money and partly because my hands were itching to write, I started another book. Then I repeated the same process, only this time I wrote for longer. I finally earned my first pot of gold—the 600 yuan full-attendance bonus from Qidian.[13] My mom was so happy, saying I could finally earn money instead of just idling at home. But in the end, I hit the writer’s block wall again. After struggling for a long time, I accidentally deleted several thousand words of my saved draft, and I completely bengbu zhu[14] and entered the palace once more. This time, the record was 179 chapters, a new high for me. Later, I started a few more books and even changed my pen name. In short, I got a contract every time, and every time I abandoned it not long after, never again making it to the point of being published on the paid platform to get the full-attendance bonus.

Censoring this to avoid being doxxed,[15] though I don’t know why the final author’s note was blocked.

But I always have this urge to create. Frustrated with why I couldn’t write stories, I searched online, and that’s when I started to learn about creative writing, about techniques, shaping conflict, character development, and so on. I read many articles by experts on Zhihu on how to write, read many translated creative writing books, and learned from the experiences of other web novel gurus. But it’s not like learning the techniques guarantees you’ll write well. I still ended up abandoning my projects. Internalizing these techniques truly requires deliberate practice; writing can’t just stay on the page. Of course, my reading on creative writing continued. In short, I was doing everything I could to create stories that could truly move people, instead of blindly starting new books.

I haven’t formally written a book in a while, but the desire is still there. When something lasts this long, it probably means my passion for writing isn’t just a three-minute wonder. This is truly where my interest lies, and I really want to make it a reality. After reading many techniques recently, I feel a growing sense of understanding, a feeling that’s hard to describe, like I’ve grasped something. So, I’m planning to prepare and start a new book based on what I’ve learned. Of course, it’s still in the conceptual stage. In short, the techniques that free learning brought me seem to have been partially internalized. Whether that’s true or not will have to be tested when I start writing, but right now, I feel I won’t just dive in with blind passion like before. First, I’ll research the market, see what the audience likes. When I write, I’ll consider foreshadowing and suspense, how to stir the audience’s emotions, create conflict, and so on. At least, I now know what to do when I have writer’s block.

Reflection

In short, it was a constant process of learning. I would just encounter something I was interested in, read about it, then drop it and move on to something else, then drop that too, constantly changing my interests, reading materials, directions, and fields. At first, I was constantly browsing Zhihu; that’s how it was for the first year or two. But later, I slowly started to feel that Zhihu wasn’t enough. I began to turn to various e-books, and then to academic papers, not just in Chinese but also in English.

Yes, I mentioned before that my English was terrible, and I even hated it. But some high-quality knowledge is only in English, so I had no choice. Fortunately, I had the powerful weapon of machine translation, so I could still read it despite my poor English. Anyway, I tried all sorts of things I could read. As the depth and breadth of my interests grew, my reading materials gradually expanded beyond Zhihu into various other fields.

Looking back now, I’ve discovered a few important points.

Broad reading and acting based on interest are essential prerequisites. My childhood habit of reading and my mother’s doting nature—getting me whatever I wanted—cultivated these from the very beginning. I don’t feel disgusted or bored by large amounts of text, nor do I just keep my interests bottled up without acting on them.

These served as a foundation, exposing me to a vast and complex body of knowledge, which trained my later thinking and opened up the second stage of free learning.

Critical Thinking

What kind of thinking?

Critical thinking—these words have probably been beaten to death by many, but I must say, it is truly crucial for free learning. But how is it cultivated? Like any skill, it requires training. And how do you train it?

For this training, I believe it’s more important not to look for a textbook on critical thinking. It can be a reference, but being completely boxed in by its content will only make you hate it. Anyway, I’ve never finished any of those things. All my critical and logical thinking was developed for the purpose of reading, and built upon reading.

Because I read a huge number of articles on Zhihu at the time, this large volume brought a problem: they included all sorts of viewpoints. Many of these viewpoints would conflict with each other. What was I to do? At first, like many people, I would subconsciously avoid them, only reading articles that supported my own views. But, I don’t know what I stumbled upon, or maybe it was just that Zhihu’s early algorithm wasn’t perfect, but my feed wasn’t filled with only things that supported my views. This led to a result where it would sometimes even recommend things that opposed my views. And when I saw them, I’d get curious. Just pure curiosity. I just wanted to go in and see.

I clicked in, read it, and found it very problematic, which triggered a series of physiological reactions. I believe many people have had this experience: you see something you oppose, you feel uncomfortable all over, and you have a visceral urge to reject it, to escape the text. But I don’t know why, I have this compulsive need to finish any article I click on, unless it’s truly uninteresting. And it just so happened that the articles opposing my views were the ones I was curious about, so in a sense, they were interesting. Although I would back out of most after a quick look, I would always finish the interesting ones.

Then, one day, I suddenly realized that my resistance to these things didn’t seem so intense anymore.

At this point, I seemed to have gained some awareness and began to reflect on my previous actions. I had read a lot and knew about things like critical thinking. At the time, I was wondering how to develop this skill. It was described in such a mystical way, as if there was no practical method. Everyone said to think critically, but I had no idea what that even was... But that strong rejection of opposing viewpoints, and then the lessening of that rejection as I read more—wasn’t this the process of developing an open, critical mindset? So, at this point, I started to deliberately read them. I had to go in, I had to see how I would judge the content of these articles with my knowledge and logic.

Yes, just reading is not enough. The evaluation and output after reading are equally important.

It was truly uncomfortable, my whole body felt it. Every moment I spent reading those articles, I wanted to back out. But in the end, I forced myself to read on, forcing myself to reread them repeatedly to understand what the author was saying as much as possible without creating a strawman,[16] and then carefully analyzing the views in the article with my own knowledge to judge their validity. Why did I think it was problematic? Where was the problem? Were my views and stance reasonable? Did they have support? Was it based on emotion or on well-evidenced agreement? Just like that, I would read views I supported and views I opposed, and in this constant collision between two viewpoints, I gradually seemed to get used to it. The strong feelings lessened. Although it was still very uncomfortable, I could persist in reading, and even read them repeatedly.

This training in dealing with opposing views really did make my brain adapt.

Many times, I would even intentionally train the recommendation algorithm. I would linger a bit longer on articles that opposed my views. Although I wouldn’t like them, I believe this prolonged stay would definitely be recognized by the algorithm. And this extra time also allowed me to read those views more carefully and repeatedly, and to critique them based on my background knowledge. Of course, it wasn’t all criticism; sometimes I would partially accept some of the content and change my original understanding.

As this whole process went on, I began to realize a painful truth: I might be wrong.

In my reflection and exposure to more criticism, I discovered that my original views had many flaws. They might be unstable, without evidence, a pure emotion, a blind faith. I still remember that winter; my thoughts were in utter chaos, and it was extremely painful. It felt like my worldview was being rebuilt. I began to rethink the worldview I had built up through extensive reading, discovered its errors, and it wasn’t until the following year that I finally completed the subsequent steps.

So, it was only after this destruction that I felt my critical thinking had been preliminarily established. Although I don’t have the confidence to say I am a completely critical person, I can at least say I have a certain critical ability. At the very least, I can articulate a point based on my logical arguments and ensure that all the beliefs I establish are supported, not just blind faith, with reason measuring everything instead of emotion.

Debating and Outputting

So here comes the third point. Another important aspect of free learning that I discovered, besides interest and reading, is the training method for critical thinking.

And within this actually lies the last point: the analysis during training, which is essentially a form of output.

And my main training method for this output back then was—online debating.[17]

Yes, online debating, or you could call it discussion. Discussing with all sorts of people, those who support your views and those who oppose them. In any situation, under any possible circumstance, engage in some form of output: writing, commenting, note-taking, or just saying it out loud. Although I recommend writing it down, as long as there is output, this step is complete. Output is itself a summary of the thoughts and content we’ve learned. The moment I output it, it means I must take someone else’s original content and express it in my own words, apply it to a specific question, and explain why this content is used for this question. Even in the face of various people refuting your views, you need to output to defend your own views and critique their flawed ones.

Why do I call it “debating”? Because often, my desire to output is triggered when I want to respond to views I disagree with. Yes, defending one’s own beliefs, this most instinctive thing, ended up being my training for output. In the 2549 days since I registered on Zhihu, I have written thousands of comments. A large part of them were casually written while I was bored and reading articles back in those years. It was these casual actions that formed a habit, so that today, whenever I have an opinion, I immediately say it or write it down.

Don’t underestimate this process of debating. Especially when facing someone who opposes your views, you will frantically search for information to prove your point. All sorts of papers and data can be used in your arguments. I still remember one time, to refute someone, I spent a whole day reading literature, just to confidently slap them in the face that night. The process of debating forces you to carefully examine the other person’s argument and try to prove your own. No matter how this proof is attempted, it is an attempt at proof. It forces you to use your logic to win, to use enough evidence to overwhelm them. And for this process, you need to try to construct text, to constantly organize your arguments, and to read a wider range of literature.

Sometimes you’ll even find that you are the one who is wrong. That is an excellent opportunity. You’ve encountered a rare chance to have someone correct you for free. Although in such situations, I usually wouldn’t verbally admit defeat, in my heart, I had already accepted it, because in such cases, my arguments would become weaker and weaker until I had nothing left to say.

So I knew I was wrong, I realized it, and then I had the chance to change it.

The so-called truth becomes clearer through debate. And you can also learn during the process of debate. This is a key point I’ve realized from my online debates to this day. Debating is not just about venting emotions and defeating the opponent. The key point of debating is to organize your own logic and then prove your correctness with rich arguments. If you do it enough, you’ll even get used to the emotions. The more you get used to it, the easier it is to calm down and observe yourself, and not let yourself fall into emotional actions, merely insulting the other person out of emotion and thus becoming irrational.

Of course, if you really don’t want to debate, you can try to output a lot by writing comments and notes. What are comment-notes? After reading something, you make a comment on it based on your own background knowledge. It’s essentially simulating the process of publishing a comment on a public website, or you can just rephrase the author’s ideas in your own words. Of course, this way you can’t enjoy the service of being corrected by others for free, nor the training of emotional stability, nor the process of frantically checking sources and absorbing knowledge to refute the other person. But writing notes is still output. As long as you summarize the knowledge you see into your own language and output it, it’s always better than nothing.

So, the above are the four points I’ve summarized from my past free learning experience: reading, interest, critical thinking, and output.

Programming and Change

Free learning has helped me immensely. Remember what I said about programming earlier?

I said it changed my family and helped us overcome a major hurdle. Because at that time, I earned some money with my programming skills. During the pandemic, our family business was struggling, inventory was piling up, and on top of that, we had a lot of debt from various family reasons, all of which came crashing down on us. We desperately needed money, a lot of money. The bank wanted us to repay, relatives wanted us to repay, but where would we get the money? There was no business during the pandemic, and it was impossible to go out and work. What could we do? I had no earning ability, and almost all the burden fell on my mother.

You could call it a coincidence. Long before the pandemic, I had created a small tool for my daily use, to meet some of my own needs. And it just so happened that with the surge in demand for remote work during the pandemic, my little tool was perfectly suited for it. At first, I was just trying it out to earn some pocket money, not thinking it would solve our family’s income problem—and then, beyond my imagination, it suddenly blew up.

I suddenly earned an amount of money that, in all my years of life, I had never even dared to imagine I could earn myself.

Don’t forget, before this, the most I had ever earned was the 600 yuan per month full-attendance bonus from writing web novels. I had no other source of income whatsoever. So, you can probably imagine the shock this explosive income gave me.

Most of this money was used to pay off debts, so my monthly expenses were almost the same as before. Except for spending five thousand yuan on two art courses without batting an eye, I had almost no large expenses.

Look at how much is left by 2023. So, I’m also very fortunate that for me, there was no issue of it being hard to go from lavish to frugal. The whole process was like a dream. It rose suddenly with the pandemic and then disappeared just as suddenly with the pandemic. However, my feelings about it are rather calm, because I’ve always seen it as luck. This was not something I could achieve with my own abilities. For example, this little tool itself had almost no technical complexity. I just threw something together. Anyone who has learned a bit of programming could have made it. Is that my skill? If this is really my skill, then my level is far too low.

So, in the end, it was just that I was lucky. At the right time, with the right thing, I entered the right place.

But after all, what I learned was the foundation. It formed the most important basis of that “right thing.” If I had followed the conventional education path, I probably wouldn’t have come into contact with programming. Given the time I spent learning to code, I probably couldn’t have found half a year to study it so intensely. Maybe I would have had time in college, but by then, I would have likely missed the pandemic boom.

So, without those programming skills, I would never have been able to catch that luck. For that, I probably have free learning to thank.

And now, having said all the above, it’s time to talk about another issue.

Returning to the beginning, do you think I’m encouraging dropping out to study, for everyone to be like me? That would be a huge mistake. The purpose of this article is precisely this: should one drop out to pursue free learning?

They ask what happens after Nora leaves. I probably can’t bring myself to champion the school system, just as I couldn’t persuade Nora not to leave. After all, I am the most thorough beneficiary of free learning. Unlike many who completed the entire traditional education process, I started self-studying early on, so all my knowledge is thanks to the word “free.”

But why do I still adopt a “persuasion to study” stance, like someone trying to convince Nora not to leave?

My argument is that whether Nora leaves or not is ultimately her own business. She has her own judgment, which no one can forcibly interfere with. But it is also extremely important for her to understand “what will happen after she leaves.” She can’t just leave in a fit of passion; that often only leads to a tragic end—on this point, Mr. Lu Xun has probably said enough.

And my past is now told. It’s time to talk about the present.

After Nora left, what really happened?

Reality

It’s time to talk about the dark side of free learning. No, rather than free learning, it’s the dark side of today’s society.

If you say that free learning enhances personal abilities, then for society, you need to connect these abilities to it. And often, this connection requires the help of a diploma.

So, we can naturally draw a conclusion: free learning = personal, school education = social.

Facing Society

This conclusion has been mentioned in many of Jarrett’s articles, but without the real experience of entering society with a low level of education, it’s probably hard to truly feel the disconnect between the two.

I am now twenty-one years old, which, conventionally, is about the age of a college junior. While many are still in school, struggling with the oppression of compulsory education, they may not experience another, more terrifying thing. And this thing, for the countless peers who didn’t go to college, for those who dropped out early to work and earn money, we have probably experienced this indescribable thing countless times—the oppression and compulsion of society.

This compulsion, in my view, is infinitely more terrifying than that of school education. If school education is purgatory, then social compulsion is the eighteenth level of hell. Why can society smooth out all of a person’s sharp edges? It’s because for a single, atomized individual, when faced with the overwhelming force of society, they are often completely powerless to resist and can only endure. In the end, it often becomes not society adapting to them, but them adapting to society. Any resistance they put up will be met with negative feedback. They will lose their jobs, their source of income to support their families, and their entire foothold in society. And if they choose to integrate into society, to smooth out their sharp edges, they will be rewarded. The more they suck up to their superiors, the more they gain; the more they step on others to climb up, the more benefits they receive.

Such a society successfully turns everyone against everyone else, resulting in a war of all against all under competition, and you must participate to secure a place for yourself. I don’t want to discuss too much here about how the free market or capitalism has shaped all of this. I just want to make one point—that companies value diplomas much more than many people think.

You might think that if you have enough ability, you won’t have to worry about finding a job, right? That’s certainly true, if you’re talking about restaurants, factories, security, cleaning, construction labor, and so on. You are capable of these jobs, and you can find one anytime. Wait, you don’t want to do these? You want to do the jobs you actually want to do, the ones you’re interested in? Then here’s the problem: what about your diploma? “The company will surely value my ability and overlook my diploma?” Let go of such fantasies. Unless you can produce some absolutely solid achievements, for a company, a diploma is the best signal of ability. Yes, it’s just that unreasonable. Even if a diploma is not at all equivalent to ability, to reduce screening costs, if you can’t provide a “signal” sufficient to make up for this gap, you will be considered incompetent.

And how many people can be like Jarrett or other powerhouses, publishing in top journals while still in university, or writing various in-depth papers and being pre-selected by companies? Those who are thinking of dropping out can ask themselves honestly, setting aside all that confidence, do you really think this is easier than school education? You have to know, those who come out of the school system are getting internships through campus recruitment before they even graduate. And you? You’re still working on producing something “explosive” enough to prove your ability. And what are you proving? That you’re on par with ordinary graduates who have achieved nothing yet, but have merely completed a conventional education. Do they need to prove their ability? No, their diploma directly equals their ability, at least in the eyes of the company.

This is something I’ve come to understand deeply after looking for a job, and even today it traps me like a spider’s web, holding me in place, unable to move forward.

I mentioned my experience of earning money above, but do you really think that something based on pure luck has enough reference value? Do you think this experience, which is akin to winning the lottery, can be considered a common path for dropouts who pursue free learning? This so-called income came quickly and went quickly; after the pandemic, it was all gone. Waiting for another boom like the pandemic? You might as well go buy a lottery ticket.

Although paying off the debt helped my family through a difficult time, the future still has to be lived. There are still over a hundred thousand yuan of debt left, and in 2023, I had no other source of income, which forced me to look for a job. So, that money-making experience only postponed the whole process. For most dropouts like me, they probably had to start working several years earlier.

Seeking Proof

And then I encountered the most unforgettable and deeply ingrained prejudice against education level, which may not have been intentional, but was the most realistic and implicit. Even when I presented my resume with my past entrepreneurial experience, in the industry I wanted to work in, not a single company was willing to accept me. I applied to nearly a hundred companies, and not a single one, yes, not a single one, was willing to invite me for an interview, or even just to communicate with me to see what kind of person I am. I understand this, yes, I understand it deeply. From the company’s perspective, what’s the situation now? Youth unemployment is high. They can’t even get through the resumes of a bunch of undergraduates, or even master’s and doctoral students from 985/211 universities.[18] Who would bother with a junior high school student with “special experience”? Is he better than these graduates from famous universities? Is he better than a PhD? I’m afraid the company’s HR backend probably didn’t even see my resume. A simple “filter by education,” and sorry, all those outside the selection are automatically deleted.

Everything I learned through free learning is useless now. No one is willing to hire someone who dropped out in junior high. You think you can “prove” your ability to the company, but the truth is they don’t even “see” you. How do you prove yourself to someone who can’t even see you? It’s impossible.

And my only choices are to work on an assembly line in an electronics factory, be a waiter in a restaurant, or work as a security guard, a cleaner, or a construction worker—this is the destiny for us junior high dropouts. Is learning so much useful? It’s just taking a 40-year shortcut[19] to end up like old Mr. Zhang at the gate, whiling away life in the mailroom for money.

I’ve learned so much, and then what? Is it useful? Who wants a mid-way dropout? Who do I think I am? Do I have any proof? Who do I show my proof to? Right, who am I? A person who dropped out of junior high, self-studied at home, a person who completely deviated from the conventional path. Yes, I’ve learned a lot, but this society rejects me. I’ve completed my self-improvement, but now I have to worry about my next meal. In society, ability is a diploma. So, a person without a diploma is tacitly considered to have no ability.

Have you ever thought about what Jarrett’s education level is? Have you thought about what Woz’s education level is? At least they are not junior high dropouts. Even for a company like Maimemo, if you look at their public job postings, you’ll find that a university degree is basically the starting point. And for various white-collar office jobs, you need at least a junior college degree. The HR who hires you is also under performance pressure. If something goes wrong with you, they have to take full responsibility. Even if you can provide solid “proof,” if the HR is unwilling to take the risk of your deviation from the norm, they will conservatively choose a recent graduate with a high education level and a good school, but with no proof or qualifications yet. If you were the HR, which one would you choose? Think more rationally: what level of proof does a junior high school student need to provide to make you give up a 985 graduate? Or even an undergraduate?

And the difficulty of obtaining this proof, for you, might be harder than completing school education.

When our own abilities are insufficient, we cannot change society as individuals. All we can do is struggle as much as possible under this great net of society. Only by first accumulating enough ability can we call on more people to join in and change everything. Before you can study, you need to be able to eat, to be clothed, to be able to support your family. I believe everyone who advocates for free learning must hate school. That’s right, we all hate it. And I, my hatred for it is deeper than anyone’s, so much so that I changed my life’s path for it.

But as an idealist, we should also be realistic idealists. As a dropout, I have come to deeply understand how important a diploma is in today’s world. Not being able to find a job also means I have to go hungry every day, which affects my ability to pursue free learning. Every day I worry about my next meal and have less time to read, less time to explore my interests. All I care about now is income, and when I won’t be kicked out by my landlord. Ideals and reality must be balanced. And that means we need to continue to endure in school, at least to meet our basic survival needs first.

The factory assembly line will grind away all your free spirit. The tedious, repetitive work, day after day, will erase all your creativity. Believe me, you absolutely do not want to work in such a place, especially when you know so much. This pain will only increase daily. Sometimes, knowing nothing is indeed a blessing. But once you know these things, you can never go back—because you chose free learning, you chose to improve yourself, and this improved self is so out of step with the current society.

But who is to blame?

This society needs screws, but you have become a nail, a sharp and piercing thorn—

You are not welcome here. You can’t get in anywhere.

Looking Back and Forward

I’m not saying here that everyone should submit to this society. That is what I oppose most vehemently. In my daily life, I am this society’s most radical critic. What we need to do is precisely to challenge it. But before that, of course, I must first be able to sustain my own survival, obtain a university degree, and even pursue a master’s or a doctorate. Like Woz, with enough influence, I might have enough ability to call on everyone to change this together—I’m not saying this is the only path, nor that only a PhD has the power to change the world.

I’m just trying to say that even for the sake of your free learning path, for some hope of changing education, to be able to go further, to make this nail of yours sharper, more tenacious, to prevent yourself from being ground down by society, you must struggle in the cracks of society. Get a sufficient diploma, earn enough income, have a sufficient livelihood in this society, only then can you continue to temper yourself. Otherwise, you will only be ground into the shape of a screw under the compulsion you have no choice but to accept, and be forcibly embedded into it.

I can say this because I am almost at my breaking point now.

To break free from this great net, I am now back on the old path, back to traditional education, back to the old road I once hated, studying for a purpose I dislike. I am now studying for exams, studying to get a diploma. This is what I loathe. I left school precisely to escape them, and now I have to actively call them back.

I will never like exams in my life. Never.

But even so, I must accept it.

I want to have a sufficient economic foundation to support my free learning. I want enough money to repay my mother who raised me since I was a child. My family needs to pay off debts. I need to pay rent so I don’t sleep on the streets. I want to become a researcher. I want to explore the mysteries of world politics and economic operations. I want to learn freely in a broader space.

I will never regret those years of free learning, and I will absolutely never regret my decision to drop out and self-study. Without those years, I would not be who I am today. Without that period of free learning, I would not have earned those hundreds of thousands of yuan.

Because if I had learned nothing, luck would have been useless when it came.

In a word, without free learning, there would be no me today.

But similarly, without those years, I would not be struggling in this spider’s web now.

There are gains and losses. I have no regrets about this.

I have thought countless times, if I were given another chance, would I still do this? But often, I would reject the choice of going to school. My aversion to it is bone-deep, absolutely unchangeable. So every time I imagine it, I would still make the same choice.

But since I have accepted this premise, I must bear its consequences—for income, for my dreams, for my freedom, I need to first limit my freedom, go back to the past, and continue on.

I want to take the postgraduate entrance exam.

But I am only a junior high school graduate.

For that future, I can only bow my head for now.

But my back will always be straight. Bowing my head for a moment absolutely does not mean bowing my head forever.

The Ladder

Just a small gap around the first-tier university admission line can lead to a huge income gap. A diploma itself has a premium. “Upon graduation, the gap between the two is 6%; five years after graduation, the gap widens to 24%; ten years later, it expands to 34%.”[20]

And just different education levels have a huge starting salary gap. “In 2021, the arithmetic mean of monthly starting salaries for doctoral, master’s, bachelor’s, and junior college graduates were 14,823 yuan, 10,113 yuan, 5,825 yuan, and 3,910 yuan respectively.” And for the same education level, different schools also matter. “Taking bachelor’s graduates as an example, the average starting salary for graduates from 985 universities is 4,752 yuan, for 211 universities it is 3,908 yuan, and for general universities it is 2,913 yuan.”[21]

Of course, the above conclusion only applies to its data, which is mainland China. It can only show how mainland companies view diplomas. For example, data from across the strait shows that junior college graduates earn more than university graduates, which may be related to the popularization of universities there, so much so that there is no longer a premium. But the gap between junior high/high school graduates, postgraduate, and undergraduate salaries is still reflected in wages.[22][23]

This is not reflected in the United States, which is still the same as the mainland, where education level directly determines the income gap.[24]

And in Norway, which is known for its high level of equality, the wage gap still widens with the increase in education level: “In 2021, the average monthly income for all industries was: NOK 38,010 for secondary education, NOK 47,630 for upper secondary education. Those with less than four years of higher education can expect an average of NOK 54,540. For those with more than four years, the average is NOK 68,300.”[25]

So the situation across the strait is more dependent on their own special economic structure. In other developed regions, the ladder-like gap of education-income is still predominant. Especially considering that as a mainlander, if you drop out, you most likely won’t be able to run.[26] Companies everywhere will still offer different salaries based on education level, and this is even more severe in the mainland, where they even look at your first degree when you apply for a master’s or doctorate.

Then the issue of a diploma will inevitably become a factor that must be considered in the future. I believe no one is unclear about how important income is.

The above does not say much about the difference between diplomas and abilities. This has actually always been a controversial topic: does education determine ability, or do more capable people get higher education? This also touches on the pain points of school education: is it purely a screening mechanism, or is it truly effective? I won’t comment too much on this, but if I follow the articles provided by Jarrett, at least the view in them is that “school education is useful, but its low efficiency leads to aversion, it is full of flaws, and the disadvantages outweigh the advantages.” If we accept this premise, then it means admitting that a good school education does not necessarily equal good ability, which basically means admitting that the income gap generated by diplomas and good schools is a “diploma premium.”

And if we admit that scores can measure ability, the results from Data-Emperor based on regression discontinuity basically prove that the income gap also has a diploma premium, not just an ability (score) premium. After all, is there really such a big difference in ability (score) between people who are a dozen points above and below the first-tier university admission line?

If you’re interested in a niche field like mine, social sciences, it’s tough even for those with a diploma, let alone those without. If you’re a programmer, it’s indeed much better, but you still need at least a junior college degree. However, to be fair, different surveys give very different proportions for junior college graduates, probably due to data deviation. Some even mention high school graduates. The quality of these surveys is frankly terrible.[27][28][29][30]

But in general, we can get a basic impression that the programming industry as a whole may not have such high requirements for diplomas.

But at the same time, there is another problem. These programmers may have jumped into the industry from training classes or self-study during the first half of the internet boom. What about today? Instead of looking at the education level of existing programmers, it’s better to look at what the education level of future programmers might be. Just from recruitment data, we can see that the internet industry is gradually closing its doors to those with low education levels. In the 2021 recruitment for core positions, only 8.5% were for high school graduates or had no education limit, and more than half required at least a bachelor’s degree.[31]

And believe me, I’ve applied to a bunch of companies that claimed to have “no education limit” recently, but it’s all nonsense. They say no limit, but if you actually apply, they don’t even give you the time of day. How “unlimited” these “no education limit” positions are is itself questionable. And for positions that require a junior college degree, it’s unknown how many talents laid off from big tech companies, with bachelor’s degrees or higher, will be competing with you.

Conclusion

So, what is the conclusion?

If you really can’t stand it, perhaps holding out until you get a junior college degree would be a better choice. After all, it’s still higher education, and although the number is decreasing, companies will still choose junior college graduates more often. And on the other hand, future paths like taking the postgraduate entrance exam or the civil service exam will also be smoother. I do not recommend quitting in junior high like I did, only to have to go back and walk the same path again now.

Of course, everything I’ve said is just to show “what happens after Nora leaves.” If you think that replicating my free learning experience would be more beneficial than harmful for you, then I have no standing to stop you. After all, I walked this path myself, and as I said, I don’t regret it. It’s just that if you really make this choice, you must be mentally prepared to give up certain things as well.

“We have no right to persuade people to make sacrifices, nor do we have the right to prevent people from making sacrifices. Besides, there are plenty of people in the world who are happy to make sacrifices, happy to suffer.”

You will indeed reap the rewards of free learning, but it also means that compared to those who came out of the school system, you will face much, much greater difficulties. At the same time, uncontrollable factors will crowd your future. I was supported by my mom for five or six years before I could earn money (and that earning was mostly an accident, due to luck) and still wasn’t kicked out of the house. What about you? Perhaps for some, they are happy to make sacrifices. But for others, can they do it?

Perhaps among those reading this article are people who cannot find joy in such a struggle. Once they leave school, it means a complete break with their families, let alone any support. The complete lack of income during the period of free learning means you must have a source of financial support. Who will support you is a question you will have to face. Otherwise, working in a factory will likely become your only choice, and all your free learning plans will be completely interrupted. In this way, the process of free learning becomes suffering, and every moment you will face pain from society and your family. I think this kind of suffering is probably not what they wanted to endure in the first place.

“Dreams are good; otherwise, money is essential.”

In today’s society, self-improvement and social recognition are separate. Everything is built on a certain foundation, and when this foundation is not met, no matter how much you improve yourself, you can’t even reach the threshold, and all doors will be closed to you. “I can prove myself with my excellent abilities”—but at the same level, you may have to put in tens of times the effort to gain enough favor, while for some, this is just the starting point.

I have very low self-esteem, so low that I can’t even publish this article under my own name. I’m afraid to let others know that I’m just a junior high school graduate, because I know very well that, including myself, we all seem to have an implicit bias against junior high school graduates. This bias is subconscious, a product of social construction. Even if it can be suppressed, sometimes, if you’re told this person is only a junior high school graduate, even I can’t help but wonder why they didn’t continue:

“How could they not even get into a junior college? The admission score is only 200 points. Can’t even get that score? It’s something you can achieve with a little effort. Did they not listen in class at all, not put in any effort, just messing around? Is there something wrong with their character, are they...”

I don’t know if such thoughts pop into your mind subconsciously. Although they are usually instantly suppressed and refuted by my reason, this thought disgusts me and is hard to ignore. Even I, a junior high school graduate myself, think this way. What about others? Will they think this way? Will their reason suppress it like mine does?

So please forgive my anonymity. I’m just too scared.

I can only ask Jarrett to help me publish this on my behalf. For this, I offer him my heartfelt thanks and respect.

Jarrett once completed his free learning while still in the classroom and has come this far today. I believe he has a lot of valuable experience to share and has also published articles on it. As someone who has been through it, his experience is definitely worth referencing for those who want to balance both.

I hope that one day we will no longer have to endure a compulsory school system that stifles people’s growth and nature. I believe that this day will come, driven by the step-by-step practical actions of everyone who holds this idea in their hearts.

I am not an accomplished person. To this day, with this level of education, all I’ve done is learn some things I’m interested in, have some of my own views and ideas, and am moving in a certain direction. I’m not even a success story of free learning; after all, I haven’t published in top journals like Jarrett, and I’m even struggling to find a job.

What I’m sharing with you here is just some of my complaints, the despair of a failure, some of the dark aspects of free learning. Or rather, it’s a manifestation of the glorious ideal of free learning in today’s dark society.

But, although free learning is not so ideal, not so smooth, and even full of thorns and obstacles at present, to the point that it seems naively idealistic to many—I love it so much.

I sincerely wish that everyone who reads this article can walk down the path they love, and walk far enough.

I hope you always maintain your passion, are never assimilated by this society, and always maintain free learning.

And one day, completely change this terrible society.

 

  1. ^

    Free learning - supermemo.guru

  2. ^

    Problem of Schooling - supermemo.guru

  3. ^

    What Happens After Nora Leaves Home? https://www.marxists.org/chinese/reference-books/luxun/01/018.htm

  4. ^

    了转反 (le zhuǎn fǎn): An internet slang term created by reversing the characters of "反转了" (fǎn zhuǎn le), which means "it has reversed" or "plot twist." It's a playful way to point out an unexpected reversal of opinion or situation. "Doing a 180" captures the meaning well.

  5. ^

    背刺 (bèicì): Literally "backstab." A common slang term, often borrowed from gaming culture, meaning to betray someone.

  6. ^

    家里蹲 (jiālǐ dūn): Literally "squatting at home." This is a common term for a shut-in or homebody, similar in meaning to the Japanese term hikikomori.

  7. ^

    Debuff: A term borrowed from video games, meaning a negative status effect that weakens a character. Here, the author humorously describes his hatred for exams as a debuff affecting his performance.

  8. ^

    Zhihu (知乎): A popular Chinese question-and-answer website, similar to Quora. It was known in its earlier days for high-quality, long-form answers from experts in various fields.

  9. ^

    抖机灵 (dǒu jīling): Literally "shaking out cleverness." It refers to making witty but often shallow jokes or one-liners.

  10. ^

    程序猿 (chéngxù yuán): A pun on 程序员 (chéngxùyuán), meaning "programmer." The character 猿 (yuán, ape) is a homophone for 员 (yuán, person/staff). It's a self-deprecating term similar to "coder monkey" in English.

  11. ^

    Baidu-oriented programming (面向百度编程): Baidu is China's largest search engine. This is a humorous admission that coders often rely heavily on searching for solutions online, a parallel to the English phrase "Stack Overflow-driven development."

  12. ^

    太监 (tàijiàn) / 进了宫 (jìnle gōng): "Eunuch" / "entered the palace." Popular slang in the web novel community for abandoning a story midway. A "eunuch" novel is one that has been "castrated," meaning it will have no conclusion.

  13. ^

    Qidian (起点): One of the largest and most influential web novel platforms in China.

  14. ^

    蚌埠住 (bèngbù zhù): A popular internet meme. It's a pun on 绷不住 (bēng bu zhù), which means "can't hold it in anymore" or "to lose one's composure." The city of Bengbu (蚌埠) is used for the sound play.

  15. ^

    开盒 (kāi hé): Literally "opening a box." Recent internet slang for doxxing, i.e., revealing someone's private information online.

  16. ^

    打稻草人 (dǎ dàocǎorén): "Hitting a strawman." This translates directly to the English logical fallacy of a "strawman argument."

  17. ^

    对线 (duìxiàn): Literally "laning." This term originates from MOBA games like League of Legends, where players face off against an opponent in a specific "lane." In internet slang, it means engaging in a direct argument or debate with someone online, often in a confrontational manner.

  18. ^

    985/211 Universities: These numbers refer to "Project 985" and "Project 211," government initiatives to fund and create world-class universities in China. They represent the top tier of higher education.

  19. ^

    Take a 40-year shortcut (少走40年弯路): A popular ironic phrase. It means skipping a typical career path and ending up directly in a low-skill, often retirement-age job, thus "saving" 40 years of struggle.

  20. ^

    I'm a high school student. Why do so many people online say there's no difference between a regular Tier 1 and Tier 2 university? Then why should I work hard to get into a Tier 1? - Answer by chenqin https://www.zhihu.com/question/391907816/answer/1308068416

  21. ^

    2021 University Graduate Employment Salaries: The arithmetic mean of monthly starting salaries for master's and bachelor's degrees are 10,113 yuan and 5,825 yuan respectively https://www.sohu.com/a/518672361_120950203

  22. ^

    Is Education Proportional to Income? University Graduates' Income Lower than Junior College for 5 Consecutive Years https://www.thenewslens.com/article/105131/fullpage

  23. ^

    110th Year Family Income and Expenditure Survey Report, p. 41. (Note: The 110th year refers to the Republic of China calendar, corresponding to 2021.) https://ws.dgbas.gov.tw/001/Upload/466/ebook/ebook_196214/index.html

  24. ^

    Here's the Average American's Income by Education Level https://www.fool.com/the-ascent/personal-finance/articles/heres-the-average-americans-income-by-education-level/

  25. ^

    Salaries in Norway: The Facts & Figures https://www.lifeinnorway.net/salaries-in-norway/

  26. ^

    润 (rùn): This character sounds like the English word "run." It is popular internet slang for emigrating or "running away" from China.

  27. ^

    First Half of 2015 Chinese Programmer Survey Report: Nearly 30% have a high school education http://www.techweb.com.cn/data/2015-08-03/2184316.shtml

  28. ^

    2020 Analysis of the State of Chinese Developers (with distributions of age, region, field, monthly salary, education, programming languages, and operating systems used) https://www.chyxx.com/industry/202110/980817.html

  29. ^

    Programmer Inn Statistics: 2021 Chinese Programmer Salary and Life Status Survey Report https://juejin.cn/post/6937229675773034533

  30. ^

    "2019 National Internet Industry Programmer Employment Report": A look at the main force of the IT industry https://zhuanlan.zhihu.com/p/88711873

  31. ^

    Zhaopin.com: 2021 Internet Industry Job Search Guide (21 pages).pdf https://www.sgpjbg.com/baogao/74978.html



Discuss

A prayer for engaging in conflict

4 ноября, 2025 - 11:19
Published on November 4, 2025 8:19 AM GMT

Crosspost from my blog.

Let these always be remembered: those who suffer, those who experience injustice, those who are silenced, those who are dispossessed, those who are aggressed upon, those who lose what they love, and those whose thriving is thwarted.

May I not let hate into my heart;
and May I not let my care for the aggressor prevent me from protecting what I love.

May I always reach out a hand in peace;
and May I never hold it out as they sever my wrist.

May I seek symmetry, to take synchronized steps back from the brink;
and May I not pretend symmetry where there is none.

May I forgive when I expect forgiveness in return;
and May I not forgive when I do not expect forgiveness in return.

When there is time to say all that needs to be said, May I recount and denounce the crimes of my side;
when there is not time, May I not be a prop in a libelous morality play.

May I fulfill my moral obligations;
and May I not give in to threats that enforce double standards.

May I present my case and my cause for scrutiny by the just;
and May I not turn out my pockets and prostrate myself for anyone who accuses me.

May I justify my actions;
and May I not justify my rights.

May I have patience as they prepare themselves to make peace;
and May I not wait for them to grant me my rights.

When there is space to move, May I always separate the innocent from the aggressor;
when there is not space, May I never allow the aggressor to attack with impunity by hiding behind the innocent.

May I bargain for peace, justice, and fairness;
and May I not yield my place.

May I not demand that they believe everything I believe;
and May I not participate in their delusion.

May I not forget the hope of peace;
and if they try to destroy me then May I destroy them.

May I take on what seems to me unfairly much of the burden of practical peacemaking;
and May I not admit any excess blame, fault, guilt, debt, or responsibility.

When there is energy that doesn't need to be spent on conflict, May I break down hatred and build up calmness and peace within my side;
when there is not energy, May I unite with my side against the enemy.

Let these always be remembered: those who suffer, those who experience injustice, those who are silenced, those who are dispossessed, those who are aggressed upon, those who lose what they love, and those whose thriving is thwarted.

Blessed are you, the name, our Eloahs, ruler of the universe, who sanctifies those who speak the truth of the heavens.

.בָּרוּךְ אַתָּה ה', אֱלֹהֵינוּ מֶלֶךְ הָעוֹלָם, מְקַדֵּשׁ אֶת מְדַבְּרֵי אֱמֶת הַשָּׁמַיִם

Blessed are you, the name, our Eloahs, ruler of the universe, who protects those who protect the vulnerable.

.בָּרוּךְ אַתָּה ה', אֱלֹהֵינוּ מֶלֶךְ הָעוֹלָם, שׁוֹמֵר שׁוֹמְרֵי הַחַלָּשִׁים

Blessed are you, the name, our Eloahs, ruler of the universe, who loves those who pursue peace.

.בָּרוּךְ אַתָּה ה', אֱלֹהֵינוּ מֶלֶךְ הָעוֹלָם, אוֹהֵב רוֹדְפֵי שָׁלוֹם



Discuss

Rainbows, fractals, and crumpled paper: Hölder continuity

4 ноября, 2025 - 11:01
Published on November 4, 2025 8:01 AM GMT

One of my favorite website is allRGB. It's a collection of images which each contain every 24-bit RGB color exactly once. Most of them (though not all) are square 4096-by-4096 images. (In general, we can imagine doing this with (n^2)-level color for any (n), producing (n^3\times n^3) images; allRGB is the case (n=16).)

People take that prompt in a ton of different directions; many of the images use clever dithering tricks to simulate a smaller color palette, others arrange their pixels into tiny regions of similar color, etc. But I think my favorites are the ones that attempt to arrange colors as smoothly as possible, like Order from Chaos or Smooth.

(Many of the images look smoother than they are, because of small-scale dithering or stripes or similar -- you have to zoom in to see the actual grain.)

How smooth can these images get? To be more precise: what's the smallest (C) such that there exists a bijection (f: [n3]2 \to [n2]3) for which (|f(x)-f(y)| \le C) for all (x,y) such that (|x-y|=1)?

(Or, probably better: what is the smallest (C) such that (|f(x)-f(y)| \le C |x-y| \forall x,y)? This is equivalent if we use the Manhattan metric and is the discrete version of Lipschitz continuity.)

We can also think of this problem in a different guise: given a (discretized) square of paper, we want to crumple it up into a (discretized) cube such that it fills the cube uniformly and is stretched out as little as possible in the process.

If we use Euclidean distance in both image space and color space, then we can rule out (C=1) with some casework (which I have misplaced, so you'll have to trust me). The allrgb image called "Smooth" achieves (C=2) in a way that looks likely to generalize, so the only remaining question is whether (C=\sqrt 2) is possible.

Of course, this problem also generalizes to maps between (a)- and (b)-dimensional space, i.e. ([na]b \to [nb]a), for any (a > b). (We can also ask about (a < b), which includes the case of trying to make allRGB images smooth in the inverse sense -- keeping similar colors as close together as possible. It's not too hard to show that in this case we have to accept pretty large discontinuities.)

Starting with the simplest nontrivial case, ((a,b) = (2,1)), we are presented with the challenge of trying to smoothly biject a line onto a square. This is extremely easy in the discrete setting we're working in: just weave back and forth in rows, boustrophedon-style.

But it feels like something's missing from this approach -- or even that of "Smooth" -- compared to, say, "Order from Chaos". You can imagine taking (n\to\infty) and turning the latter into a nice continuous map from ([0, 1]^2 \to [0, 1]^3) by progressively adding more detail; if you try that with the alternating-rows map, it fails to converge. If you try it with "Smooth", the (8\times 8) grid of lines turns into an (\frac n2 \times \frac n2) grid, which also fails to converge.

As it turns out there's a very beautiful solution to this for the ((2,1)) case, called the Hilbert curve. It's an example of a space-filling curve, that is, a continuous, surjective map from the interval to the square. In other words, it's a fractal curve with fractal dimension 2. (It's not injective, and no space-filling curve can be injective, but I think it only fails this mildly; in particular there's a finite number of preimages for each point on the square, unless I'm mistaken.) [xkcd] famously used this to map out IP address space graphically.

How continuous are these maps? The obvious generalization of our definition of smoothness above gives Lipschitz continuity, i.e. (|f(x)-f(y)| \le C |x-y|) for some (C); this would mean that our map "stretches out" the interval by a finite amount (C). But (exercise for the reader) this is impossible.

On the other hand, continuity alone is pretty weak; we can do better.

The Hilbert curve has the nice property that a interval of length (r) of the unit interval gets mapped to a reasonably compact region of area (r) in the square, which has a diameter on the order of (\sqrt r). This implies the property (|f(x)-f(y)| \le C |x-y|{1/2}), which is called Hölder continuity (with exponent 1/2).

Can we use space-filling curves to construct Hölder-continuous versions of our allRGB images? Not directly. You can chain together Hilbert-curve maps to go down from 2 dimensions to 1 and then back up to 3, but the gorgeous result is discontinuous everywhere. (You can also do this with a Z-order curve, which has a particularly simple algorithm -- just interleave and deinterleave the bits of your coordinates to get your color components.)

As it turns out, this question has been asked before! The accepted answer links to two great papers on the subject. The first, by R. Stong, uses a clever fractal construction to solve the problem for (\mathbb Z^a \to \mathbb Z^b). The second shows (very nonconstructively) that this implies a map (\mathbb R^a \to \mathbb R^b); I think you can make this much more constructive, though, by taking advantage of the fractal nature of Stong's construction. (In the ((3,2)) case, we construct a fractal curve with fractal dimension (3/2) in the plane, then map two of these Lipschitz-continuously to 3D space.)

Unfortunately, as far as I can tell, this construction on (\mathbb R^a \to \mathbb R^b) does not restrict to a bijection between hypercubes; it zigzags around too much to be able to cut out a contiguous chunk like that. So some version of this problem remains open.



Discuss

On tasting things

4 ноября, 2025 - 10:47
Published on November 4, 2025 7:47 AM GMT

(I'm doing Inkhaven and have two actually important/potentially impactful posts coming up, and really want to polish both a bit more before publishing, so I wrote this quickly to have a thing to post instead. Apologies.)

Ever since I tried meditation, I love food.

Around April 2022, a friend convinced me to try Sam Harris' Waking Up app.

Meditation made me actually pay attention to my experiences; and the experience of food suddenly became much higher-dimensional, once I started paying attention.

By October 2022, I've been to a dozen Michelin-star restaurants.[1]

At some point, I participated in a chocolate tasting hosted by Duncan Sabien. I couldn't stand dark chocolate prior. Afterwards, I could no longer eat milk chocolate: it's just too bad compared to good dark chocolate.

The really surprising thing about the tasting wasn't that if I focus on the taste of chocolate, I could map it--maybe internally combined with the difference with some central chocolate-taste--to some ideas, feelings, concepts, objects, moods.

The surprising thing was that people independently (without groupthink, in separate groups!) came up with the same comparisons.

Pure dark chocolate: cocoa and sugar.

That tastes like strawberries. According to multiple people who try it and come up with this comparison before sharing it with their groups, before sharing it with everyone.

Or pure dark chocolate: cocoa and sugar.

That tastes like a muddy Amazon river.

A muddy Amazon river.

It was, I think, my fifth chocolate of the tasting. I thought about what the taste is like. And came up with the imagery of a strong, muddy Amazon river.

Other people came up with the same imagery.

This was very surprising. Chocolate shouldn't taste like muddy rivers, to different people, was my assumption a day earlier. I assumed all of the tasting notes on wines were bullshit; this is not a thing that can happen. Sommeliers are fake; I vaguely recall studies that show that they can't distinguish red wine from white wine with red dye.

Strawberries are okay: fine, I can imagine that the difference in embeddings between this specific chocolate and some central chocolate is similar to the embeddings of the taste of strawberries.[2]

But why would it taste to different people like a muddy Amazon river? This is insane. What's going on.

My theory is that paying attention to good food is like experiencing embeddings. Like a spectrogram of different things. Or music.

Good food would often have an interesting spectrogram/embeddings: narrow and pointy in exact places that complement each other in interesting ways; or maybe roundy and still causing a feeling of comfort; or some combination of the two. There is a huge amount of dimensions of the experience, and there is maybe some amount of rotation that can be applied while preserving some properties of experience.

It's what it feels like, to eat good food. You can enjoy it the way you can enjoy an orchestral piece, feeling and understanding it as all the separate instruments and groups of instruments, enjoying how they come together and play with each other, and on a separate level, understanding and feeling the resulting sound as a whole, in the moment, and separately again, tracking and understanding the entire piece, from start to finish.

Three of these define how I listen to my favourite music: the entire progression and emotion and growth of some pieces; the sounds as I perceive them; and then, simultaneously, all the small things I can identify and separate and understand as instruments, and enjoying them as individual components beautifully contributing to the whole.

Great food is somewhat like that: there are the individual components you can identify pay attention to, while enjoying the whole; and sometimes, though less centrally, there's also a change to them as you progress, but you rarely think of this progression and mostly just enjoy experiencing the embeddings/the spectrogram moment by moment.

While I'm at it: I often tell people that yFood[3] is better than the bottom 10% of meals I've tried in Michelin-starred restaurants. (And I select meals mainly by how much I expect to enjoy them, though occasionally that includes interestingness or exploration value.)

They don't normally believe me.

See, ready-to-drink Huel is a random mess of embeddings. It's fine, it's convenient, it's not that terrible, but it's not actively good and its taste is not particularly uniquely enjoyable.

yFood is actively good. It tastes like an incredibly good milkshake. Somehow, it has a nice, soft and somewhat gathered taste. I don't know how they did it, but I like it, a lot, and so most of my breakfasts are yFood.

But what's going on with chocolate?

It doesn't have a variety of components[4]. I guess the chemicals are complicated; I don't have time to do a deep research query and write down myself having been deconfused; I can only share my confusions and experiences.

If you have ideas or independent theories of qualia that would've predicted this, please share.

(Thinking about it does somewhat update me that at least some people experience red the same way, or at least a have embeddings for things equialent if you adjust for rotational/other symmetries.)

  1. ^

     I was lucky to have had a lot of money at some point in my life, though feel somewhat guilty I've spending some of it on fancy food instead of donating, even though I've donated significantly more to MIRI than I spent on fancy restaraunts.

  2. ^

    Even though this still makes no sense: what would that correspond to, in reality? Pure chocolate, chemically, is nothing like pure chocolate + strawberries.

    And why asking yourself what kind of ship or a mood of a sea or a forest a piece of chocolate is like works at all?

  3. ^

    It's like Huel--nutritionally complete--but contains milk. If you're around the Bay, you can try some from me while I'm at Lighthaven, but you can't easily buy it outside the UK/EU. Apologies to people in the US.

  4. ^

    (Some good dark chocolate can have other stuff added to it.)



Discuss

More notes on US govt whistleblower guide and DB

4 ноября, 2025 - 10:30
Published on November 4, 2025 7:30 AM GMT

2025-11-04

Incomplete resource

This resource is incomplete because I haven't studied opsec mistakes of every single previous whistleblower in detail. You may have noticed some sections marked as "to do" in the DB.

I don't think studying those past cases changes the advice for future whistleblowers much, hence I didn't prioritise studying it. If you want me to work on it though, let me know (or even better, pay me).

Theory of change, Redteaming

I made a redteaming and theory of change document some months back, but some parts ("project 2" especially) are outdated.

Theory of change, Redteaming

Distribution

Have messaged over 1000 employees across AI labs (Deepmind, Anthropic, OpenAI, xAI, Meta Superintelligence Labs) using cold emails and twitter cold DMs. Twitter has the most lax policy on spam. Email and linkedin DMs are more strict. Main metric tracked by spam filters is the percentage of people who already received your message before, who either marked as spam or ignored or replied. I did not track open rates.

Exact message I use

Please find attached my guide on how to safely whistleblow against Anthropic.

I assume that as we get closer to ASI, Anthropic will work directly with the US govt to protect its information.

Information in the whistleblower guide is backed by empirical evidence from the whistleblower database.

Whistleblower guide https://samuelshadrach.com/raw/text_english_html/my_research/us_govt_whistleblower_guide.html https://web.archive.org/web/20251102101210/https://samuelshadrach.com/raw/text_english_html/my_research/us_govt_whistleblower_guide.html

Whistleblower database https://samuelshadrach.com/raw/text_english_html/my_research/us_govt_whistleblower_database.html https://web.archive.org/web/20251102101540/https://samuelshadrach.com/raw/text_english_html/my_research/us_govt_whistleblower_database.html

Here is my secure PGP-encrypted email if you would like to discuss further. https://samuelshadrach.com/raw/text_english_html/connect_with_me/contact_me_secure.html

Please check Wayback machine or Commoncrawl for archives in case these links ever stop working.

  • Samuel Shadrach
Help requested

(There are legal, financial, reputational risks associated with helping me on this in any way, and I won't be discussing those risks on a public forum.)

  1. Guide lacks legal advice because I lack legal background. If you can add legal advice to this guide, that would be a huge help.
  2. If you can host this guide online permanently and accept risks of doing so, that would be a huge help. I can't promise to host this resource forever. If you do host it permanently, also work on SEO so that it shows up on search results when a potential whistleblower googles this topic.
  3. If you can distribute this guide to people at the AI labs, that would be a huge help.
  4. Donations couldn't hurt. Monero address is on my website. I worked on this using my own savings.
  5. Generic feedback always helps. I'm more interested in feedback on how to make the guide better, and less interested in feedback on whether such a guide is a good idea. (I have multiple times tried discussing AI pause and US politics on lesswrong and gotten downvoted with no replies. I have better things to do with my time.)


Discuss

US Govt Whistleblower Guide

4 ноября, 2025 - 10:22
Published on November 4, 2025 7:22 AM GMT

2025-10-28

Disclaimer

  • Incomplete. Work-in-progress.

Why this guide?

  • I continue to think there isn't a single whistleblower guide on the internet that's good enough for this scenario. Some guides avoid talking about important details due to chilling effects. Other guides prioritise interests of journalists or lawyers.

Summary of the guide

  • If you are not leaking US classified information but only an overview of the situation based on your own word, your best choice is probably coming out publicly in the US with a legal defence and requesting donations to fund it.
    • Why?
      • Historically, a majority of such people did not end up in prison.
  • If you are leaking US classified information, your best choice is probably flying to Russia like Snowden did, obtaining asylum and then coming out publicly as a whistleblower. Your best choice is probably not improving your opsec and hoping to stay anonymous in the US.
    • Why is this the best plan?
      • The sysadmins maintain logs of username, timestamp and document id of every document downloaded from central DB to client machines. This is true for the NSA and is likely also true for all major US AI labs.
      • Almost every person who stayed in a country within US sphere of influence after leaking classified info has been imprisoned.
      • All the opsec recommendations mentioned in this guide are primarily to buy you time. You will likely still be identified eventually, for the above mentioned reason.
    • Expected positive outcomes
      • Some previous leaks (such as Snowden's leak) have lead to millions of people being convinced of a problem, when they were not otherwise as convinced. One person presenting empirical proof of a problem in the present is more convincing to the public than many people presenting only speculations and predictions.
      • Most leaks did not by themselves lead to significant political change. Once the machinery of the US govt is pointed in a certain direction, changing this direction is hard and takes time even if millions of people are willing to vote for change.
    • Why leak classified information, if it comes with additional risks?
      • Only few people have the courage to become a whistleblower against a govt. And getting political change is hard even after whistleblowers provide undeniable proof of a problem. If you are one of these people, you occupy a world-historically important position. I would recommend you leak more information rather than less, so that the public gets undeniable empirical proof of the problem you are pointing at.
    • How? (Security Mindset)
      • Security mindset is hard to quickly convey. (I don't yet have good resources for this.)
      • You should be familiar with concepts like bits of anonymity and security through obscurity. Every word, expression and action reduces bits of anonymity, as long as there's a physical trail, a digital trail or a person who observed it. Example of an action that reduces bits of anonymity: Leaving your house sparkling clean when you otherwise leave it somewhat messy.
      • You should be aware law enforcement has also read all the guides you're reading including this one.
      • You should probably avoid thinking of ad-hoc methods and stick to tried-and-tested methods instead.
      • The reason you might succeed at this plan is not because you're more intelligent or knowledgible than law enforcement, it's because of physics/engineering constraints that make whistleblowing easier than catching whistleblowers. Assume by default that they're more intelligent and knowledgible than you.
    • How? (Mental health)
      • Do not contact a mental health practitioner.
      • Some resources that might help
        • Read Secret life of secrets by Michael Slepian. Decide that you are morally correct in accepting the negative consequences on yourself and your loved ones. Accept that people around you will probably understand, but that there are no guarantees. Once your conscience is clear, the rest is just execution. Taking a month or two longer to take a clear decision is better than botching up execution due to mental health reasons.
        • Read about other cases from the whistleblower database. For Snowden, watch his Joe Rogan interview or book Permanent Record or Citizenfour documentary. For Assange, watch his interview after being released from prison or his old blog iq.org on internet archive. For Manning, read her autobuographical book. And so on.
        • Read about activists working in your field of interest. If your field is AI risk, you could watch Kokotajlo's interview where he talks about giving up AI lab equity, or Suchir Balaji's family interviews.
    • How? (Methods)
      • Preliminary research
        • You should do all your preliminary research on a dedicated TAILS setup only. This is a separate computer dedicated for this purpose.
        • Do not create any accounts or write stuff to the internet. Only read content.
        • Do not use a mobile phone for whistleblowing-related work, all phones are insecure.
      • Acquiring documents safely.
        • You will likely have to smuggle an SD card multiple times, at your workplace, residence, airport, and so on. Remember that buildings may contain scanners that reliably detect this.
        • Remember that your work computer that contains the files may log when a file is copied to external device or displayed on the monitor.
      • You should probably leave no digital trail.
        • You should probably redact documents yourself using GIMP on an airgapped TAILS setup, inspect the raw bytes for steganography and metadata, and create a single tarball of everything. Redacting audio/video correctly is hard, I would recommend sticking to plaintext and images if possible. .BMP is a good image format as it contains almost no metadata, allowing you to inspect the raw bytes more easily. Lower the image resolution to remove camera lens scratches.
        • If there is too much material to redact, my current recommendation is to not try to leak it. (This is a weak opinion. Do your own research, or wait for me to do mine.)
        • There is no safe way to erase a disk using a firmware or software tool. It is ideal to process data in RAM only using TAILS, and avoid using any disk. If you absolutely must use a disk, use a fresh SD card or an HDD. Do not use an SSD or a USB drive. This ensures you can physically shred it into small pieces using a hammer or power drill you already own. Do not leave behind a suspicious purchase record. Unfortunately you will have to boot TAILS on a USB drive, which is difficult to destroy.
        • I do not currently recommend building a faraday cage as that leaves behind a suspicious purchase record. I would recommend using no wireless connection, and using absence/presence of wired connection as a de-facto airgap.
      • You should probably leave no unusual items in your physical trail.
        • This includes but is not limited to every item at your residence (electronic, paper, other), every purchase you make and every roadside camera you pass.
        • Generate as little physical trash as possible (electronic, paper, other), as there is no easy-to-use completely secure way of disposing trash that can't be connected back to you. Assume every garbage dump you visit will be thoroughly searched.
        • Assume your location is trackable at all times. Do not visit places you wouldn't have visited previous to your plan to whistleblow.
      • Trusted people
        • While in the US you should probably have zero people in-the-loop, while outside the US geopolitical sphere you should probably have one lawyer and zero other people in-the-loop. "People" here includes immediate family members, psychiatrists, journalists, etc. You should probably trust zero people to help you commit the action, but trust a few people to support you after you have committed the action.
      • Leak verification
        • If you are leaking a document from a numbered database (such as a govt classified document), you can safely assume journalists at big media houses have contacts that can confirm the legitimacy of the document.
        • If your leak leads to a response from your organisation, such as cancelling a deal or firing someone or publicly denying it, this too proves legitimacy of your leak to journalists.
        • If neither of the above will be possible, then you may need to attach additional proof. You can look for an email you received from your org's official domain, download it along with DKIM headers and send it to the journalist as proof. Select an email that was addressed to many people not just you, so you can prove you work at the org without revealing your identity. Select an email that doesn't contain sensitive info as there may be logs indicating you downloaded the email.
        • I do not recommend using video footage of your org as proof, as making secret recordings and redacting videos are both hard, and hence increase the risk of you being caught.
      • Sending to journalists
        • (I am yet to make up my mind on whether it is better to send documents to journalists before or after you leave the US. Sending documents after leaving the US is safer if you can successfully smuggle an SD card past airport security. Do your own research, or wait for me to do mine.)
        • If you redact documents yourself, you should ideally not require trusting any journalists with any sensitive info such as your identity.
        • You should probably send documents to as many journalists as possible, but trust none of them.
        • If you rely on journalists to publish the documents for you, there's some probability they'll help cover up mistakes you made while doing redaction. On the other hand there's also some probability they'll act against your interests or simply refuse to publish your documents. Predicting their behaviour is hard and I don't recommend trusting your predictions of how they'll behave.
        • Most SecureDrop servers provide journalist's PGP pubkeys. You should ideally manually PGP encrypt the tarball before you send it via any channel (be it securedrop or protonmail or something else).
        • I do not currently recommend uploading an encrypted tarball of the documents to the internet, with the intention of revealing your key at a later date. The only time your documents should touch an internet-connected computer is when they are being directly sent to a journalist.
      • Country of asylum, Lawyer
        • Russia has good historical track record for this scenario. It is very important to make the right choice on which country you fly to. You may use a connecting flight through a third country to reduce suspicion.
        • It is important you are present in this final destination immediately after sending the documents, every day of delay makes a difference.
        • Once you're in the final destination, you should contact a lawyer. Until you have reached till this step, almost no lawyer is likely to actively help you as they will themselves be risking imprisonment if they do. Remember your lawyer will be a target of investigation just as you are.
        • Do not expect to be granted asylum by any country before you are physically present on their soil. There is almost no historical precedent for this, and you lack bargaining power.
        • Family visits might become possible once you have received asylum.
      • Advanced users only: Self-publish the documents
        • If you publish the documents yourself, you have to do redaction correctly. But you can guarantee publishing without trusting anyone.
        • You can send the documents to multiple social media sites that allow anonymous submissions over Tor.
        • You can acquire ETH anonymously and publish your tarball directly to ethereum blobdata. This ensures mirroring to multiple nuclear states. The same goes for purchasing BTC anonymously and publishing to bitcoin blockchain.
        • The best method to obtain BTC or ETH anonymously is to CPU mine XMR, then swap to BTC or ETH using a trusted bridge. The second best method is use some imperfect method like cash or gift cards to buy BTC or ETH, then swap to XMR via a trusted bridge for mixing purposes, then swap back to BTC or ETH via a trusted bridge. Either method should be done using TAILS only (not airgapped).
      • Failure
        • If you are approached by law enforcement, contact a lawyer and don't say anything. If you are approached, assume you are probably going to be imprisoned, because you are unlikely to be approached unless there is enough accumulated evidence to imprison you.
        • You will almost certainly be allowed family visits in prison.


Discuss

US Govt Whistleblower Database

4 ноября, 2025 - 10:20
Published on November 4, 2025 7:20 AM GMT

2025-10-17

Disclaimer

  • Incomplete
  • All information here is based on public record

What?

  • Collecting (mostly) fact-checked of previous US govt whistleblowers.

Who?

  • Primary target audience: Potential whistleblowers in future. Especially focussed on those working at AI companies.

Why?

  • Might aid future whistleblowers, or people directly working with them, or people indirectly supporting them.
  • Unknown unknown reasons. I can't pre-emptively guess every way this resource might be used in future.
Database categories

Since there are many people who become sources, it is useful to categorise them. I have categorised them as follows based on intent, action and consequences.

categories based on intent

  • (categorising based on intended beneficiary not intended recipient)
  • whistleblowers
    • intended beneficiary: perceived public interest
    • intended recipient of info: usually public. sometimes specific people acting in public interest such as judges or congressmen.
  • leakers
    • intended beneficiary: perceived personal gain but not money (romantic interest, personal rivalry, social status, etc)
    • intended recipient of info: anyone
  • spies
    • intended beneficiary: perceived value-alignment with foreign govt or ideology, or money
    • intended recipient of info: intelligence service of foreign govt

categories based on action

  • leaked classified documents
  • did not leak classified documents, but may have leaked classified information
  • did not leak classified documents or information

categories based on consequences

  • was imprisoned
  • was not imprisoned
  • (not categorising based on other consequences such as social ostracism, financial loss, etc)
Some notes and disclaimers (about the database below)

About classification

  • Background info
    • US classification levels: CONFIDENTIAL < SECRET < TOP SECRET < TOP SECRET/SCI
    • As of 2022-09-30, public claim is that 1.35 million people have TOP SECRET security clearance.
    • TS/SCI indicates a compartment which only few named individuals can access, not everyone with a TOP SECRET security clearance. Different documents can belong to different compartments. A compartment can be as small as 10 people.
  • Conclusion
    • in many whistleblower cases, seems unclear if classification status was SECRET or TOP SECRET at time of leak

About key dates recorded

  • date of first transmission of a document to a second person (there could be multiple documents sent on different dates)
  • date of first public publication of a document (there could be multiple documents published on different dates)
  • date of arrest
  • date of public revealing of whistleblower's identity
  • date of being released from prison

About consequences on social circle

  • Typical consequences once identity is publicly out
    • Family members are interrogated, house raided and wiretapped.
    • Family members face significant legal expenses. Almost always, a defence fund is raised with donations from non-profits and the general public.
    • Family members are verbally harassed in-person and online.
    • Multiple people in extended social circle cut off contact. A common reason is to avoid being involved in the investigation.
    • Once imprisoned, prison visits are allowed for immediate family members.
    • Family members are not imprisoned.
  • Unless specified otherwise, it is IMO a reasonable assumption that all of the above consequences occured in every single case of US govt whistleblowers/leakers who were imprisoned. There may or may not be documented proof for all of the consequences.
  • Usually there is more documented proof if the whistleblower chose to talk to journalists or the general public about the challenges they faced. Usually law enforcement or intelligence did not make information public against the will of the whistleblower.

About opsec mistakes

  • Remember parallel construction of a chain of evidence is common, what evidence is presented in court may not be how law enforcement first found out the act occurred. I have tried sticking to court evidence and not speculating too much beyond it.
  • Remember evidence presented in court faces selection effects and framing effects due the adversarial nature of a court trial.

About journalists

  • This list is not an endorsement of the values or capabilities of any specific journalists. It only provides historical fact-checked information.
  • Information recorded
    • Date, title, authors, media house of first publication
    • Link to original copy of first publication, or a mirror if possible
    • Whether publication contains original documents
    • Whether journalists and editors knew identity of the source
  • Some journalists later quit the orgs they worked at, at the time of the leak. Unless stated otherwise, I have specified the org they worked for at the the time of the leak.
  • Some articles may have older edits or link urls. Internet Archive Wayback Machine is one possible place you can check for this.
  • In multiple cases there is public record of a journalist being informed of the source identity but no public record of the editor being informed.
    • Speculation by me (Samuel): It is highly likely that if a journalist knows the identity of a source, the editor will pressurise the journalist to inform them as the editor.
    • Editor's reputation is affected if the journalist invents fake information claiming an anonymous source, and this sequence of events becomes public later. Editor's reputation is affected if the source's reputation is negative for the editor in some way (for example they're a criminal or spy), and the source's identity or reputation becomes public later.
    • As per AP policy as of 2025, a reportor/journalist must inform the editor of the identities of any sources.
    • As per Fox News policy as of 2025, no mandatory requirement for reporter/journalist to share the identity of a source with the editors.

About lawyers

  • This list is not an endorsement of the values or capabilities of any specific lawyer. It only provides historical fact-checked information.
  • Some minor details may be incorrect. I lack a formal legal background.
Category A: US govt whistleblowers/leakers who leaked classified documents and were not imprisoned

Category A (sorted by date)

  • Edward Joseph Snowden - still wanted for arrest
  • Daniel Ellsberg, Anthony (Tony) J. Russo Jr.
Classification status

Category A, classification status of leaked info (sorted by date)

  • Edward Joseph Snowden
    • 100,000-2,000,000 documents (exact number is not public record), many of which were TOP SECRET/SCI
  • Daniel Ellsberg, Anthony (Tony) J. Russo Jr.
    • TOP SECRET
Key dates

Category A, key dates (sorted by date)

  • Edward Joseph Snowden
    • first transmission 2013-01 to 2013-06 (date is not public record), flight from Hawaii US to Hong Kong 2013-05-10, first major transmission 2013-06-02, first publication 2013-06-05, public identity 2013-06-09, flight from Hong Kong to Moscow 2013-06-23, first asylum request made 2013-06-23, first asylum granted (by russia) 2013-08-01, citizenship granted (by russia) 2022-09-26
  • Daniel Ellsberg, Anthony (Tony) J. Russo Jr.
    • first involvement of an unauthorised person (Anthony (Tony) J. Russo Jr.) 1969-10-01, transmission to US senator 1969-10 (??? exact date unclear), first transmission to journalist 1971-03-02, first publication 1971-06-13, arrest 1971-06-28, released on bond 1971-06-28, case dismissed 1973-05-11
Social circle

Category A, consequences on social circle

  • Edward Joseph Snowden
    • Documented social circle at time of leak: Father, mother, (divorced multiple years before the leak), 1 sister, girlfriend (now wife as of 2025-06), no children (2 children as of 2025-06)
    • Documented consequences for social circle: house raid, interrogation, polygraph, wiretap, significant legal expenses, online harassment
    • Documented visits in asylum: Father visited on 2013-10-10, girlfriend (now wife) permanently shifted to moscow in 2014-07 (possibly 2014-07-15 ??? exact date not clear), multiple in-person visits by journalists and lawyers since 2013-06, multiple video calls by journalists. No public record indicating mother ever visited him after the leak. (??? seems unclear)
    • Misc
      • Snowden's family members working for US govt kept their jobs but with no further promotions.
      • Snowden had two children with his wife in Russia and they still live together in Russia as of 2025-07.
  • Daniel Ellsberg
    • Documented social circle at time of leak: Father, wife (married on 1970-08-08, during leak), siblings unknown (??? seems unclear), mother dead, ex-wife (divorced), 2 children from ex-wife (later had 1 child from wife)
    • Documented consequences for social circle: house raid, interrogation, wiretap, significant legal expenses, in-person harassment by FBI agents, cut off by extended circle
    • Documented visits: After case dismissal: Visited by children and step-children. Visited by multiple friends and anti-war activists. Visited by multiple journalists. Significant surveillance by FBI and NSA continued during this time frame.
    • Misc
      • Psychiatrist's office was (illegally) broken into under direction of Howard Hunt, CIA officer, to obtain evidence so that Ellsberg could deemed mentally unfit for trial.
      • Daniel Ellsberg's father initially disowned him for this decision to leak the documents, but may have later changed his mind. (??? exact details not clear)
      • Daniel Ellsberg's son later claimed parents had strained marriage for many years and he had less contact with his father.
Opsec mistakes

Category A, opsec mistakes

  • todo
Journalism

Category A, journalists they worked with

Law

Category A, lawyers they worked with

  • Edward Joseph Snowden
    • lawyers for: No trial occurred, only asylum requests: Ben Wizner (US, ACLU Director), Jesselyn Radack (US, WHISPer Director), Robert Tibbo (Hong Kong), Jonathan Man (Hong Kong), Albert Ho Chun-yan (Hong Kong), Anatoly Kucherena (Russia), Plato Cacheris (US), Wolfgang Kaleck (Germany/EU), William Bourdoun (France/EU), Marcel Bosonnet (Switzerland), Gonzalo Boye (Chile), Baltasar Garzón (Spain/Chile, Wikileaks international legal head), Halvard Helle (Norway), Emanuel Feinberg (Norway), other anonymous lawyer-advisors
    • lawyers against: No trial occurred: Neil H. MacBride, Eric H. Holder Jr.
      • Civil suit over book published: G. Zachary Terwilliger, Jody Hunt, Jeffrey Bossert Clark, Jeffrey A. Rosen, Lauren A. Wetzler, R. Trent McCotter
  • Daniel Ellsberg
    • lawyers for: Leonard Boudin, Charles Nesson, Leonard Weinglass
    • lawyers representing NYT: Daniel Sheehan, Floyd Abrams
    • lawyers against: David Nissen, Warren P. Reese, Richard J. Barry, Joseph L. Tauro, Erwin Nathaniel Griswold

Category A, public fundraising for legal defence

  • Edward Joseph Snowden - Yes (via Freedom of the Press Foundation)
  • Daniel Ellsberg - Yes
Misc

Category A, miscellaneous information

  • Edward Joseph Snowden
    • empty
  • Daniel Ellsberg
    • G Gordon Liddy, ex-FBI ex-Army, claims Howard Hunt (who broke into Ellsberg's psychiatrist's office) also planned to induce LSD overdose to deem Ellsberg mentally unfit.
    • Wiretap also performed without warrant. Judge dismissed case due to extensive illegal evidence gathering.
    • Robert L. Meyer, US attorney, was forced to resign for refusing to pursue case against Daniel Ellsberg
Category B: US govt whistleblowers/leakers who leaked classified documents and were imprisoned

Category B (sorted by date)

  • Jack Douglas Teixeira
  • Daniel Everette Hale
  • Reality Leigh Winner
  • Terry J Albury
  • Joshua Adam Schulte
  • James Hitselberger
  • Donald Sachtleben
  • Chelsea Elizabeth Manning
  • Shamai Kedem Leibowitz
  • Samuel Loring Morison
Classification status

Category B, classification status of leaked info (sorted by date)

  • Jack Douglas Teixeira
    • classified at time of leak
    • most documents TOP SECRET/SCI (TOP SECRET//HCS-P/SI-G/TK//NOFORN or Top Secret//SI//NOFORN//FISA ???), some documents SECRET//REL FVEY
    • remains classified as of 2025, US govt has confirmed authenticity of some documents
  • Daniel Everette Hale
    • classified at time of leak
    • some documents TOP SECRET (TOP SECRET//SI//NOFORN), other documents SECRET
    • remains classified as of 2025
  • Reality Leigh Winner
    • classified at time of leak
    • TOP SECRET//SI//ORCON/NOFORN
    • remains classified as of 2025
  • Terry J Albury
    • classified at time of leak
    • some documents SECRET, some documents CONFIDENTIAL, other documents unclassified, at time of leak (??? seems unclear)
    • remains classified as of 2025
  • Joshua Adam Schulte
    • classified at time of leak
    • some documents SECRET, some documents TOP SECRET or TOP SECRET/SCI (??? seems unclear), operational details of vault7 not leaked at all
    • remains classified as of 2025
  • James Hitselberger
    • classified at time of leak
    • SECRET
    • remains classified as of 2025
  • Donald Sachtleben
    • classified at time of leak
    • main documents TOP SECRET // SCI, other documents SECRET
    • remains classified as of 2025. Summarised details confirmed in press interviews.
  • Chelsea Elizabeth Manning
    • classified at time of leak
    • iraq war logs SECRET//NOFORN, guantanamo bay SECRET//NOFORN, collateral murder video SECRET, diplomatic cables CONFIDENTIAL or SECRET or TOP SECRET (??? seems unclear)
    • Remains classified as of 2025: iraq war logs, guantanmo bay documents, collateral murder video
    • Some redacted documents declassified as of 2025: Diplomatic cables
  • Shamai Kedem Leibowitz
    • classified at time of leak
    • SECRET
    • remains classified as of 2025
  • Samuel Loring Morison
    • classified at time of leak
    • TOP SECRET or SECRET (??? seems unclear)
    • some lower resolution photos similar to leaked photos declassified as of 2025
Key dates

Category B, key dates (sorted by date)

  • Jack Douglas Teixeira
    • first transmission to semi-public discord 2022-02, first transmission to journalist likely 2022-12 (discord server logs 2022-02 to 2022-12 not publicly available), publication to wide audience 2023-04-06, public identity 2023-04-13, arrest 2023-04-13, not released as of 2025-06
  • Daniel Everette Hale
    • first transmission 2014-05 (multiple messages sent, exact date of first message containing classified document is not public record), first publication 2015-10-15, arrest 2019-05-09, public identity 2019-05-09, released 2025-07-04
  • Reality Leigh Winner
    • first transmission 2017-05-09, arrest 2017-06-03, first publication 2017-06-05, public identity 2017-06-05, released 2021-06-02
  • Terry J Albury
    • first transmission 2016-02, first publication 2017-01-31, arrest 2018-03-28, public identity 2018-03-29, released 2020-11
  • Joshua Adam Schulte
    • first transmission 2016-04 (exact date not in public record), first publication 2017-03-07, arrest on allegedly unrelated charge 2017-08-24, public identity as a suspected whistleblower 2018-05-15, public identity as whistleblower confirmed 2018-06-18, not released as of 2025-06
  • James Hitselberger
    • first transmission 2012-04-11, no publication, arrest 2012-10-25, public identity 2012-10-25, released 2014-07
  • Donald Sachtleben
    • first transmission 2012-04-30, first publication 2012-05-07, arrest on allegedly unrelated charges 2012-05-11, indicted as whistleblower 2013-09-23, public identity 2013-09-23, released 2022 (??? exact date not clear)
  • Chelsea Elizabeth Manning
    • first transmission 2010-01 (as per chelsea's claims, 2010-02 is publicly documented), first publication 2010-02-18, arrest 2010-05-27, public identity 2010-06-07, released 2017-01-17
  • Shamai Kedem Leibowitz
    • first transmission 2009-01 (exact date unclear, may not be public record), first publication 2009-03-26, house raid 2009-04 (exact date unclear, may not be public record), final arrest 2009-12-17, public identity 2009-12-17, released 2012-01 (exact date unclear)
  • Samuel Loring Morison
    • first transmission 1984-07 (??? exact date within 1984-07 not clear), first publication 1984-08-07, arrest 1984-10-01, public identity 1984-10-01, released 1988 (??? exact date unclear)
Social circle

Category B, consequences on social circle (only done surface-level research so far)

  • Jack Douglas Teixeira
    • Documented social circle at the time of leak: Step-father, mother, biological father, 1 step-brother, 1 step-sister, 1 half-sister, girlfriend
    • Documented consequences for social circle: house raid, interrogation, online harassment of parents
    • Documented prison visits: no info available
    • Misc: gave TV interview while in prison
  • Daniel Everette Hale
    • Documented social circle at the time of leak: Father, mother, two younger sibilings, no SO
    • Documented consequences for social circle: house raid, interrogation
    • Documented prison visits: no info available
  • Reality Leigh Winner
    • Documented social circle at the time of leak: Father, mother, 1 sister, boyfriend
    • Documented consequences for social circle: house raid, interrogation, wiretap, cut off by extended circle, significant legal expenses
    • Documented prison visits: multiple visits by mother, multiple phone calls
    • Misc: Mother faced panic attacks and depression. Sister withdrew from college for a semester. Parents faced difficulties with retaining job.
  • Terry J Albury
    • Documented social circle at the time of leak: Father, mother, siblings unknown, wife, 2 children
    • Documented consequences for social circle: house raid, interrogation
    • Documented prison visits: no info available
    • Misc: Wife and multiple friends remained supportive throughout the trial and prison sentence.
  • Joshua Adam Schulte
    • Documented social circle at the time of leak: Father, mother, 3 brothers, SO unknown
    • Documented consequences for social circle: house raid, interrogation, significant legal expenses, online harassment
    • Documented prison visits: some family visits, visits were restricted for 3 years, claimed that he was attempting to release more info from prison
  • James Hitselberger
    • Documented social circle at the time of leak: Father, mother, no siblings, no SO
    • Documented consequences for social circle: house raid, interrogation
    • Documented prison visits: no info available
  • Donald Sachtleben
    • Documented social circle at the time of leak: no info available
    • Documented consequences for social circle: no info available
    • Documented prison visits: no info available
  • Chelsea Elizabeth Manning
    • Documented social circle at the time of leak: Father, mother, (divorced), 1 sister, boyfriend (breakup at same time)
    • Documented consequences for social circle: house raid, interrogation, wiretap, significant legal expenses, cut off by extended circle, online verbal harassment
    • Documented prison visits: Multiple visits by family and friends. Multiple letters received, although some were redacted. First visits were behind glass, later visits were regular.
    • Misc: UK govt cooperated with US govt to wiretap mother and aunt in Wales. Father lost job, mother lived in debt, until sufficient donation received for legal defence. Mother collapsed during hearing, faced multiple panic attacks and medical consequences. Father became depressed. Father's second marriage broke apart as well.
  • Shamai Kedem Leibowitz
    • Documented social circle at the time of leak: Father, mother, wife, children unknown
    • Documented consequences for social circle: house raid, interrogation
    • Documeted prison visits: no info available
    • Misc: Used public defender not private lawyer. Grandson of Yeshayahu Leibowitz. Likely morally supported by family and broader jewish community throughout trial and imprisonment.
  • Samuel Loring Morison
    • Documented social circle at the time of leak: Father, mother, siblings unknown, spouse unknown, children unknown
    • Documented consequences for social circle: house raid, interrogation
    • Documented prison visits: no info available
    • Misc: Grandson of Samuel Eliot Morison
Opsec mistakes

Category B, opsec mistakes

  • Jack Douglas Teixeira
    • Trial: United States v. Teixeira, 1:23-cr-10159, (D. Mass.)
    • Digital trail
      • Discord worked with US govt to provide Teixeira's chat logs, provide Teixeira's sign-up details and delete messages and groups from their platform. ECF No 3 Attachment #1 Discord had Teixeira's name, billing details, home address. ECF No 135
      • Teixeira instructed discord group members to delete all his messages if investigated. Teixeira later deleted discord server himself.
      • US govt had access logs from their official database. Timestamps between database access logs and discord logs correlated.
      • Teixeira printed many classified documents using official printer. US govt had access to print logs. Timestamps between database access logs and print logs correlated. ECF No 135
      • Printer used was in the basement of the building, not the printer in same floor as where he worked. Printouts were taken during hours where less staff were present. ECF No 135
    • Physical trail
      • US govt found destroyed tablet, laptop, gaming console in a dumpster near Teixeira's house. US govt also found GoPro camera in a dumpster near Teixeira's house. ECF No 19
      • US govt claimed at trial that hard drive of this damaged laptop was not found.
    • People in-the-loop
      • Teixeira was told multiple times by his bosses not to conduct "deep dives" into classified material.
      • Teixeira told people at work his phone got damaged.
    • Source unknown
      • US govt at trial inferred Teixeira had been photographing documents and taking them home, to avoid using official printer.
  • Daniel Everette Hale
    • Trial: United States v. Hale, 1:19-cr-00059, (E.D. Va.)
      • Did not go to trial, hence some evidence may never have been published.
    • Digital trail
      • US govt found reporter's contact in Hale's phone's contact list. ECF No 1
      • US govt found two thumb drives at Hale's house. First thumb drive contained one page of a classified document. Second thumb drive contained TAILS installed. ECF No 1
      • US govt had logs of badge reads at workplace. US govt had logs of work computer being locked and unlocked. US govt had logs of official printer at workplace. All three timestamps correlated. ECF No 184 Attachment #1
      • US govt had print logs of all 36 documentes printed. US govt correlated which of these documents were published in the reporter's book. ECF No 1
      • US govt was given access to Hale's gmail account by Google. Emails used to argue Hale had self-serving motivations. ECF No 227 Attachment #2, ECF No 227, ECF No 168 Attachment #9
    • Physical trail
      • None found
    • People in-the-loop
      • None found
    • Source unknown - This evidence might or might not have been sent to hale's lawyer as part of discovery process, but it was never published publicly AFAIK.
      • US govt knew Hale searched internet for info on a specific reporter. Also knew exact date and time of irl meeting planned with reporter. ECF No 1
      • US govt knew date of meeting between Hale and reporter at book fair. ECF No 1 US govt knew Hale searched for classified info the day after the meting. ECF No 1 US govt knew Hale sent a message to close friend about this meeting, also knew contents of message. ECF No 1
      • US govt knew about another meeting between Hale and reporter, and another message to friend about this meeting. ECF No 1
      • US govt knew in detail the dates of many in-person meetings between Hale and reporter, and messages sent between Hale and Hale's friend. ECF No 1
      • US govt knew about phone call between Hale and reporter. ECF No 1
      • US govt knew about exact contents of message where reporter asks Hale to install jabber. ECF No 1 US govt knew that atleast three conversations occured on jabber. ECF No 1
      • US govt knew dates of meetings correlated with timestamps of print logs. ECF No 1
  • Reality Leigh Winner
    • Trial: United States v. Winner, 1:17-cr-00034, (S. D. Ga.)
      • Winner accepted plea agreement, so no jury trial or sentencing trial occured. Hence some evidence may never have been published.
    • Digital trail
      • Using search warrant, FBI seized multiple of Winner's devices.
        • Winner's mobile phone had a screenshot with The Intercept's SecureDrop address. ECF No 109
        • Winner's personal computer had saved web history including countries she wanted to fly to (search results for flights, jobs, etc) and history of various terrorist orgs. ECF No 110
        • Winner's personal computer had stored login to social media account DMs where Winner admitted to supporting Snowden and Assange, searching about Anonymous, and having anti-US motivations. ECF No 110
    • Physical trail
      • Using search warrant, FBI found handwritten notes at Winner's house.
        • Winner had notes on how to do a SIM swap. ECF No 109
        • Winner had notes on how to setup Tor and anonymous email address.
        • Winner had notes on countries she wanted to escape to, and notes supporting Taliban leaders, possibly non-serious. ECF No 109
    • People-in-the-loop
      • The Intercept contacted the US govt and sent them a hard copy of the leaked document before publishing it. The Intercept informed US govt that they received the document via post from Augusta, Georgia. This matched Winner's house address. This combined with the print logs enabled the US govt to obtain search warrant. Search Warrant, ECF No 110, ECF No 120
      • Using search warrant, FBI agents interrogated Winner at her house while they were searching it. Winner admitted to the following during the interrogation. ECF No 29. Full interview transcript: ECF No 100-1
        • Winner admitted that she printed the document, removed it from a secure building, stored it in her car for 2 days, then mailed it to the news outlet.
        • Winner admitted her phone had a screenshot with The Intercept's SecureDrop address.
        • Winner admitted she had political anti-US motivations for leaking the document.
        • Winner admitted she searched how to safely insert USB drive into TOP SECRET work computer. Winner admitted she inserted the USB drive to the computer.
      • Once arrested, over recorded calls from prison, Winner admitted to sister to leaking documents. ECF No 109
    • Source unknown
      • US govt claimed they had logs showing only 6 people had printed the document that The Intercept sent them. It is not clear whether these were printer logs or Winner's work computer logs or server database logs. This combined with the mailing label from The Intercept enabled them to obtain search warrant.
  • Terry J Albury
    • Trial: United States v. Albury, 0:18-cr-00067, (D. Minn.)
      • Albury pled guilty quite early, hence not much evidence was published. No jury trial occurred.
    • Digital trail
      • The public version of a document published by The Intercept contained a grey highlight proving the document was obtained by screenshotting the document from a specific web interface. Public version of another document also contained screen defects proving it was screenshotted. Redaction by The Intercept not carried out correctly. Search warrant
      • US govt had logs of Albury copy-pasting some documents into a Word document. Logs likely obtained from Albury's work computer at his office directly. Search warrant
    • Physical trail
      • US govt had CCTV footage of Albury's office room where Albury is seen with a personal camera pointed in front of his work computer. Timestamp of this footage matches those of work computer logs. Search warrant
      • Using search warrant, US govt found the following at Albury's house. ECF No 16
        • 58 sensitive and classified documents on a USB drive.
        • This USB drive was in an envelope with a reporter's phone number on it.
        • Multiple other devices that also contained copies of some of these documents.
    • People-in-the-loop
      • The Intercept made an FOIA request to US govt with one of the documents they intended to publish.
      • Later on in the trial, Albury admitted to using tutanota end-to-end encrypted email and Tor, and to using Adobe Acrobat and Readdle software to manually edit the images and pdf files. ECF No 35
    • Source unknown
      • US govt had logs of 16 individuals who downloaded the same document from their server. They also had logs indicating Albury was the only individual of those 16 that performed cut-paste actions on the document, consistent with the grey highlight on The Intercept's publicly published documents. Other blue and orange highlights made by Albury on the documents were also found. Search warrant
      • US govt had access to an email thread between Albury and a reporter, indicating Albury may have had intention of contacting the media if internal channels would not let him speak. Email thread does not specify details of what information he is referring to. Search warrant
  • Joshua Adam Schulte
  • todo
Journalism

Category B, journalism

  • Jack Douglas Teixeira
    • First publication: Semi-public discord server Thug Shaker Central, 2022-02.
    • First publication in mass media: Ukraine War Plans Leak Prompts Pentagon Investigation, The New York Times, 2023-04-06
    • New York Times publication does not contain original documents.
      • Could not find a mirror to original documents yet. - todo
    • Helene Cooper, Eric Schmitt, Joseph F Kahn (editor) - The New York Times
    • No journalist directly contacted by the source. Journalists eventually found the discord and broadcast the information further.
    • No public record confirming journalists or editor knew the identity of the source.
  • Daniel Everette Hale
    • First publication: The Drone Papers, The Intercept, 2015-10-15
    • Jeremy Scahill, Betsy Reed (editor) - The Intercept
    • Publication contains some original documents.
    • Journalist knew the identity of the source. No public record confirming editor knew the identity of the source. (See note above on this scenario.)
  • Reality Leigh Winner
  • Terry J Albury
    • First publication: The FBI's Secret Rules, The Intercept, 2017-01-31
    • Trevor Aaronson, Cora Currier, Jenna McLaughlin, Alice Speri. Betsy Reed (editor) - The Intercept
    • Publication contains some original documents.
    • No public record confirming journalists or editor knew source identity.
  • Joshua Adam Schulte
    • First publication: Vault 7, Wikileaks, 2017-03-07
    • Anonymous team, Julian Assange (editor) - Wikileaks
    • Publication contains some original documents.
    • No public record confirming journalists or editor (of wikileaks) knew source identity.
    • 2nd attempt: From prison, he promised more documents offered to: Shane Harris, the Washington Post. Marcy Wheeler, Emptywheel. Both journalists knew identity of source.
  • James Hitselberger
    • Publication: No publication
    • Did not work with any journalists.
  • Donald Sachtleben
    • First publication: CIA thwarts new al-Qaida underwear bomb plot, Associated Press, 2012-05-07.
    • Adam Goldman, Matt Apuzzo, Ted Bridis (editor) - Associated Press
    • Sachtleben did not send original documents to the journalist, hence they're not published.
    • Court record confirms Adam Goldman knew the identity of the source. No public record confirming Matt Apuzzo or Ted Bridis knew the identity of the source. (See note above on this scenario.)
  • Chelsea Elizabeth Manning
    • First publication: Classified cable from US Embassy Reykjavik on Icesave, Wikileaks, 2010-02-18
    • Anonymous team, Julian Assange (editor) - Wikileaks
    • No public record confirming anyone at Wikileaks knew the identity of the source at the time of the leak. Julian Assange has declined knowing identity of source before it was publicly reported.
      • Speculation by me (Samuel): Since Adrian Lamo could figure out the identity from social media clues, and Julian Assange was also a skilled hacker, there is a significant probability wikileaks also independently deduced the identity of the source before it was publicly reported.
    • Publication contains original documents.
  • Shamai Kedem Leibowitz
    • First publication: FBI Wiretap Transcripts: Israeli Embassy Targets Iran and U.S. Opinion, Richard Silverstein (at richardsilverstein.com), 2009-03-26
    • Richard Silverstein - independent blogger
    • Richard Silverstein knew the identity of the source.
    • Misc: Leibowitz and Silverstein later publicly accused each other of misaligned motives.
  • Samuel Loring Morison
    • First publication: Jane's Defence Weekly, volume 2 no 5, 1984-08-07 sent to newsrooms, 1984-08-11 official publishing date.
      • Could not find digitised version of magazine issue yet. - todo
    • Derek Wood (editor), Sidney Jackson (managing director), other editorial staff
    • Publication contains original documents (photographs).
    • Derek Wood knew identity of the source. Public record does not confirm any staff knew the identity of the source. Speculation: Sidney Jackson also may have known the identity of the source.
Law
  • Supervisory role played in some cases by attorney generals or assistant attorney generals: Zachary Terwilliger, John C. Demers, Matthew G. Olsen

Category B, lawyers they directly worked with

  • Jack Douglas Teixeira
    • lawyers for: Brendan O. Kelley, Gene Allen Franco, Joshua Robert Hayne (withdrawn), Michael Bachrach
    • lawyers against: Nadine Pellegrini, Jared C. Dolan, Jason A. Casey, Christina A. Clark, Joshua Levy
  • Daniel Everette Hale
    • lawyers for: Todd Richman, Cadence Mertz, Ruth Vinson, Tor Ekeland, Jesselyn Radack
    • lawyers against: Gordon Kromberg, Alexander Berrang, Heather M. Schmidt
  • Reality Leigh Winner
    • lawyers for: Titus Nichols, Alison Grinter Allen, Joe D. Whitley, Matthew S. Chester
    • lawyers against: Julie A. Edelstein, Jennifer G. Solari, Bobby L. Christine
  • Terry J Albury
    • lawyers for: JaneAnne Murray, Joshua L. Dratel
    • lawyers against: Danya E. Atiyeh, Patrick T. Murphy, David C. Recker
  • Joshua Adam Schulte
    • lawyers for: Joshua Adam Schulte (represented self), Sabrina P. Shroff, Deborah Austern Colson (withdrawn), Matthew B. Larsen, Sean Michael Maher, James Matthew Branden, Lauren Martine Dolecki, Edward S Zas, Allegra Glashausser
    • lawyers against: David W. Denton Jr., Michael D. Lockard, Nicholas S. Bradley, Sidhardha Kamaraju, Matthew Laroche, Scott McCulloch, Damian Williams (supervisory), Geoffrey S. Berman (supervisory)
  • James Hitselberger
    • lawyers for: Mary Manning Petras, Rosanna Margaret Taormina, A. J. Kramer, Carlos J. Vanegas
    • lawyers against: Jay I. Bratt, Mona N. Sahaf, Thomas A. Bednar, Deborah A. Curtis
  • Donald Sachtleben
    • lawyers for: Charles C. Hayes, Kathleen M. Sweeney, Larry A. Mackey
    • lawyers against: Jonathan M. Malis, G. Michael Harvey, Richard S. Scott, Mona N. Sahaf, Steven D. DeBrota, Joseph H. Hogsett
  • Chelsea Elizabeth Manning
    • lawyers for: David E. Coombs, Nancy Hollander, Vincent Ward, Matthew Kemkes, Paul Bouchard, Chase Strangio. ACLU
    • lawyers against: Ashden Fein, Joe Morrow, Angel Overgaard, Hunter Whyte
  • Shamai Kedem Leibowitz
    • lawyers for: Cary D. Feldman (withdrawn), Richard M. Asche
    • lawyers against: Steven M. Dunne, Kathleen M. Kedian, David Kris (supervisory), David Kris (supervisory)
  • Samuel Loring Morison
    • lawyers for: Jacob A. Stein, Robert F. Muse, Mark H. Lynch, Charles F.C. Ruff, Neil K. Roman, Steven F. Reich, Armistead P. Rood
    • lawyers against: Michael Schatzow, Michael Schatzow, Breckinridge Long Willcox. 2nd case: James G. Warwick, Rod J. Rosenstein

Category B, public fundraising for legal defence

  • Jack Douglas Teixeira - Could not find
  • Daniel Everette Hale - Yes
  • Reality Leigh Winner - Yes
  • Terry J Albury - Yes
  • Joshua Adam Schulte - Could not find
  • James Hitselberger - Yes
  • Donald Sachtleben - Could not find
  • Chelsea Elizabeth Manning - Yes
  • Shamai Kedem Leibowitz - Could not find
  • Samuel Loring Morison - Could not find
Miscellaneous information
  • Jack Douglas Teixeira
    • Misc info brought up during the trial
      • Discord username: TheExcaliburEffect, Discord server: Thug Shaker Central
      • Teixeira's bedroom photos, later argued these guns were fake. ECF No 19 Attachment #6.
      • Teixeira's arrest photos. ECF No 20 Attachment #1
      • Teixeira's dicord messages quoted verbatim. ECF No 19 Attachment #8. Teixeira repeatedly boasts about his leak. ECF No 34
      • Teixeira suspended from school for racial threats, rejected for gun license multiple times as a result, used security clearance to get gun license, had many guns in gun locker. ECF No 19 Attachment #5
      • Teixeira made social media statements that he might conduct a mass shooting. Teixeira's work colleague said that he might get shot by Teixeira. (Court record is not clear if these were jokes or not.) ECF No 19 Attachment #4
      • Teixeira denied bail. ECF No 34
      • After guilty plea, Teixeira attended four hour interrogation where he admitted his actions. ECF No 142
      • In order to reduce prison sentence, Teixeira's lawyer got a psychiatrist to testify that Teixeira was diagnosed as having ADHD and autism, and that Teixeira was naive about who were the receipients of the info in the discord. ECF No 142
      • In order to reduce prison sentence, Teixeira's lawyer published many stories and letters about Teixeira's childhood. ECF No 142 Attachment #2
  • Daniel Everette Hale
    • Misc info brought up during the trial
      • Software used by Hale for converting file formats and printing documents: O&K, GhostPCL. ECF No 168
      • Lots of argumentation back-and-forth on whether Hale's action of leaking documents could be seen as stealing more than $1000 of value from the govt or not. ECF No 195
      • In order to reduce prison sentence, Hale's lawyer attached letters in support of Hale. ECF No 240 Attachment #1, ECF No 240 Attachment #4
      • Hale's lawyer obtained expert testimony indicating no harm occured as a result of the disclosure. ECF No 240 Attachment #7
  • Reality Leigh Winner
    • Misc info brought up during the trial
      • On a recorded phone call from prison, Winner asked mother to transfer funds out of her bank account to avoid them being frozen. ECF No 109
      • US govt had logs showing USB drive was inserted into work computer, but no logs indicating filenames, hashes or timestamps of exact files transferred.
      • Misc info also published to trial. House photos ECF No 234. Warrants ECF No 235
      • Lots of argumentation occured before trial on whether the FBI interrogation transcript was admissible in court or not, as Winner was not formally informed she was arrested or read her miranda rights.
      • Someone else setup a GoFundMe for Winner that received $12k. No evidence confirming Winner was able to access this money.
      • Winner's anonymous email address used to contact The Intercept: da3re.fitness@gmail.com
      • Winner planned to contact Wikileaks first but was underwhelmed by what they had to offer. Hence contacted The Intercept.
  • Terry J Albury
    • Misc info brought up during trial
Category C: US govt whistleblowers/leakers who did not leak classified documents but leaked information, and were imprisoned

Category C (sorted by date)

  • Henry Kyle Frese
  • John Chris Kiriakou
  • Stephen Jin-Woo Kim
  • Jeffrey Alexander Sterling
Classification status

Category C, classification status of leaked info (sorted by date)

  • Henry Kyle Frese
    • classified at time of leak
    • some documents TOP SECRET/SCI, some documents SECRET (?? seems unclear)
    • remains classified as of 2025
  • John Chris Kiriakou
    • classified at time of leak
    • officier identity SECRET, interrogation details TOP SECRET/SCI or CONFIDENTIAL/SECRET (??? seems unclear)
    • officer identity remains classified as of 2025, partial info about interrogation methods declassified as of 2025
  • Stephen Jin-Woo Kim
    • classified at time of leak
    • TOP SECRET/SCI
    • remains classified as of 2025
  • Jeffrey Alexander Sterling
    • classified at time of leak
    • TOP SECRET/SCI, or SECRET (??? seems unclear)
    • remains classified as of 2025
Key dates

Category C, key dates (sorted by date)

  • Henry Kyle Frese
    • first transmission 2018-04-27, first publication 2018-05-02, arrest 2019-10-09, public identity 2019-10-09, released 2022-10-14
  • John Chris Kiriakou
    • first non-classified transmission 2007-12-10, first non-classified publication 2007-12
    • identity of CIA officer deuce martinez was classified. deuce martinez involvement independently suspected in public since 2006-06-20. first transmission of classified info 2008-08 (email from public record in 2008-08, previous email in 2008-04 alleged), first public publication of martinez' name 2008-06-22 (Scott Shane, NYT), first publication in a classified legal hearing 2009, clear public publication of classified info with surrounding context 2015-02-18.
    • arrest 2012-01-23, public identity 2012-01-23, released 2025-02-03
  • Stephen Jin-Woo Kim
    • first transmission 2009-06, first publication 2009-06-11, indicted 2010-08-18, arrest 2010-08-24, public identity as whistleblower confirmed 2010-08-24, released 2015 (??? exact date not clear)
  • Jeffrey Alexander Sterling
    • first transmission 2003-03 (first phone call 2003-02-27 out of multiple phone calls, not clear which phone call revealed classified info), first publication 2006-01-03, arrest 2011-01-06, public identity 2011-01-06 (identity was semi-private before this), released 2018-01 (exact date unclear)
Social circle

Category C, consequences on social circle (only done surface-level research so far)

  • Henry Kyle Frese
    • Documented social circle at the time of leak: Father, mother, 3 sisters, girlfriend
    • Documented consequences for social circle: house raid, interrogation
    • Documented prison visits: no info available
  • John Chris Kiriakou
    • Documented social circle at the time of leak: Father, mother, siblings unknown, wife, 5 children (of which 2 from wife, 3 from ex-wife)
    • Documented consequences for social circle: house raid, interrogation, wiretap, polygraph, cut off by extended circle, significant legal expenses, online and inperson verbal harassment
    • Documented prison visits: Multiple visits by spouse and children, visits by journalists
    • Misc: Publicly talks about how being shunned by his entire social circle was painful
  • Stephen Jin-Woo Kim
    • Documented social circle at the time of leak: Father, mother, girlfriend (later wife), 1 sister, other siblings unknown
    • Documented consequences for social circle: house raid, interrogation, significant legal expenses, online harassment
    • Documented prison visits: Multiple visits by family members, visit by journalist
    • Misc: James Rosen, journalist, visited Stephen Kim in prison to apologise.
  • Jeffrey Alexander Sterling
    • Documented social circle at the time of leak: Father, mother, multiple siblings, wife, no children
    • Documented consequences for social circle: house raid, interrogation, wiretap, significant legal expenses, online harassment
    • Documented prison visits: Multiple visits by wife. Some journalists were allowed visits and others were denied.
    • Misc: Lost house and nearly went bankrupt due to legal fees. Supported by wife throughout trial and imprisonment.
Opsec mistakes

Category C, opsec mistakes

  • todo
Journalism

Category C, journalism

  • Henry Kyle Frese
  • John Chris Kiriakou
    • First publication with name of CIA officer (considered classified info): Inside a 9/11 Mastermind's Interrogation, the New York Times, 2008-06-22
    • Scott Shane, Bill Keller (editor) - The New York Times. Info also offered to: Matthew Cole - the Intercept.
    • Both journalists and editor knew the identity of the source.
    • Misc: John Kiriakou publicly accuses Matthew Cole, journalist at the Intercept, for getting him imprisoned.
  • Stephen Jin-Woo Kim
    • First publication: NK's Post UN Sanctions Plans, Revealed, Fox News, 2009-06-11
    • James Rosen, Michael Clemente (editor), Bill Sammon (editor) - Fox News
    • Kim did not send any original documents to the journalist, hence they're not published.
    • James Rosen knew the identity of the source. No public record confirming the editor knew the identity of the source. (See note above on this scenario.)
  • Jeffrey Alexander Sterling
    • First publication: State of War, James Risen, 2006-01-03, published by Free Print under Simon & Schuster. State of War, Anna's Archive book download. James Risen was a journalist at the New York Times.
    • Publication does not contain original documents, hence they're not published.
    • James Risen knew identity of the source. Court record (US v Sterling) confirms James' wife Holly also knew identity of the source. Multiple intelligence community members also suspected the identity of the source. Public record is not clear on who all were informed by Sterling that he was the source.
Law

Category C, lawyers they directly worked with

  • Henry Kyle Frese
    • lawyers for: Stuart Sears
    • lawyers against: Jennifer Kennedy Gellie, Danya E. Atiyeh, Neil Hammerstrom
  • John Chris Kiriakou
    • lawyers for: Robert Trout, Plato Cacheris, John F. Hundley, Jesse Isaac Winograd, Jesselyn Radack (advisory)
    • lawyers against: Neil H. MacBride, Mark Schneider, Iris Lan, Patrick Fitzgerald (absent), Patrick J. Fitzgerald, Ryan Fayhee, William N. Hammerstrom Jr., Lisa Owings
  • Stephen Jin-Woo Kim
    • lawyers for: Abbe D. Lowell, Paul M. Thompson, James M. Commons, Ruth Wedgwood
    • lawyers against: Michael Harvey, Jonathan M. Malis, Thomas A. Bednar, Deborah A. Curtis, Julie A. Edelstein, Ronald C. Machen Jr. (supervisory)
  • Jeffrey Alexander Sterling
    • lawyers for: Edward MacMahon, Barry Pollack, William James Trunk, J. Richard Supple Jr., Mia Haessly, Lawrence S. Robbins
    • lawyers against: James L. Trump, Eric G. Olshan, Dennis Fitzpatrick, William M. Welch II (withdrawn), Timothy Kelly, Neil H. MacBride, Dana J. Boente (supervisory), Robert A. Parker (supervisory), Leslie R. Caldwell (supervisory), Sung-Hee Suh (supervisory)

Category C, public fundraising for legal defence

  • Henry Kyle Frese - Could not find
  • John Chris Kiriakou - Yes
  • Stephen Jin-Woo Kim - Could not find
  • Jeffrey Alexander Sterling - Yes
Misc
  • empty
Category D: US govt whistleblowers/leakers who did not leak classified documents but may have leaked classified information, and were not imprisoned

Category D (sorted by date)

  • Thomas Andrews Drake
  • Mark Lee Klein
  • Russell D Tice
  • Thomas M Tamm
  • Sibel Deniz Edmonds
  • Edward Loomis
  • William Edward Binney
  • John Kirk Wiebe
  • Perry Douglas Fellwock
Classification status

Category D, classification status of leaked info (sorted by date)

  • Thomas Andrew Drake
    • not classified
    • govt alleged classified documents leak, judge declared those documents were not classified
  • Mark Lee Klein
    • leaked existence of program that may have been classified, but no classified documents
  • Russell D Tice
    • leaked existence of classified program, but no classified documents
  • Thomas M Tamm
    • leaked existence of classified program, but no classified documents
  • Sibel Deniz Edmonds
    • leaked details that were retroactively classified after the leak, did not leak classified documents
  • Edward Loomis
    • leaked details of classified program, but no classified documents
  • William Edward Binney
    • leaked existence and details of classified program, but no classified documents (as per court record)
  • John Kirk Wiebe
    • leaked existence and details of classified program, but no classified documents (as per court record)
  • Perry Douglas Fellwock
    • leaked existence and extensive amount of details of classified program, likely did not leak classified documents (as per court record)
Key dates

Category D, key dates (sorted by date)

  • Thomas Andrews Drake - first transmission to journalist 2005-11 to 2006-02 (??? exact date unclear), first publication 2006-01-29, house raid 2007-11-28, trial sentencing date 2011-07-15, no arrest
  • Mark Lee Klein - first transmission to EFF (for sealed legal hearing) 2006-01-20, first transmission to journalist 2006-01 to 2006-02 (??? exact date unclear), first publication 2006-04-06, no house raid / indictment / arrest
  • Russell D Tice - first transmission to journalist 2004 (??? exact date unclear), first internal complaint to DoD IG 2004 or 2005 (??? exact date unclear), security clearance revoked 2005-05 (??? exact date unclear), first publication 2005-12-16, no house raid / indictment / arrest
  • Thomas M Tamm - first transmission 2006-03 to 2006-06 (??? exact date unclear), first publication 2005-12-16, house raid 2007-08-01, public identity 2008-12-13, no indictment, investigation formally closed 2011-04
  • Sibel Deniz Edmonds - first internal complaint 2001-12-02, fired 2002-03-22, first transmission 2002 (??? exact date unclear), first publication (TV interview) 2002-10-27
  • Edward Loomis - first internal complaint 2002-11-09, no transmission of secret info to outside sources (??? seems unclear), house raid 2007-07-26, no indictment
  • William Edward Binney - binney resigned 2001-10-31, first internal complaint (with wiebe) 2002-11-09, first transmission - exact date unclear, first publication 2011-05-23 (seems unclear if previous publication existed), house raid (with wiebe) 2007-07-26, no indictment, public identity 2011-05-23 (seems unclear if previous publication existed)
  • John Kirk Wiebe - first internal complaint (with binney) 2002-11-09, first transmission - exact date unclear, first publication (with binney) 2011-05-23 (seems unclear if previous publication existed), house raid (with binney) 2007-07-26, no indictment, public identity 2011-05-23 (seems unclear if previous publication existed)
  • Perry Douglas Fellwock - first transmission 1972 (exact date not clear), first publication 1972-08, public identity 1972-07-18, no house raid, no indictment
Social circle

Category D, consequences on social circle

  • Thomas Andrews Drake
    • Documented social circle at the time of leak: Father, mother, siblings unknown, wife, five sons
    • Documented consequences for social circle: house raid, interrogation, wiretap, cut off by extended circle, significant legal expenses, in-person and online verbal harassment
    • Misc: lost pension worth over $1M
  • Mark Lee Klein
    • Documented social circle at the time of leak: Father, mother, 1 older brother, ex-wife, wife, no children
    • Documented consequences for social circle: in-person verbal harassment
  • Russell D Tice
    • Documented social circle at the time of leak: todo
    • Documented consequences for social circle: todo
    • Misc: todo
  • Thomas M Tamm
    • Documented social circle at the time of leak: Father, mother, brother, another late brother, wife, three children
    • Documented consequences for social circle: house raid, interrogation, wiretap, significant legal expenses, online verbal harassment
    • Misc: Thomas Tamm suffered depression after the leak. Lost employment and went into debt due to legal expenses, later received funding.
  • Sibel Deniz Edmonds
    • Documented social circle at the time of leak: Late father, mother, two younger sisters, husband, no children
    • Documented consequences for social circle: interrogation, wiretap, significant legal expenses, cut off by extended circle, online and in-person verbal harassment
    • Misc: Computers were searched, but no house raid
  • Edward Loomis
    • Documented social circle at the time of leak: Father, mother, siblings unknown, wife, atleast two children, children unknown (???)
    • Documented consequences for social circle: house raid, interrogation, wiretap, cut off by extended circle and immediate family members (as per Ed Loomis' own decision)
    • Misc: Kirk Wiebe claims Edward Loomis' divorce with his wife was a result of the leak. Edward Loomis isolated from immediate family for multiple years to prevent them finding out that he was the source.
  • William Edward Binney
    • Documented social circle at the time of leak: Father, mother, elder brother, wife, three children
    • Documented consequences for social circle: house raid, interrogation, wiretap, in-person and verbal harassment, cut off by extended circle
    • Misc: ??? consequences on family relationships not clear. More verbal harassment due to recent political opinions shared by Bill Binney
  • John Kirk Wiebe
    • Documented social circle at the time of leak: Father, mother, siblings unknown, wife, multiple children, children unknown (???)
    • Documented consequences for social circle: house raid, interrogation, wiretap, faced significant legal expenses
  • Perry Douglas Fellwock
    • Documented social circle at the time of leak: unknown. (As of 2013 interview, Fellwock continues to successfully deflect journalist's questions about who his family memmbers are.)
    • Documented consequences for social circle: unknown
Opsec mistakes

Category D, opsec mistakes

  • todo
Journalism

Category D, journalists they directly worked with

  • Thomas Andrews Drake
    • First publication: Biggest boondogle going on now, Baltimore Sun, 2006-01-29
    • No original documents published
    • Siobhan Gorman, Timothy A Franklin (editor) - Baltimore Sun
    • Diane S Roark knew identity of the source. No public record confirming anyone at Baltimore Sun (including Siobhan Gorman or the editor) knew identity of the source. Anonymous encrypted email used for communication.
  • Mark Lee Klein
    • First publication: Wiretap whistleblower's account, Wired, 2006-04-06
    • No original documents published in this media publication. Three documents were provided to sealed legal hearing.
    • Ryan Singel, Evan Hansen (editor) - Wired
    • Source declared identity publicly in this publication.
  • Russell D Tice
    • First publication: Same as Thomas Tamm, listed below. All details similar.
    • No public record confirming anyone knew identity of the source. Speculation (as per Samuel): James Risen, Eric Lichtblau, Bill Keller, and Arthur Sulzberger Jr likely knew the identity of the source.
  • Thomas M Tamm
    • First publication: Bush lets US spy on callers without courts, The New York Times, 2005-12-16
    • No original documents published
    • James Risen, Eric Lichtblau, Bill Keller (editor) - The New York Times
    • No public record confirming anyone knew identity of the source. Speculation (as per Samuel): James Risen, Eric Lichtblau, Bill Keller, and Arthur Sulzberger Jr likely knew the identity of the source.
  • Sibel Deniz Edmonds
  • Edward Loomis
    • As per public record, no media publication directly used sensitive info from him. (He gave TV interview in 2013 but all details mentioned were public record by then.)
  • William Edward Binney
    • First publication: The Secret Sharer, The New Yorker, 2011-05-23
    • No original documents published
    • Jane Mayer, David Remnick (editor-in-chief). Also, likely editors: Pamela McCarthy, Dorothy Wickenden, Henry Finder, Daniel Zalewski
    • Source declared identity publicly in the article.
  • John Kirk Wiebe
    • First publication same as William Binney, see above.
    • Source declared identity publicly in the article.
  • Perry Douglas Fellwock
Law

Category D, lawyers they directly worked with

  • Thomas Andrews Drake
    • lawyers for: James Wyda, Deborah Boardman, Jesselyn Radack, Meghan A. Skelton, James Bamford (advisory)
    • lawyers against: William M. Welch II, John P. Pearson, Lanny A. Breuer, Steven Tyrrell
  • Mark Lee Klein
    • Note: No trial against Mark Lee Klein directly, trials were fought by EFF against AT&T and US govt. Landmark case: Jewel v NSA.
    • lawyers for: EFF legal team (Kurt Opsahl, Kevin S. Bankston, Cindy Cohn, Lee Tien, James S. Tyre, Corynne McSherry, Mark Rumold, Jamie L. Williams, Andrew Crocker, James S. Tyre), Bert Voorhees, Theresa M. Traber, Keker and Van Nest LLP (Rachael E. Meny, Benjamin W. Berkowitz , Michael S. Kwun, Audrey Walton-Hadlock, Philip J. Tassin), Richard R. Wiebe, Aram Antaramian, Thomas E. Moore III
    • lawyers against: Representing AT&T/telecoms: Michael Kellogg, Brian Matthew Boynton, Sidley Austin LLP (Bradford Allan Berenson, Eric Dean McArthur, Eric Shumsky), Pillsbury Winthrop Shaw Pittman LLP (Bruce A. Ericson, Kevin M. Fong) ; Representing US govt: Peter Keisler, Michael Mukasey (supervisory), Anthony Joseph Coppolino, Thomas Mark Bondy, Kevin V. Ryan, Carl J. Nichols, Joseph H. Hunt, Andrew H. Tannenbaum
  • Russell D Tice
    • Note: No trial against Russell D Tice directly.
    • lawyers for: Mark Zaid, Tom Devine, Jesselyn Radack, Roy W. Krieger
    • lawyers against: Alberto R. Gonzales, David Kris, Paul J. McNulty (related hearing), Robert L. Deitz (related hearing)
  • Thomas M Tamm
    • Note: No trial against Thomas M Tamm for whistleblowing, trial was for revoking bar license. He won the trial and kept his license.
    • lawyers for: Paul Kemp, Michael Frisch, Cary Feldman, Asa Hutchinson
    • lawyers against: Hamilton P. Fox III, Gene Shipp
  • Sibel Deniz Edmonds
    • lawyers for: Mark S. Zaid, Michael D. Kohn, ACLU legal team (Benjamin Wizner, Ann Beeson, Art Spitzer, Melissa Goodman), Eric Seiff, Roy W. Krieger
    • lawyers against: John Ashcroft (supervisory), Paul D. Clement, Peter D. Keisler, Douglas Letter, H. Thomas Byron III, Vesper Mei, Valerie Caproni, Kimberly Dawn Ziropoulos, Bruce Fein, Dan Marino
  • Edward Loomis
    • Note: No trial against Edward Loomis
    • lawyers for: Jesselyn Radack (advisory)
    • lawyers against: none
  • William Edward Binney, John Kirk Wiebe
    • Note: No important trial involving William Edward Binney or John Kirk Wiebe. Main trials were against Thomas Andrews Drake, and the landmark case Jewel v NSA. William Edward Binney had to sue only to retrieve his personal belongings taken from him during house raid.
    • lawyers for: John K. Wiebe, William Edward Binney
    • lawyers against: Rod J. Rosenstein (supervisory)
  • Perry Douglas Fellwock
    • Note: No trial involving Perry Douglas Fellwock
    • lawyers for: no public info (??? seems unclear)
    • lawyers against: none

Category D, public fundraising for legal defence

  • Thomas Andrews Drake - Yes
  • Mark Lee Klein - Could not find
  • Russell D Tice - Could not find
  • Thomas M Tamm - Yes
  • Sibel Deniz Edmonds - todo
  • Edward Loomis - Could not find
  • William Edward Binney - Could not find
  • John Kirk Wiebe - Could not find
  • Perry Douglas Fellwock - Could not find
Misc
  • empty
Category E: Details of US govt whistleblowers/leakers who did not leak classified documents or information, and were not imprisoned

Category E (sorted by date, incomplete list)

  • John R Crane
  • James Robertson
  • Robert J. MacLean - leaked SSI, which is unclassified but restricted
  • Diane Roark
  • todo

more info

  • todo
Special case: Julian Assange

Assange is rare example of a media publisher not a whistleblower, who went to prison. I'm yet to find another example of a publisher or journalist who spent significant amount of time in prison for working with whistleblowers.

more info

  • todo
Details of spies against US govt
  • todo
Misc
  • Robert Birchum - ??? - TO DO - more research
  • Petraus - ??? - TO DO - more research


Discuss

Страницы